Hi Linus and folks,
I've been developing a tool for detecting deadlock possibilities by
tracking wait/event rather than lock(?) acquisition order to try to
cover all synchonization machanisms. It's done on v5.18-rc3 tag.
https://github.com/lgebyungchulpark/linux-dept/commits/dept1.20_on_v5.18-rc3
Benifit:
0. Works with all lock primitives.
1. Works with wait_for_completion()/complete().
2. Works with 'wait' on PG_locked.
3. Works with 'wait' on PG_writeback.
4. Works with swait/wakeup.
5. Works with waitqueue.
6. Multiple reports are allowed.
7. Deduplication control on multiple reports.
8. Withstand false positives thanks to 6.
9. Easy to tag any wait/event.
Future work:
0. To make it more stable.
1. To separates Dept from Lockdep.
2. To improves performance in terms of time and space.
3. To use Dept as a dependency engine for Lockdep.
4. To add any missing tags of wait/event in the kernel.
5. To deduplicate stack trace.
How to interpret reports:
1. E(event) in each context cannot be triggered because of the
W(wait) that cannot be woken.
2. The stack trace helping find the problematic code is located
in each conext's detail.
Thanks,
Byungchul
---
Changes from v5:
1. Use just pr_warn_once() rather than WARN_ONCE() on the lack
of internal resources because WARN_*() printing stacktrace is
too much for informing the lack. (feedback from Ted, Hyeonggon)
2. Fix trivial bugs like missing initializing a struct before
using it.
3. Assign a different class per task when handling onstack
variables for waitqueue or the like. Which makes Dept
distinguish between onstack variables of different tasks so
as to prevent false positives. (reported by Hyeonggon)
4. Make Dept aware of even raw_local_irq_*() to prevent false
positives. (reported by Hyeonggon)
5. Don't consider dependencies between the events that might be
triggered within __schedule() and the waits that requires
__schedule(), real ones. (reported by Hyeonggon)
6. Unstage the staged wait that has prepare_to_wait_event()'ed
*and* yet to get to __schedule(), if we encounter __schedule()
in-between for another sleep, which is possible if e.g. a
mutex_lock() exists in 'condition' of ___wait_event().
7. Turn on CONFIG_PROVE_LOCKING when CONFIG_DEPT is on, to rely
on the hardirq and softirq entrance tracing to make Dept more
portable for now.
Changes from v4:
1. Fix some bugs that produce false alarms.
2. Distinguish each syscall context from another *for arm64*.
3. Make it not warn it but just print it in case Dept ring
buffer gets exhausted. (feedback from Hyeonggon)
4. Explicitely describe "EXPERIMENTAL" and "Dept might produce
false positive reports" in Kconfig. (feedback from Ted)
Changes from v3:
1. Dept shouldn't create dependencies between different depths
of a class that were indicated by *_lock_nested(). Dept
normally doesn't but it does once another lock class comes
in. So fixed it. (feedback from Hyeonggon)
2. Dept considered a wait as a real wait once getting to
__schedule() even if it has been set to TASK_RUNNING by wake
up sources in advance. Fixed it so that Dept doesn't consider
the case as a real wait. (feedback from Jan Kara)
3. Stop tracking dependencies with a map once the event
associated with the map has been handled. Dept will start to
work with the map again, on the next sleep.
Changes from v2:
1. Disable Dept on bit_wait_table[] in sched/wait_bit.c
reporting a lot of false positives, which is my fault.
Wait/event for bit_wait_table[] should've been tagged in a
higher layer for better work, which is a future work.
(feedback from Jan Kara)
2. Disable Dept on crypto_larval's completion to prevent a false
positive.
Changes from v1:
1. Fix coding style and typo. (feedback from Steven)
2. Distinguish each work context from another in workqueue.
3. Skip checking lock acquisition with nest_lock, which is about
correct lock usage that should be checked by Lockdep.
Changes from RFC(v0):
1. Prevent adding a wait tag at prepare_to_wait() but __schedule().
(feedback from Linus and Matthew)
2. Use try version at lockdep_acquire_cpus_lock() annotation.
3. Distinguish each syscall context from another.
Byungchul Park (21):
llist: Move llist_{head,node} definition to types.h
dept: Implement Dept(Dependency Tracker)
dept: Apply Dept to spinlock
dept: Apply Dept to mutex families
dept: Apply Dept to rwlock
dept: Apply Dept to wait_for_completion()/complete()
dept: Apply Dept to seqlock
dept: Apply Dept to rwsem
dept: Add proc knobs to show stats and dependency graph
dept: Introduce split map concept and new APIs for them
dept: Apply Dept to wait/event of PG_{locked,writeback}
dept: Apply SDT to swait
dept: Apply SDT to wait(waitqueue)
locking/lockdep, cpu/hotplus: Use a weaker annotation in AP thread
dept: Distinguish each syscall context from another
dept: Distinguish each work from another
dept: Disable Dept within the wait_bit layer by default
dept: Disable Dept on struct crypto_larval's completion for now
dept: Differentiate onstack maps from others of different tasks in
class
dept: Do not add dependencies between events within scheduler and
sleeps
dept: Unstage wait when tagging a normal sleep wait
arch/arm64/kernel/syscall.c | 2 +
arch/x86/entry/common.c | 4 +
crypto/api.c | 7 +-
include/linux/completion.h | 44 +-
include/linux/dept.h | 596 ++++++++
include/linux/dept_page.h | 78 +
include/linux/dept_sdt.h | 67 +
include/linux/hardirq.h | 3 +
include/linux/irqflags.h | 71 +-
include/linux/llist.h | 8 -
include/linux/lockdep.h | 186 ++-
include/linux/lockdep_types.h | 3 +
include/linux/mutex.h | 22 +
include/linux/page-flags.h | 45 +-
include/linux/pagemap.h | 7 +-
include/linux/percpu-rwsem.h | 4 +-
include/linux/rtmutex.h | 1 +
include/linux/rwlock.h | 42 +
include/linux/rwlock_api_smp.h | 8 +-
include/linux/rwlock_types.h | 1 +
include/linux/rwsem.h | 22 +
include/linux/sched.h | 7 +
include/linux/seqlock.h | 60 +-
include/linux/spinlock.h | 21 +
include/linux/spinlock_types_raw.h | 3 +
include/linux/swait.h | 4 +
include/linux/types.h | 8 +
include/linux/wait.h | 6 +-
init/init_task.c | 2 +
init/main.c | 4 +
kernel/Makefile | 1 +
kernel/cpu.c | 2 +-
kernel/dependency/Makefile | 4 +
kernel/dependency/dept.c | 2938 ++++++++++++++++++++++++++++++++++++
kernel/dependency/dept_hash.h | 10 +
kernel/dependency/dept_internal.h | 26 +
kernel/dependency/dept_object.h | 13 +
kernel/dependency/dept_proc.c | 92 ++
kernel/exit.c | 7 +
kernel/fork.c | 2 +
kernel/locking/lockdep.c | 28 +-
kernel/locking/spinlock_rt.c | 24 +-
kernel/module.c | 2 +
kernel/sched/completion.c | 12 +-
kernel/sched/core.c | 10 +
kernel/sched/swait.c | 10 +
kernel/sched/wait.c | 16 +
kernel/sched/wait_bit.c | 5 +-
kernel/workqueue.c | 3 +
lib/Kconfig.debug | 28 +
mm/filemap.c | 68 +
mm/page_ext.c | 5 +
52 files changed, 4558 insertions(+), 84 deletions(-)
create mode 100644 include/linux/dept.h
create mode 100644 include/linux/dept_page.h
create mode 100644 include/linux/dept_sdt.h
create mode 100644 kernel/dependency/Makefile
create mode 100644 kernel/dependency/dept.c
create mode 100644 kernel/dependency/dept_hash.h
create mode 100644 kernel/dependency/dept_internal.h
create mode 100644 kernel/dependency/dept_object.h
create mode 100644 kernel/dependency/dept_proc.c
--
1.9.1
Makes Dept able to track dependencies by rwsem.
Signed-off-by: Byungchul Park <[email protected]>
---
include/linux/lockdep.h | 24 ++++++++++++++++++++----
include/linux/percpu-rwsem.h | 4 +++-
include/linux/rwsem.h | 22 ++++++++++++++++++++++
3 files changed, 45 insertions(+), 5 deletions(-)
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index b59d8f3..b0e097f 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -634,10 +634,26 @@ static inline void print_irqtrace_events(struct task_struct *curr)
dept_mutex_unlock(&(l)->dmap, i); \
} while (0)
-#define rwsem_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i)
-#define rwsem_acquire_nest(l, s, t, n, i) lock_acquire_exclusive(l, s, t, n, i)
-#define rwsem_acquire_read(l, s, t, i) lock_acquire_shared(l, s, t, NULL, i)
-#define rwsem_release(l, i) lock_release(l, i)
+#define rwsem_acquire(l, s, t, i) \
+do { \
+ lock_acquire_exclusive(l, s, t, NULL, i); \
+ dept_rwsem_lock(&(l)->dmap, s, t, NULL, "up_write", i); \
+} while (0)
+#define rwsem_acquire_nest(l, s, t, n, i) \
+do { \
+ lock_acquire_exclusive(l, s, t, n, i); \
+ dept_rwsem_lock(&(l)->dmap, s, t, (n) ? &(n)->dmap : NULL, "up_write", i);\
+} while (0)
+#define rwsem_acquire_read(l, s, t, i) \
+do { \
+ lock_acquire_shared(l, s, t, NULL, i); \
+ dept_rwsem_lock(&(l)->dmap, s, t, NULL, "up_read", i); \
+} while (0)
+#define rwsem_release(l, i) \
+do { \
+ lock_release(l, i); \
+ dept_rwsem_unlock(&(l)->dmap, i); \
+} while (0)
#define lock_map_acquire(l) lock_acquire_exclusive(l, 0, 0, NULL, _THIS_IP_)
#define lock_map_acquire_read(l) lock_acquire_shared_recursive(l, 0, 0, NULL, _THIS_IP_)
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 5fda40f..9a0603d 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -21,7 +21,9 @@ struct percpu_rw_semaphore {
};
#ifdef CONFIG_DEBUG_LOCK_ALLOC
-#define __PERCPU_RWSEM_DEP_MAP_INIT(lockname) .dep_map = { .name = #lockname },
+#define __PERCPU_RWSEM_DEP_MAP_INIT(lockname) .dep_map = { \
+ .name = #lockname, \
+ .dmap = DEPT_MAP_INITIALIZER(lockname) },
#else
#define __PERCPU_RWSEM_DEP_MAP_INIT(lockname)
#endif
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index efa5c32..ed4c34e 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -21,6 +21,7 @@
.dep_map = { \
.name = #lockname, \
.wait_type_inner = LD_WAIT_SLEEP, \
+ .dmap = DEPT_MAP_INITIALIZER(lockname) \
},
#else
# define __RWSEM_DEP_MAP_INIT(lockname)
@@ -32,6 +33,27 @@
#include <linux/osq_lock.h>
#endif
+#ifdef CONFIG_DEPT
+#define dept_rwsem_lock(m, ne, t, n, e_fn, ip) \
+do { \
+ if (t) { \
+ dept_ecxt_enter(m, 1UL, ip, __func__, e_fn, ne); \
+ } else if (n) { \
+ dept_ecxt_enter_nokeep(m); \
+ } else { \
+ dept_wait(m, 1UL, ip, __func__, ne); \
+ dept_ecxt_enter(m, 1UL, ip, __func__, e_fn, ne); \
+ } \
+} while (0)
+#define dept_rwsem_unlock(m, ip) \
+do { \
+ dept_ecxt_exit(m, 1UL, ip); \
+} while (0)
+#else
+#define dept_rwsem_lock(m, ne, t, n, e_fn, ip) do { } while (0)
+#define dept_rwsem_unlock(m, ip) do { } while (0)
+#endif
+
/*
* For an uncontended rwsem, count and owner are the only fields a task
* needs to touch when acquiring the rwsem. So they are put next to each
--
1.9.1
Makes SDT able to track dependencies by wait(waitqueue).
Signed-off-by: Byungchul Park <[email protected]>
---
include/linux/wait.h | 6 +++++-
kernel/sched/wait.c | 16 ++++++++++++++++
2 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 851e07d..e637585 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -7,6 +7,7 @@
#include <linux/list.h>
#include <linux/stddef.h>
#include <linux/spinlock.h>
+#include <linux/dept_sdt.h>
#include <asm/current.h>
#include <uapi/linux/wait.h>
@@ -37,6 +38,7 @@ struct wait_queue_entry {
struct wait_queue_head {
spinlock_t lock;
struct list_head head;
+ struct dept_map dmap;
};
typedef struct wait_queue_head wait_queue_head_t;
@@ -56,7 +58,8 @@ struct wait_queue_head {
#define __WAIT_QUEUE_HEAD_INITIALIZER(name) { \
.lock = __SPIN_LOCK_UNLOCKED(name.lock), \
- .head = LIST_HEAD_INIT(name.head) }
+ .head = LIST_HEAD_INIT(name.head), \
+ .dmap = DEPT_MAP_INITIALIZER(name) }
#define DECLARE_WAIT_QUEUE_HEAD(name) \
struct wait_queue_head name = __WAIT_QUEUE_HEAD_INITIALIZER(name)
@@ -67,6 +70,7 @@ struct wait_queue_head {
do { \
static struct lock_class_key __key; \
\
+ sdt_map_init(&(wq_head)->dmap); \
__init_waitqueue_head((wq_head), #wq_head, &__key); \
} while (0)
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 9860bb9..d67d0dc4 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -104,6 +104,7 @@ static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
if (flags & WQ_FLAG_BOOKMARK)
continue;
+ sdt_event(&wq_head->dmap);
ret = curr->func(curr, mode, wake_flags, key);
if (ret < 0)
break;
@@ -267,6 +268,9 @@ void __wake_up_pollfree(struct wait_queue_head *wq_head)
__add_wait_queue(wq_head, wq_entry);
set_current_state(state);
spin_unlock_irqrestore(&wq_head->lock, flags);
+
+ if (state & TASK_NORMAL)
+ sdt_wait_prepare(&wq_head->dmap);
}
EXPORT_SYMBOL(prepare_to_wait);
@@ -285,6 +289,10 @@ void __wake_up_pollfree(struct wait_queue_head *wq_head)
}
set_current_state(state);
spin_unlock_irqrestore(&wq_head->lock, flags);
+
+ if (state & TASK_NORMAL)
+ sdt_wait_prepare(&wq_head->dmap);
+
return was_empty;
}
EXPORT_SYMBOL(prepare_to_wait_exclusive);
@@ -330,6 +338,9 @@ long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_en
}
spin_unlock_irqrestore(&wq_head->lock, flags);
+ if (!ret && state & TASK_NORMAL)
+ sdt_wait_prepare(&wq_head->dmap);
+
return ret;
}
EXPORT_SYMBOL(prepare_to_wait_event);
@@ -351,7 +362,9 @@ int do_wait_intr(wait_queue_head_t *wq, wait_queue_entry_t *wait)
return -ERESTARTSYS;
spin_unlock(&wq->lock);
+ sdt_wait_prepare(&wq->dmap);
schedule();
+ sdt_wait_finish();
spin_lock(&wq->lock);
return 0;
@@ -368,7 +381,9 @@ int do_wait_intr_irq(wait_queue_head_t *wq, wait_queue_entry_t *wait)
return -ERESTARTSYS;
spin_unlock_irq(&wq->lock);
+ sdt_wait_prepare(&wq->dmap);
schedule();
+ sdt_wait_finish();
spin_lock_irq(&wq->lock);
return 0;
@@ -388,6 +403,7 @@ void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_en
{
unsigned long flags;
+ sdt_wait_finish();
__set_current_state(TASK_RUNNING);
/*
* We can check for list emptiness outside the lock
--
1.9.1
On Wed, May 4, 2022 at 1:19 AM Byungchul Park <[email protected]> wrote:
>
> Hi Linus and folks,
>
> I've been developing a tool for detecting deadlock possibilities by
> tracking wait/event rather than lock(?) acquisition order to try to
> cover all synchonization machanisms.
So what is the actual status of reports these days?
Last time I looked at some reports, it gave a lot of false positives
due to mis-understanding prepare_to_sleep().
For this all to make sense, it would need to not have false positives
(or at least a very small number of them together with a way to sanely
get rid of them), and also have a track record of finding things that
lockdep doesn't.
Maybe such reports have been sent out with the current situation, and
I haven't seen them.
Linus
A sleep is not a wait that prevents the events within __schedule(). It
rather goes through __schedule() so all the events are going to be
triggered while sleeping. So they don't have any dependencies with each
other.
So distinguished sleep type of wait from the other type e.i. spinning
and made it skip building dependencies between sleep type of waits and
the events within __schedule().
Signed-off-by: Byungchul Park <[email protected]>
---
include/linux/completion.h | 2 +-
include/linux/dept.h | 28 ++++++++++++++++++++----
include/linux/dept_page.h | 4 ++--
include/linux/dept_sdt.h | 9 +++++++-
include/linux/lockdep.h | 52 +++++++++++++++++++++++++++++++++++++++-----
include/linux/mutex.h | 2 +-
include/linux/rwlock.h | 12 +++++-----
include/linux/rwsem.h | 2 +-
include/linux/seqlock.h | 2 +-
include/linux/spinlock.h | 8 +++----
kernel/dependency/dept.c | 37 ++++++++++++++++++++++++-------
kernel/locking/spinlock_rt.c | 24 ++++++++++----------
kernel/sched/core.c | 2 ++
13 files changed, 138 insertions(+), 46 deletions(-)
diff --git a/include/linux/completion.h b/include/linux/completion.h
index 358c656..2dade27 100644
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -36,7 +36,7 @@ struct completion {
#define dept_wfc_wait(m, ip) \
do { \
dept_ask_event(m); \
- dept_wait(m, 1UL, ip, __func__, 0); \
+ dept_wait(m, 1UL, ip, __func__, 0, true); \
} while (0)
#define dept_wfc_complete(m, ip) dept_event(m, 1UL, ip, __func__)
#define dept_wfc_enter(m, ip) dept_ecxt_enter(m, 1UL, ip, "completion_context_enter", "complete", 0)
diff --git a/include/linux/dept.h b/include/linux/dept.h
index 3027121..28db897 100644
--- a/include/linux/dept.h
+++ b/include/linux/dept.h
@@ -170,6 +170,11 @@ struct dept_ecxt {
*/
unsigned long event_ip;
struct dept_stack *event_stack;
+
+ /*
+ * whether the event is triggered within __schedule()
+ */
+ bool in_sched;
};
struct dept_wait {
@@ -208,6 +213,11 @@ struct dept_wait {
*/
unsigned long wait_ip;
struct dept_stack *wait_stack;
+
+ /*
+ * spin or sleep
+ */
+ bool sleep;
};
struct dept_dep {
@@ -460,6 +470,11 @@ struct dept_task {
*/
bool hardirqs_enabled;
bool softirqs_enabled;
+
+ /*
+ * whether the current task is in __schedule()
+ */
+ bool in_sched;
};
#define DEPT_TASK_INITIALIZER(t) \
@@ -480,6 +495,7 @@ struct dept_task {
.missing_ecxt = 0, \
.hardirqs_enabled = false, \
.softirqs_enabled = false, \
+ .in_sched = false, \
}
extern void dept_on(void);
@@ -492,7 +508,7 @@ struct dept_task {
extern void dept_map_reinit(struct dept_map *m);
extern void dept_map_nocheck(struct dept_map *m);
-extern void dept_wait(struct dept_map *m, unsigned long w_f, unsigned long ip, const char *w_fn, int ne);
+extern void dept_wait(struct dept_map *m, unsigned long w_f, unsigned long ip, const char *w_fn, int ne, bool sleep);
extern void dept_stage_wait(struct dept_map *m, unsigned long w_f, const char *w_fn, int ne);
extern void dept_ask_event_wait_commit(unsigned long ip);
extern void dept_clean_stage(void);
@@ -502,11 +518,13 @@ struct dept_task {
extern void dept_ecxt_exit(struct dept_map *m, unsigned long e_f, unsigned long ip);
extern void dept_split_map_each_init(struct dept_map_each *me);
extern void dept_split_map_common_init(struct dept_map_common *mc, struct dept_key *k, const char *n);
-extern void dept_wait_split_map(struct dept_map_each *me, struct dept_map_common *mc, unsigned long ip, const char *w_fn, int ne);
+extern void dept_wait_split_map(struct dept_map_each *me, struct dept_map_common *mc, unsigned long ip, const char *w_fn, int ne, bool sleep);
extern void dept_event_split_map(struct dept_map_each *me, struct dept_map_common *mc, unsigned long ip, const char *e_fn);
extern void dept_ask_event_split_map(struct dept_map_each *me, struct dept_map_common *mc);
extern void dept_kernel_enter(void);
extern void dept_work_enter(void);
+extern void dept_sched_enter(void);
+extern void dept_sched_exit(void);
static inline void dept_ecxt_enter_nokeep(struct dept_map *m)
{
@@ -546,7 +564,7 @@ static inline void dept_ecxt_enter_nokeep(struct dept_map *m)
#define dept_map_reinit(m) do { } while (0)
#define dept_map_nocheck(m) do { } while (0)
-#define dept_wait(m, w_f, ip, w_fn, ne) do { (void)(w_fn); } while (0)
+#define dept_wait(m, w_f, ip, w_fn, ne, s) do { (void)(w_fn); } while (0)
#define dept_stage_wait(m, w_f, w_fn, ne) do { (void)(w_fn); } while (0)
#define dept_ask_event_wait_commit(ip) do { } while (0)
#define dept_clean_stage() do { } while (0)
@@ -556,11 +574,13 @@ static inline void dept_ecxt_enter_nokeep(struct dept_map *m)
#define dept_ecxt_exit(m, e_f, ip) do { } while (0)
#define dept_split_map_each_init(me) do { } while (0)
#define dept_split_map_common_init(mc, k, n) do { (void)(n); (void)(k); } while (0)
-#define dept_wait_split_map(me, mc, ip, w_fn, ne) do { } while (0)
+#define dept_wait_split_map(me, mc, ip, w_fn, ne, s) do { } while (0)
#define dept_event_split_map(me, mc, ip, e_fn) do { } while (0)
#define dept_ask_event_split_map(me, mc) do { } while (0)
#define dept_kernel_enter() do { } while (0)
#define dept_work_enter() do { } while (0)
+#define dept_sched_enter() do { } while (0)
+#define dept_sched_exit() do { } while (0)
#define dept_ecxt_enter_nokeep(m) do { } while (0)
#define dept_key_init(k) do { (void)(k); } while (0)
#define dept_key_destroy(k) do { (void)(k); } while (0)
diff --git a/include/linux/dept_page.h b/include/linux/dept_page.h
index d2d093d..4af3b2d 100644
--- a/include/linux/dept_page.h
+++ b/include/linux/dept_page.h
@@ -20,7 +20,7 @@
\
if (likely(me)) \
dept_wait_split_map(me, &pglocked_mc, _RET_IP_, \
- __func__, 0); \
+ __func__, 0, true); \
} while (0)
#define dept_pglocked_set_bit(f) \
@@ -46,7 +46,7 @@
\
if (likely(me)) \
dept_wait_split_map(me, &pgwriteback_mc, _RET_IP_,\
- __func__, 0); \
+ __func__, 0, true); \
} while (0)
#define dept_pgwriteback_set_bit(f) \
diff --git a/include/linux/dept_sdt.h b/include/linux/dept_sdt.h
index 49763cd..14a1720 100644
--- a/include/linux/dept_sdt.h
+++ b/include/linux/dept_sdt.h
@@ -29,7 +29,13 @@
#define sdt_wait(m) \
do { \
dept_ask_event(m); \
- dept_wait(m, 1UL, _THIS_IP_, "wait", 0); \
+ dept_wait(m, 1UL, _THIS_IP_, "wait", 0, true); \
+ } while (0)
+
+#define sdt_wait_spin(m) \
+ do { \
+ dept_ask_event(m); \
+ dept_wait(m, 1UL, _THIS_IP_, "wait", 0, false); \
} while (0)
/*
* This will be committed in __schedule() when it actually gets to
@@ -47,6 +53,7 @@
#define sdt_map_init(m) do { } while (0)
#define sdt_map_init_key(m, k) do { (void)(k); } while (0)
#define sdt_wait(m) do { } while (0)
+#define sdt_wait_spin(m) do { } while (0)
#define sdt_wait_prepare(m) do { } while (0)
#define sdt_wait_finish() do { } while (0)
#define sdt_ecxt_enter(m) do { } while (0)
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index b0e097f..b2119f4 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -575,12 +575,12 @@ static inline void print_irqtrace_events(struct task_struct *curr)
#define spin_acquire(l, s, t, i) \
do { \
lock_acquire_exclusive(l, s, t, NULL, i); \
- dept_spin_lock(&(l)->dmap, s, t, NULL, "spin_unlock", i); \
+ dept_spin_lock(&(l)->dmap, s, t, NULL, "spin_unlock", i, false);\
} while (0)
#define spin_acquire_nest(l, s, t, n, i) \
do { \
lock_acquire_exclusive(l, s, t, n, i); \
- dept_spin_lock(&(l)->dmap, s, t, (n) ? &(n)->dmap : NULL, "spin_unlock", i); \
+ dept_spin_lock(&(l)->dmap, s, t, (n) ? &(n)->dmap : NULL, "spin_unlock", i, false); \
} while (0)
#define spin_release(l, i) \
do { \
@@ -591,16 +591,16 @@ static inline void print_irqtrace_events(struct task_struct *curr)
#define rwlock_acquire(l, s, t, i) \
do { \
lock_acquire_exclusive(l, s, t, NULL, i); \
- dept_rwlock_wlock(&(l)->dmap, s, t, NULL, "write_unlock", i); \
+ dept_rwlock_wlock(&(l)->dmap, s, t, NULL, "write_unlock", i, false);\
} while (0)
#define rwlock_acquire_read(l, s, t, i) \
do { \
if (read_lock_is_recursive()) { \
lock_acquire_shared_recursive(l, s, t, NULL, i); \
- dept_rwlock_rlock(&(l)->dmap, s, t, NULL, "read_unlock", i, 0);\
+ dept_rwlock_rlock(&(l)->dmap, s, t, NULL, "read_unlock", i, 0, false);\
} else { \
lock_acquire_shared(l, s, t, NULL, i); \
- dept_rwlock_rlock(&(l)->dmap, s, t, NULL, "read_unlock", i, 1);\
+ dept_rwlock_rlock(&(l)->dmap, s, t, NULL, "read_unlock", i, 1, false);\
} \
} while (0)
#define rwlock_release(l, i) \
@@ -614,6 +614,48 @@ static inline void print_irqtrace_events(struct task_struct *curr)
dept_rwlock_runlock(&(l)->dmap, i); \
} while (0)
+#define rt_spin_acquire(l, s, t, i) \
+do { \
+ lock_acquire_exclusive(l, s, t, NULL, i); \
+ dept_spin_lock(&(l)->dmap, s, t, NULL, "spin_unlock", i, true); \
+} while (0)
+#define rt_spin_acquire_nest(l, s, t, n, i) \
+do { \
+ lock_acquire_exclusive(l, s, t, n, i); \
+ dept_spin_lock(&(l)->dmap, s, t, (n) ? &(n)->dmap : NULL, "spin_unlock", i, true);\
+} while (0)
+#define rt_spin_release(l, i) \
+do { \
+ lock_release(l, i); \
+ dept_spin_unlock(&(l)->dmap, i); \
+} while (0)
+
+#define rt_rwlock_acquire(l, s, t, i) \
+do { \
+ lock_acquire_exclusive(l, s, t, NULL, i); \
+ dept_rwlock_wlock(&(l)->dmap, s, t, NULL, "write_unlock", i, true);\
+} while (0)
+#define rt_rwlock_acquire_read(l, s, t, i) \
+do { \
+ if (read_lock_is_recursive()) { \
+ lock_acquire_shared_recursive(l, s, t, NULL, i); \
+ dept_rwlock_rlock(&(l)->dmap, s, t, NULL, "read_unlock", i, 0, true);\
+ } else { \
+ lock_acquire_shared(l, s, t, NULL, i); \
+ dept_rwlock_rlock(&(l)->dmap, s, t, NULL, "read_unlock", i, 1, true);\
+ } \
+} while (0)
+#define rt_rwlock_release(l, i) \
+do { \
+ lock_release(l, i); \
+ dept_rwlock_wunlock(&(l)->dmap, i); \
+} while (0)
+#define rt_rwlock_release_read(l, i) \
+do { \
+ lock_release(l, i); \
+ dept_rwlock_runlock(&(l)->dmap, i); \
+} while (0)
+
#define seqcount_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i)
#define seqcount_acquire_read(l, s, t, i) lock_acquire_shared_recursive(l, s, t, NULL, i)
#define seqcount_release(l, i) lock_release(l, i)
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index b699cf41..e98a912 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -84,7 +84,7 @@ struct mutex {
} else if (n) { \
dept_ecxt_enter_nokeep(m); \
} else { \
- dept_wait(m, 1UL, ip, __func__, ne); \
+ dept_wait(m, 1UL, ip, __func__, ne, true); \
dept_ecxt_enter(m, 1UL, ip, __func__, e_fn, ne); \
} \
} while (0)
diff --git a/include/linux/rwlock.h b/include/linux/rwlock.h
index bbab144..68a083d 100644
--- a/include/linux/rwlock.h
+++ b/include/linux/rwlock.h
@@ -33,25 +33,25 @@
#define DEPT_EVT_RWLOCK_W (1UL << 1)
#define DEPT_EVT_RWLOCK_RW (DEPT_EVT_RWLOCK_R | DEPT_EVT_RWLOCK_W)
-#define dept_rwlock_wlock(m, ne, t, n, e_fn, ip) \
+#define dept_rwlock_wlock(m, ne, t, n, e_fn, ip, s) \
do { \
if (t) { \
dept_ecxt_enter(m, DEPT_EVT_RWLOCK_W, ip, __func__, e_fn, ne);\
} else if (n) { \
dept_ecxt_enter_nokeep(m); \
} else { \
- dept_wait(m, DEPT_EVT_RWLOCK_RW, ip, __func__, ne); \
+ dept_wait(m, DEPT_EVT_RWLOCK_RW, ip, __func__, ne, s); \
dept_ecxt_enter(m, DEPT_EVT_RWLOCK_W, ip, __func__, e_fn, ne);\
} \
} while (0)
-#define dept_rwlock_rlock(m, ne, t, n, e_fn, ip, q) \
+#define dept_rwlock_rlock(m, ne, t, n, e_fn, ip, q, s) \
do { \
if (t) { \
dept_ecxt_enter(m, DEPT_EVT_RWLOCK_R, ip, __func__, e_fn, ne);\
} else if (n) { \
dept_ecxt_enter_nokeep(m); \
} else { \
- dept_wait(m, (q) ? DEPT_EVT_RWLOCK_RW : DEPT_EVT_RWLOCK_W, ip, __func__, ne);\
+ dept_wait(m, (q) ? DEPT_EVT_RWLOCK_RW : DEPT_EVT_RWLOCK_W, ip, __func__, ne, s);\
dept_ecxt_enter(m, DEPT_EVT_RWLOCK_R, ip, __func__, e_fn, ne);\
} \
} while (0)
@@ -64,8 +64,8 @@
dept_ecxt_exit(m, DEPT_EVT_RWLOCK_R, ip); \
} while (0)
#else
-#define dept_rwlock_wlock(m, ne, t, n, e_fn, ip) do { } while (0)
-#define dept_rwlock_rlock(m, ne, t, n, e_fn, ip, q) do { } while (0)
+#define dept_rwlock_wlock(m, ne, t, n, e_fn, ip, s) do { } while (0)
+#define dept_rwlock_rlock(m, ne, t, n, e_fn, ip, q, s) do { } while (0)
#define dept_rwlock_wunlock(m, ip) do { } while (0)
#define dept_rwlock_runlock(m, ip) do { } while (0)
#endif
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index ed4c34e..fd86dfd5 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -41,7 +41,7 @@
} else if (n) { \
dept_ecxt_enter_nokeep(m); \
} else { \
- dept_wait(m, 1UL, ip, __func__, ne); \
+ dept_wait(m, 1UL, ip, __func__, ne, true); \
dept_ecxt_enter(m, 1UL, ip, __func__, e_fn, ne); \
} \
} while (0)
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 47c3379..ac2ac40 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -25,7 +25,7 @@
#ifdef CONFIG_DEPT
#define DEPT_EVT_ALL ((1UL << DEPT_MAX_SUBCLASSES_EVT) - 1)
-#define dept_seq_wait(m, ip) dept_wait(m, DEPT_EVT_ALL, ip, __func__, 0)
+#define dept_seq_wait(m, ip) dept_wait(m, DEPT_EVT_ALL, ip, __func__, 0, false)
#define dept_seq_writebegin(m, ip) \
do { \
dept_ecxt_enter(m, 1UL, ip, __func__, "write_seqcount_end", 0);\
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 191fb99..a78aaa3 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -96,14 +96,14 @@
#endif
#ifdef CONFIG_DEPT
-#define dept_spin_lock(m, ne, t, n, e_fn, ip) \
+#define dept_spin_lock(m, ne, t, n, e_fn, ip, s) \
do { \
if (t) { \
dept_ecxt_enter(m, 1UL, ip, __func__, e_fn, ne); \
} else if (n) { \
dept_ecxt_enter_nokeep(m); \
} else { \
- dept_wait(m, 1UL, ip, __func__, ne); \
+ dept_wait(m, 1UL, ip, __func__, ne, s); \
dept_ecxt_enter(m, 1UL, ip, __func__, e_fn, ne); \
} \
} while (0)
@@ -112,8 +112,8 @@
dept_ecxt_exit(m, 1UL, ip); \
} while (0)
#else
-#define dept_spin_lock(m, ne, t, n, e_fn, ip) do { } while (0)
-#define dept_spin_unlock(m, ip) do { } while (0)
+#define dept_spin_lock(m, ne, t, n, e_fn, ip, s) do { } while (0)
+#define dept_spin_unlock(m, ip) do { } while (0)
#endif
#ifdef CONFIG_DEBUG_SPINLOCK
diff --git a/kernel/dependency/dept.c b/kernel/dependency/dept.c
index 2bc6259..14dc33b 100644
--- a/kernel/dependency/dept.c
+++ b/kernel/dependency/dept.c
@@ -1425,6 +1425,13 @@ static void add_dep(struct dept_ecxt *e, struct dept_wait *w)
struct dept_dep *d;
int i;
+ /*
+ * It's meaningless to track dependencies between sleeps and
+ * events triggered within __schedule().
+ */
+ if (e->in_sched && w->sleep)
+ return;
+
if (lookup_dep(fc, tc))
return;
@@ -1469,7 +1476,7 @@ static void add_dep(struct dept_ecxt *e, struct dept_wait *w)
static atomic_t wgen = ATOMIC_INIT(1);
static void add_wait(struct dept_class *c, unsigned long ip,
- const char *w_fn, int ne)
+ const char *w_fn, int ne, bool sleep)
{
struct dept_task *dt = dept_task();
struct dept_wait *w;
@@ -1485,6 +1492,7 @@ static void add_wait(struct dept_class *c, unsigned long ip,
w->wait_ip = ip;
w->wait_fn = w_fn;
w->wait_stack = get_current_stack();
+ w->sleep = sleep;
cxt = cur_cxt();
if (cxt == DEPT_CXT_HIRQ || cxt == DEPT_CXT_SIRQ)
@@ -1538,6 +1546,7 @@ static bool add_ecxt(void *obj, struct dept_class *c, unsigned long ip,
e->ecxt_stack = ip && rich_stack ? get_current_stack() : NULL;
e->event_fn = e_fn;
e->ecxt_fn = c_fn;
+ e->in_sched = dt->in_sched;
eh = dt->ecxt_held + (dt->ecxt_held_pos++);
eh->ecxt = get_ecxt(e);
@@ -1906,6 +1915,16 @@ void dept_hardirq_enter(void)
dt->cxt_id[DEPT_CXT_HIRQ] += (1UL << DEPT_CXTS_NR);
}
+void dept_sched_enter(void)
+{
+ dept_task()->in_sched = true;
+}
+
+void dept_sched_exit(void)
+{
+ dept_task()->in_sched = false;
+}
+
/*
* DEPT API
* =====================================================================
@@ -2119,7 +2138,8 @@ static struct dept_class *check_new_class(struct dept_key *local,
}
static void __dept_wait(struct dept_map *m, unsigned long w_f,
- unsigned long ip, const char *w_fn, int ne)
+ unsigned long ip, const char *w_fn, int ne,
+ bool sleep)
{
int e;
@@ -2142,12 +2162,12 @@ static void __dept_wait(struct dept_map *m, unsigned long w_f,
if (!c)
continue;
- add_wait(c, ip, w_fn, ne);
+ add_wait(c, ip, w_fn, ne, sleep);
}
}
void dept_wait(struct dept_map *m, unsigned long w_f, unsigned long ip,
- const char *w_fn, int ne)
+ const char *w_fn, int ne, bool sleep)
{
struct dept_task *dt = dept_task();
unsigned long flags;
@@ -2163,7 +2183,7 @@ void dept_wait(struct dept_map *m, unsigned long w_f, unsigned long ip,
flags = dept_enter();
- __dept_wait(m, w_f, ip, w_fn, ne);
+ __dept_wait(m, w_f, ip, w_fn, ne, sleep);
dept_exit(flags);
}
@@ -2296,7 +2316,7 @@ void dept_ask_event_wait_commit(unsigned long ip)
wg = atomic_inc_return(&wgen) ?: atomic_inc_return(&wgen);
WRITE_ONCE(m->wgen, wg);
- __dept_wait(m, w_f, ip, w_fn, ne);
+ __dept_wait(m, w_f, ip, w_fn, ne, true);
exit:
dept_exit(flags);
}
@@ -2526,7 +2546,8 @@ void dept_split_map_common_init(struct dept_map_common *mc,
void dept_wait_split_map(struct dept_map_each *me,
struct dept_map_common *mc,
- unsigned long ip, const char *w_fn, int ne)
+ unsigned long ip, const char *w_fn, int ne,
+ bool sleep)
{
struct dept_task *dt = dept_task();
struct dept_class *c;
@@ -2547,7 +2568,7 @@ void dept_wait_split_map(struct dept_map_each *me,
k = mc->keys ?: &mc->keys_local;
c = check_new_class(&mc->keys_local, k, 0, 0UL, mc->name);
if (c)
- add_wait(c, ip, w_fn, ne);
+ add_wait(c, ip, w_fn, ne, sleep);
dept_exit(flags);
}
diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c
index 48a19ed..2e1d0e5 100644
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -51,7 +51,7 @@ static __always_inline void __rt_spin_lock(spinlock_t *lock)
void __sched rt_spin_lock(spinlock_t *lock)
{
- spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+ rt_spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
__rt_spin_lock(lock);
}
EXPORT_SYMBOL(rt_spin_lock);
@@ -59,7 +59,7 @@ void __sched rt_spin_lock(spinlock_t *lock)
#ifdef CONFIG_DEBUG_LOCK_ALLOC
void __sched rt_spin_lock_nested(spinlock_t *lock, int subclass)
{
- spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_);
+ rt_spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_);
__rt_spin_lock(lock);
}
EXPORT_SYMBOL(rt_spin_lock_nested);
@@ -67,7 +67,7 @@ void __sched rt_spin_lock_nested(spinlock_t *lock, int subclass)
void __sched rt_spin_lock_nest_lock(spinlock_t *lock,
struct lockdep_map *nest_lock)
{
- spin_acquire_nest(&lock->dep_map, 0, 0, nest_lock, _RET_IP_);
+ rt_spin_acquire_nest(&lock->dep_map, 0, 0, nest_lock, _RET_IP_);
__rt_spin_lock(lock);
}
EXPORT_SYMBOL(rt_spin_lock_nest_lock);
@@ -75,7 +75,7 @@ void __sched rt_spin_lock_nest_lock(spinlock_t *lock,
void __sched rt_spin_unlock(spinlock_t *lock)
{
- spin_release(&lock->dep_map, _RET_IP_);
+ rt_spin_release(&lock->dep_map, _RET_IP_);
migrate_enable();
rcu_read_unlock();
@@ -104,7 +104,7 @@ static __always_inline int __rt_spin_trylock(spinlock_t *lock)
ret = rt_mutex_slowtrylock(&lock->lock);
if (ret) {
- spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+ rt_spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
rcu_read_lock();
migrate_disable();
}
@@ -197,7 +197,7 @@ int __sched rt_read_trylock(rwlock_t *rwlock)
ret = rwbase_read_trylock(&rwlock->rwbase);
if (ret) {
- rwlock_acquire_read(&rwlock->dep_map, 0, 1, _RET_IP_);
+ rt_rwlock_acquire_read(&rwlock->dep_map, 0, 1, _RET_IP_);
rcu_read_lock();
migrate_disable();
}
@@ -211,7 +211,7 @@ int __sched rt_write_trylock(rwlock_t *rwlock)
ret = rwbase_write_trylock(&rwlock->rwbase);
if (ret) {
- rwlock_acquire(&rwlock->dep_map, 0, 1, _RET_IP_);
+ rt_rwlock_acquire(&rwlock->dep_map, 0, 1, _RET_IP_);
rcu_read_lock();
migrate_disable();
}
@@ -222,7 +222,7 @@ int __sched rt_write_trylock(rwlock_t *rwlock)
void __sched rt_read_lock(rwlock_t *rwlock)
{
rtlock_might_resched();
- rwlock_acquire_read(&rwlock->dep_map, 0, 0, _RET_IP_);
+ rt_rwlock_acquire_read(&rwlock->dep_map, 0, 0, _RET_IP_);
rwbase_read_lock(&rwlock->rwbase, TASK_RTLOCK_WAIT);
rcu_read_lock();
migrate_disable();
@@ -232,7 +232,7 @@ void __sched rt_read_lock(rwlock_t *rwlock)
void __sched rt_write_lock(rwlock_t *rwlock)
{
rtlock_might_resched();
- rwlock_acquire(&rwlock->dep_map, 0, 0, _RET_IP_);
+ rt_rwlock_acquire(&rwlock->dep_map, 0, 0, _RET_IP_);
rwbase_write_lock(&rwlock->rwbase, TASK_RTLOCK_WAIT);
rcu_read_lock();
migrate_disable();
@@ -243,7 +243,7 @@ void __sched rt_write_lock(rwlock_t *rwlock)
void __sched rt_write_lock_nested(rwlock_t *rwlock, int subclass)
{
rtlock_might_resched();
- rwlock_acquire(&rwlock->dep_map, subclass, 0, _RET_IP_);
+ rt_rwlock_acquire(&rwlock->dep_map, subclass, 0, _RET_IP_);
rwbase_write_lock(&rwlock->rwbase, TASK_RTLOCK_WAIT);
rcu_read_lock();
migrate_disable();
@@ -253,7 +253,7 @@ void __sched rt_write_lock_nested(rwlock_t *rwlock, int subclass)
void __sched rt_read_unlock(rwlock_t *rwlock)
{
- rwlock_release(&rwlock->dep_map, _RET_IP_);
+ rt_rwlock_release(&rwlock->dep_map, _RET_IP_);
migrate_enable();
rcu_read_unlock();
rwbase_read_unlock(&rwlock->rwbase, TASK_RTLOCK_WAIT);
@@ -262,7 +262,7 @@ void __sched rt_read_unlock(rwlock_t *rwlock)
void __sched rt_write_unlock(rwlock_t *rwlock)
{
- rwlock_release(&rwlock->dep_map, _RET_IP_);
+ rt_rwlock_release(&rwlock->dep_map, _RET_IP_);
rcu_read_unlock();
migrate_enable();
rwbase_write_unlock(&rwlock->rwbase);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5784b07..cb42f52 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6272,6 +6272,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
struct rq *rq;
int cpu;
+ dept_sched_enter();
cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = rq->curr;
@@ -6401,6 +6402,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
__balance_callbacks(rq);
raw_spin_rq_unlock_irq(rq);
}
+ dept_sched_exit();
}
void __noreturn do_task_dead(void)
--
1.9.1
On Sat, May 07, 2022 at 04:20:50PM +0900, Hyeonggon Yoo wrote:
> On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote:
> > Linus wrote:
> > >
> > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <[email protected]> wrote:
> > > >
> > > > Hi Linus and folks,
> > > >
> > > > I've been developing a tool for detecting deadlock possibilities by
> > > > tracking wait/event rather than lock(?) acquisition order to try to
> > > > cover all synchonization machanisms.
> > >
> > > So what is the actual status of reports these days?
> > >
> > > Last time I looked at some reports, it gave a lot of false positives
> > > due to mis-understanding prepare_to_sleep().
> >
> > Yes, it was. I handled the case in the following way:
> >
> > 1. Stage the wait at prepare_to_sleep(), which might be used at commit.
> > Which has yet to be an actual wait that Dept considers.
> > 2. If the condition for sleep is true, the wait will be committed at
> > __schedule(). The wait becomes an actual one that Dept considers.
> > 3. If the condition is false and the task gets back to TASK_RUNNING,
> > clean(=reset) the staged wait.
> >
> > That way, Dept only works with what actually hits to __schedule() for
> > the waits through sleep.
> >
> > > For this all to make sense, it would need to not have false positives
> > > (or at least a very small number of them together with a way to sanely
> >
> > Yes. I agree with you. I got rid of them that way I described above.
> >
>
> IMHO DEPT should not report what lockdep allows (Not talking about
No.
> wait events). I mean lockdep allows some kind of nested locks but
> DEPT reports them.
You have already asked exactly same question in another thread of
LKML. That time I answered to it but let me explain it again.
---
CASE 1.
lock L with depth n
lock_nested L' with depth n + 1
...
unlock L'
unlock L
This case is allowed by Lockdep.
This case is allowed by DEPT cuz it's not a deadlock.
CASE 2.
lock L with depth n
lock A
lock_nested L' with depth n + 1
...
unlock L'
unlock A
unlock L
This case is allowed by Lockdep.
This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*.
---
The following scenario would explain why CASE 2 is problematic.
THREAD X THREAD Y
lock L with depth n
lock L' with depth n
lock A
lock A
lock_nested L' with depth n + 1
lock_nested L'' with depth n + 1
... ...
unlock L' unlock L''
unlock A unlock A
unlock L unlock L'
Yes. I need to check if the report you shared with me is a true one, but
it's not because DEPT doesn't work with *_nested() APIs.
Byungchul
On Mon, 9 May 2022 09:16:37 +0900
Byungchul Park <[email protected]> wrote:
> CASE 2.
>
> lock L with depth n
> lock A
> lock_nested L' with depth n + 1
> ...
> unlock L'
> unlock A
> unlock L
>
> This case is allowed by Lockdep.
> This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*.
>
> ---
>
> The following scenario would explain why CASE 2 is problematic.
>
> THREAD X THREAD Y
>
> lock L with depth n
> lock L' with depth n
> lock A
> lock A
> lock_nested L' with depth n + 1
I'm confused by what exactly you are saying is a deadlock above.
Are you saying that lock A and L' are inversed? If so, lockdep had better
detect that regardless of L. A nested lock associates the the nesting with
the same type of lock. That is, in lockdep nested tells lockdep not to
trigger on the L and L' but it will not ignore that A was taken.
-- Steve
> lock_nested L'' with depth n + 1
> ... ...
> unlock L' unlock L''
> unlock A unlock A
> unlock L unlock L'
I tried DEPT-v6 applied against 5.18-rc5, and it reported the
following positive.
The reason why it's nonsense is that in context A's [W] wait:
[ 1538.545054] [W] folio_wait_bit_common(pglocked:0):
[ 1538.545370] [<ffffffff81259944>] __filemap_get_folio+0x3e4/0x420
[ 1538.545763] stacktrace:
[ 1538.545928] folio_wait_bit_common+0x2fa/0x460
[ 1538.546248] __filemap_get_folio+0x3e4/0x420
[ 1538.546558] pagecache_get_page+0x11/0x40
[ 1538.546852] ext4_mb_init_group+0x80/0x2e0
[ 1538.547152] ext4_mb_good_group_nolock+0x2a3/0x2d0
... we're reading the block allocation bitmap into the page cache.
This does not correspond to a real inode, and so we don't actually
take ei->i_data_sem in this on the psuedo-inode used.
In contast, context's B's [W] and [E]'s stack traces, the
folio_wait_bit is clearly associated with page which is mapped to a
real inode:
[ 1538.553656] [W] down_write(&ei->i_data_sem:0):
[ 1538.553948] [<ffffffff8141c01b>] ext4_map_blocks+0x17b/0x680
[ 1538.554320] stacktrace:
[ 1538.554485] ext4_map_blocks+0x17b/0x680
[ 1538.554772] mpage_map_and_submit_extent+0xef/0x530
[ 1538.555122] ext4_writepages+0x798/0x990
[ 1538.555409] do_writepages+0xcf/0x1c0
[ 1538.555682] __writeback_single_inode+0x58/0x3f0
[ 1538.556014] writeback_sb_inodes+0x210/0x540
...
[ 1538.558621] [E] folio_wake_bit(pglocked:0):
[ 1538.558896] [<ffffffff814418c0>] ext4_bio_write_page+0x400/0x560
[ 1538.559290] stacktrace:
[ 1538.559455] ext4_bio_write_page+0x400/0x560
[ 1538.559765] mpage_submit_page+0x5c/0x80
[ 1538.560051] mpage_map_and_submit_buffers+0x15a/0x250
[ 1538.560409] mpage_map_and_submit_extent+0x134/0x530
[ 1538.560764] ext4_writepages+0x798/0x990
[ 1538.561057] do_writepages+0xcf/0x1c0
[ 1538.561329] __writeback_single_inode+0x58/0x3f0
...
In any case, this will ***never*** deadlock, and it's due to DEPT
fundamentally not understanding that waiting on different pages may be
due to inodes that come from completely different inodes, and so there
is zero possible chance this would never deadlock.
I suspect there will be similar false positives for tests (or
userspace) that uses copy_file_range(2) or send_file(2) system calls.
I've included the full DEPT log report below.
- Ted
generic/011 [20:11:16][ 1533.411773] run fstests generic/011 at 2022-05-07 20:11:16
[ 1533.509603] DEPT_INFO_ONCE: Need to expand the ring buffer.
[ 1536.910044] DEPT_INFO_ONCE: Pool(wait) is empty.
[ 1538.533315] ===================================================
[ 1538.533793] DEPT: Circular dependency has been detected.
[ 1538.534199] 5.18.0-rc5-xfstests-dept-00021-g8d3d751c9964 #571 Not tainted
[ 1538.534645] ---------------------------------------------------
[ 1538.535035] summary
[ 1538.535177] ---------------------------------------------------
[ 1538.535567] *** DEADLOCK ***
[ 1538.535567]
[ 1538.535854] context A
[ 1538.536008] [S] down_write(&ei->i_data_sem:0)
[ 1538.536323] [W] folio_wait_bit_common(pglocked:0)
[ 1538.536655] [E] up_write(&ei->i_data_sem:0)
[ 1538.536958]
[ 1538.537063] context B
[ 1538.537216] [S] (unknown)(pglocked:0)
[ 1538.537480] [W] down_write(&ei->i_data_sem:0)
[ 1538.537789] [E] folio_wake_bit(pglocked:0)
[ 1538.538082]
[ 1538.538184] [S]: start of the event context
[ 1538.538460] [W]: the wait blocked
[ 1538.538680] [E]: the event not reachable
[ 1538.538939] ---------------------------------------------------
[ 1538.539327] context A's detail
[ 1538.539530] ---------------------------------------------------
[ 1538.539918] context A
[ 1538.540072] [S] down_write(&ei->i_data_sem:0)
[ 1538.540382] [W] folio_wait_bit_common(pglocked:0)
[ 1538.540712] [E] up_write(&ei->i_data_sem:0)
[ 1538.541015]
[ 1538.541119] [S] down_write(&ei->i_data_sem:0):
[ 1538.541410] [<ffffffff8141c01b>] ext4_map_blocks+0x17b/0x680
[ 1538.541782] stacktrace:
[ 1538.541946] ext4_map_blocks+0x17b/0x680
[ 1538.542234] ext4_getblk+0x5f/0x1f0
[ 1538.542493] ext4_bread+0xc/0x70
[ 1538.542736] ext4_append+0x48/0xf0
[ 1538.542991] ext4_init_new_dir+0xc8/0x160
[ 1538.543284] ext4_mkdir+0x19a/0x320
[ 1538.543542] vfs_mkdir+0x83/0xe0
[ 1538.543788] do_mkdirat+0x8c/0x130
[ 1538.544042] __x64_sys_mkdir+0x29/0x30
[ 1538.544319] do_syscall_64+0x40/0x90
[ 1538.544584] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1538.544949]
[ 1538.545054] [W] folio_wait_bit_common(pglocked:0):
[ 1538.545370] [<ffffffff81259944>] __filemap_get_folio+0x3e4/0x420
[ 1538.545763] stacktrace:
[ 1538.545928] folio_wait_bit_common+0x2fa/0x460
[ 1538.546248] __filemap_get_folio+0x3e4/0x420
[ 1538.546558] pagecache_get_page+0x11/0x40
[ 1538.546852] ext4_mb_init_group+0x80/0x2e0
[ 1538.547152] ext4_mb_good_group_nolock+0x2a3/0x2d0
[ 1538.547496] ext4_mb_regular_allocator+0x391/0x780
[ 1538.547840] ext4_mb_new_blocks+0x44e/0x720
[ 1538.548145] ext4_ext_map_blocks+0x7f1/0xd00
[ 1538.548455] ext4_map_blocks+0x19e/0x680
[ 1538.548743] ext4_getblk+0x5f/0x1f0
[ 1538.549006] ext4_bread+0xc/0x70
[ 1538.549250] ext4_append+0x48/0xf0
[ 1538.549505] ext4_init_new_dir+0xc8/0x160
[ 1538.549798] ext4_mkdir+0x19a/0x320
[ 1538.550058] vfs_mkdir+0x83/0xe0
[ 1538.550302] do_mkdirat+0x8c/0x130
[ 1538.550557]
[ 1538.550660] [E] up_write(&ei->i_data_sem:0):
[ 1538.550940] (N/A)
[ 1538.551071] ---------------------------------------------------
[ 1538.551459] context B's detail
[ 1538.551662] ---------------------------------------------------
[ 1538.552047] context B
[ 1538.552202] [S] (unknown)(pglocked:0)
[ 1538.552466] [W] down_write(&ei->i_data_sem:0)
[ 1538.552775] [E] folio_wake_bit(pglocked:0)
[ 1538.553071]
[ 1538.553174] [S] (unknown)(pglocked:0):
[ 1538.553422] (N/A)
[ 1538.553553]
[ 1538.553656] [W] down_write(&ei->i_data_sem:0):
[ 1538.553948] [<ffffffff8141c01b>] ext4_map_blocks+0x17b/0x680
[ 1538.554320] stacktrace:
[ 1538.554485] ext4_map_blocks+0x17b/0x680
[ 1538.554772] mpage_map_and_submit_extent+0xef/0x530
[ 1538.555122] ext4_writepages+0x798/0x990
[ 1538.555409] do_writepages+0xcf/0x1c0
[ 1538.555682] __writeback_single_inode+0x58/0x3f0
[ 1538.556014] writeback_sb_inodes+0x210/0x540
[ 1538.556324] __writeback_inodes_wb+0x4c/0xe0
[ 1538.556635] wb_writeback+0x298/0x450
[ 1538.556911] wb_do_writeback+0x29e/0x320
[ 1538.557199] wb_workfn+0x6a/0x2c0
[ 1538.557447] process_one_work+0x302/0x650
[ 1538.557743] worker_thread+0x55/0x400
[ 1538.558013] kthread+0xf0/0x120
[ 1538.558251] ret_from_fork+0x1f/0x30
[ 1538.558518]
[ 1538.558621] [E] folio_wake_bit(pglocked:0):
[ 1538.558896] [<ffffffff814418c0>] ext4_bio_write_page+0x400/0x560
[ 1538.559290] stacktrace:
[ 1538.559455] ext4_bio_write_page+0x400/0x560
[ 1538.559765] mpage_submit_page+0x5c/0x80
[ 1538.560051] mpage_map_and_submit_buffers+0x15a/0x250
[ 1538.560409] mpage_map_and_submit_extent+0x134/0x530
[ 1538.560764] ext4_writepages+0x798/0x990
[ 1538.561057] do_writepages+0xcf/0x1c0
[ 1538.561329] __writeback_single_inode+0x58/0x3f0
[ 1538.561662] writeback_sb_inodes+0x210/0x540
[ 1538.561973] __writeback_inodes_wb+0x4c/0xe0
[ 1538.562283] wb_writeback+0x298/0x450
[ 1538.562555] wb_do_writeback+0x29e/0x320
[ 1538.562842] wb_workfn+0x6a/0x2c0
[ 1538.563095] process_one_work+0x302/0x650
[ 1538.563387] worker_thread+0x55/0x400
[ 1538.563658] kthread+0xf0/0x120
[ 1538.563895] ret_from_fork+0x1f/0x30
[ 1538.564161] ---------------------------------------------------
[ 1538.564548] information that might be helpful
[ 1538.564832] ---------------------------------------------------
[ 1538.565223] CPU: 1 PID: 46539 Comm: dirstress Not tainted 5.18.0-rc5-xfstests-dept-00021-g8d3d751c9964 #571
[ 1538.565854] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 1538.566394] Call Trace:
[ 1538.566559] <TASK>
[ 1538.566701] dump_stack_lvl+0x4f/0x68
[ 1538.566945] print_circle.cold+0x15b/0x169
[ 1538.567218] ? print_circle+0xe0/0xe0
[ 1538.567461] cb_check_dl+0x55/0x60
[ 1538.567687] bfs+0xd5/0x1b0
[ 1538.567874] add_dep+0xd3/0x1a0
[ 1538.568083] ? __filemap_get_folio+0x3e4/0x420
[ 1538.568374] add_wait+0xe3/0x250
[ 1538.568590] ? __filemap_get_folio+0x3e4/0x420
[ 1538.568886] dept_wait_split_map+0xb1/0x130
[ 1538.569163] folio_wait_bit_common+0x2fa/0x460
[ 1538.569456] ? lock_is_held_type+0xfc/0x130
[ 1538.569733] __filemap_get_folio+0x3e4/0x420
[ 1538.570013] ? __lock_release+0x1b2/0x2c0
[ 1538.570278] pagecache_get_page+0x11/0x40
[ 1538.570543] ext4_mb_init_group+0x80/0x2e0
[ 1538.570813] ? ext4_get_group_desc+0xb2/0x200
[ 1538.571102] ext4_mb_good_group_nolock+0x2a3/0x2d0
[ 1538.571418] ext4_mb_regular_allocator+0x391/0x780
[ 1538.571733] ? rcu_read_lock_sched_held+0x3f/0x70
[ 1538.572044] ? trace_kmem_cache_alloc+0x2c/0xd0
[ 1538.572343] ? kmem_cache_alloc+0x1f7/0x3f0
[ 1538.572618] ext4_mb_new_blocks+0x44e/0x720
[ 1538.572896] ext4_ext_map_blocks+0x7f1/0xd00
[ 1538.573179] ? find_held_lock+0x2b/0x80
[ 1538.573434] ext4_map_blocks+0x19e/0x680
[ 1538.573693] ext4_getblk+0x5f/0x1f0
[ 1538.573927] ext4_bread+0xc/0x70
[ 1538.574141] ext4_append+0x48/0xf0
[ 1538.574369] ext4_init_new_dir+0xc8/0x160
[ 1538.574634] ext4_mkdir+0x19a/0x320
[ 1538.574866] vfs_mkdir+0x83/0xe0
[ 1538.575082] do_mkdirat+0x8c/0x130
[ 1538.575308] __x64_sys_mkdir+0x29/0x30
[ 1538.575557] do_syscall_64+0x40/0x90
[ 1538.575795] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1538.576128] RIP: 0033:0x7f0960466b07
[ 1538.576367] Code: 1f 40 00 48 8b 05 89 f3 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 f3 0c 00 f7 d8 64 89 01 48
[ 1538.577576] RSP: 002b:00007ffd0fa955a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[ 1538.578069] RAX: ffffffffffffffda RBX: 0000000000000239 RCX: 00007f0960466b07
[ 1538.578533] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007ffd0fa955d0
[ 1538.578995] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000010
[ 1538.579458] R10: 00007ffd0fa95345 R11: 0000000000000246 R12: 00000000000003e8
[ 1538.579923] R13: 0000000000000000 R14: 00007ffd0fa955d0 R15: 00007ffd0fa95dd0
[ 1538.580389] </TASK>
[ 1540.581382] EXT4-fs (vdb): mounted filesystem with ordered data mode. Quota mode: none.
[20:11:24] 8s
P.S. Later on the console, the test ground to the halt because DEPT
started WARNING over and over and over again....
[ 3129.686102] DEPT_WARN_ON: dt->ecxt_held_pos == DEPT_MAX_ECXT_HELD
[ 3129.686396] ? __might_fault+0x32/0x80
[ 3129.686660] WARNING: CPU: 1 PID: 107320 at kernel/dependency/dept.c:1537 add_ecxt+0x1c0/0x1d0
[ 3129.687040] ? __might_fault+0x32/0x80
[ 3129.687282] CPU: 1 PID: 107320 Comm: aio-stress Tainted: G W 5.18.0-rc5-xfstests-dept-00021-g8d3d751c9964 #571
with multiple CPU's completely spamming the serial console. This
should probably be a WARN_ON_ONCE, or some thing that disables DEPT
entirely, since apparently won't be any useful DEPT reports (or any
useful kernel work, for that matteR) is going to be happening after
this.
Oh, one other problem with DEPT --- it's SLOW --- the overhead is
enormous. Using kvm-xfstests[1] running "kvm-xfstests smoke", here
are some sample times:
LOCKDEP DEPT
Time to first test 49 seconds 602 seconds
ext4/001 2 s 22 s
ext4/003 2 s 8 s
ext4/005 0 s 7 s
ext4/020 1 s 8 s
ext4/021 11 s 17 s
ext4/023 0 s 83 s
generic/001 4 s 76 s
generic/002 0 s 11 s
generic/003 10 s 19 s
There are some large variations; in some cases, some xfstests take 10x
as much time or more to run. In fact, when I first started the
kvm-xfstests run with DEPT, I thought something had hung and that
tests would never start. (In fact, with gce-xfstests the default
watchdog "something has gone terribly wrong with the kexec" had fired,
and I didn't get any test results using gce-xfstests at all. If DEPT
goes in without any optimizations, I'm going to have to adjust the
watchdogs timers for gce-xfstests.)
The bottom line is that at the moment, between the false positives,
and the significant overhead imposed by DEPT, I would suggest that if
DEPT ever does go in, that it should be possible to disable DEPT and
only use the existing CONFIG_PROVE_LOCKING version of LOCKDEP, just
because DEPT is S - L - O - W.
[1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md
- Ted
P.S. Darrick and I both have disabled using LOCKDEP by default
because it slows down ext4 -g auto testing by a factor 2, and xfs -g
auto testing by a factor of 3. So the fact that DEPT is a factor of
2x to 10x or more slower than LOCKDEP when running various xfstests
tests should be a real concern.
On Mon, May 09, 2022 at 06:28:17PM -0400, Theodore Ts'o wrote:
> Oh, one other problem with DEPT --- it's SLOW --- the overhead is
> enormous. Using kvm-xfstests[1] running "kvm-xfstests smoke", here
> are some sample times:
Yes, right. DEPT has never been optimized. It rather turns on
CONFIG_LOCKDEP and even CONFIG_PROVE_LOCKING when CONFIG_DEPT gets on
because of porting issue. I have no choice but to rely on those to
develop DEPT out of tree. Of course, that's what I don't like.
Plus, for now, I'm focusing on removing false positives. Once it's
considered settled down, I will work on performance optimizaition. But
it should still keep relying on Lockdep CONFIGs and adding additional
overhead on it until DEPT can be developed in the tree.
> LOCKDEP DEPT
> Time to first test 49 seconds 602 seconds
> ext4/001 2 s 22 s
> ext4/003 2 s 8 s
> ext4/005 0 s 7 s
> ext4/020 1 s 8 s
> ext4/021 11 s 17 s
> ext4/023 0 s 83 s
> generic/001 4 s 76 s
> generic/002 0 s 11 s
> generic/003 10 s 19 s
>
> There are some large variations; in some cases, some xfstests take 10x
> as much time or more to run. In fact, when I first started the
> kvm-xfstests run with DEPT, I thought something had hung and that
> tests would never start. (In fact, with gce-xfstests the default
> watchdog "something has gone terribly wrong with the kexec" had fired,
> and I didn't get any test results using gce-xfstests at all. If DEPT
> goes in without any optimizations, I'm going to have to adjust the
> watchdogs timers for gce-xfstests.)
Thank you for informing it. I will go for the optimization as well.
> The bottom line is that at the moment, between the false positives,
> and the significant overhead imposed by DEPT, I would suggest that if
> DEPT ever does go in, that it should be possible to disable DEPT and
> only use the existing CONFIG_PROVE_LOCKING version of LOCKDEP, just
> because DEPT is S - L - O - W.
>
> [1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md
>
> - Ted
>
> P.S. Darrick and I both have disabled using LOCKDEP by default
> because it slows down ext4 -g auto testing by a factor 2, and xfs -g
> auto testing by a factor of 3. So the fact that DEPT is a factor of
> 2x to 10x or more slower than LOCKDEP when running various xfstests
> tests should be a real concern.
DEPT is tracking way more objects than Lockdep so it's inevitable to be
slower, but let me try to make it have the similar performance to
Lockdep.
Byungchul
On Tue, May 10, 2022 at 09:32:13AM +0900, Byungchul Park wrote:
> Yes, right. DEPT has never been optimized. It rather turns on
> CONFIG_LOCKDEP and even CONFIG_PROVE_LOCKING when CONFIG_DEPT gets on
> because of porting issue. I have no choice but to rely on those to
> develop DEPT out of tree. Of course, that's what I don't like.
Sure, but blaming the overhead on unnecessary CONFIG_PROVE_LOCKING
overhead can explain only a tiny fraction of the slowdown. Consider:
if time to first test (time to boot the kernel, setup the test
environment, figure out which tests to run, etc.) is 12 seconds w/o
LOCKDEP, 49 seconds with LOCKDEP/PROVE_LOCKING and 602 seconds with
DEPT, you can really only blame 37 seconds out of the 602 seconds of
DEPT on unnecessary PROVE_LOCKING overhead.
So let's assume we can get rid of all of the PROVE_LOCKING overhead.
We're still talking about 12 seconds for time-to-first test without
any lock debugging, versus ** 565 ** seconds for time-to-first test
with DEPT. That's a factor of 47x for DEPT sans LOCKDEP overhead,
compared to a 4x overhead for PROVE_LOCKING.
> Plus, for now, I'm focusing on removing false positives. Once it's
> considered settled down, I will work on performance optimizaition. But
> it should still keep relying on Lockdep CONFIGs and adding additional
> overhead on it until DEPT can be developed in the tree.
Well, please take a look at the false positive which I reported. I
suspect that in order to fix that particular false positive, we'll
either need to have a way to disable DEPT on waiting on all page/folio
dirty bits, or it will need to treat pages from different inodes
and/or address spaces as being entirely separate classes, instead of
collapsing all inode dirty bits, and all of various inode's mutexes
(such as ext4's i_data_sem) as being part of a single object class.
> DEPT is tracking way more objects than Lockdep so it's inevitable to be
> slower, but let me try to make it have the similar performance to
> Lockdep.
In order to eliminate some of these false positives, I suspect it's
going to increase the number of object classes that DEPT will need to
track even *more*. At which point, the cost/benefit of DEPT may get
called into question, especially if all of the false positives can't
be suppressed.
- Ted
Ted wrote:
> On Tue, May 10, 2022 at 09:32:13AM +0900, Byungchul Park wrote:
> > Yes, right. DEPT has never been optimized. It rather turns on
> > CONFIG_LOCKDEP and even CONFIG_PROVE_LOCKING when CONFIG_DEPT gets on
> > because of porting issue. I have no choice but to rely on those to
> > develop DEPT out of tree. Of course, that's what I don't like.
>
> Sure, but blaming the overhead on unnecessary CONFIG_PROVE_LOCKING
> overhead can explain only a tiny fraction of the slowdown. Consider:
> if time to first test (time to boot the kernel, setup the test
> environment, figure out which tests to run, etc.) is 12 seconds w/o
> LOCKDEP, 49 seconds with LOCKDEP/PROVE_LOCKING and 602 seconds with
> DEPT, you can really only blame 37 seconds out of the 602 seconds of
> DEPT on unnecessary PROVE_LOCKING overhead.
>
> So let's assume we can get rid of all of the PROVE_LOCKING overhead.
> We're still talking about 12 seconds for time-to-first test without
> any lock debugging, versus ** 565 ** seconds for time-to-first test
> with DEPT. That's a factor of 47x for DEPT sans LOCKDEP overhead,
> compared to a 4x overhead for PROVE_LOCKING.
Okay. I will work on it.
> > Plus, for now, I'm focusing on removing false positives. Once it's
> > considered settled down, I will work on performance optimizaition. But
> > it should still keep relying on Lockdep CONFIGs and adding additional
> > overhead on it until DEPT can be developed in the tree.
>
> Well, please take a look at the false positive which I reported. I
> suspect that in order to fix that particular false positive, we'll
> either need to have a way to disable DEPT on waiting on all page/folio
> dirty bits, or it will need to treat pages from different inodes
> and/or address spaces as being entirely separate classes, instead of
> collapsing all inode dirty bits, and all of various inode's mutexes
> (such as ext4's i_data_sem) as being part of a single object class.
I'd rather solve it by assigning different classes to different types of
inode. This is the right way.
> > DEPT is tracking way more objects than Lockdep so it's inevitable to be
> > slower, but let me try to make it have the similar performance to
> > Lockdep.
>
> In order to eliminate some of these false positives, I suspect it's
> going to increase the number of object classes that DEPT will need to
> track even *more*. At which point, the cost/benefit of DEPT may get
> called into question, especially if all of the false positives can't
> be suppressed.
Look. Let's talk in general terms. There's no way to get rid of the
false positives all the way. It's a decision issue for *balancing*
between considering potential cases and only real ones. Definitely,
potential is not real. The more potential things we consider, the higher
the chances are, that false positives appear.
But yes. The advantage we'd take by detecting potential ones should be
higher than the risk of being bothered by false ones. Do you think a
tool is useless if it produces a few false positives? Of course, it'd
be a problem if it's too many, but otherwise, I think it'd be a great
tool if the advantage > the risk.
Don't get me wrong here. It doesn't mean DEPT is perfect for now. The
performance should be improved and false alarms that appear should be
removed, of course. I'm talking about the direction.
For now, there's no tool to track wait/event itself in Linux kernel -
a subset of the functionality exists tho. DEPT is the 1st try for that
purpose and can be a useful tool by the right direction.
I know what you are concerning about. I bet it's false positives that
are going to bother you once merged. I'll insist that DEPT shouldn't be
used as a mandatory testing tool until considered stable enough. But
what about ones who would take the advantage use DEPT. Why don't you
think of folks who will take the advantage from the hints about
dependency of synchronization esp. when their subsystem requires very
complicated synchronization? Should a tool be useful only in a final
testing stage? What about the usefulness during development stage?
It's worth noting DEPT works with any wait/event so any lockups e.g.
even by HW-SW interface, retry logic or the like can be detected by DEPT
once all waits and events are tagged properly. I believe the advantage
by that is much higher than the bad side facing false alarms. It's just
my opinion. I'm goning to respect the majority opinion.
Byungchul
>
> - Ted
>
On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote:
> On Sat, May 07, 2022 at 04:20:50PM +0900, Hyeonggon Yoo wrote:
> > On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote:
> > > Linus wrote:
> > > >
> > > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <[email protected]> wrote:
> > > > >
> > > > > Hi Linus and folks,
> > > > >
> > > > > I've been developing a tool for detecting deadlock possibilities by
> > > > > tracking wait/event rather than lock(?) acquisition order to try to
> > > > > cover all synchonization machanisms.
> > > >
> > > > So what is the actual status of reports these days?
> > > >
> > > > Last time I looked at some reports, it gave a lot of false positives
> > > > due to mis-understanding prepare_to_sleep().
> > >
> > > Yes, it was. I handled the case in the following way:
> > >
> > > 1. Stage the wait at prepare_to_sleep(), which might be used at commit.
> > > Which has yet to be an actual wait that Dept considers.
> > > 2. If the condition for sleep is true, the wait will be committed at
> > > __schedule(). The wait becomes an actual one that Dept considers.
> > > 3. If the condition is false and the task gets back to TASK_RUNNING,
> > > clean(=reset) the staged wait.
> > >
> > > That way, Dept only works with what actually hits to __schedule() for
> > > the waits through sleep.
> > >
> > > > For this all to make sense, it would need to not have false positives
> > > > (or at least a very small number of them together with a way to sanely
> > >
> > > Yes. I agree with you. I got rid of them that way I described above.
> > >
> >
> > IMHO DEPT should not report what lockdep allows (Not talking about
>
> No.
>
> > wait events). I mean lockdep allows some kind of nested locks but
> > DEPT reports them.
>
> You have already asked exactly same question in another thread of
> LKML. That time I answered to it but let me explain it again.
>
> ---
>
> CASE 1.
>
> lock L with depth n
> lock_nested L' with depth n + 1
> ...
> unlock L'
> unlock L
>
> This case is allowed by Lockdep.
> This case is allowed by DEPT cuz it's not a deadlock.
>
> CASE 2.
>
> lock L with depth n
> lock A
> lock_nested L' with depth n + 1
> ...
> unlock L'
> unlock A
> unlock L
>
> This case is allowed by Lockdep.
> This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*.
>
Yeah, in previous threads we discussed this [1]
And the case was:
scan_mutex -> object_lock -> kmemleak_lock -> object_lock
And dept reported:
object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as
deadlock.
But IIUC - What DEPT reported happens only under scan_mutex and
It is not simple just not to take them because the object can be removed from the
list and freed while scanning via kmemleak_free() without kmemleak_lock and object_lock.
Just I'm still not sure that someone will fix the warning in the future - even if the
locking rule is not good - if it will not cause a real deadlock.
> ---
>
> The following scenario would explain why CASE 2 is problematic.
>
> THREAD X THREAD Y
>
> lock L with depth n
> lock L' with depth n
> lock A
> lock A
> lock_nested L' with depth n + 1
> lock_nested L'' with depth n + 1
> ... ...
> unlock L' unlock L''
> unlock A unlock A
> unlock L unlock L'
>
> Yes. I need to check if the report you shared with me is a true one, but
> it's not because DEPT doesn't work with *_nested() APIs.
>
Sorry, It was not right just to say DEPT doesn't work with _nested() APIs.
> Byungchul
[1] https://lore.kernel.org/lkml/20220304002809.GA6112@X58A-UD3R/
--
Thanks,
Hyeonggon
On Tue, 10 May 2022 08:38:38 +0900
Byungchul Park <[email protected]> wrote:
> Yes, I was talking about A and L'.
>
> > detect that regardless of L. A nested lock associates the the nesting with
>
> When I checked Lockdep code, L' with depth n + 1 and L' with depth n
> have different classes in Lockdep.
If that's the case, then that's a bug in lockdep.
>
> That's why I said Lockdep cannot detect it. By any chance, has it
> changed so as to consider this case? Or am I missing something?
No, it's not that lockdep cannot detect it, it should detect it. If it is
not detecting it, then we need to fix that.
-- Steve
On Tue, May 10, 2022 at 10:12:54AM -0400, Steven Rostedt wrote:
> On Tue, 10 May 2022 08:38:38 +0900
> Byungchul Park <[email protected]> wrote:
>
> > Yes, I was talking about A and L'.
> >
> > > detect that regardless of L. A nested lock associates the the nesting with
> >
> > When I checked Lockdep code, L' with depth n + 1 and L' with depth n
> > have different classes in Lockdep.
>
> If that's the case, then that's a bug in lockdep.
Yes, agree. I should've said 'Lockdep doesn't detect it currently.'
rather than 'Lockdep can't detect it.'.
I also think we make it for this case by fixing the bug in Lockdep.
> >
> > That's why I said Lockdep cannot detect it. By any chance, has it
> > changed so as to consider this case? Or am I missing something?
>
> No, it's not that lockdep cannot detect it, it should detect it. If it is
> not detecting it, then we need to fix that.
Yes.
Byungchul
>
> -- Steve
On Tue, May 10, 2022 at 08:18:12PM +0900, Hyeonggon Yoo wrote:
> On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote:
> > On Sat, May 07, 2022 at 04:20:50PM +0900, Hyeonggon Yoo wrote:
> > > On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote:
> > > > Linus wrote:
> > > > >
> > > > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <[email protected]> wrote:
> > > > > >
> > > > > > Hi Linus and folks,
> > > > > >
> > > > > > I've been developing a tool for detecting deadlock possibilities by
> > > > > > tracking wait/event rather than lock(?) acquisition order to try to
> > > > > > cover all synchonization machanisms.
> > > > >
> > > > > So what is the actual status of reports these days?
> > > > >
> > > > > Last time I looked at some reports, it gave a lot of false positives
> > > > > due to mis-understanding prepare_to_sleep().
> > > >
> > > > Yes, it was. I handled the case in the following way:
> > > >
> > > > 1. Stage the wait at prepare_to_sleep(), which might be used at commit.
> > > > Which has yet to be an actual wait that Dept considers.
> > > > 2. If the condition for sleep is true, the wait will be committed at
> > > > __schedule(). The wait becomes an actual one that Dept considers.
> > > > 3. If the condition is false and the task gets back to TASK_RUNNING,
> > > > clean(=reset) the staged wait.
> > > >
> > > > That way, Dept only works with what actually hits to __schedule() for
> > > > the waits through sleep.
> > > >
> > > > > For this all to make sense, it would need to not have false positives
> > > > > (or at least a very small number of them together with a way to sanely
> > > >
> > > > Yes. I agree with you. I got rid of them that way I described above.
> > > >
> > >
> > > IMHO DEPT should not report what lockdep allows (Not talking about
> >
> > No.
> >
> > > wait events). I mean lockdep allows some kind of nested locks but
> > > DEPT reports them.
> >
> > You have already asked exactly same question in another thread of
> > LKML. That time I answered to it but let me explain it again.
> >
> > ---
> >
> > CASE 1.
> >
> > lock L with depth n
> > lock_nested L' with depth n + 1
> > ...
> > unlock L'
> > unlock L
> >
> > This case is allowed by Lockdep.
> > This case is allowed by DEPT cuz it's not a deadlock.
> >
> > CASE 2.
> >
> > lock L with depth n
> > lock A
> > lock_nested L' with depth n + 1
> > ...
> > unlock L'
> > unlock A
> > unlock L
> >
> > This case is allowed by Lockdep.
> > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*.
> >
>
> Yeah, in previous threads we discussed this [1]
>
> And the case was:
> scan_mutex -> object_lock -> kmemleak_lock -> object_lock
> And dept reported:
> object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as
> deadlock.
>
> But IIUC - What DEPT reported happens only under scan_mutex and
> It is not simple just not to take them because the object can be removed from the
> list and freed while scanning via kmemleak_free() without kmemleak_lock and object_lock.
That should be one of the following order:
1. kmemleak_lock -> object_lock -> object_lock(nested)
2. object_lock -> object_lock(nested) -> kmemleak_lock
> Just I'm still not sure that someone will fix the warning in the future - even if the
> locking rule is not good - if it will not cause a real deadlock.
There's more important thing than making code just work for now. For
example, maintainance, communcation via code between current developers
and potential new commers in the future and so on.
At least, a comment describing why the wrong order in the code is safe
should be added. I wouldn't allow the current order in the code if I
were the maintainer.
Byungchul
> > ---
> >
> > The following scenario would explain why CASE 2 is problematic.
> >
> > THREAD X THREAD Y
> >
> > lock L with depth n
> > lock L' with depth n
> > lock A
> > lock A
> > lock_nested L' with depth n + 1
> > lock_nested L'' with depth n + 1
> > ... ...
> > unlock L' unlock L''
> > unlock A unlock A
> > unlock L unlock L'
> >
> > Yes. I need to check if the report you shared with me is a true one, but
> > it's not because DEPT doesn't work with *_nested() APIs.
> >
>
> Sorry, It was not right just to say DEPT doesn't work with _nested() APIs.
>
> > Byungchul
>
> [1] https://lore.kernel.org/lkml/20220304002809.GA6112@X58A-UD3R/
>
> --
> Thanks,
> Hyeonggon
On Tue, May 10, 2022 at 02:37:40PM +0900, Byungchul Park wrote:
> Ted wrote:
> > On Tue, May 10, 2022 at 09:32:13AM +0900, Byungchul Park wrote:
> > > DEPT is tracking way more objects than Lockdep so it's inevitable to be
> > > slower, but let me try to make it have the similar performance to
> > > Lockdep.
> >
> > In order to eliminate some of these false positives, I suspect it's
> > going to increase the number of object classes that DEPT will need to
> > track even *more*. At which point, the cost/benefit of DEPT may get
> > called into question, especially if all of the false positives can't
> > be suppressed.
>
> Look. Let's talk in general terms. There's no way to get rid of the
> false positives all the way. It's a decision issue for *balancing*
> between considering potential cases and only real ones. Definitely,
> potential is not real. The more potential things we consider, the higher
> the chances are, that false positives appear.
>
> But yes. The advantage we'd take by detecting potential ones should be
> higher than the risk of being bothered by false ones. Do you think a
> tool is useless if it produces a few false positives? Of course, it'd
> be a problem if it's too many, but otherwise, I think it'd be a great
> tool if the advantage > the risk.
>
> Don't get me wrong here. It doesn't mean DEPT is perfect for now. The
> performance should be improved and false alarms that appear should be
> removed, of course. I'm talking about the direction.
>
> For now, there's no tool to track wait/event itself in Linux kernel -
> a subset of the functionality exists tho. DEPT is the 1st try for that
> purpose and can be a useful tool by the right direction.
>
> I know what you are concerning about. I bet it's false positives that
> are going to bother you once merged. I'll insist that DEPT shouldn't be
> used as a mandatory testing tool until considered stable enough. But
> what about ones who would take the advantage use DEPT. Why don't you
> think of folks who will take the advantage from the hints about
> dependency of synchronization esp. when their subsystem requires very
> complicated synchronization? Should a tool be useful only in a final
> testing stage? What about the usefulness during development stage?
>
> It's worth noting DEPT works with any wait/event so any lockups e.g.
> even by HW-SW interface, retry logic or the like can be detected by DEPT
> once all waits and events are tagged properly. I believe the advantage
> by that is much higher than the bad side facing false alarms. It's just
> my opinion. I'm goning to respect the majority opinion.
s/take advantage/have the benefit/g
Byungchul
On Thu, May 19, 2022 at 11:11:10AM +0100, Catalin Marinas wrote:
> On Wed, May 11, 2022 at 07:04:51PM +0900, Hyeonggon Yoo wrote:
> > On Wed, May 11, 2022 at 08:39:29AM +0900, Byungchul Park wrote:
> > > On Tue, May 10, 2022 at 08:18:12PM +0900, Hyeonggon Yoo wrote:
> > > > On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote:
> > > > > CASE 1.
> > > > >
> > > > > lock L with depth n
> > > > > lock_nested L' with depth n + 1
> > > > > ...
> > > > > unlock L'
> > > > > unlock L
> > > > >
> > > > > This case is allowed by Lockdep.
> > > > > This case is allowed by DEPT cuz it's not a deadlock.
> > > > >
> > > > > CASE 2.
> > > > >
> > > > > lock L with depth n
> > > > > lock A
> > > > > lock_nested L' with depth n + 1
> > > > > ...
> > > > > unlock L'
> > > > > unlock A
> > > > > unlock L
> > > > >
> > > > > This case is allowed by Lockdep.
> > > > > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*.
> > > >
> > > > Yeah, in previous threads we discussed this [1]
> > > >
> > > > And the case was:
> > > > scan_mutex -> object_lock -> kmemleak_lock -> object_lock
> > > > And dept reported:
> > > > object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as
> > > > deadlock.
> > > >
> > > > But IIUC - What DEPT reported happens only under scan_mutex and it
> > > > is not simple just not to take them because the object can be
> > > > removed from the list and freed while scanning via kmemleak_free()
> > > > without kmemleak_lock and object_lock.
>
> The above kmemleak sequence shouldn't deadlock since those locks, even
> if taken in a different order, are serialised by scan_mutex. For various
> reasons, trying to reduce the latency, I ended up with some
> fine-grained, per-object locking.
I understand why you introduced the fine-grained lock. However, the
different order should be avoided anyway. As Steven said, Lockdep also
should've detected this case, say, this would have been detected if
Lockdep worked correctly.
It's not a technical issue to make a tool skip the reversed order when
it's already protected by another lock. Because each lock has its own
purpose as you explained, no body knows if the cases might arise that
use kmemleak_lock and object_lock only w/o holding scan_mutex someday.
I'm wondering how other folks think this case should be handled tho.
> For object allocation (rbtree modification) and tree search, we use
> kmemleak_lock. During scanning (which can take minutes under
> scan_mutex), we want to prevent (a) long latencies and (b) freeing the
> object being scanned. We release the locks regularly for (a) and hold
> the object->lock for (b).
>
> In another thread Byungchul mentioned:
>
> | context X context Y
> |
> | lock mutex A lock mutex A
> | lock B lock C
> | lock C lock B
> | unlock C unlock B
> | unlock B unlock C
> | unlock mutex A unlock mutex A
> |
> | In my opinion, lock B and lock C are unnecessary if they are always
> | along with lock mutex A. Or we should keep correct lock order across all
> | the code.
>
> If these are the only two places, yes, locks B and C would be
> unnecessary. But we have those locks acquired (not nested) on the
> allocation path (kmemleak_lock) and freeing path (object->lock). We
> don't want to block those paths while scan_mutex is held.
>
> That said, we may be able to use a single kmemleak_lock for everything.
> The object freeing path may be affected slightly during scanning but the
> code does release it every MAX_SCAN_SIZE bytes. It may even get slightly
> faster as we'd hammer a single lock (I'll do some benchmarks).
>
> But from a correctness perspective, I think the DEPT tool should be
> improved a bit to detect when such out of order locking is serialised by
> an enclosing lock/mutex.
Again, I don't think this is a technical issue.
Byungchul
>
> --
> Catalin