LinuxLists.cc - [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist

2023-12-13 10:18:49

Subject: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

As a rwlock for tasklist_lock, there are multiple scenarios to acquire
read lock which write lock needed to be waiting for.
In freeze_process/thaw_processes it can take about 200+ms for holding read
lock of tasklist_lock by walking and freezing/thawing tasks in commercial
devices. And write_lock_irq will have preempt disabled and local irq
disabled to spin until the tasklist_lock can be acquired. This leading to
a bad responsive performance of current system.
Take an example:
1. cpu0 is holding read lock of tasklist_lock to thaw_processes.
2. cpu1 is waiting write lock of tasklist_lock to exec a new thread with
preempt_disabled and local irq disabled.
3. cpu2 is waiting write lock of tasklist_lock to do_exit with
preempt_disabled and local irq disabled.
4. cpu3 is waiting write lock of tasklist_lock to do_exit with
preempt_disabled and local irq disabled.
So introduce a write lock/unlock wrapper for tasklist_lock specificly.
The current taskslist_lock writers all have write_lock_irq to hold
tasklist_lock, and write_unlock_irq to release tasklist_lock, that means
the writers are not suitable or workable to wait on tasklist_lock in irq
disabled scenarios. So the write lock/unlock wrapper here only follow the
current design of directly use local_irq_disable and local_irq_enable,
and not take already irq disabled writer callers into account.
Use write_trylock in the loop and enabled irq for cpu to repsond if lock
cannot be taken.

Signed-off-by: Maria Yu <[email protected]>
---
fs/exec.c | 10 +++++-----
include/linux/sched/task.h | 29 +++++++++++++++++++++++++++++
kernel/exit.c | 16 ++++++++--------
kernel/fork.c | 6 +++---
kernel/ptrace.c | 12 ++++++------
kernel/sys.c | 8 ++++----
security/keys/keyctl.c | 4 ++--
7 files changed, 57 insertions(+), 28 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 4aa19b24f281..030eef6852eb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1086,7 +1086,7 @@ static int de_thread(struct task_struct *tsk)

for (;;) {
cgroup_threadgroup_change_begin(tsk);
- write_lock_irq(&tasklist_lock);
+ write_lock_tasklist_lock();
/*
* Do this under tasklist_lock to ensure that
* exit_notify() can't miss ->group_exec_task
@@ -1095,7 +1095,7 @@ static int de_thread(struct task_struct *tsk)
if (likely(leader->exit_state))
break;
__set_current_state(TASK_KILLABLE);
- write_unlock_irq(&tasklist_lock);
+ write_unlock_tasklist_lock();
cgroup_threadgroup_change_end(tsk);
schedule();
if (__fatal_signal_pending(tsk))
@@ -1150,7 +1150,7 @@ static int de_thread(struct task_struct *tsk)
*/
if (unlikely(leader->ptrace))
__wake_up_parent(leader, leader->parent);
- write_unlock_irq(&tasklist_lock);
+ write_unlock_tasklist_lock();
cgroup_threadgroup_change_end(tsk);

release_task(leader);
@@ -1198,13 +1198,13 @@ static int unshare_sighand(struct task_struct *me)

refcount_set(&newsighand->count, 1);

- write_lock_irq(&tasklist_lock);
+ write_lock_tasklist_lock();
spin_lock(&oldsighand->siglock);
memcpy(newsighand->action, oldsighand->action,
sizeof(newsighand->action));
rcu_assign_pointer(me->sighand, newsighand);
spin_unlock(&oldsighand->siglock);
- write_unlock_irq(&tasklist_lock);
+ write_unlock_tasklist_lock();

__cleanup_sighand(oldsighand);
}
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index a23af225c898..6f69d9a3c868 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -50,6 +50,35 @@ struct kernel_clone_args {
* a separate lock).
*/
extern rwlock_t tasklist_lock;
+
+/*
+ * Tasklist_lock is a special lock, it takes a good amount of time of
+ * taskslist_lock readers to finish, and the pure write_irq_lock api
+ * will do local_irq_disable at the very first, and put the current cpu
+ * waiting for the lock while is non-responsive for interrupts.
+ *
+ * The current taskslist_lock writers all have write_lock_irq to hold
+ * tasklist_lock, and write_unlock_irq to release tasklist_lock, that
+ * means the writers are not suitable or workable to wait on
+ * tasklist_lock in irq disabled scenarios. So the write lock/unlock
+ * wrapper here only follow the current design of directly use
+ * local_irq_disable and local_irq_enable.
+ */
+static inline void write_lock_tasklist_lock(void)
+{
+ while (1) {
+ local_irq_disable();
+ if (write_trylock(&tasklist_lock))
+ break;
+ local_irq_enable();
+ cpu_relax();
+ }
+}
+static inline void write_unlock_tasklist_lock(void)
+{
+ write_unlock_irq(&tasklist_lock);
+}
+
extern spinlock_t mmlist_lock;

extern union thread_union init_thread_union;
diff --git a/kernel/exit.c b/kernel/exit.c
index ee9f43bed49a..18b00f477079 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -251,7 +251,7 @@ void release_task(struct task_struct *p)

cgroup_release(p);

- write_lock_irq(&tasklist_lock);
+ write_lock_tasklist_lock();
ptrace_release_task(p);
thread_pid = get_pid(p->thread_pid);
__exit_signal(p);
@@ -275,7 +275,7 @@ void release_task(struct task_struct *p)
leader->exit_state = EXIT_DEAD;
}

- write_unlock_irq(&tasklist_lock);
+ write_unlock_tasklist_lock();
seccomp_filter_release(p);
proc_flush_pid(thread_pid);
put_pid(thread_pid);
@@ -598,7 +598,7 @@ static struct task_struct *find_child_reaper(struct task_struct *father,
return reaper;
}

- write_unlock_irq(&tasklist_lock);
+ write_unlock_tasklist_lock();

list_for_each_entry_safe(p, n, dead, ptrace_entry) {
list_del_init(&p->ptrace_entry);
@@ -606,7 +606,7 @@ static struct task_struct *find_child_reaper(struct task_struct *father,
}

zap_pid_ns_processes(pid_ns);
- write_lock_irq(&tasklist_lock);
+ write_lock_tasklist_lock();

return father;
}
@@ -730,7 +730,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
struct task_struct *p, *n;
LIST_HEAD(dead);

- write_lock_irq(&tasklist_lock);
+ write_lock_tasklist_lock();
forget_original_parent(tsk, &dead);

if (group_dead)
@@ -758,7 +758,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
/* mt-exec, de_thread() is waiting for group leader */
if (unlikely(tsk->signal->notify_count < 0))
wake_up_process(tsk->signal->group_exec_task);
- write_unlock_irq(&tasklist_lock);
+ write_unlock_tasklist_lock();

list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
list_del_init(&p->ptrace_entry);
@@ -1172,7 +1172,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
wo->wo_stat = status;

if (state == EXIT_TRACE) {
- write_lock_irq(&tasklist_lock);
+ write_lock_tasklist_lock();
/* We dropped tasklist, ptracer could die and untrace */
ptrace_unlink(p);

@@ -1181,7 +1181,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
if (do_notify_parent(p, p->exit_signal))
state = EXIT_DEAD;
p->exit_state = state;
- write_unlock_irq(&tasklist_lock);
+ write_unlock_tasklist_lock();
}
if (state == EXIT_DEAD)
release_task(p);
diff --git a/kernel/fork.c b/kernel/fork.c
index 10917c3e1f03..06c4b4ab9102 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2623,7 +2623,7 @@ __latent_entropy struct task_struct *copy_process(
* Make it visible to the rest of the system, but dont wake it up yet.
* Need tasklist lock for parent etc handling!
*/
- write_lock_irq(&tasklist_lock);
+ write_lock_tasklist_lock();

/* CLONE_PARENT re-uses the old parent */
if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
@@ -2714,7 +2714,7 @@ __latent_entropy struct task_struct *copy_process(
hlist_del_init(&delayed.node);
spin_unlock(&current->sighand->siglock);
syscall_tracepoint_update(p);
- write_unlock_irq(&tasklist_lock);
+ write_unlock_tasklist_lock();

if (pidfile)
fd_install(pidfd, pidfile);
@@ -2735,7 +2735,7 @@ __latent_entropy struct task_struct *copy_process(
bad_fork_cancel_cgroup:
sched_core_free(p);
spin_unlock(&current->sighand->siglock);
- write_unlock_irq(&tasklist_lock);
+ write_unlock_tasklist_lock();
cgroup_cancel_fork(p, args);
bad_fork_put_pidfd:
if (clone_flags & CLONE_PIDFD) {
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index d8b5e13a2229..a8d7e2d06f3e 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -435,7 +435,7 @@ static int ptrace_attach(struct task_struct *task, long request,
if (retval)
goto unlock_creds;

- write_lock_irq(&tasklist_lock);
+ write_lock_tasklist_lock();
retval = -EPERM;
if (unlikely(task->exit_state))
goto unlock_tasklist;
@@ -479,7 +479,7 @@ static int ptrace_attach(struct task_struct *task, long request,

retval = 0;
unlock_tasklist:
- write_unlock_irq(&tasklist_lock);
+ write_unlock_tasklist_lock();
unlock_creds:
mutex_unlock(&task->signal->cred_guard_mutex);
out:
@@ -508,7 +508,7 @@ static int ptrace_traceme(void)
{
int ret = -EPERM;

- write_lock_irq(&tasklist_lock);
+ write_lock_tasklist_lock();
/* Are we already being traced? */
if (!current->ptrace) {
ret = security_ptrace_traceme(current->parent);
@@ -522,7 +522,7 @@ static int ptrace_traceme(void)
ptrace_link(current, current->real_parent);
}
}
- write_unlock_irq(&tasklist_lock);
+ write_unlock_tasklist_lock();

return ret;
}
@@ -588,7 +588,7 @@ static int ptrace_detach(struct task_struct *child, unsigned int data)
/* Architecture-specific hardware disable .. */
ptrace_disable(child);

- write_lock_irq(&tasklist_lock);
+ write_lock_tasklist_lock();
/*
* We rely on ptrace_freeze_traced(). It can't be killed and
* untraced by another thread, it can't be a zombie.
@@ -600,7 +600,7 @@ static int ptrace_detach(struct task_struct *child, unsigned int data)
*/
child->exit_code = data;
__ptrace_detach(current, child);
- write_unlock_irq(&tasklist_lock);
+ write_unlock_tasklist_lock();

proc_ptrace_connector(child, PTRACE_DETACH);

diff --git a/kernel/sys.c b/kernel/sys.c
index e219fcfa112d..0b1647d3ed32 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1088,7 +1088,7 @@ SYSCALL_DEFINE2(setpgid, pid_t, pid, pid_t, pgid)
/* From this point forward we keep holding onto the tasklist lock
* so that our parent does not change from under us. -DaveM
*/
- write_lock_irq(&tasklist_lock);
+ write_lock_tasklist_lock();

err = -ESRCH;
p = find_task_by_vpid(pid);
@@ -1136,7 +1136,7 @@ SYSCALL_DEFINE2(setpgid, pid_t, pid, pid_t, pgid)
err = 0;
out:
/* All paths lead to here, thus we are safe. -DaveM */
- write_unlock_irq(&tasklist_lock);
+ write_unlock_tasklist_lock();
rcu_read_unlock();
return err;
}
@@ -1229,7 +1229,7 @@ int ksys_setsid(void)
pid_t session = pid_vnr(sid);
int err = -EPERM;

- write_lock_irq(&tasklist_lock);
+ write_lock_tasklist_lock();
/* Fail if I am already a session leader */
if (group_leader->signal->leader)
goto out;
@@ -1247,7 +1247,7 @@ int ksys_setsid(void)

err = session;
out:
- write_unlock_irq(&tasklist_lock);
+ write_unlock_tasklist_lock();
if (err > 0) {
proc_sid_connector(group_leader);
sched_autogroup_create_attach(group_leader);
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 19be69fa4d05..dd8aed20486a 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -1652,7 +1652,7 @@ long keyctl_session_to_parent(void)

me = current;
rcu_read_lock();
- write_lock_irq(&tasklist_lock);
+ write_lock_tasklist_lock();

ret = -EPERM;
oldwork = NULL;
@@ -1702,7 +1702,7 @@ long keyctl_session_to_parent(void)
if (!ret)
newwork = NULL;
unlock:
- write_unlock_irq(&tasklist_lock);
+ write_unlock_tasklist_lock();
rcu_read_unlock();
if (oldwork)
put_cred(container_of(oldwork, struct cred, rcu));

base-commit: 88035e5694a86a7167d490bb95e9df97a9bb162b
--
2.17.1

2023-12-13 16:22:50

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

On Wed, Dec 13, 2023 at 06:17:45PM +0800, Maria Yu wrote:
> +static inline void write_lock_tasklist_lock(void)
> +{
> + while (1) {
> + local_irq_disable();
> + if (write_trylock(&tasklist_lock))
> + break;
> + local_irq_enable();
> + cpu_relax();

This is a bad implementation though. You don't set the _QW_WAITING flag
so readers don't know that there's a pending writer. Also, I've seen
cpu_relax() pessimise CPU behaviour; putting it into a low-power mode
that takes a while to wake up from.

I think the right way to fix this is to pass a boolean flag to
queued_write_lock_slowpath() to let it know whether it can re-enable
interrupts while checking whether _QW_WAITING is set.

2023-12-13 18:27:33

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

Matthew Wilcox <[email protected]> writes:

> On Wed, Dec 13, 2023 at 06:17:45PM +0800, Maria Yu wrote:
>> +static inline void write_lock_tasklist_lock(void)
>> +{
>> + while (1) {
>> + local_irq_disable();
>> + if (write_trylock(&tasklist_lock))
>> + break;
>> + local_irq_enable();
>> + cpu_relax();
>
> This is a bad implementation though. You don't set the _QW_WAITING flag
> so readers don't know that there's a pending writer. Also, I've seen
> cpu_relax() pessimise CPU behaviour; putting it into a low-power mode
> that takes a while to wake up from.
>
> I think the right way to fix this is to pass a boolean flag to
> queued_write_lock_slowpath() to let it know whether it can re-enable
> interrupts while checking whether _QW_WAITING is set.

Yes. It seems to make sense to distinguish between write_lock_irq and
write_lock_irqsave and fix this for all of write_lock_irq.

Either that or someone can put in the work to start making the
tasklist_lock go away.

Eric

2023-12-15 05:53:08

by Aiqun Yu (Maria)

[permalink] [raw]

Subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

On 12/14/2023 2:27 AM, Eric W. Biederman wrote:
> Matthew Wilcox <[email protected]> writes:
>
>> On Wed, Dec 13, 2023 at 06:17:45PM +0800, Maria Yu wrote:
>>> +static inline void write_lock_tasklist_lock(void)
>>> +{
>>> + while (1) {
>>> + local_irq_disable();
>>> + if (write_trylock(&tasklist_lock))
>>> + break;
>>> + local_irq_enable();
>>> + cpu_relax();
>>
>> This is a bad implementation though. You don't set the _QW_WAITING flag
Any better ideas and suggestions are welcomed. :)
>> so readers don't know that there's a pending writer. Also, I've see >> cpu_relax() pessimise CPU behaviour; putting it into a low-power mode
>> that takes a while to wake up from.
>>
>> I think the right way to fix this is to pass a boolean flag to
>> queued_write_lock_slowpath() to let it know whether it can re-enable
>> interrupts while checking whether _QW_WAITING is set.
>
> Yes. It seems to make sense to distinguish between write_lock_irq and
> write_lock_irqsave and fix this for all of write_lock_irq.
>
Let me think about this.
It seems a possible because there is a special behavior from reader side
when in interrupt it will directly get the lock regardless of the
pending writer.

> Either that or someone can put in the work to start making the
> tasklist_lock go away.
>
> Eric
>

--
Thx and BRs,
Aiqun(Maria) Yu

2023-12-26 10:47:25

by Hillf Danton

[permalink] [raw]

Subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

On Wed, 13 Dec 2023 12:27:05 -0600 Eric W. Biederman <[email protected]>
> Matthew Wilcox <[email protected]> writes:
> > On Wed, Dec 13, 2023 at 06:17:45PM +0800, Maria Yu wrote:
> >> +static inline void write_lock_tasklist_lock(void)
> >> +{
> >> + while (1) {
> >> + local_irq_disable();
> >> + if (write_trylock(&tasklist_lock))
> >> + break;
> >> + local_irq_enable();
> >> + cpu_relax();
> >
> > This is a bad implementation though. You don't set the _QW_WAITING flag
> > so readers don't know that there's a pending writer. Also, I've seen
> > cpu_relax() pessimise CPU behaviour; putting it into a low-power mode
> > that takes a while to wake up from.
> >
> > I think the right way to fix this is to pass a boolean flag to
> > queued_write_lock_slowpath() to let it know whether it can re-enable
> > interrupts while checking whether _QW_WAITING is set.

lock(&lock->wait_lock)
enable irq
int
lock(&lock->wait_lock)

You are adding chance for recursive locking.
>
> Yes. It seems to make sense to distinguish between write_lock_irq and
> write_lock_irqsave and fix this for all of write_lock_irq.
>
> Either that or someone can put in the work to start making the
> tasklist_lock go away.
>
> Eric

2023-12-26 20:50:17

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

On Tue, Dec 26, 2023 at 06:46:52PM +0800, Hillf Danton wrote:
> On Wed, 13 Dec 2023 12:27:05 -0600 Eric W. Biederman <[email protected]>
> > Matthew Wilcox <[email protected]> writes:
> > > On Wed, Dec 13, 2023 at 06:17:45PM +0800, Maria Yu wrote:
> > >> +static inline void write_lock_tasklist_lock(void)
> > >> +{
> > >> + while (1) {
> > >> + local_irq_disable();
> > >> + if (write_trylock(&tasklist_lock))
> > >> + break;
> > >> + local_irq_enable();
> > >> + cpu_relax();
> > >
> > > This is a bad implementation though. You don't set the _QW_WAITING flag
> > > so readers don't know that there's a pending writer. Also, I've seen
> > > cpu_relax() pessimise CPU behaviour; putting it into a low-power mode
> > > that takes a while to wake up from.
> > >
> > > I think the right way to fix this is to pass a boolean flag to
> > > queued_write_lock_slowpath() to let it know whether it can re-enable
> > > interrupts while checking whether _QW_WAITING is set.
>
> lock(&lock->wait_lock)
> enable irq
> int
> lock(&lock->wait_lock)
>
> You are adding chance for recursive locking.

Did you bother to read queued_read_lock_slowpath() before writing this
email?

if (unlikely(in_interrupt())) {
/*
* Readers in interrupt context will get the lock immediately
* if the writer is just waiting (not holding the lock yet),
* so spin with ACQUIRE semantics until the lock is available
* without waiting in the queue.
*/
atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
return;

2023-12-27 01:41:55

by Aiqun Yu (Maria)

[permalink] [raw]

Subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

On 12/26/2023 6:46 PM, Hillf Danton wrote:
> On Wed, 13 Dec 2023 12:27:05 -0600 Eric W. Biederman <[email protected]>
>> Matthew Wilcox <[email protected]> writes:
>>> On Wed, Dec 13, 2023 at 06:17:45PM +0800, Maria Yu wrote:
>>>> +static inline void write_lock_tasklist_lock(void)
>>>> +{
>>>> + while (1) {
>>>> + local_irq_disable();
>>>> + if (write_trylock(&tasklist_lock))
>>>> + break;
>>>> + local_irq_enable();
>>>> + cpu_relax();
>>>
>>> This is a bad implementation though. You don't set the _QW_WAITING flag
>>> so readers don't know that there's a pending writer. Also, I've seen
>>> cpu_relax() pessimise CPU behaviour; putting it into a low-power mode
>>> that takes a while to wake up from.
>>>
>>> I think the right way to fix this is to pass a boolean flag to
>>> queued_write_lock_slowpath() to let it know whether it can re-enable
>>> interrupts while checking whether _QW_WAITING is set.
>
> lock(&lock->wait_lock)
> enable irq
> int
> lock(&lock->wait_lock)
>
> You are adding chance for recursive locking.

Thx for the comments for discuss of the deadlock possibility. While I
think deadlock can be differentiate with below 2 scenarios:
1. queued_write_lock_slowpath being triggered in interrupt context.
tasklist_lock don't have write_lock_irq(save) in interrupt context.
while for common rw lock, maybe write_lock_irq(save) usage in
interrupt context is a possible.
so may introduce a state when lock->wait_lock is released and left
the _QW_WAITING flag.
Welcome others to suggest on designs and comments.

2.queued_read_lock_slowpath can be triggered in interrupt context. And
it already have the handle to avoid possible deadlock.
In the queued_read_lock_slowpath, there is check whether current context
is in interrupt or not, and get the lock directly of only write lock
waiting.

Pls reference[1]:
/*
* Readers come here when they cannot get the lock without waiting
*/
if (unlikely(in_interrupt())) {
/*
* Readers in interrupt context will get the lock immediately
* if the writer is just waiting (not holding the lock yet),
* so spin with ACQUIRE semantics until the lock is available
* without waiting in the queue.
*/
atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
return;
}

[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/locking/qrwlock.c
>>
>> Yes. It seems to make sense to distinguish between write_lock_irq and
>> write_lock_irqsave and fix this for all of write_lock_irq.
>>
>> Either that or someone can put in the work to start making the
>> tasklist_lock go away.
>>
>> Eric

--
Thx and BRs,
Aiqun(Maria) Yu

2023-12-27 10:15:23

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

On Wed, Dec 27, 2023 at 09:41:29AM +0800, Aiqun Yu (Maria) wrote:
> On 12/26/2023 6:46 PM, Hillf Danton wrote:
> > On Wed, 13 Dec 2023 12:27:05 -0600 Eric W. Biederman <[email protected]>
> > > Matthew Wilcox <[email protected]> writes:
> > > > On Wed, Dec 13, 2023 at 06:17:45PM +0800, Maria Yu wrote:
> > > > > +static inline void write_lock_tasklist_lock(void)
> > > > > +{
> > > > > + while (1) {
> > > > > + local_irq_disable();
> > > > > + if (write_trylock(&tasklist_lock))
> > > > > + break;
> > > > > + local_irq_enable();
> > > > > + cpu_relax();
> > > >
> > > > This is a bad implementation though. You don't set the _QW_WAITING flag
> > > > so readers don't know that there's a pending writer. Also, I've seen
> > > > cpu_relax() pessimise CPU behaviour; putting it into a low-power mode
> > > > that takes a while to wake up from.
> > > >
> > > > I think the right way to fix this is to pass a boolean flag to
> > > > queued_write_lock_slowpath() to let it know whether it can re-enable
> > > > interrupts while checking whether _QW_WAITING is set.
> >
> > lock(&lock->wait_lock)
> > enable irq
> > int
> > lock(&lock->wait_lock)
> >
> > You are adding chance for recursive locking.
>
> Thx for the comments for discuss of the deadlock possibility. While I think
> deadlock can be differentiate with below 2 scenarios:
> 1. queued_write_lock_slowpath being triggered in interrupt context.
> tasklist_lock don't have write_lock_irq(save) in interrupt context.
> while for common rw lock, maybe write_lock_irq(save) usage in interrupt
> context is a possible.
> so may introduce a state when lock->wait_lock is released and left the
> _QW_WAITING flag.
> Welcome others to suggest on designs and comments.

Hm? I am confused. You're talking about the scenario where:

- CPU B holds the lock for read
- CPU A attempts to get the lock for write in user context, fails, sets
the _QW_WAITING flag
- CPU A re-enables interrupts
- CPU A executes an interrupt handler which calls queued_write_lock()
- If CPU B has dropped the read lock in the meantime,
atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED) succeeds
- CPU A calls queued_write_unlock() which stores 0 to the lock and we
_lose_ the _QW_WAITING flag for the userspace waiter.

How do we end up with CPU A leaving the _QW_WAITING flag set?

2023-12-27 11:10:25

by Hillf Danton

[permalink] [raw]

Subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

On Tue, 26 Dec 2023 20:49:38 +0000 Matthew Wilcox <[email protected]>
> On Tue, Dec 26, 2023 at 06:46:52PM +0800, Hillf Danton wrote:
> > On Wed, 13 Dec 2023 12:27:05 -0600 Eric W. Biederman <[email protected]>
> > > Matthew Wilcox <[email protected]> writes:
> > > > I think the right way to fix this is to pass a boolean flag to
> > > > queued_write_lock_slowpath() to let it know whether it can re-enable
> > > > interrupts while checking whether _QW_WAITING is set.
> >
> > lock(&lock->wait_lock)
> > enable irq
> > int
> > lock(&lock->wait_lock)
> >
> > You are adding chance for recursive locking.
>
> Did you bother to read queued_read_lock_slowpath() before writing this email?

Nope but it matters nothing in this case.
>
> if (unlikely(in_interrupt())) {
> /*
> * Readers in interrupt context will get the lock immediately
> * if the writer is just waiting (not holding the lock yet),
> * so spin with ACQUIRE semantics until the lock is available
> * without waiting in the queue.
> */
> atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
> return;

This is the lock acquirer for read in irq context, and it rolls out the red
carpet for write acquirer in irq, right Willy?

Feel free to ignore the following leg works.

/* Set the waiting flag to notify readers that a writer is pending */
atomic_or(_QW_WAITING, &lock->cnts);

enable irq;

/* When no more readers or writers, set the locked flag */
do {
cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING);
} while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED));

int
atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
deadlock
disable irq;

Though the case below is safe, it looks not pretty but clumsy.

/* Set the waiting flag to notify readers that a writer is pending */
atomic_or(_QW_WAITING, &lock->cnts);

/* When no more readers or writers, set the locked flag */
do {
enable irq;

cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING);

disable irq;

} while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED));

2023-12-28 09:07:53

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

On Wed, Dec 27, 2023 at 07:07:27PM +0800, Hillf Danton wrote:
> Feel free to ignore the following leg works.
>
> /* Set the waiting flag to notify readers that a writer is pending */
> atomic_or(_QW_WAITING, &lock->cnts);
>
> enable irq;
>
> /* When no more readers or writers, set the locked flag */
> do {
> cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING);
> } while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED));
>
> int
> atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
> deadlock
> disable irq;

That would be a buggy implementation, and would not be what I was
thinking.

> Though the case below is safe, it looks not pretty but clumsy.
>
> /* Set the waiting flag to notify readers that a writer is pending */
> atomic_or(_QW_WAITING, &lock->cnts);
>
> /* When no more readers or writers, set the locked flag */
> do {
> enable irq;
>
> cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING);
>
> disable irq;
>
> } while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED));

Why do you think it looks clumsy? It's more or less what I was
thinking.

-void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock)
+void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq)
...
do {
+ if (irq)
+ local_irq_enable();
cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING);
+ if (irq)
+ local_irq_disable();
} while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED));

2023-12-28 22:40:21

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

On Wed, Dec 13, 2023 at 12:27:05PM -0600, Eric W. Biederman wrote:
> Matthew Wilcox <[email protected]> writes:
> > I think the right way to fix this is to pass a boolean flag to
> > queued_write_lock_slowpath() to let it know whether it can re-enable
> > interrupts while checking whether _QW_WAITING is set.
>
> Yes. It seems to make sense to distinguish between write_lock_irq and
> write_lock_irqsave and fix this for all of write_lock_irq.

I wasn't planning on doing anything here, but Hillf kind of pushed me into
it. I think it needs to be something like this. Compile tested only.
If it ends up getting used,

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 75b8f4601b28..1152e080c719 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -33,8 +33,8 @@
/*
* External function declarations
*/
-extern void queued_read_lock_slowpath(struct qrwlock *lock);
-extern void queued_write_lock_slowpath(struct qrwlock *lock);
+void queued_read_lock_slowpath(struct qrwlock *lock);
+void queued_write_lock_slowpath(struct qrwlock *lock, bool irq);

/**
* queued_read_trylock - try to acquire read lock of a queued rwlock
@@ -98,7 +98,21 @@ static inline void queued_write_lock(struct qrwlock *lock)
if (likely(atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED)))
return;

- queued_write_lock_slowpath(lock);
+ queued_write_lock_slowpath(lock, false);
+}
+
+/**
+ * queued_write_lock_irq - acquire write lock of a queued rwlock
+ * @lock : Pointer to queued rwlock structure
+ */
+static inline void queued_write_lock_irq(struct qrwlock *lock)
+{
+ int cnts = 0;
+ /* Optimize for the unfair lock case where the fair flag is 0. */
+ if (likely(atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED)))
+ return;
+
+ queued_write_lock_slowpath(lock, true);
}

/**
@@ -138,6 +152,7 @@ static inline int queued_rwlock_is_contended(struct qrwlock *lock)
*/
#define arch_read_lock(l) queued_read_lock(l)
#define arch_write_lock(l) queued_write_lock(l)
+#define arch_write_lock_irq(l) queued_write_lock_irq(l)
#define arch_read_trylock(l) queued_read_trylock(l)
#define arch_write_trylock(l) queued_write_trylock(l)
#define arch_read_unlock(l) queued_read_unlock(l)
diff --git a/include/linux/rwlock.h b/include/linux/rwlock.h
index c0ef596f340b..897010b6ba0a 100644
--- a/include/linux/rwlock.h
+++ b/include/linux/rwlock.h
@@ -33,6 +33,7 @@ do { \
extern int do_raw_read_trylock(rwlock_t *lock);
extern void do_raw_read_unlock(rwlock_t *lock) __releases(lock);
extern void do_raw_write_lock(rwlock_t *lock) __acquires(lock);
+ extern void do_raw_write_lock_irq(rwlock_t *lock) __acquires(lock);
extern int do_raw_write_trylock(rwlock_t *lock);
extern void do_raw_write_unlock(rwlock_t *lock) __releases(lock);
#else
@@ -40,6 +41,7 @@ do { \
# define do_raw_read_trylock(rwlock) arch_read_trylock(&(rwlock)->raw_lock)
# define do_raw_read_unlock(rwlock) do {arch_read_unlock(&(rwlock)->raw_lock); __release(lock); } while (0)
# define do_raw_write_lock(rwlock) do {__acquire(lock); arch_write_lock(&(rwlock)->raw_lock); } while (0)
+# define do_raw_write_lock_irq(rwlock) do {__acquire(lock); arch_write_lock_irq(&(rwlock)->raw_lock); } while (0)
# define do_raw_write_trylock(rwlock) arch_write_trylock(&(rwlock)->raw_lock)
# define do_raw_write_unlock(rwlock) do {arch_write_unlock(&(rwlock)->raw_lock); __release(lock); } while (0)
#endif
diff --git a/include/linux/rwlock_api_smp.h b/include/linux/rwlock_api_smp.h
index dceb0a59b692..6257976dfb72 100644
--- a/include/linux/rwlock_api_smp.h
+++ b/include/linux/rwlock_api_smp.h
@@ -193,7 +193,7 @@ static inline void __raw_write_lock_irq(rwlock_t *lock)
local_irq_disable();
preempt_disable();
rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
- LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock);
+ LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock_irq);
}

static inline void __raw_write_lock_bh(rwlock_t *lock)
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index d2ef312a8611..6c644a71b01d 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -61,9 +61,10 @@ EXPORT_SYMBOL(queued_read_lock_slowpath);

/**
* queued_write_lock_slowpath - acquire write lock of a queued rwlock
- * @lock : Pointer to queued rwlock structure
+ * @lock: Pointer to queued rwlock structure
+ * @irq: True if we can enable interrupts while spinning
*/
-void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock)
+void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq)
{
int cnts;

@@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock)

/* When no more readers or writers, set the locked flag */
do {
+ if (irq)
+ local_irq_enable();
cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING);
+ if (irq)
+ local_irq_disable();
} while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED));
unlock:
arch_spin_unlock(&lock->wait_lock);
diff --git a/kernel/locking/spinlock_debug.c b/kernel/locking/spinlock_debug.c
index 87b03d2e41db..bf94551d7435 100644
--- a/kernel/locking/spinlock_debug.c
+++ b/kernel/locking/spinlock_debug.c
@@ -212,6 +212,13 @@ void do_raw_write_lock(rwlock_t *lock)
debug_write_lock_after(lock);
}

+void do_raw_write_lock_irq(rwlock_t *lock)
+{
+ debug_write_lock_before(lock);
+ arch_write_lock_irq(&lock->raw_lock);
+ debug_write_lock_after(lock);
+}
+
int do_raw_write_trylock(rwlock_t *lock)
{
int ret = arch_write_trylock(&lock->raw_lock);

2023-12-29 11:37:23

by kernel test robot

[permalink] [raw]

Subject: Re: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

Hi Matthew,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/locking/core]
[also build test ERROR on arnd-asm-generic/master brauner-vfs/vfs.all vfs-idmapping/for-next linus/master v6.7-rc7 next-20231222]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Matthew-Wilcox/Re-PATCH-kernel-Introduce-a-write-lock-unlock-wrapper-for-tasklist_lock/20231229-062352
base: tip/locking/core
patch link: https://lore.kernel.org/r/ZY30k7OCtxrdR9oP%40casper.infradead.org
patch subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock
config: i386-randconfig-011-20231229 (https://download.01.org/0day-ci/archive/20231229/[email protected]/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231229/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

kernel/locking/spinlock_debug.c: In function 'do_raw_write_lock_irq':
>> kernel/locking/spinlock_debug.c:217:9: error: implicit declaration of function 'arch_write_lock_irq'; did you mean '_raw_write_lock_irq'? [-Werror=implicit-function-declaration]
217 | arch_write_lock_irq(&lock->raw_lock);
| ^~~~~~~~~~~~~~~~~~~
| _raw_write_lock_irq
cc1: some warnings being treated as errors

vim +217 kernel/locking/spinlock_debug.c

213
214 void do_raw_write_lock_irq(rwlock_t *lock)
215 {
216 debug_write_lock_before(lock);
> 217 arch_write_lock_irq(&lock->raw_lock);
218 debug_write_lock_after(lock);
219 }
220

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-01-02 02:21:05

by Aiqun Yu (Maria)

[permalink] [raw]

Subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

On 12/29/2023 6:20 AM, Matthew Wilcox wrote:
> On Wed, Dec 13, 2023 at 12:27:05PM -0600, Eric W. Biederman wrote:
>> Matthew Wilcox <[email protected]> writes:
>>> I think the right way to fix this is to pass a boolean flag to
>>> queued_write_lock_slowpath() to let it know whether it can re-enable
>>> interrupts while checking whether _QW_WAITING is set.
>>
>> Yes. It seems to make sense to distinguish between write_lock_irq and
>> write_lock_irqsave and fix this for all of write_lock_irq.
>
> I wasn't planning on doing anything here, but Hillf kind of pushed me into
> it. I think it needs to be something like this. Compile tested only.
> If it ends up getting used,
Happy new year!
Thx Metthew for chiming into this. I think more thoughts will gain more
perfect designs.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>
> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> index 75b8f4601b28..1152e080c719 100644
> --- a/include/asm-generic/qrwlock.h
> +++ b/include/asm-generic/qrwlock.h
> @@ -33,8 +33,8 @@
> /*
> * External function declarations
> */
> -extern void queued_read_lock_slowpath(struct qrwlock *lock);
> -extern void queued_write_lock_slowpath(struct qrwlock *lock);
> +void queued_read_lock_slowpath(struct qrwlock *lock);
> +void queued_write_lock_slowpath(struct qrwlock *lock, bool irq);
>
> /**
> * queued_read_trylock - try to acquire read lock of a queued rwlock
> @@ -98,7 +98,21 @@ static inline void queued_write_lock(struct qrwlock *lock)
> if (likely(atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED)))
> return;
>
> - queued_write_lock_slowpath(lock);
> + queued_write_lock_slowpath(lock, false);
> +}
> +
> +/**
> + * queued_write_lock_irq - acquire write lock of a queued rwlock
> + * @lock : Pointer to queued rwlock structure
> + */
> +static inline void queued_write_lock_irq(struct qrwlock *lock)
> +{
> + int cnts = 0;
> + /* Optimize for the unfair lock case where the fair flag is 0. */
> + if (likely(atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED)))
> + return;
> +
> + queued_write_lock_slowpath(lock, true);
> }
>
> /**
> @@ -138,6 +152,7 @@ static inline int queued_rwlock_is_contended(struct qrwlock *lock)
> */
> #define arch_read_lock(l) queued_read_lock(l)
> #define arch_write_lock(l) queued_write_lock(l)
> +#define arch_write_lock_irq(l) queued_write_lock_irq(l)
> #define arch_read_trylock(l) queued_read_trylock(l)
> #define arch_write_trylock(l) queued_write_trylock(l)
> #define arch_read_unlock(l) queued_read_unlock(l)
> diff --git a/include/linux/rwlock.h b/include/linux/rwlock.h
> index c0ef596f340b..897010b6ba0a 100644
> --- a/include/linux/rwlock.h
> +++ b/include/linux/rwlock.h
> @@ -33,6 +33,7 @@ do { \
> extern int do_raw_read_trylock(rwlock_t *lock);
> extern void do_raw_read_unlock(rwlock_t *lock) __releases(lock);
> extern void do_raw_write_lock(rwlock_t *lock) __acquires(lock);
> + extern void do_raw_write_lock_irq(rwlock_t *lock) __acquires(lock);
> extern int do_raw_write_trylock(rwlock_t *lock);
> extern void do_raw_write_unlock(rwlock_t *lock) __releases(lock);
> #else
> @@ -40,6 +41,7 @@ do { \
> # define do_raw_read_trylock(rwlock) arch_read_trylock(&(rwlock)->raw_lock)
> # define do_raw_read_unlock(rwlock) do {arch_read_unlock(&(rwlock)->raw_lock); __release(lock); } while (0)
> # define do_raw_write_lock(rwlock) do {__acquire(lock); arch_write_lock(&(rwlock)->raw_lock); } while (0)
> +# define do_raw_write_lock_irq(rwlock) do {__acquire(lock); arch_write_lock_irq(&(rwlock)->raw_lock); } while (0)
> # define do_raw_write_trylock(rwlock) arch_write_trylock(&(rwlock)->raw_lock)
> # define do_raw_write_unlock(rwlock) do {arch_write_unlock(&(rwlock)->raw_lock); __release(lock); } while (0)
> #endif
> diff --git a/include/linux/rwlock_api_smp.h b/include/linux/rwlock_api_smp.h
> index dceb0a59b692..6257976dfb72 100644
> --- a/include/linux/rwlock_api_smp.h
> +++ b/include/linux/rwlock_api_smp.h
> @@ -193,7 +193,7 @@ static inline void __raw_write_lock_irq(rwlock_t *lock)
> local_irq_disable();
> preempt_disable();
> rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> - LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock);
> + LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock_irq);
> }
>
> static inline void __raw_write_lock_bh(rwlock_t *lock)
> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> index d2ef312a8611..6c644a71b01d 100644
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> @@ -61,9 +61,10 @@ EXPORT_SYMBOL(queued_read_lock_slowpath);
>
> /**
> * queued_write_lock_slowpath - acquire write lock of a queued rwlock
> - * @lock : Pointer to queued rwlock structure
> + * @lock: Pointer to queued rwlock structure
> + * @irq: True if we can enable interrupts while spinning
> */
> -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock)
> +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq)
> {
> int cnts;
>
> @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock)
>
Also a new state showed up after the current design:
1. locked flag with _QW_WAITING, while irq enabled.
2. And this state will be only in interrupt context.
3. lock->wait_lock is hold by the write waiter.
So per my understanding, a different behavior also needed to be done in
queued_write_lock_slowpath:
when (unlikely(in_interrupt())) , get the lock directly.
So needed to be done in release path. This is to address Hillf's concern
on possibility of deadlock.

Add Hillf here to merge thread. I am going to have a tested patch V2
accordingly.
Feel free to let me know your thoughts prior on that.
> /* When no more readers or writers, set the locked flag */
> do {
> + if (irq)
> + local_irq_enable();
I think write_lock_irqsave also needs to be take account. So
loal_irq_save(flags) should be take into account here.
> cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING);
> + if (irq)
> + local_irq_disable();
ditto.
> } while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED));
> unlock:
> arch_spin_unlock(&lock->wait_lock);
> diff --git a/kernel/locking/spinlock_debug.c b/kernel/locking/spinlock_debug.c
> index 87b03d2e41db..bf94551d7435 100644
> --- a/kernel/locking/spinlock_debug.c
> +++ b/kernel/locking/spinlock_debug.c
> @@ -212,6 +212,13 @@ void do_raw_write_lock(rwlock_t *lock)
> debug_write_lock_after(lock);
> }
>
> +void do_raw_write_lock_irq(rwlock_t *lock)
> +{
> + debug_write_lock_before(lock);
> + arch_write_lock_irq(&lock->raw_lock);
> + debug_write_lock_after(lock);
> +}
> +
> int do_raw_write_trylock(rwlock_t *lock)
> {
> int ret = arch_write_trylock(&lock->raw_lock);

--
Thx and BRs,
Aiqun(Maria) Yu

2024-01-02 09:15:49

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

On Tue, Jan 02, 2024 at 10:19:47AM +0800, Aiqun Yu (Maria) wrote:
> On 12/29/2023 6:20 AM, Matthew Wilcox wrote:
> > On Wed, Dec 13, 2023 at 12:27:05PM -0600, Eric W. Biederman wrote:
> > > Matthew Wilcox <[email protected]> writes:
> > > > I think the right way to fix this is to pass a boolean flag to
> > > > queued_write_lock_slowpath() to let it know whether it can re-enable
> > > > interrupts while checking whether _QW_WAITING is set.
> > >
> > > Yes. It seems to make sense to distinguish between write_lock_irq and
> > > write_lock_irqsave and fix this for all of write_lock_irq.
> >
> > I wasn't planning on doing anything here, but Hillf kind of pushed me into
> > it. I think it needs to be something like this. Compile tested only.
> > If it ends up getting used,
> Happy new year!

Thank you! I know your new year is a few weeks away still ;-)

> > -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock)
> > +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq)
> > {
> > int cnts;
> > @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock)
> Also a new state showed up after the current design:
> 1. locked flag with _QW_WAITING, while irq enabled.
> 2. And this state will be only in interrupt context.
> 3. lock->wait_lock is hold by the write waiter.
> So per my understanding, a different behavior also needed to be done in
> queued_write_lock_slowpath:
> when (unlikely(in_interrupt())) , get the lock directly.

I don't think so. Remember that write_lock_irq() can only be called in
process context, and when interrupts are enabled.

> So needed to be done in release path. This is to address Hillf's concern on
> possibility of deadlock.

Hillf's concern is invalid.

> > /* When no more readers or writers, set the locked flag */
> > do {
> > + if (irq)
> > + local_irq_enable();
> I think write_lock_irqsave also needs to be take account. So
> loal_irq_save(flags) should be take into account here.

If we did want to support the same kind of spinning with interrupts
enabled for write_lock_irqsave(), we'd want to pass the flags in
and do local_irq_restore(), but I don't know how we'd support
write_lock_irq() if we did that -- can we rely on passing in 0 for flags
meaning "reenable" on all architectures? And ~0 meaning "don't
reenable" on all architectures?

That all seems complicated, so I didn't do that.

2024-01-03 02:59:50

by Aiqun Yu (Maria)

[permalink] [raw]

Subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

On 1/2/2024 5:14 PM, Matthew Wilcox wrote:
> On Tue, Jan 02, 2024 at 10:19:47AM +0800, Aiqun Yu (Maria) wrote:
>> On 12/29/2023 6:20 AM, Matthew Wilcox wrote:
>>> On Wed, Dec 13, 2023 at 12:27:05PM -0600, Eric W. Biederman wrote:
>>>> Matthew Wilcox <[email protected]> writes:
>>>>> I think the right way to fix this is to pass a boolean flag to
>>>>> queued_write_lock_slowpath() to let it know whether it can re-enable
>>>>> interrupts while checking whether _QW_WAITING is set.
>>>>
>>>> Yes. It seems to make sense to distinguish between write_lock_irq and
>>>> write_lock_irqsave and fix this for all of write_lock_irq.
>>>
>>> I wasn't planning on doing anything here, but Hillf kind of pushed me into
>>> it. I think it needs to be something like this. Compile tested only.
>>> If it ends up getting used,
>> Happy new year!
>
> Thank you! I know your new year is a few weeks away still ;-)
Yeah, Chinese new year will come about 5 weeks later. :)
>
>>> -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock)
>>> +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq)
>>> {
>>> int cnts;
>>> @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock)
>> Also a new state showed up after the current design:
>> 1. locked flag with _QW_WAITING, while irq enabled.
>> 2. And this state will be only in interrupt context.
>> 3. lock->wait_lock is hold by the write waiter.
>> So per my understanding, a different behavior also needed to be done in
>> queued_write_lock_slowpath:
>> when (unlikely(in_interrupt())) , get the lock directly.
>
> I don't think so. Remember that write_lock_irq() can only be called in
> process context, and when interrupts are enabled.
In current kernel drivers, I can see same lock called with
write_lock_irq and write_lock_irqsave in different drivers.

And this is the scenario I am talking about:
1. cpu0 have task run and called write_lock_irq.(Not in interrupt context)
2. cpu0 hold the lock->wait_lock and re-enabled the interrupt.
* this is the new state with _QW_WAITING set, lock->wait_lock locked,
interrupt enabled. *
3. cpu0 in-interrupt context and want to do write_lock_irqsave.
4. cpu0 tried to acquire lock->wait_lock again.

I was thinking to support both write_lock_irq and write_lock_irqsave
with interrupt enabled together in queued_write_lock_slowpath.

That's why I am suggesting in write_lock_irqsave when (in_interrupt()),
instead spin for the lock->wait_lock, spin to get the lock->cnts directly.
>
>> So needed to be done in release path. This is to address Hillf's concern on
>> possibility of deadlock.
>
> Hillf's concern is invalid.
>
>>> /* When no more readers or writers, set the locked flag */
>>> do {
>>> + if (irq)
>>> + local_irq_enable();
>> I think write_lock_irqsave also needs to be take account. So
>> loal_irq_save(flags) should be take into account here.
>
> If we did want to support the same kind of spinning with interrupts
> enabled for write_lock_irqsave(), we'd want to pass the flags in
> and do local_irq_restore(), but I don't know how we'd support
> write_lock_irq() if we did that -- can we rely on passing in 0 for flags
> meaning "reenable" on all architectures? And ~0 meaning "don't
> reenable" on all architectures?
What about for all write_lock_irq, pass the real flags from
local_irq_save(flags) into the queued_write_lock_slowpath?
Arch specific valid flags won't be !0 limited then.
>
> That all seems complicated, so I didn't do that.
This is complicated. Also need test verify to ensure.
More careful design more better.

Fixed previous wrong email address. ^-^!
>

--
Thx and BRs,
Aiqun(Maria) Yu

2024-01-03 06:04:36

by kernel test robot

[permalink] [raw]

Subject: Re: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

Hello,

kernel test robot noticed "WARNING:inconsistent_lock_state" on:

commit: 30ebdbe58c5be457f329cb81487df2a9eae886b4 ("Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock")
url: https://github.com/intel-lab-lkp/linux/commits/Matthew-Wilcox/Re-PATCH-kernel-Introduce-a-write-lock-unlock-wrapper-for-tasklist_lock/20231229-062352
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git a51749ab34d9e5dec548fe38ede7e01e8bb26454
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

in testcase: boot

compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)

+-----------------------------------------------------+------------+------------+
| | a51749ab34 | 30ebdbe58c |
+-----------------------------------------------------+------------+------------+
| WARNING:inconsistent_lock_state | 0 | 10 |
| inconsistent{IN-HARDIRQ-R}->{HARDIRQ-ON-W}usage | 0 | 10 |
+-----------------------------------------------------+------------+------------+

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]

[ 75.968288][ T141] WARNING: inconsistent lock state
[ 75.968550][ T141] 6.7.0-rc1-00006-g30ebdbe58c5b #1 Tainted: G W N
[ 75.968946][ T141] --------------------------------
[ 75.969208][ T141] inconsistent {IN-HARDIRQ-R} -> {HARDIRQ-ON-W} usage.
[ 75.969556][ T141] systemd-udevd/141 [HC0[0]:SC0[0]:HE0:SE1] takes:
[ 75.969889][ T141] ffff888113a9d958 (&ep->lock){+-.-}-{2:2}, at: ep_start_scan (include/linux/list.h:373 (discriminator 31) include/linux/list.h:571 (discriminator 31) fs/eventpoll.c:628 (discriminator 31))
[ 75.970329][ T141] {IN-HARDIRQ-R} state was registered at:
[ 75.970620][ T141] __lock_acquire (kernel/locking/lockdep.c:5090)
[ 75.970873][ T141] lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755)
[ 75.971113][ T141] _raw_read_lock_irqsave (include/linux/rwlock_api_smp.h:161 kernel/locking/spinlock.c:236)
[ 75.971387][ T141] ep_poll_callback (include/net/busy_poll.h:37 fs/eventpoll.c:434 fs/eventpoll.c:1178)
[ 75.971638][ T141] __wake_up_common (kernel/sched/wait.c:90)
[ 75.971894][ T141] __wake_up (include/linux/spinlock.h:406 kernel/sched/wait.c:108 kernel/sched/wait.c:127)
[ 75.972110][ T141] irq_work_single (kernel/irq_work.c:222)
[ 75.972363][ T141] irq_work_run_list (kernel/irq_work.c:251 (discriminator 3))
[ 75.972619][ T141] update_process_times (kernel/time/timer.c:2074)
[ 75.972895][ T141] tick_nohz_highres_handler (kernel/time/tick-sched.c:257 kernel/time/tick-sched.c:1516)
[ 75.973188][ T141] __hrtimer_run_queues (kernel/time/hrtimer.c:1688 kernel/time/hrtimer.c:1752)
[ 75.973460][ T141] hrtimer_interrupt (kernel/time/hrtimer.c:1817)
[ 75.973719][ T141] __sysvec_apic_timer_interrupt (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 arch/x86/include/asm/trace/irq_vectors.h:41 arch/x86/kernel/apic/apic.c:1083)
[ 75.974031][ T141] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076 (discriminator 14))
[ 75.974324][ T141] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:645)
[ 75.974636][ T141] kasan_check_range (mm/kasan/generic.c:186)
[ 75.974888][ T141] trace_preempt_off (arch/x86/include/asm/bitops.h:227 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 include/linux/cpumask.h:504 include/linux/cpumask.h:1104 include/trace/events/preemptirq.h:51 kernel/trace/trace_preemptirq.c:109)
[ 75.975144][ T141] _raw_spin_lock (include/linux/spinlock_api_smp.h:133 kernel/locking/spinlock.c:154)
[ 75.975383][ T141] __change_page_attr_set_clr (arch/x86/mm/pat/set_memory.c:1765)
[ 75.975683][ T141] change_page_attr_set_clr (arch/x86/mm/pat/set_memory.c:1849)
[ 75.975971][ T141] set_memory_ro (arch/x86/mm/pat/set_memory.c:2077)
[ 75.976206][ T141] module_enable_ro (kernel/module/strict_rwx.c:19 kernel/module/strict_rwx.c:47)
[ 75.976460][ T141] do_init_module (kernel/module/main.c:2572)
[ 75.976715][ T141] load_module (kernel/module/main.c:2981)
[ 75.976959][ T141] init_module_from_file (kernel/module/main.c:3148)
[ 75.977233][ T141] idempotent_init_module (kernel/module/main.c:3165)
[ 75.977514][ T141] __ia32_sys_finit_module (include/linux/file.h:45 kernel/module/main.c:3187 kernel/module/main.c:3169 kernel/module/main.c:3169)
[ 75.977796][ T141] __do_fast_syscall_32 (arch/x86/entry/common.c:164 arch/x86/entry/common.c:230)
[ 75.978060][ T141] do_fast_syscall_32 (arch/x86/entry/common.c:255)
[ 75.978315][ T141] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:121)
[ 75.978644][ T141] irq event stamp: 226426
[ 75.978866][ T141] hardirqs last enabled at (226425): syscall_enter_from_user_mode_prepare (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:77 kernel/entry/common.c:122)
[ 75.979436][ T141] hardirqs last disabled at (226426): _raw_write_lock_irq (include/linux/rwlock_api_smp.h:193 kernel/locking/spinlock.c:326)
[ 75.979932][ T141] softirqs last enabled at (225118): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:400 kernel/softirq.c:582)
[ 75.980407][ T141] softirqs last disabled at (225113): irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:644)
[ 75.980892][ T141]
[ 75.980892][ T141] other info that might help us debug this:
[ 75.981299][ T141] Possible unsafe locking scenario:
[ 75.981299][ T141]
[ 75.981676][ T141] CPU0
[ 75.981848][ T141] ----
[ 75.982019][ T141] lock(&ep->lock);
[ 75.982224][ T141] <Interrupt>
[ 75.982405][ T141] lock(&ep->lock);
[ 75.982617][ T141]
[ 75.982617][ T141] *** DEADLOCK ***
[ 75.982617][ T141]
[ 75.983028][ T141] 2 locks held by systemd-udevd/141:
[ 75.983299][ T141] #0: ffff888113a9d868 (&ep->mtx){+.+.}-{3:3}, at: ep_send_events (fs/eventpoll.c:1695)
[ 75.983758][ T141] #1: ffff888113a9d958 (&ep->lock){+-.-}-{2:2}, at: ep_start_scan (include/linux/list.h:373 (discriminator 31) include/linux/list.h:571 (discriminator 31) fs/eventpoll.c:628 (discriminator 31))
[ 75.984215][ T141]
[ 75.984215][ T141] stack backtrace:
[ 75.984517][ T141] CPU: 1 PID: 141 Comm: systemd-udevd Tainted: G W N 6.7.0-rc1-00006-g30ebdbe58c5b #1 f53d658e8913bcc30100423f807a4e1a31eca25f
[ 75.985251][ T141] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 75.985777][ T141] Call Trace:
[ 75.985950][ T141] <TASK>
[ 75.986105][ T141] dump_stack_lvl (lib/dump_stack.c:107)
[ 75.986344][ T141] mark_lock_irq (kernel/locking/lockdep.c:4216)
[ 75.986591][ T141] ? print_usage_bug (kernel/locking/lockdep.c:4206)
[ 75.986847][ T141] ? stack_trace_snprint (kernel/stacktrace.c:114)
[ 75.987115][ T141] ? save_trace (kernel/locking/lockdep.c:586)
[ 75.987350][ T141] mark_lock (kernel/locking/lockdep.c:4677)
[ 75.987576][ T141] ? mark_lock_irq (kernel/locking/lockdep.c:4638)
[ 75.987836][ T141] mark_held_locks (kernel/locking/lockdep.c:4273)
[ 75.988077][ T141] lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4291 kernel/locking/lockdep.c:4358)
[ 75.988377][ T141] trace_hardirqs_on (kernel/trace/trace_preemptirq.c:62)
[ 75.988631][ T141] queued_write_lock_slowpath (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:77 kernel/locking/qrwlock.c:87)
[ 75.988926][ T141] ? queued_read_lock_slowpath (kernel/locking/qrwlock.c:68)
[ 75.989226][ T141] ? lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755)
[ 75.989473][ T141] ? lock_sync (kernel/locking/lockdep.c:5721)
[ 75.989706][ T141] do_raw_write_lock_irq (include/asm-generic/qrwlock.h:115 kernel/locking/spinlock_debug.c:217)
[ 75.989980][ T141] ? do_raw_write_lock (kernel/locking/spinlock_debug.c:215)
[ 75.990245][ T141] ? _raw_write_lock_irq (include/linux/rwlock_api_smp.h:195 kernel/locking/spinlock.c:326)
[ 75.990512][ T141] ep_start_scan (include/linux/list.h:373 (discriminator 31) include/linux/list.h:571 (discriminator 31) fs/eventpoll.c:628 (discriminator 31))
[ 75.990749][ T141] ep_send_events (fs/eventpoll.c:1701)
[ 75.990995][ T141] ? _copy_from_iter_nocache (lib/iov_iter.c:181)
[ 75.991296][ T141] ? __ia32_sys_epoll_create (fs/eventpoll.c:1678)
[ 75.991579][ T141] ? mark_lock (arch/x86/include/asm/bitops.h:227 (discriminator 3) arch/x86/include/asm/bitops.h:239 (discriminator 3) include/asm-generic/bitops/instrumented-non-atomic.h:142 (discriminator 3) kernel/locking/lockdep.c:228 (discriminator 3) kernel/locking/lockdep.c:4655 (discriminator 3))
[ 75.991813][ T141] ep_poll (fs/eventpoll.c:1865)
[ 75.992030][ T141] ? ep_send_events (fs/eventpoll.c:1827)
[ 75.992290][ T141] do_epoll_wait (fs/eventpoll.c:2318)
[ 75.992532][ T141] __ia32_sys_epoll_wait (fs/eventpoll.c:2325)
[ 75.992810][ T141] ? clockevents_program_event (kernel/time/clockevents.c:336 (discriminator 3))
[ 75.993112][ T141] ? do_epoll_wait (fs/eventpoll.c:2325)
[ 75.993366][ T141] __do_fast_syscall_32 (arch/x86/entry/common.c:164 arch/x86/entry/common.c:230)
[ 75.993627][ T141] do_fast_syscall_32 (arch/x86/entry/common.c:255)
[ 75.993879][ T141] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:121)
[ 75.994204][ T141] RIP: 0023:0xf7f55579
[ 75.994417][ T141] Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
All code
========
0: b8 01 10 06 03 mov $0x3061001,%eax
5: 74 b4 je 0xffffffffffffffbb
7: 01 10 add %edx,(%rax)
9: 07 (bad)
a: 03 74 b0 01 add 0x1(%rax,%rsi,4),%esi
e: 10 08 adc %cl,(%rax)
10: 03 74 d8 01 add 0x1(%rax,%rbx,8),%esi
...
20: 00 51 52 add %dl,0x52(%rcx)
23: 55 push %rbp
24:* 89 e5 mov %esp,%ebp <-- trapping instruction
26: 0f 34 sysenter
28: cd 80 int $0x80
2a: 5d pop %rbp
2b: 5a pop %rdx
2c: 59 pop %rcx
2d: c3 ret
2e: 90 nop
2f: 90 nop
30: 90 nop
31: 90 nop
32: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
39: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi

Code starting with the faulting instruction
===========================================
0: 5d pop %rbp
1: 5a pop %rdx
2: 59 pop %rcx
3: c3 ret
4: 90 nop
5: 90 nop
6: 90 nop
7: 90 nop
8: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
f: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi

The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240103/[email protected]

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-01-03 18:18:53

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

On Wed, Jan 03, 2024 at 10:58:33AM +0800, Aiqun Yu (Maria) wrote:
> On 1/2/2024 5:14 PM, Matthew Wilcox wrote:
> > > > -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock)
> > > > +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq)
> > > > {
> > > > int cnts;
> > > > @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock)
> > > Also a new state showed up after the current design:
> > > 1. locked flag with _QW_WAITING, while irq enabled.
> > > 2. And this state will be only in interrupt context.
> > > 3. lock->wait_lock is hold by the write waiter.
> > > So per my understanding, a different behavior also needed to be done in
> > > queued_write_lock_slowpath:
> > > when (unlikely(in_interrupt())) , get the lock directly.
> >
> > I don't think so. Remember that write_lock_irq() can only be called in
> > process context, and when interrupts are enabled.
> In current kernel drivers, I can see same lock called with write_lock_irq
> and write_lock_irqsave in different drivers.
>
> And this is the scenario I am talking about:
> 1. cpu0 have task run and called write_lock_irq.(Not in interrupt context)
> 2. cpu0 hold the lock->wait_lock and re-enabled the interrupt.

Oh, I missed that it was holding the wait_lock. Yes, we also need to
release the wait_lock before spinning with interrupts disabled.

> I was thinking to support both write_lock_irq and write_lock_irqsave with
> interrupt enabled together in queued_write_lock_slowpath.
>
> That's why I am suggesting in write_lock_irqsave when (in_interrupt()),
> instead spin for the lock->wait_lock, spin to get the lock->cnts directly.

Mmm, but the interrupt could come in on a different CPU and that would
lead to it stealing the wait_lock from the CPU which is merely waiting
for the readers to go away.

2024-01-04 00:47:19

by Aiqun Yu (Maria)

[permalink] [raw]

Subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock

On 1/4/2024 2:18 AM, Matthew Wilcox wrote:
> On Wed, Jan 03, 2024 at 10:58:33AM +0800, Aiqun Yu (Maria) wrote:
>> On 1/2/2024 5:14 PM, Matthew Wilcox wrote:
>>>>> -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock)
>>>>> +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq)
>>>>> {
>>>>> int cnts;
>>>>> @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock)
>>>> Also a new state showed up after the current design:
>>>> 1. locked flag with _QW_WAITING, while irq enabled.
>>>> 2. And this state will be only in interrupt context.
>>>> 3. lock->wait_lock is hold by the write waiter.
>>>> So per my understanding, a different behavior also needed to be done in
>>>> queued_write_lock_slowpath:
>>>> when (unlikely(in_interrupt())) , get the lock directly.
>>>
>>> I don't think so. Remember that write_lock_irq() can only be called in
>>> process context, and when interrupts are enabled.
>> In current kernel drivers, I can see same lock called with write_lock_irq
>> and write_lock_irqsave in different drivers.
>>
>> And this is the scenario I am talking about:
>> 1. cpu0 have task run and called write_lock_irq.(Not in interrupt context)
>> 2. cpu0 hold the lock->wait_lock and re-enabled the interrupt.
>
> Oh, I missed that it was holding the wait_lock. Yes, we also need to
> release the wait_lock before spinning with interrupts disabled.
>
>> I was thinking to support both write_lock_irq and write_lock_irqsave with
>> interrupt enabled together in queued_write_lock_slowpath.
>>
>> That's why I am suggesting in write_lock_irqsave when (in_interrupt()),
>> instead spin for the lock->wait_lock, spin to get the lock->cnts directly.
>
> Mmm, but the interrupt could come in on a different CPU and that would
> lead to it stealing the wait_lock from the CPU which is merely waiting
> for the readers to go away.
That's right.
The fairness(or queue mechanism) wouldn't be ensured (only in interrupt
context) if we have the special design when (in_interrupt()) spin to get
the lock->cnts directly. When in interrupt context, the later
write_lock_irqsave may get the lock earlier than the write_lock_irq()
which is not in interrupt context.

This is a side effect of the design, while similar unfairness design in
read lock as well. I think it is reasonable to have in_interrupt()
waiters get lock earlier from the whole system's performance of view.
>

--
Thx and BRs,
Aiqun(Maria) Yu