2024-01-30 09:13:53

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET wq/for-6.9] workqueue: Implement BH workqueue and convert several tasklet users

Hello,

The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws such as
the execution code accessing the tasklet item after the execution is
complete which can lead to subtle use-after-free in certain usage scenarios
and less-developed flush and cancel mechanisms.

Mikulas Patocka recently reported that dm-crypt and dm-crypt are affected by
the access-after-completion issue and suggested adding TASKLET_STATE_ONESHOT
flag which selectively removes post-completion access while significantly
limiting how the tasklet can be used in the following thread:

http://lkml.kernel.org/r/[email protected]

Linus didn't like the approach and suggested extending workqueue to support
execution from atomic context:

http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com

As suggested, this patchset implements BH workqueues which are like regular
workqueues but executes work items in the BH (softirq) context and converts
several tasklet users.

- The name bh is used instead of the suggested atomic as it's more in line
with widely used execution context interface - local_bh_enable/disable()
and friends.

- The system default BH workqueues - system_bh_wq and system_bh_highpri_wq -
are provided. As queue-wide flushing doesn't exist in tasklet, all
existing tasklet users should be able to use the system BH workqueues
without creating their own.

- BH workqueues currently use tasklet to run the work items to avoid
priority inversions involving tasklet_hi and WQ_BH | WQ_HIGHPRI. Once all
tasklet users are converted, tasklet code can be removed and BH workqueues
can take over its softirqs.

This patchset is on top of wq/for-6.9 (aae17ebb53c ("workqueue: Avoid using
isolated cpus' timers on queue_delayed_work")) and contains the following
eight patches.

0001-workqueue-Update-lock-debugging-code.patch
0002-workqueue-Factor-out-init_cpu_worker_pool.patch
0003-workqueue-Implement-BH-workqueues-to-eventually-repl.patch
0004-backtracetest-Convert-from-tasklet-to-BH-workqueue.patch
0005-usb-core-hcd-Convert-from-tasklet-to-BH-workqueue.patch
0006-net-tcp-tsq-Convert-from-tasklet-to-BH-workqueue.patch
0007-dm-crypt-Convert-from-tasklet-to-BH-workqueue.patch
0008-dm-verity-Convert-from-tasklet-to-BH-workqueue.patch

0001-0003 prepare and implement BH workqueues.

0004-0008 convert some tasklet users to BH workqueue. The conversions are
relatively straightforward but are in descending order of confidence.

The patchset is also available in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git wq-bh-v1

diffstat follows. Thanks.

Documentation/core-api/workqueue.rst | 29 ++++-
drivers/md/dm-crypt.c | 36 -------
drivers/md/dm-verity-target.c | 8 -
drivers/md/dm-verity.h | 2
drivers/usb/core/hcd.c | 23 ++--
include/linux/usb/hcd.h | 2
include/linux/workqueue.h | 9 +
include/net/tcp.h | 2
kernel/backtracetest.c | 18 +--
kernel/workqueue.c | 312 ++++++++++++++++++++++++++++++++++++++++++++++++--------------
kernel/workqueue_internal.h | 3
net/ipv4/tcp.c | 2
net/ipv4/tcp_output.c | 36 +++----
tools/workqueue/wq_dump.py | 11 +-
14 files changed, 335 insertions(+), 158 deletions(-)

--
tejun


2024-01-30 09:15:57

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 4/8] backtracetest: Convert from tasklet to BH workqueue

The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.

This patch converts backtracetest from tasklet to BH workqueue.

- Replace "irq" with "bh" in names and message to better reflect what's
happening.

- Replace completion usage with a flush_work() call.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Arjan van de Ven <[email protected]>
---
kernel/backtracetest.c | 18 +++++++-----------
1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/kernel/backtracetest.c b/kernel/backtracetest.c
index 370217dd7e39..a4181234232b 100644
--- a/kernel/backtracetest.c
+++ b/kernel/backtracetest.c
@@ -21,24 +21,20 @@ static void backtrace_test_normal(void)
dump_stack();
}

-static DECLARE_COMPLETION(backtrace_work);
-
-static void backtrace_test_irq_callback(unsigned long data)
+static void backtrace_test_bh_workfn(struct work_struct *work)
{
dump_stack();
- complete(&backtrace_work);
}

-static DECLARE_TASKLET_OLD(backtrace_tasklet, &backtrace_test_irq_callback);
+static DECLARE_WORK(backtrace_bh_work, &backtrace_test_bh_workfn);

-static void backtrace_test_irq(void)
+static void backtrace_test_bh(void)
{
- pr_info("Testing a backtrace from irq context.\n");
+ pr_info("Testing a backtrace from BH context.\n");
pr_info("The following trace is a kernel self test and not a bug!\n");

- init_completion(&backtrace_work);
- tasklet_schedule(&backtrace_tasklet);
- wait_for_completion(&backtrace_work);
+ queue_work(system_bh_wq, &backtrace_bh_work);
+ flush_work(&backtrace_bh_work);
}

#ifdef CONFIG_STACKTRACE
@@ -65,7 +61,7 @@ static int backtrace_regression_test(void)
pr_info("====[ backtrace testing ]===========\n");

backtrace_test_normal();
- backtrace_test_irq();
+ backtrace_test_bh();
backtrace_test_saved();

pr_info("====[ end of backtrace testing ]====\n");
--
2.43.0


2024-01-30 09:16:02

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 6/8] net: tcp: tsq: Convert from tasklet to BH workqueue

The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.

This patch converts TCP Small Queues implementation from tasklet to BH
workqueue.

Semantically, this is an equivalent conversion and there shouldn't be any
user-visible behavior changes. While workqueue's queueing and execution
paths are a bit heavier than tasklet's, unless the work item is being queued
every packet, the difference hopefully shouldn't matter.

My experience with the networking stack is very limited and this patch
definitely needs attention from someone who actually understands networking.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: "David S. Miller" <[email protected]> (maintainer:NETWORKING [IPv4/IPv6])
Cc: David Ahern <[email protected]>
Cc: Jakub Kicinski <[email protected]> (maintainer:NETWORKING [GENERAL])
Cc: Paolo Abeni <[email protected]>
Cc: [email protected] (open list:NETWORKING [TCP])
---
include/net/tcp.h | 2 +-
net/ipv4/tcp.c | 2 +-
net/ipv4/tcp_output.c | 36 ++++++++++++++++++------------------
3 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index dd78a1181031..89f3702be47a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -324,7 +324,7 @@ extern struct proto tcp_prot;
#define TCP_DEC_STATS(net, field) SNMP_DEC_STATS((net)->mib.tcp_statistics, field)
#define TCP_ADD_STATS(net, field, val) SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)

-void tcp_tasklet_init(void);
+void tcp_tsq_work_init(void);

int tcp_v4_err(struct sk_buff *skb, u32);

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1baa484d2190..d085ee5642fe 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4772,6 +4772,6 @@ void __init tcp_init(void)
tcp_v4_init();
tcp_metrics_init();
BUG_ON(tcp_register_congestion_control(&tcp_reno) != 0);
- tcp_tasklet_init();
+ tcp_tsq_work_init();
mptcp_init();
}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e3167ad96567..d11be6eebb6e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1049,15 +1049,15 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
* needs to be reallocated in a driver.
* The invariant being skb->truesize subtracted from sk->sk_wmem_alloc
*
- * Since transmit from skb destructor is forbidden, we use a tasklet
+ * Since transmit from skb destructor is forbidden, we use a BH work item
* to process all sockets that eventually need to send more skbs.
- * We use one tasklet per cpu, with its own queue of sockets.
+ * We use one work item per cpu, with its own queue of sockets.
*/
-struct tsq_tasklet {
- struct tasklet_struct tasklet;
+struct tsq_work {
+ struct work_struct work;
struct list_head head; /* queue of tcp sockets */
};
-static DEFINE_PER_CPU(struct tsq_tasklet, tsq_tasklet);
+static DEFINE_PER_CPU(struct tsq_work, tsq_work);

static void tcp_tsq_write(struct sock *sk)
{
@@ -1087,14 +1087,14 @@ static void tcp_tsq_handler(struct sock *sk)
bh_unlock_sock(sk);
}
/*
- * One tasklet per cpu tries to send more skbs.
- * We run in tasklet context but need to disable irqs when
+ * One work item per cpu tries to send more skbs.
+ * We run in BH context but need to disable irqs when
* transferring tsq->head because tcp_wfree() might
* interrupt us (non NAPI drivers)
*/
-static void tcp_tasklet_func(struct tasklet_struct *t)
+static void tcp_tsq_workfn(struct work_struct *work)
{
- struct tsq_tasklet *tsq = from_tasklet(tsq, t, tasklet);
+ struct tsq_work *tsq = container_of(work, struct tsq_work, work);
LIST_HEAD(list);
unsigned long flags;
struct list_head *q, *n;
@@ -1164,15 +1164,15 @@ void tcp_release_cb(struct sock *sk)
}
EXPORT_SYMBOL(tcp_release_cb);

-void __init tcp_tasklet_init(void)
+void __init tcp_tsq_work_init(void)
{
int i;

for_each_possible_cpu(i) {
- struct tsq_tasklet *tsq = &per_cpu(tsq_tasklet, i);
+ struct tsq_work *tsq = &per_cpu(tsq_work, i);

INIT_LIST_HEAD(&tsq->head);
- tasklet_setup(&tsq->tasklet, tcp_tasklet_func);
+ INIT_WORK(&tsq->work, tcp_tsq_workfn);
}
}

@@ -1186,11 +1186,11 @@ void tcp_wfree(struct sk_buff *skb)
struct sock *sk = skb->sk;
struct tcp_sock *tp = tcp_sk(sk);
unsigned long flags, nval, oval;
- struct tsq_tasklet *tsq;
+ struct tsq_work *tsq;
bool empty;

/* Keep one reference on sk_wmem_alloc.
- * Will be released by sk_free() from here or tcp_tasklet_func()
+ * Will be released by sk_free() from here or tcp_tsq_workfn()
*/
WARN_ON(refcount_sub_and_test(skb->truesize - 1, &sk->sk_wmem_alloc));

@@ -1212,13 +1212,13 @@ void tcp_wfree(struct sk_buff *skb)
nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED;
} while (!try_cmpxchg(&sk->sk_tsq_flags, &oval, nval));

- /* queue this socket to tasklet queue */
+ /* queue this socket to BH workqueue */
local_irq_save(flags);
- tsq = this_cpu_ptr(&tsq_tasklet);
+ tsq = this_cpu_ptr(&tsq_work);
empty = list_empty(&tsq->head);
list_add(&tp->tsq_node, &tsq->head);
if (empty)
- tasklet_schedule(&tsq->tasklet);
+ queue_work(system_bh_wq, &tsq->work);
local_irq_restore(flags);
return;
out:
@@ -2623,7 +2623,7 @@ static bool tcp_small_queue_check(struct sock *sk, const struct sk_buff *skb,
if (refcount_read(&sk->sk_wmem_alloc) > limit) {
/* Always send skb if rtx queue is empty or has one skb.
* No need to wait for TX completion to call us back,
- * after softirq/tasklet schedule.
+ * after softirq schedule.
* This helps when TX completions are delayed too much.
*/
if (tcp_rtx_queue_empty_or_single_skb(sk))
--
2.43.0


2024-01-30 09:16:09

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 5/8] usb: core: hcd: Convert from tasklet to BH workqueue

The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.

This patch converts usb hcd from tasklet to BH workqueue.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Alan Stern <[email protected]>
Cc: [email protected]
---
drivers/usb/core/hcd.c | 23 ++++++++++++-----------
include/linux/usb/hcd.h | 2 +-
2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 12b6dfeaf658..edf74458474a 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -1664,9 +1664,10 @@ static void __usb_hcd_giveback_urb(struct urb *urb)
usb_put_urb(urb);
}

-static void usb_giveback_urb_bh(struct tasklet_struct *t)
+static void usb_giveback_urb_bh(struct work_struct *work)
{
- struct giveback_urb_bh *bh = from_tasklet(bh, t, bh);
+ struct giveback_urb_bh *bh =
+ container_of(work, struct giveback_urb_bh, bh);
struct list_head local_list;

spin_lock_irq(&bh->lock);
@@ -1691,9 +1692,9 @@ static void usb_giveback_urb_bh(struct tasklet_struct *t)
spin_lock_irq(&bh->lock);
if (!list_empty(&bh->head)) {
if (bh->high_prio)
- tasklet_hi_schedule(&bh->bh);
+ queue_work(system_bh_highpri_wq, &bh->bh);
else
- tasklet_schedule(&bh->bh);
+ queue_work(system_bh_wq, &bh->bh);
}
bh->running = false;
spin_unlock_irq(&bh->lock);
@@ -1706,7 +1707,7 @@ static void usb_giveback_urb_bh(struct tasklet_struct *t)
* @status: completion status code for the URB.
*
* Context: atomic. The completion callback is invoked in caller's context.
- * For HCDs with HCD_BH flag set, the completion callback is invoked in tasklet
+ * For HCDs with HCD_BH flag set, the completion callback is invoked in BH
* context (except for URBs submitted to the root hub which always complete in
* caller's context).
*
@@ -1725,7 +1726,7 @@ void usb_hcd_giveback_urb(struct usb_hcd *hcd, struct urb *urb, int status)
struct giveback_urb_bh *bh;
bool running;

- /* pass status to tasklet via unlinked */
+ /* pass status to BH via unlinked */
if (likely(!urb->unlinked))
urb->unlinked = status;

@@ -1747,9 +1748,9 @@ void usb_hcd_giveback_urb(struct usb_hcd *hcd, struct urb *urb, int status)
if (running)
;
else if (bh->high_prio)
- tasklet_hi_schedule(&bh->bh);
+ queue_work(system_bh_highpri_wq, &bh->bh);
else
- tasklet_schedule(&bh->bh);
+ queue_work(system_bh_wq, &bh->bh);
}
EXPORT_SYMBOL_GPL(usb_hcd_giveback_urb);

@@ -2540,7 +2541,7 @@ static void init_giveback_urb_bh(struct giveback_urb_bh *bh)

spin_lock_init(&bh->lock);
INIT_LIST_HEAD(&bh->head);
- tasklet_setup(&bh->bh, usb_giveback_urb_bh);
+ INIT_WORK(&bh->bh, usb_giveback_urb_bh);
}

struct usb_hcd *__usb_create_hcd(const struct hc_driver *driver,
@@ -2926,7 +2927,7 @@ int usb_add_hcd(struct usb_hcd *hcd,
&& device_can_wakeup(&hcd->self.root_hub->dev))
dev_dbg(hcd->self.controller, "supports USB remote wakeup\n");

- /* initialize tasklets */
+ /* initialize BHs */
init_giveback_urb_bh(&hcd->high_prio_bh);
hcd->high_prio_bh.high_prio = true;
init_giveback_urb_bh(&hcd->low_prio_bh);
@@ -3036,7 +3037,7 @@ void usb_remove_hcd(struct usb_hcd *hcd)
mutex_unlock(&usb_bus_idr_lock);

/*
- * tasklet_kill() isn't needed here because:
+ * flush_work() isn't needed here because:
* - driver's disconnect() called from usb_disconnect() should
* make sure its URBs are completed during the disconnect()
* callback
diff --git a/include/linux/usb/hcd.h b/include/linux/usb/hcd.h
index 00724b4f6e12..f698aac71de3 100644
--- a/include/linux/usb/hcd.h
+++ b/include/linux/usb/hcd.h
@@ -55,7 +55,7 @@ struct giveback_urb_bh {
bool high_prio;
spinlock_t lock;
struct list_head head;
- struct tasklet_struct bh;
+ struct work_struct bh;
struct usb_host_endpoint *completing_ep;
};

--
2.43.0


2024-01-30 09:17:14

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 7/8] dm-crypt: Convert from tasklet to BH workqueue

The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.

This patch converts dm-crypt from tasklet to BH workqueue.

Like a regular workqueue, a BH workqueue allows freeing the currently
executing work item. Converting from tasklet to BH workqueue removes the
need for deferring bio_endio() again to a work item, which was buggy anyway.

I tested this lightly with "--perf-no_read_workqueue
--perf-no_write_workqueue" + some code modifications, but would really
-appreciate if someone who knows the code base better could take a look.

Signed-off-by: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Mikulas Patocka <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: [email protected]
---
drivers/md/dm-crypt.c | 36 ++----------------------------------
1 file changed, 2 insertions(+), 34 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 855b482cbff1..619c762d4072 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -73,11 +73,8 @@ struct dm_crypt_io {
struct bio *base_bio;
u8 *integrity_metadata;
bool integrity_metadata_from_pool:1;
- bool in_tasklet:1;

struct work_struct work;
- struct tasklet_struct tasklet;
-
struct convert_context ctx;

atomic_t io_pending;
@@ -1762,7 +1759,6 @@ static void crypt_io_init(struct dm_crypt_io *io, struct crypt_config *cc,
io->ctx.r.req = NULL;
io->integrity_metadata = NULL;
io->integrity_metadata_from_pool = false;
- io->in_tasklet = false;
atomic_set(&io->io_pending, 0);
}

@@ -1771,13 +1767,6 @@ static void crypt_inc_pending(struct dm_crypt_io *io)
atomic_inc(&io->io_pending);
}

-static void kcryptd_io_bio_endio(struct work_struct *work)
-{
- struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
-
- bio_endio(io->base_bio);
-}
-
/*
* One of the bios was finished. Check for completion of
* the whole request and correctly clean up the buffer.
@@ -1800,21 +1789,6 @@ static void crypt_dec_pending(struct dm_crypt_io *io)
kfree(io->integrity_metadata);

base_bio->bi_status = error;
-
- /*
- * If we are running this function from our tasklet,
- * we can't call bio_endio() here, because it will call
- * clone_endio() from dm.c, which in turn will
- * free the current struct dm_crypt_io structure with
- * our tasklet. In this case we need to delay bio_endio()
- * execution to after the tasklet is done and dequeued.
- */
- if (io->in_tasklet) {
- INIT_WORK(&io->work, kcryptd_io_bio_endio);
- queue_work(cc->io_queue, &io->work);
- return;
- }
-
bio_endio(base_bio);
}

@@ -2246,11 +2220,6 @@ static void kcryptd_crypt(struct work_struct *work)
kcryptd_crypt_write_convert(io);
}

-static void kcryptd_crypt_tasklet(unsigned long work)
-{
- kcryptd_crypt((struct work_struct *)work);
-}
-
static void kcryptd_queue_crypt(struct dm_crypt_io *io)
{
struct crypt_config *cc = io->cc;
@@ -2263,9 +2232,8 @@ static void kcryptd_queue_crypt(struct dm_crypt_io *io)
* it is being executed with irqs disabled.
*/
if (in_hardirq() || irqs_disabled()) {
- io->in_tasklet = true;
- tasklet_init(&io->tasklet, kcryptd_crypt_tasklet, (unsigned long)&io->work);
- tasklet_schedule(&io->tasklet);
+ INIT_WORK(&io->work, kcryptd_crypt);
+ queue_work(system_bh_wq, &io->work);
return;
}

--
2.43.0


2024-01-30 09:17:23

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 8/8] dm-verity: Convert from tasklet to BH workqueue

The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.

This patch converts dm-verity from tasklet to BH workqueue.

This is a minimal conversion which doesn't rename the related names
including the "try_verify_in_tasklet" option. If this patch is applied, a
follow-up patch would be necessary. I couldn't decide whether the option
name would need to be updated too.

Only compile tested. I don't know how to verity.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: [email protected]
---
drivers/md/dm-verity-target.c | 8 ++++----
drivers/md/dm-verity.h | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c
index 14e58ae70521..911261de2d08 100644
--- a/drivers/md/dm-verity-target.c
+++ b/drivers/md/dm-verity-target.c
@@ -645,9 +645,9 @@ static void verity_work(struct work_struct *w)
verity_finish_io(io, errno_to_blk_status(verity_verify_io(io)));
}

-static void verity_tasklet(unsigned long data)
+static void verity_bh_work(struct work_struct *w)
{
- struct dm_verity_io *io = (struct dm_verity_io *)data;
+ struct dm_verity_io *io = container_of(w, struct dm_verity_io, bh_work);
int err;

io->in_tasklet = true;
@@ -675,8 +675,8 @@ static void verity_end_io(struct bio *bio)
}

if (static_branch_unlikely(&use_tasklet_enabled) && io->v->use_tasklet) {
- tasklet_init(&io->tasklet, verity_tasklet, (unsigned long)io);
- tasklet_schedule(&io->tasklet);
+ INIT_WORK(&io->bh_work, verity_bh_work);
+ queue_work(system_bh_wq, &io->bh_work);
} else {
INIT_WORK(&io->work, verity_work);
queue_work(io->v->verify_wq, &io->work);
diff --git a/drivers/md/dm-verity.h b/drivers/md/dm-verity.h
index f9d522c870e6..7c16f834f31a 100644
--- a/drivers/md/dm-verity.h
+++ b/drivers/md/dm-verity.h
@@ -83,7 +83,7 @@ struct dm_verity_io {
struct bvec_iter iter;

struct work_struct work;
- struct tasklet_struct tasklet;
+ struct work_struct bh_work;

/*
* Three variably-size fields follow this struct:
--
2.43.0


2024-01-30 09:24:09

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/8] workqueue: Factor out init_cpu_worker_pool()

Factor out init_cpu_worker_pool() from workqueue_init_early(). This is pure
reorganization in preparation of BH workqueue support.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 32 ++++++++++++++++++--------------
1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3f2081bd05a4..f93554e479c4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -7135,6 +7135,22 @@ static void __init restrict_unbound_cpumask(const char *name, const struct cpuma
cpumask_and(wq_unbound_cpumask, wq_unbound_cpumask, mask);
}

+static void __init init_cpu_worker_pool(struct worker_pool *pool, int cpu, int nice)
+{
+ BUG_ON(init_worker_pool(pool));
+ pool->cpu = cpu;
+ cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
+ cpumask_copy(pool->attrs->__pod_cpumask, cpumask_of(cpu));
+ pool->attrs->nice = nice;
+ pool->attrs->affn_strict = true;
+ pool->node = cpu_to_node(cpu);
+
+ /* alloc pool ID */
+ mutex_lock(&wq_pool_mutex);
+ BUG_ON(worker_pool_assign_id(pool));
+ mutex_unlock(&wq_pool_mutex);
+}
+
/**
* workqueue_init_early - early init for workqueue subsystem
*
@@ -7195,20 +7211,8 @@ void __init workqueue_init_early(void)
struct worker_pool *pool;

i = 0;
- for_each_cpu_worker_pool(pool, cpu) {
- BUG_ON(init_worker_pool(pool));
- pool->cpu = cpu;
- cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
- cpumask_copy(pool->attrs->__pod_cpumask, cpumask_of(cpu));
- pool->attrs->nice = std_nice[i++];
- pool->attrs->affn_strict = true;
- pool->node = cpu_to_node(cpu);
-
- /* alloc pool ID */
- mutex_lock(&wq_pool_mutex);
- BUG_ON(worker_pool_assign_id(pool));
- mutex_unlock(&wq_pool_mutex);
- }
+ for_each_cpu_worker_pool(pool, cpu)
+ init_cpu_worker_pool(pool, cpu, std_nice[i++]);
}

/* create default unbound and ordered wq attrs */
--
2.43.0


2024-01-30 09:27:16

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/8] workqueue: Implement BH workqueues to eventually replace tasklets

The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws such as
the execution code accessing the tasklet item after the execution is
complete which can lead to subtle use-after-free in certain usage scenarios
and less-developed flush and cancel mechanisms.

This patch implements BH workqueues which share the same semantics and
features of regular workqueues but execute their work items in the softirq
context. As there is always only one BH execution context per CPU, none of
the concurrency management mechanisms applies and a BH workqueue can be
thought of as a convenience wrapper around softirq.

Except for the inability to sleep while executing and lack of max_active
adjustments, BH workqueues and work items should behave the same as regular
workqueues and work items.

Currently, the execution is hooked to tasklet[_hi]. However, the goal is to
convert all tasklet users over to BH workqueues. Once the conversion is
complete, tasklet can be removed and BH workqueues can directly take over
the tasklet softirqs.

system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in
tasklet, all existing tasklet users should be able to use the system BH
workqueues without creating their own.

Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com
---
Documentation/core-api/workqueue.rst | 29 +++-
include/linux/workqueue.h | 9 ++
kernel/workqueue.c | 231 ++++++++++++++++++++++-----
kernel/workqueue_internal.h | 3 +
tools/workqueue/wq_dump.py | 11 +-
5 files changed, 237 insertions(+), 46 deletions(-)

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 33c4539155d9..2d6af6c4665c 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -77,10 +77,12 @@ wants a function to be executed asynchronously it has to set up a work
item pointing to that function and queue that work item on a
workqueue.

-Special purpose threads, called worker threads, execute the functions
-off of the queue, one after the other. If no work is queued, the
-worker threads become idle. These worker threads are managed in so
-called worker-pools.
+A work item can be executed in either a thread or the BH (softirq) context.
+
+For threaded workqueues, special purpose threads, called [k]workers, execute
+the functions off of the queue, one after the other. If no work is queued,
+the worker threads become idle. These worker threads are managed in
+worker-pools.

The cmwq design differentiates between the user-facing workqueues that
subsystems and drivers queue work items on and the backend mechanism
@@ -91,6 +93,12 @@ for high priority ones, for each possible CPU and some extra
worker-pools to serve work items queued on unbound workqueues - the
number of these backing pools is dynamic.

+BH workqueues use the same framework. However, as there can only be one
+concurrent execution context, there's no need to worry about concurrency.
+Each per-CPU BH worker pool contains only one pseudo worker which represents
+the BH execution context. A BH workqueue can be considered a convenience
+interface to softirq.
+
Subsystems and drivers can create and queue work items through special
workqueue API functions as they see fit. They can influence some
aspects of the way the work items are executed by setting flags on the
@@ -106,7 +114,7 @@ unless specifically overridden, a work item of a bound workqueue will
be queued on the worklist of either normal or highpri worker-pool that
is associated to the CPU the issuer is running on.

-For any worker pool implementation, managing the concurrency level
+For any thread pool implementation, managing the concurrency level
(how many execution contexts are active) is an important issue. cmwq
tries to keep the concurrency at a minimal but sufficient level.
Minimal to save resources and sufficient in that the system is used at
@@ -164,6 +172,17 @@ resources, scheduled and executed.
``flags``
---------

+``WQ_BH``
+ BH workqueues can be considered a convenience interface to softirq. BH
+ workqueues are always per-CPU and all BH work items are executed in the
+ queueing CPU's softirq context in the queueing order.
+
+ All BH workqueues must have 0 ``max_active`` and ``WQ_HIGHPRI`` is the
+ only allowed additional flag.
+
+ BH work items cannot sleep. All other features such as delayed queueing,
+ flushing and canceling are supported.
+
``WQ_UNBOUND``
Work items queued to an unbound wq are served by the special
worker-pools which host workers which are not bound to any
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 232baea90a1d..3ac044691246 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -353,6 +353,7 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
* Documentation/core-api/workqueue.rst.
*/
enum wq_flags {
+ WQ_BH = 1 << 0, /* execute in bottom half (softirq) context */
WQ_UNBOUND = 1 << 1, /* not bound to any cpu */
WQ_FREEZABLE = 1 << 2, /* freeze during suspend */
WQ_MEM_RECLAIM = 1 << 3, /* may be used for memory reclaim */
@@ -392,6 +393,9 @@ enum wq_flags {
__WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */
__WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */
__WQ_ORDERED_EXPLICIT = 1 << 19, /* internal: alloc_ordered_workqueue() */
+
+ /* BH wq only allows the following flags */
+ __WQ_BH_ALLOWS = WQ_BH | WQ_HIGHPRI,
};

enum wq_consts {
@@ -434,6 +438,9 @@ enum wq_consts {
* they are same as their non-power-efficient counterparts - e.g.
* system_power_efficient_wq is identical to system_wq if
* 'wq_power_efficient' is disabled. See WQ_POWER_EFFICIENT for more info.
+ *
+ * system_bh[_highpri]_wq are convenience interface to softirq. BH work items
+ * are executed in the queueing CPU's BH context in the queueing order.
*/
extern struct workqueue_struct *system_wq;
extern struct workqueue_struct *system_highpri_wq;
@@ -442,6 +449,8 @@ extern struct workqueue_struct *system_unbound_wq;
extern struct workqueue_struct *system_freezable_wq;
extern struct workqueue_struct *system_power_efficient_wq;
extern struct workqueue_struct *system_freezable_power_efficient_wq;
+extern struct workqueue_struct *system_bh_wq;
+extern struct workqueue_struct *system_bh_highpri_wq;

/**
* alloc_workqueue - allocate a workqueue
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f93554e479c4..9c972db910f6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -72,8 +72,12 @@ enum worker_pool_flags {
* Note that DISASSOCIATED should be flipped only while holding
* wq_pool_attach_mutex to avoid changing binding state while
* worker_attach_to_pool() is in progress.
+ *
+ * As there can only be one concurrent BH execution context per CPU, a
+ * BH pool is per-CPU and always DISASSOCIATED.
*/
- POOL_MANAGER_ACTIVE = 1 << 0, /* being managed */
+ POOL_BH = 1 << 0, /* is a BH pool */
+ POOL_MANAGER_ACTIVE = 1 << 1, /* being managed */
POOL_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */
};

@@ -115,6 +119,14 @@ enum wq_internal_consts {
WQ_NAME_LEN = 32,
};

+/*
+ * We don't want to trap softirq for too long. See MAX_SOFTIRQ_TIME and
+ * MAX_SOFTIRQ_RESTART in kernel/softirq.c. These are macros because
+ * msecs_to_jiffies() can't be an initializer.
+ */
+#define BH_WORKER_JIFFIES msecs_to_jiffies(2)
+#define BH_WORKER_RESTARTS 10
+
/*
* Structure fields follow one of the following exclusion rules.
*
@@ -441,8 +453,13 @@ static bool wq_debug_force_rr_cpu = false;
#endif
module_param_named(debug_force_rr_cpu, wq_debug_force_rr_cpu, bool, 0644);

+/* the BH worker pools */
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
+ bh_worker_pools);
+
/* the per-cpu worker pools */
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS], cpu_worker_pools);
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
+ cpu_worker_pools);

static DEFINE_IDR(worker_pool_idr); /* PR: idr of all pools */

@@ -476,8 +493,13 @@ struct workqueue_struct *system_power_efficient_wq __ro_after_init;
EXPORT_SYMBOL_GPL(system_power_efficient_wq);
struct workqueue_struct *system_freezable_power_efficient_wq __ro_after_init;
EXPORT_SYMBOL_GPL(system_freezable_power_efficient_wq);
+struct workqueue_struct *system_bh_wq;
+EXPORT_SYMBOL_GPL(system_bh_wq);
+struct workqueue_struct *system_bh_highpri_wq;
+EXPORT_SYMBOL_GPL(system_bh_highpri_wq);

static int worker_thread(void *__worker);
+static void bh_worker_taskletfn(struct tasklet_struct *tasklet);
static void workqueue_sysfs_unregister(struct workqueue_struct *wq);
static void show_pwq(struct pool_workqueue *pwq);
static void show_one_worker_pool(struct worker_pool *pool);
@@ -496,6 +518,11 @@ static void show_one_worker_pool(struct worker_pool *pool);
!lockdep_is_held(&wq_pool_mutex), \
"RCU, wq->mutex or wq_pool_mutex should be held")

+#define for_each_bh_worker_pool(pool, cpu) \
+ for ((pool) = &per_cpu(bh_worker_pools, cpu)[0]; \
+ (pool) < &per_cpu(bh_worker_pools, cpu)[NR_STD_WORKER_POOLS]; \
+ (pool)++)
+
#define for_each_cpu_worker_pool(pool, cpu) \
for ((pool) = &per_cpu(cpu_worker_pools, cpu)[0]; \
(pool) < &per_cpu(cpu_worker_pools, cpu)[NR_STD_WORKER_POOLS]; \
@@ -1184,6 +1211,14 @@ static bool kick_pool(struct worker_pool *pool)
if (!need_more_worker(pool) || !worker)
return false;

+ if (pool->flags & POOL_BH) {
+ if (pool->attrs->nice == HIGHPRI_NICE_LEVEL)
+ tasklet_hi_schedule(&worker->bh_tasklet);
+ else
+ tasklet_schedule(&worker->bh_tasklet);
+ return true;
+ }
+
p = worker->task;

#ifdef CONFIG_SMP
@@ -1663,8 +1698,16 @@ static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq, bool fill)
lockdep_assert_held(&pool->lock);

if (!nna) {
- /* per-cpu workqueue, pwq->nr_active is sufficient */
- obtained = pwq->nr_active < READ_ONCE(wq->max_active);
+ /*
+ * BH workqueues always share a single execution context per CPU
+ * and don't impose any max_active limit, so tryinc always
+ * succeeds. For a per-cpu workqueue, checking pwq->nr_active is
+ * sufficient.
+ */
+ if (wq->flags & WQ_BH)
+ obtained = true;
+ else
+ obtained = pwq->nr_active < READ_ONCE(wq->max_active);
goto out;
}

@@ -2599,27 +2642,31 @@ static struct worker *create_worker(struct worker_pool *pool)

worker->id = id;

- if (pool->cpu >= 0)
- snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
- pool->attrs->nice < 0 ? "H" : "");
- else
- snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
-
- worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
- "kworker/%s", id_buf);
- if (IS_ERR(worker->task)) {
- if (PTR_ERR(worker->task) == -EINTR) {
- pr_err("workqueue: Interrupted when creating a worker thread \"kworker/%s\"\n",
- id_buf);
- } else {
- pr_err_once("workqueue: Failed to create a worker thread: %pe",
- worker->task);
+ if (pool->flags & POOL_BH) {
+ tasklet_setup(&worker->bh_tasklet, bh_worker_taskletfn);
+ } else {
+ if (pool->cpu >= 0)
+ snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
+ pool->attrs->nice < 0 ? "H" : "");
+ else
+ snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
+
+ worker->task = kthread_create_on_node(worker_thread, worker,
+ pool->node, "kworker/%s", id_buf);
+ if (IS_ERR(worker->task)) {
+ if (PTR_ERR(worker->task) == -EINTR) {
+ pr_err("workqueue: Interrupted when creating a worker thread \"kworker/%s\"\n",
+ id_buf);
+ } else {
+ pr_err_once("workqueue: Failed to create a worker thread: %pe",
+ worker->task);
+ }
+ goto fail;
}
- goto fail;
- }

- set_user_nice(worker->task, pool->attrs->nice);
- kthread_bind_mask(worker->task, pool_allowed_cpus(pool));
+ set_user_nice(worker->task, pool->attrs->nice);
+ kthread_bind_mask(worker->task, pool_allowed_cpus(pool));
+ }

/* successful, attach the worker to the pool */
worker_attach_to_pool(worker, pool);
@@ -2635,7 +2682,8 @@ static struct worker *create_worker(struct worker_pool *pool)
* check if not woken up soon. As kick_pool() is noop if @pool is empty,
* wake it up explicitly.
*/
- wake_up_process(worker->task);
+ if (worker->task)
+ wake_up_process(worker->task);

raw_spin_unlock_irq(&pool->lock);

@@ -2977,7 +3025,8 @@ __acquires(&pool->lock)
worker->current_work = work;
worker->current_func = work->func;
worker->current_pwq = pwq;
- worker->current_at = worker->task->se.sum_exec_runtime;
+ if (worker->task)
+ worker->current_at = worker->task->se.sum_exec_runtime;
work_data = *work_data_bits(work);
worker->current_color = get_work_color(work_data);

@@ -3075,7 +3124,8 @@ __acquires(&pool->lock)
* stop_machine. At the same time, report a quiescent RCU state so
* the same condition doesn't freeze RCU.
*/
- cond_resched();
+ if (worker->task)
+ cond_resched();

raw_spin_lock_irq(&pool->lock);

@@ -3358,6 +3408,43 @@ static int rescuer_thread(void *__rescuer)
goto repeat;
}

+void bh_worker_taskletfn(struct tasklet_struct *tasklet)
+{
+ struct worker *worker = container_of(tasklet, struct worker, bh_tasklet);
+ struct worker_pool *pool = worker->pool;
+ int nr_restarts = BH_WORKER_RESTARTS;
+ unsigned long end = jiffies + BH_WORKER_JIFFIES;
+
+ raw_spin_lock_irq(&pool->lock);
+ worker_leave_idle(worker);
+
+ /*
+ * This function follows the structure of worker_thread(). See there for
+ * explanations on each step.
+ */
+ if (!need_more_worker(pool))
+ goto done;
+
+ WARN_ON_ONCE(!list_empty(&worker->scheduled));
+ worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
+
+ do {
+ struct work_struct *work =
+ list_first_entry(&pool->worklist,
+ struct work_struct, entry);
+
+ if (assign_work(work, worker, NULL))
+ process_scheduled_works(worker);
+ } while (keep_working(pool) &&
+ --nr_restarts && time_before(jiffies, end));
+
+ worker_set_flags(worker, WORKER_PREP);
+done:
+ worker_enter_idle(worker);
+ kick_pool(pool);
+ raw_spin_unlock_irq(&pool->lock);
+}
+
/**
* check_flush_dependency - check for flush dependency sanity
* @target_wq: workqueue being flushed
@@ -3430,6 +3517,7 @@ static void insert_wq_barrier(struct pool_workqueue *pwq,
struct wq_barrier *barr,
struct work_struct *target, struct worker *worker)
{
+ static __maybe_unused struct lock_class_key bh_key, thr_key;
unsigned int work_flags = 0;
unsigned int work_color;
struct list_head *head;
@@ -3439,8 +3527,13 @@ static void insert_wq_barrier(struct pool_workqueue *pwq,
* as we know for sure that this will not trigger any of the
* checks and call back into the fixup functions where we
* might deadlock.
+ *
+ * BH and threaded workqueues need separate lockdep keys to avoid
+ * spuriously triggering "inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W}
+ * usage".
*/
- INIT_WORK_ONSTACK(&barr->work, wq_barrier_func);
+ INIT_WORK_ONSTACK_KEY(&barr->work, wq_barrier_func,
+ (pwq->wq->flags & WQ_BH) ? &bh_key : &thr_key);
__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work));

init_completion_map(&barr->done, &target->lockdep_map);
@@ -3546,15 +3639,31 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,

static void touch_wq_lockdep_map(struct workqueue_struct *wq)
{
+#ifdef CONFIG_LOCKDEP
+ if (wq->flags & WQ_BH)
+ local_bh_disable();
+
lock_map_acquire(&wq->lockdep_map);
lock_map_release(&wq->lockdep_map);
+
+ if (wq->flags & WQ_BH)
+ local_bh_enable();
+#endif
}

static void touch_work_lockdep_map(struct work_struct *work,
struct workqueue_struct *wq)
{
+#ifdef CONFIG_LOCKDEP
+ if (wq->flags & WQ_BH)
+ local_bh_disable();
+
lock_map_acquire(&work->lockdep_map);
lock_map_release(&work->lockdep_map);
+
+ if (wq->flags & WQ_BH)
+ local_bh_enable();
+#endif
}

/**
@@ -5007,10 +5116,17 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)

if (!(wq->flags & WQ_UNBOUND)) {
for_each_possible_cpu(cpu) {
- struct pool_workqueue **pwq_p =
- per_cpu_ptr(wq->cpu_pwq, cpu);
- struct worker_pool *pool =
- &(per_cpu_ptr(cpu_worker_pools, cpu)[highpri]);
+ struct pool_workqueue **pwq_p;
+ struct worker_pool __percpu *pools;
+ struct worker_pool *pool;
+
+ if (wq->flags & WQ_BH)
+ pools = bh_worker_pools;
+ else
+ pools = cpu_worker_pools;
+
+ pool = &(per_cpu_ptr(pools, cpu)[highpri]);
+ pwq_p = per_cpu_ptr(wq->cpu_pwq, cpu);

*pwq_p = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL,
pool->node);
@@ -5185,6 +5301,13 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
size_t wq_size;
int name_len;

+ if (flags & WQ_BH) {
+ if (WARN_ON_ONCE(flags & ~__WQ_BH_ALLOWS))
+ return NULL;
+ if (WARN_ON_ONCE(max_active))
+ return NULL;
+ }
+
/*
* Unbound && max_active == 1 used to imply ordered, which is no longer
* the case on many machines due to per-pod pools. While
@@ -5404,6 +5527,9 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
*/
void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
{
+ /* max_active doesn't mean anything for BH workqueues */
+ if (WARN_ON(wq->flags & WQ_BH))
+ return;
/* disallow meddling with max_active for ordered workqueues */
if (WARN_ON(wq->flags & __WQ_ORDERED_EXPLICIT))
return;
@@ -5605,7 +5731,12 @@ static void pr_cont_pool_info(struct worker_pool *pool)
pr_cont(" cpus=%*pbl", nr_cpumask_bits, pool->attrs->cpumask);
if (pool->node != NUMA_NO_NODE)
pr_cont(" node=%d", pool->node);
- pr_cont(" flags=0x%x nice=%d", pool->flags, pool->attrs->nice);
+ pr_cont(" flags=0x%x", pool->flags);
+ if (pool->flags & POOL_BH)
+ pr_cont(" bh%s",
+ pool->attrs->nice == HIGHPRI_NICE_LEVEL ? "-hi" : "");
+ else
+ pr_cont(" nice=%d", pool->attrs->nice);
}

struct pr_cont_work_struct {
@@ -6078,13 +6209,15 @@ int workqueue_online_cpu(unsigned int cpu)
mutex_lock(&wq_pool_mutex);

for_each_pool(pool, pi) {
- mutex_lock(&wq_pool_attach_mutex);
+ /* BH pools aren't affected by hotplug */
+ if (pool->flags & POOL_BH)
+ continue;

+ mutex_lock(&wq_pool_attach_mutex);
if (pool->cpu == cpu)
rebind_workers(pool);
else if (pool->cpu < 0)
restore_unbound_workers_cpumask(pool, cpu);
-
mutex_unlock(&wq_pool_attach_mutex);
}

@@ -7206,10 +7339,16 @@ void __init workqueue_init_early(void)
pt->pod_node[0] = NUMA_NO_NODE;
pt->cpu_pod[0] = 0;

- /* initialize CPU pools */
+ /* initialize BH and CPU pools */
for_each_possible_cpu(cpu) {
struct worker_pool *pool;

+ i = 0;
+ for_each_bh_worker_pool(pool, cpu) {
+ init_cpu_worker_pool(pool, cpu, std_nice[i++]);
+ pool->flags |= POOL_BH;
+ }
+
i = 0;
for_each_cpu_worker_pool(pool, cpu)
init_cpu_worker_pool(pool, cpu, std_nice[i++]);
@@ -7245,10 +7384,14 @@ void __init workqueue_init_early(void)
system_freezable_power_efficient_wq = alloc_workqueue("events_freezable_pwr_efficient",
WQ_FREEZABLE | WQ_POWER_EFFICIENT,
0);
+ system_bh_wq = alloc_workqueue("events_bh", WQ_BH, 0);
+ system_bh_highpri_wq = alloc_workqueue("events_bh_highpri",
+ WQ_BH | WQ_HIGHPRI, 0);
BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq ||
!system_unbound_wq || !system_freezable_wq ||
!system_power_efficient_wq ||
- !system_freezable_power_efficient_wq);
+ !system_freezable_power_efficient_wq ||
+ !system_bh_wq || !system_bh_highpri_wq);
}

static void __init wq_cpu_intensive_thresh_init(void)
@@ -7314,9 +7457,10 @@ void __init workqueue_init(void)
* up. Also, create a rescuer for workqueues that requested it.
*/
for_each_possible_cpu(cpu) {
- for_each_cpu_worker_pool(pool, cpu) {
+ for_each_bh_worker_pool(pool, cpu)
+ pool->node = cpu_to_node(cpu);
+ for_each_cpu_worker_pool(pool, cpu)
pool->node = cpu_to_node(cpu);
- }
}

list_for_each_entry(wq, &workqueues, list) {
@@ -7327,7 +7471,16 @@ void __init workqueue_init(void)

mutex_unlock(&wq_pool_mutex);

- /* create the initial workers */
+ /*
+ * Create the initial workers. A BH pool has one pseudo worker that
+ * represents the shared BH execution context and thus doesn't get
+ * affected by hotplug events. Create the BH pseudo workers for all
+ * possible CPUs here.
+ */
+ for_each_possible_cpu(cpu)
+ for_each_bh_worker_pool(pool, cpu)
+ BUG_ON(!create_worker(pool));
+
for_each_online_cpu(cpu) {
for_each_cpu_worker_pool(pool, cpu) {
pool->flags &= ~POOL_DISASSOCIATED;
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index f6275944ada7..8da306b76ec9 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -10,6 +10,7 @@

#include <linux/workqueue.h>
#include <linux/kthread.h>
+#include <linux/interrupt.h>
#include <linux/preempt.h>

struct worker_pool;
@@ -42,6 +43,8 @@ struct worker {
struct list_head scheduled; /* L: scheduled works */

struct task_struct *task; /* I: worker task */
+ struct tasklet_struct bh_tasklet; /* I: tasklet for bh pool */
+
struct worker_pool *pool; /* A: the associated pool */
/* L: for rescuers */
struct list_head node; /* A: anchored at pool->workers */
diff --git a/tools/workqueue/wq_dump.py b/tools/workqueue/wq_dump.py
index bd381511bd9a..d29b918306b4 100644
--- a/tools/workqueue/wq_dump.py
+++ b/tools/workqueue/wq_dump.py
@@ -79,7 +79,9 @@ args = parser.parse_args()
wq_type_len = 9

def wq_type_str(wq):
- if wq.flags & WQ_UNBOUND:
+ if wq.flags & WQ_BH:
+ return f'{"bh":{wq_type_len}}'
+ elif wq.flags & WQ_UNBOUND:
if wq.flags & WQ_ORDERED:
return f'{"ordered":{wq_type_len}}'
else:
@@ -97,6 +99,7 @@ wq_pod_types = prog['wq_pod_types']
wq_affn_dfl = prog['wq_affn_dfl']
wq_affn_names = prog['wq_affn_names']

+WQ_BH = prog['WQ_BH']
WQ_UNBOUND = prog['WQ_UNBOUND']
WQ_ORDERED = prog['__WQ_ORDERED']
WQ_MEM_RECLAIM = prog['WQ_MEM_RECLAIM']
@@ -107,6 +110,8 @@ WQ_AFFN_CACHE = prog['WQ_AFFN_CACHE']
WQ_AFFN_NUMA = prog['WQ_AFFN_NUMA']
WQ_AFFN_SYSTEM = prog['WQ_AFFN_SYSTEM']

+POOL_BH = prog['POOL_BH']
+
WQ_NAME_LEN = prog['WQ_NAME_LEN'].value_()
cpumask_str_len = len(cpumask_str(wq_unbound_cpumask))

@@ -151,10 +156,12 @@ max_ref_len = 0

for pi, pool in idr_for_each(worker_pool_idr):
pool = drgn.Object(prog, 'struct worker_pool', address=pool)
- print(f'pool[{pi:0{max_pool_id_len}}] ref={pool.refcnt.value_():{max_ref_len}} nice={pool.attrs.nice.value_():3} ', end='')
+ print(f'pool[{pi:0{max_pool_id_len}}] flags=0x{pool.flags.value_():02x} ref={pool.refcnt.value_():{max_ref_len}} nice={pool.attrs.nice.value_():3} ', end='')
print(f'idle/workers={pool.nr_idle.value_():3}/{pool.nr_workers.value_():3} ', end='')
if pool.cpu >= 0:
print(f'cpu={pool.cpu.value_():3}', end='')
+ if pool.flags & POOL_BH:
+ print(' bh', end='')
else:
print(f'cpus={cpumask_str(pool.attrs.cpumask)}', end='')
print(f' pod_cpus={cpumask_str(pool.attrs.__pod_cpumask)}', end='')
--
2.43.0


2024-01-30 09:42:43

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/8] workqueue: Update lock debugging code

These changes are in preparation of BH workqueue which will execute work
items from BH context.

- Update lock and RCU depth checks in process_one_work() so that it
remembers and checks against the starting depths and prints out the depth
changes.

- Factor out lockdep annotations in the flush paths into
touch_{wq|work}_lockdep_map(). The work->lockdep_map touching is moved
from __flush_work() to its callee - start_flush_work(). This brings it
closer to the wq counterpart and will allow testing the associated wq's
flags which will be needed to support BH workqueues. This is not expected
to cause any functional changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 51 ++++++++++++++++++++++++++++++----------------
1 file changed, 34 insertions(+), 17 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9221a4c57ae1..3f2081bd05a4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2954,6 +2954,7 @@ __acquires(&pool->lock)
struct pool_workqueue *pwq = get_work_pwq(work);
struct worker_pool *pool = worker->pool;
unsigned long work_data;
+ int lockdep_start_depth, rcu_start_depth;
#ifdef CONFIG_LOCKDEP
/*
* It is permissible to free the struct work_struct from
@@ -3016,6 +3017,8 @@ __acquires(&pool->lock)
pwq->stats[PWQ_STAT_STARTED]++;
raw_spin_unlock_irq(&pool->lock);

+ rcu_start_depth = rcu_preempt_depth();
+ lockdep_start_depth = lockdep_depth(current);
lock_map_acquire(&pwq->wq->lockdep_map);
lock_map_acquire(&lockdep_map);
/*
@@ -3051,12 +3054,15 @@ __acquires(&pool->lock)
lock_map_release(&lockdep_map);
lock_map_release(&pwq->wq->lockdep_map);

- if (unlikely(in_atomic() || lockdep_depth(current) > 0 ||
- rcu_preempt_depth() > 0)) {
- pr_err("BUG: workqueue leaked lock or atomic: %s/0x%08x/%d/%d\n"
- " last function: %ps\n",
- current->comm, preempt_count(), rcu_preempt_depth(),
- task_pid_nr(current), worker->current_func);
+ if (unlikely((worker->task && in_atomic()) ||
+ lockdep_depth(current) != lockdep_start_depth ||
+ rcu_preempt_depth() != rcu_start_depth)) {
+ pr_err("BUG: workqueue leaked atomic, lock or RCU: %s[%d]\n"
+ " preempt=0x%08x lock=%d->%d RCU=%d->%d workfn=%ps\n",
+ current->comm, task_pid_nr(current), preempt_count(),
+ lockdep_start_depth, lockdep_depth(current),
+ rcu_start_depth, rcu_preempt_depth(),
+ worker->current_func);
debug_show_held_locks(current);
dump_stack();
}
@@ -3538,6 +3544,19 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
return wait;
}

+static void touch_wq_lockdep_map(struct workqueue_struct *wq)
+{
+ lock_map_acquire(&wq->lockdep_map);
+ lock_map_release(&wq->lockdep_map);
+}
+
+static void touch_work_lockdep_map(struct work_struct *work,
+ struct workqueue_struct *wq)
+{
+ lock_map_acquire(&work->lockdep_map);
+ lock_map_release(&work->lockdep_map);
+}
+
/**
* __flush_workqueue - ensure that any scheduled work has run to completion.
* @wq: workqueue to flush
@@ -3557,8 +3576,7 @@ void __flush_workqueue(struct workqueue_struct *wq)
if (WARN_ON(!wq_online))
return;

- lock_map_acquire(&wq->lockdep_map);
- lock_map_release(&wq->lockdep_map);
+ touch_wq_lockdep_map(wq);

mutex_lock(&wq->mutex);

@@ -3757,6 +3775,7 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
struct worker *worker = NULL;
struct worker_pool *pool;
struct pool_workqueue *pwq;
+ struct workqueue_struct *wq;

might_sleep();

@@ -3780,11 +3799,14 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
pwq = worker->current_pwq;
}

- check_flush_dependency(pwq->wq, work);
+ wq = pwq->wq;
+ check_flush_dependency(wq, work);

insert_wq_barrier(pwq, barr, work, worker);
raw_spin_unlock_irq(&pool->lock);

+ touch_work_lockdep_map(work, wq);
+
/*
* Force a lock recursion deadlock when using flush_work() inside a
* single-threaded or rescuer equipped workqueue.
@@ -3794,11 +3816,9 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
* workqueues the deadlock happens when the rescuer stalls, blocking
* forward progress.
*/
- if (!from_cancel &&
- (pwq->wq->saved_max_active == 1 || pwq->wq->rescuer)) {
- lock_map_acquire(&pwq->wq->lockdep_map);
- lock_map_release(&pwq->wq->lockdep_map);
- }
+ if (!from_cancel && (wq->saved_max_active == 1 || wq->rescuer))
+ touch_wq_lockdep_map(wq);
+
rcu_read_unlock();
return true;
already_gone:
@@ -3817,9 +3837,6 @@ static bool __flush_work(struct work_struct *work, bool from_cancel)
if (WARN_ON(!work->func))
return false;

- lock_map_acquire(&work->lockdep_map);
- lock_map_release(&work->lockdep_map);
-
if (start_flush_work(work, &barr, from_cancel)) {
wait_for_completion(&barr.done);
destroy_work_on_stack(&barr.work);
--
2.43.0


2024-01-30 09:47:13

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-6.9] workqueue: Implement BH workqueue and convert several tasklet users

Oops, forgot to cc Lai.

On Mon, Jan 29, 2024 at 11:11:47PM -1000, Tejun Heo wrote:
> Hello,
>
> The only generic interface to execute asynchronously in the BH context is
> tasklet; however, it's marked deprecated and has some design flaws such as
> the execution code accessing the tasklet item after the execution is
> complete which can lead to subtle use-after-free in certain usage scenarios
> and less-developed flush and cancel mechanisms.
>
> Mikulas Patocka recently reported that dm-crypt and dm-crypt are affected by
> the access-after-completion issue and suggested adding TASKLET_STATE_ONESHOT
> flag which selectively removes post-completion access while significantly
> limiting how the tasklet can be used in the following thread:
>
> http://lkml.kernel.org/r/[email protected]
>
> Linus didn't like the approach and suggested extending workqueue to support
> execution from atomic context:
>
> http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com
>
> As suggested, this patchset implements BH workqueues which are like regular
> workqueues but executes work items in the BH (softirq) context and converts
> several tasklet users.

Lai, the patchset is at:

http://lkml.kernel.org/r/[email protected]

I'd really appreciate if you can take a look.

Thanks.

--
tejun

Subject: Re: [PATCHSET wq/for-6.9] workqueue: Implement BH workqueue and convert several tasklet users

On 2024-01-29 23:11:47 [-1000], Tejun Heo wrote:
> Hello,
Hi,

> As suggested, this patchset implements BH workqueues which are like regular
> workqueues but executes work items in the BH (softirq) context and converts
> several tasklet users.
>
> - The name bh is used instead of the suggested atomic as it's more in line
> with widely used execution context interface - local_bh_enable/disable()
> and friends.
>
> - The system default BH workqueues - system_bh_wq and system_bh_highpri_wq -
> are provided. As queue-wide flushing doesn't exist in tasklet, all
> existing tasklet users should be able to use the system BH workqueues
> without creating their own.
>
> - BH workqueues currently use tasklet to run the work items to avoid
> priority inversions involving tasklet_hi and WQ_BH | WQ_HIGHPRI. Once all
> tasklet users are converted, tasklet code can be removed and BH workqueues
> can take over its softirqs.

If one context creates multiple work item which are then moved to
tasklet I don't see the difference vs workqueue with a bh_disable()
around it.
Looking at the USB changes, I would prefer to see it converted to
threaded interrupts instead of using tasklet or workqueue. Both
approaches (current tasklet, suggested workqueue) lose the original
context where the request was created. Having threaded interrupts would
allow to keep everything in the same "context" so you could prioritize
according to your needs.

Sebastian

Subject: Re: [PATCH 7/8] dm-crypt: Convert from tasklet to BH workqueue

On 2024-01-29 23:11:54 [-1000], Tejun Heo wrote:
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -2263,9 +2232,8 @@ static void kcryptd_queue_crypt(struct dm_crypt_io *io)

/*
* in_hardirq(): Crypto API's skcipher_walk_first() refuses to work in hard IRQ context.
* irqs_disabled(): the kernel may run some IO completion from the idle thread, but
* it is being executed with irqs disabled.
*/
> * it is being executed with irqs disabled.
> */
> if (in_hardirq() || irqs_disabled()) {
> - io->in_tasklet = true;
> - tasklet_init(&io->tasklet, kcryptd_crypt_tasklet, (unsigned long)&io->work);
> - tasklet_schedule(&io->tasklet);
> + INIT_WORK(&io->work, kcryptd_crypt);
> + queue_work(system_bh_wq, &io->work);

Why do we need the tasklet here in the first place? Couldn't we use
workqueue? As per comment, the request originates in hardirq and then it
is moved to tasklet. Couldn't it be moved to workqueue regardless?

> return;
> }
>

Sebastian

2024-01-30 15:53:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 7/8] dm-crypt: Convert from tasklet to BH workqueue

Hello,

On Tue, Jan 30, 2024 at 11:46:45AM +0100, Sebastian Andrzej Siewior wrote:
> On 2024-01-29 23:11:54 [-1000], Tejun Heo wrote:
> > if (in_hardirq() || irqs_disabled()) {
> > - io->in_tasklet = true;
> > - tasklet_init(&io->tasklet, kcryptd_crypt_tasklet, (unsigned long)&io->work);
> > - tasklet_schedule(&io->tasklet);
> > + INIT_WORK(&io->work, kcryptd_crypt);
> > + queue_work(system_bh_wq, &io->work);
>
> Why do we need the tasklet here in the first place? Couldn't we use
> workqueue? As per comment, the request originates in hardirq and then it
> is moved to tasklet. Couldn't it be moved to workqueue regardless?

Yes, you can and if you replace that system_bh_wq with system_wq, or
system_highpri_wq, everything should be fine in terms of correctness.
However, that would mean that the work item now would always incur a
scheduling latency which can be lengthy in certain circumstances. Whether
that's an actual problem for the subsystem at hand, I have no idea.

Thanks.

--
tejun

2024-01-30 16:17:12

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-6.9] workqueue: Implement BH workqueue and convert several tasklet users

Hello, Sebastian.

On Tue, Jan 30, 2024 at 11:20:11AM +0100, Sebastian Andrzej Siewior wrote:
> If one context creates multiple work item which are then moved to
> tasklet I don't see the difference vs workqueue with a bh_disable()
> around it.

The main difference is that it avoids scheduling latencies in scenarios
where softirq isn't heavily loaded.

> Looking at the USB changes, I would prefer to see it converted to
> threaded interrupts instead of using tasklet or workqueue. Both
> approaches (current tasklet, suggested workqueue) lose the original
> context where the request was created. Having threaded interrupts would
> allow to keep everything in the same "context" so you could prioritize
> according to your needs.

That's great. If threaded IRQs or even regular workqueues are suitable, that
should be fine. This conversion is just targeted at the use cases which can
benefit from executing in the softirq context.

Thanks.

--
tejun

2024-01-30 16:38:33

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 5/8] usb: core: hcd: Convert from tasklet to BH workqueue

On Mon, Jan 29, 2024 at 11:11:52PM -1000, Tejun Heo wrote:
> The only generic interface to execute asynchronously in the BH context is
> tasklet; however, it's marked deprecated and has some design flaws. To
> replace tasklets, BH workqueue support was recently added. A BH workqueue
> behaves similarly to regular workqueues except that the queued work items
> are executed in the BH context.
>
> This patch converts usb hcd from tasklet to BH workqueue.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Alan Stern <[email protected]>
> Cc: [email protected]
> ---
> drivers/usb/core/hcd.c | 23 ++++++++++++-----------
> include/linux/usb/hcd.h | 2 +-
> 2 files changed, 13 insertions(+), 12 deletions(-)

Acked-by: Greg Kroah-Hartman <[email protected]>

2024-01-30 23:25:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/8] workqueue: Implement BH workqueues to eventually replace tasklets

On Tue, 30 Jan 2024 at 01:13, Tejun Heo <[email protected]> wrote:
>
> This patch implements BH workqueues which share the same semantics and
> features of regular workqueues but execute their work items in the softirq
> context.

Thanks for doing this. Honestly, while I felt this was a natural thing
to do and would clean things up, every time I look at the workqueue
code I just shudder and go "I'm sure Tejun can work this out".

Patches look fine to me - I'd love to actually have the dm-crypt
people (and networking people, for that matter) verify that there are
no performance gotchas from the slightly heavier (but more generic)
workqueue interfaces.

Linus

2024-01-30 23:38:16

by Allen

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-6.9] workqueue: Implement BH workqueue and convert several tasklet users

> The only generic interface to execute asynchronously in the BH context is
> tasklet; however, it's marked deprecated and has some design flaws such as
> the execution code accessing the tasklet item after the execution is
> complete which can lead to subtle use-after-free in certain usage scenarios
> and less-developed flush and cancel mechanisms.
>
> Mikulas Patocka recently reported that dm-crypt and dm-crypt are affected by
> the access-after-completion issue and suggested adding TASKLET_STATE_ONESHOT
> flag which selectively removes post-completion access while significantly
> limiting how the tasklet can be used in the following thread:
>
> http://lkml.kernel.org/r/[email protected]
>
> Linus didn't like the approach and suggested extending workqueue to support
> execution from atomic context:
>
> http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com
>
> As suggested, this patchset implements BH workqueues which are like regular
> workqueues but executes work items in the BH (softirq) context and converts
> several tasklet users.
>
> - The name bh is used instead of the suggested atomic as it's more in line
> with widely used execution context interface - local_bh_enable/disable()
> and friends.
>
> - The system default BH workqueues - system_bh_wq and system_bh_highpri_wq -
> are provided. As queue-wide flushing doesn't exist in tasklet, all
> existing tasklet users should be able to use the system BH workqueues
> without creating their own.
>
> - BH workqueues currently use tasklet to run the work items to avoid
> priority inversions involving tasklet_hi and WQ_BH | WQ_HIGHPRI. Once all
> tasklet users are converted, tasklet code can be removed and BH workqueues
> can take over its softirqs.
>
> This patchset is on top of wq/for-6.9 (aae17ebb53c ("workqueue: Avoid using
> isolated cpus' timers on queue_delayed_work")) and contains the following
> eight patches.
>
> 0001-workqueue-Update-lock-debugging-code.patch
> 0002-workqueue-Factor-out-init_cpu_worker_pool.patch
> 0003-workqueue-Implement-BH-workqueues-to-eventually-repl.patch
> 0004-backtracetest-Convert-from-tasklet-to-BH-workqueue.patch
> 0005-usb-core-hcd-Convert-from-tasklet-to-BH-workqueue.patch
> 0006-net-tcp-tsq-Convert-from-tasklet-to-BH-workqueue.patch
> 0007-dm-crypt-Convert-from-tasklet-to-BH-workqueue.patch
> 0008-dm-verity-Convert-from-tasklet-to-BH-workqueue.patch
>
> 0001-0003 prepare and implement BH workqueues.
>
> 0004-0008 convert some tasklet users to BH workqueue. The conversions are
> relatively straightforward but are in descending order of confidence.
>
> The patchset is also available in the following git branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git wq-bh-v1
>

Thank you for doing this work.

Tested the series, things look good. I am going to run LTP and Unixbench
on the kernel.

Tested-by: Allen Pais <[email protected]>

Thanks.

> diffstat follows. Thanks.
>
> Documentation/core-api/workqueue.rst | 29 ++++-
> drivers/md/dm-crypt.c | 36 -------
> drivers/md/dm-verity-target.c | 8 -
> drivers/md/dm-verity.h | 2
> drivers/usb/core/hcd.c | 23 ++--
> include/linux/usb/hcd.h | 2
> include/linux/workqueue.h | 9 +
> include/net/tcp.h | 2
> kernel/backtracetest.c | 18 +--
> kernel/workqueue.c | 312 ++++++++++++++++++++++++++++++++++++++++++++++++--------------
> kernel/workqueue_internal.h | 3
> net/ipv4/tcp.c | 2
> net/ipv4/tcp_output.c | 36 +++----
> tools/workqueue/wq_dump.py | 11 +-
> 14 files changed, 335 insertions(+), 158 deletions(-)
>
> --
> tejun



--
- Allen

2024-01-31 21:24:01

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH 8/8] dm-verity: Convert from tasklet to BH workqueue



On Mon, 29 Jan 2024, Tejun Heo wrote:

> The only generic interface to execute asynchronously in the BH context is
> tasklet; however, it's marked deprecated and has some design flaws. To
> replace tasklets, BH workqueue support was recently added. A BH workqueue
> behaves similarly to regular workqueues except that the queued work items
> are executed in the BH context.
>
> This patch converts dm-verity from tasklet to BH workqueue.
>
> This is a minimal conversion which doesn't rename the related names
> including the "try_verify_in_tasklet" option. If this patch is applied, a
> follow-up patch would be necessary. I couldn't decide whether the option
> name would need to be updated too.
>
> Only compile tested. I don't know how to verity.

Download the cryptsetup package with "git clone
https://gitlab.com/cryptsetup/cryptsetup" and run the testsuite:
/autogen.sh && ./configure && make && make check

> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Alasdair Kergon <[email protected]>
> Cc: Mike Snitzer <[email protected]>
> Cc: Mikulas Patocka <[email protected]>
> Cc: [email protected]
> ---
> drivers/md/dm-verity-target.c | 8 ++++----
> drivers/md/dm-verity.h | 2 +-
> 2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c
> index 14e58ae70521..911261de2d08 100644
> --- a/drivers/md/dm-verity-target.c
> +++ b/drivers/md/dm-verity-target.c
> @@ -645,9 +645,9 @@ static void verity_work(struct work_struct *w)
> verity_finish_io(io, errno_to_blk_status(verity_verify_io(io)));
> }
>
> -static void verity_tasklet(unsigned long data)
> +static void verity_bh_work(struct work_struct *w)
> {
> - struct dm_verity_io *io = (struct dm_verity_io *)data;
> + struct dm_verity_io *io = container_of(w, struct dm_verity_io, bh_work);
> int err;
>
> io->in_tasklet = true;
> @@ -675,8 +675,8 @@ static void verity_end_io(struct bio *bio)
> }
>
> if (static_branch_unlikely(&use_tasklet_enabled) && io->v->use_tasklet) {
> - tasklet_init(&io->tasklet, verity_tasklet, (unsigned long)io);
> - tasklet_schedule(&io->tasklet);
> + INIT_WORK(&io->bh_work, verity_bh_work);
> + queue_work(system_bh_wq, &io->bh_work);
> } else {
> INIT_WORK(&io->work, verity_work);
> queue_work(io->v->verify_wq, &io->work);
> diff --git a/drivers/md/dm-verity.h b/drivers/md/dm-verity.h
> index f9d522c870e6..7c16f834f31a 100644
> --- a/drivers/md/dm-verity.h
> +++ b/drivers/md/dm-verity.h
> @@ -83,7 +83,7 @@ struct dm_verity_io {
> struct bvec_iter iter;
>
> struct work_struct work;
> - struct tasklet_struct tasklet;
> + struct work_struct bh_work;
>
> /*
> * Three variably-size fields follow this struct:

Do we really need two separate work_structs here? They are never submitted
concurrently, so I think that one would be enough. Or, am I missing
something?

Mikulas


2024-01-31 21:25:27

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH 7/8] dm-crypt: Convert from tasklet to BH workqueue

This seems OK.

Reviewed-by: Mikulas Patocka <[email protected]>



On Mon, 29 Jan 2024, Tejun Heo wrote:

> The only generic interface to execute asynchronously in the BH context is
> tasklet; however, it's marked deprecated and has some design flaws. To
> replace tasklets, BH workqueue support was recently added. A BH workqueue
> behaves similarly to regular workqueues except that the queued work items
> are executed in the BH context.
>
> This patch converts dm-crypt from tasklet to BH workqueue.
>
> Like a regular workqueue, a BH workqueue allows freeing the currently
> executing work item. Converting from tasklet to BH workqueue removes the
> need for deferring bio_endio() again to a work item, which was buggy anyway.
>
> I tested this lightly with "--perf-no_read_workqueue
> --perf-no_write_workqueue" + some code modifications, but would really
> -appreciate if someone who knows the code base better could take a look.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Cc: Mikulas Patocka <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Alasdair Kergon <[email protected]>
> Cc: Mike Snitzer <[email protected]>
> Cc: [email protected]
> ---
> drivers/md/dm-crypt.c | 36 ++----------------------------------
> 1 file changed, 2 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 855b482cbff1..619c762d4072 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -73,11 +73,8 @@ struct dm_crypt_io {
> struct bio *base_bio;
> u8 *integrity_metadata;
> bool integrity_metadata_from_pool:1;
> - bool in_tasklet:1;
>
> struct work_struct work;
> - struct tasklet_struct tasklet;
> -
> struct convert_context ctx;
>
> atomic_t io_pending;
> @@ -1762,7 +1759,6 @@ static void crypt_io_init(struct dm_crypt_io *io, struct crypt_config *cc,
> io->ctx.r.req = NULL;
> io->integrity_metadata = NULL;
> io->integrity_metadata_from_pool = false;
> - io->in_tasklet = false;
> atomic_set(&io->io_pending, 0);
> }
>
> @@ -1771,13 +1767,6 @@ static void crypt_inc_pending(struct dm_crypt_io *io)
> atomic_inc(&io->io_pending);
> }
>
> -static void kcryptd_io_bio_endio(struct work_struct *work)
> -{
> - struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
> -
> - bio_endio(io->base_bio);
> -}
> -
> /*
> * One of the bios was finished. Check for completion of
> * the whole request and correctly clean up the buffer.
> @@ -1800,21 +1789,6 @@ static void crypt_dec_pending(struct dm_crypt_io *io)
> kfree(io->integrity_metadata);
>
> base_bio->bi_status = error;
> -
> - /*
> - * If we are running this function from our tasklet,
> - * we can't call bio_endio() here, because it will call
> - * clone_endio() from dm.c, which in turn will
> - * free the current struct dm_crypt_io structure with
> - * our tasklet. In this case we need to delay bio_endio()
> - * execution to after the tasklet is done and dequeued.
> - */
> - if (io->in_tasklet) {
> - INIT_WORK(&io->work, kcryptd_io_bio_endio);
> - queue_work(cc->io_queue, &io->work);
> - return;
> - }
> -
> bio_endio(base_bio);
> }
>
> @@ -2246,11 +2220,6 @@ static void kcryptd_crypt(struct work_struct *work)
> kcryptd_crypt_write_convert(io);
> }
>
> -static void kcryptd_crypt_tasklet(unsigned long work)
> -{
> - kcryptd_crypt((struct work_struct *)work);
> -}
> -
> static void kcryptd_queue_crypt(struct dm_crypt_io *io)
> {
> struct crypt_config *cc = io->cc;
> @@ -2263,9 +2232,8 @@ static void kcryptd_queue_crypt(struct dm_crypt_io *io)
> * it is being executed with irqs disabled.
> */
> if (in_hardirq() || irqs_disabled()) {
> - io->in_tasklet = true;
> - tasklet_init(&io->tasklet, kcryptd_crypt_tasklet, (unsigned long)&io->work);
> - tasklet_schedule(&io->tasklet);
> + INIT_WORK(&io->work, kcryptd_crypt);
> + queue_work(system_bh_wq, &io->work);
> return;
> }
>
> --
> 2.43.0
>


2024-01-31 21:32:50

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 8/8] dm-verity: Convert from tasklet to BH workqueue

Hello,

On Wed, Jan 31, 2024 at 10:19:07PM +0100, Mikulas Patocka wrote:
> > @@ -83,7 +83,7 @@ struct dm_verity_io {
> > struct bvec_iter iter;
> >
> > struct work_struct work;
> > - struct tasklet_struct tasklet;
> > + struct work_struct bh_work;
> >
> > /*
> > * Three variably-size fields follow this struct:
>
> Do we really need two separate work_structs here? They are never submitted
> concurrently, so I think that one would be enough. Or, am I missing
> something?

I don't know, so just did the dumb thing. If the caller always guarantees
that the work items are never queued at the same time, reusing is fine.
However, the followings might be useful to keep on mind:

- work_struct is pretty small - 4 pointers.

- INIT_WORK() on a queued work item isn't gonna be pretty.

- Flushing and no-concurrent-execution guarantee are broken on INIT_WORK().
e.g. If you queue_work(), INIT_WORK(), flush_work(), the flush isn't
actually going to wait for the work item to finish. Also, if you do
queue_work(), INIT_WORK(), queue_work(), the two queued work item
instances may end up running concurrently.

Muxing a single work item carries more risks of subtle bugs, but in some
cases, the way it's used is clear (e.g. sequential chaining) and that's
fine.

Thanks.

--
tejun

2024-01-31 22:13:32

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH 8/8] dm-verity: Convert from tasklet to BH workqueue



On Wed, 31 Jan 2024, Tejun Heo wrote:

> Hello,
>
> On Wed, Jan 31, 2024 at 10:19:07PM +0100, Mikulas Patocka wrote:
> > > @@ -83,7 +83,7 @@ struct dm_verity_io {
> > > struct bvec_iter iter;
> > >
> > > struct work_struct work;
> > > - struct tasklet_struct tasklet;
> > > + struct work_struct bh_work;
> > >
> > > /*
> > > * Three variably-size fields follow this struct:
> >
> > Do we really need two separate work_structs here? They are never submitted
> > concurrently, so I think that one would be enough. Or, am I missing
> > something?
>
> I don't know, so just did the dumb thing. If the caller always guarantees
> that the work items are never queued at the same time, reusing is fine.
> However, the followings might be useful to keep on mind:
>
> - work_struct is pretty small - 4 pointers.
>
> - INIT_WORK() on a queued work item isn't gonna be pretty.
>
> - Flushing and no-concurrent-execution guarantee are broken on INIT_WORK().
> e.g. If you queue_work(), INIT_WORK(), flush_work(), the flush isn't
> actually going to wait for the work item to finish. Also, if you do
> queue_work(), INIT_WORK(), queue_work(), the two queued work item
> instances may end up running concurrently.
>
> Muxing a single work item carries more risks of subtle bugs, but in some
> cases, the way it's used is clear (e.g. sequential chaining) and that's
> fine.

The code doesn't call INIT_WORK() on a queued work item and it doesn't
flush the workqueue (it destroys it only in a situation when there are no
work items running) so I think it's safe to use just one work_struct.

Mikulas


2024-01-31 23:32:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 8/8] dm-verity: Convert from tasklet to BH workqueue

On Wed, 31 Jan 2024 at 13:32, Tejun Heo <[email protected]> wrote:
>
> I don't know, so just did the dumb thing. If the caller always guarantees
> that the work items are never queued at the same time, reusing is fine.

So the reason I thought it would be a good cleanup to introduce that
"atomic" workqueue thing (now "bh") was that this case literally has a
switch between "use tasklets' or "use workqueues".

So it's not even about "reusing" the workqueue, it's literally a
matter of making it always just use workqueues, and the switch then
becomes just *which* workqueue to use - system or bh.

In fact, I suspect there is very little reason ever to *not* just use
the bh one, and even the switch could be removed.

Because I think the only reason the "workqueue of tasklet" choice
existed in the first place was that workqueues were the "proper" data
structure, and the tasklet case was added later as a latency hack, and
everybody knew that tasklets were deprecated.

Linus

2024-02-01 00:04:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 8/8] dm-verity: Convert from tasklet to BH workqueue

Hello, Linus.

On Wed, Jan 31, 2024 at 03:19:01PM -0800, Linus Torvalds wrote:
> On Wed, 31 Jan 2024 at 13:32, Tejun Heo <[email protected]> wrote:
> >
> > I don't know, so just did the dumb thing. If the caller always guarantees
> > that the work items are never queued at the same time, reusing is fine.
>
> So the reason I thought it would be a good cleanup to introduce that
> "atomic" workqueue thing (now "bh") was that this case literally has a
> switch between "use tasklets' or "use workqueues".
>
> So it's not even about "reusing" the workqueue, it's literally a
> matter of making it always just use workqueues, and the switch then
> becomes just *which* workqueue to use - system or bh.

Yeah, that's how the dm-crypt got converted. The patch just before this one.
This one probably can be converted the same way. I don't see the work item
being re-initialized. It probably is better to initialize the work item
together with the enclosing struct and then just queue it when needed.

Mikulas, I couldn't decide what to do with the "try_verify_in_tasklet"
option and just decided to do the minimal thing hoping that someone more
familiar with the code can take over the actual conversion. How much of user
interface commitment is that? Should it be renamed or would it be better to
leave it be?

Thanks.

--
tejun

2024-02-01 00:08:15

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH 8/8] dm-verity: Convert from tasklet to BH workqueue

On Wed, Jan 31 2024 at 6:19P -0500,
Linus Torvalds <[email protected]> wrote:

> On Wed, 31 Jan 2024 at 13:32, Tejun Heo <[email protected]> wrote:
> >
> > I don't know, so just did the dumb thing. If the caller always guarantees
> > that the work items are never queued at the same time, reusing is fine.
>
> So the reason I thought it would be a good cleanup to introduce that
> "atomic" workqueue thing (now "bh") was that this case literally has a
> switch between "use tasklets' or "use workqueues".
>
> So it's not even about "reusing" the workqueue, it's literally a
> matter of making it always just use workqueues, and the switch then
> becomes just *which* workqueue to use - system or bh.

DM generally always use dedicated workqueues instead of the system.

The dm-crypt tasklet's completion path did punt to the workqueue
otherwise there was use-after-free of the per-bio-data that included
the tasklet. And for verity there was fallback to workqueue if
tasklet-based verification failed. Didn't inspire confidence.

> In fact, I suspect there is very little reason ever to *not* just use
> the bh one, and even the switch could be removed.
>
> Because I think the only reason the "workqueue of tasklet" choice
> existed in the first place was that workqueues were the "proper" data
> structure, and the tasklet case was added later as a latency hack, and
> everybody knew that tasklets were deprecated.

Correct, abusing tasklets was a very contrived latency optimization.
Happy to see it all go away! (hindsight: it never should have gone in).

Mike

2024-02-01 00:19:31

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH 8/8] dm-verity: Convert from tasklet to BH workqueue

On Wed, Jan 31 2024 at 7:04P -0500,
Tejun Heo <[email protected]> wrote:

> Hello, Linus.
>
> On Wed, Jan 31, 2024 at 03:19:01PM -0800, Linus Torvalds wrote:
> > On Wed, 31 Jan 2024 at 13:32, Tejun Heo <[email protected]> wrote:
> > >
> > > I don't know, so just did the dumb thing. If the caller always guarantees
> > > that the work items are never queued at the same time, reusing is fine.
> >
> > So the reason I thought it would be a good cleanup to introduce that
> > "atomic" workqueue thing (now "bh") was that this case literally has a
> > switch between "use tasklets' or "use workqueues".
> >
> > So it's not even about "reusing" the workqueue, it's literally a
> > matter of making it always just use workqueues, and the switch then
> > becomes just *which* workqueue to use - system or bh.
>
> Yeah, that's how the dm-crypt got converted. The patch just before this one.
> This one probably can be converted the same way. I don't see the work item
> being re-initialized. It probably is better to initialize the work item
> together with the enclosing struct and then just queue it when needed.

Sounds good.

> Mikulas, I couldn't decide what to do with the "try_verify_in_tasklet"
> option and just decided to do the minimal thing hoping that someone more
> familiar with the code can take over the actual conversion. How much of user
> interface commitment is that? Should it be renamed or would it be better to
> leave it be?

cryptsetup did add support for it, so I think it worthwhile to
preserve the option; but it'd be fine to have it just be a backward
compatible alias for a more appropriately named option?

Mike

2024-02-01 11:03:58

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 3/8] workqueue: Implement BH workqueues to eventually replace tasklets

Hello, Tejun

On Tue, Jan 30, 2024 at 5:16 PM Tejun Heo <[email protected]> wrote:

> @@ -1184,6 +1211,14 @@ static bool kick_pool(struct worker_pool *pool)
> if (!need_more_worker(pool) || !worker)
> return false;
>
> + if (pool->flags & POOL_BH) {
> + if (pool->attrs->nice == HIGHPRI_NICE_LEVEL)
> + tasklet_hi_schedule(&worker->bh_tasklet);
> + else
> + tasklet_schedule(&worker->bh_tasklet);
> + return true;
> + }

I think it is more straightforward to call bh_worker_taskletfn[_hi]()
in tasklet_action() and tasklet_hi_action() rather than add a
worker->bh_tasklet.

raise_softirq_irqoff() can be used here (kick_pool()) instead.

As the changelog said, once the conversion is complete, tasklet can be
removed and BH workqueues can directly take over the tasklet softirqs,
in which case, then, bh_worker_taskletfn[_hi]() can directly take over
the tasklet_action() and tasklet_hi_action().


> +
> p = worker->task;
>
> #ifdef CONFIG_SMP
> @@ -1663,8 +1698,16 @@ static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq, bool fill)
> lockdep_assert_held(&pool->lock);
>
> if (!nna) {
> - /* per-cpu workqueue, pwq->nr_active is sufficient */
> - obtained = pwq->nr_active < READ_ONCE(wq->max_active);
> + /*
> + * BH workqueues always share a single execution context per CPU
> + * and don't impose any max_active limit, so tryinc always
> + * succeeds. For a per-cpu workqueue, checking pwq->nr_active is
> + * sufficient.
> + */
> + if (wq->flags & WQ_BH)
> + obtained = true;
> + else
> + obtained = pwq->nr_active < READ_ONCE(wq->max_active);

I think wq->max_active can be forced to be UINT_MAX or ULONG_MAX
in the max_active management code to avoid a branch here.

> goto out;
> }
>
> @@ -2599,27 +2642,31 @@ static struct worker *create_worker(struct worker_pool *pool)
>
> worker->id = id;
>
> - if (pool->cpu >= 0)
> - snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
> - pool->attrs->nice < 0 ? "H" : "");
> - else
> - snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
> -
> - worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
> - "kworker/%s", id_buf);
> - if (IS_ERR(worker->task)) {
> - if (PTR_ERR(worker->task) == -EINTR) {
> - pr_err("workqueue: Interrupted when creating a worker thread \"kworker/%s\"\n",
> - id_buf);
> - } else {
> - pr_err_once("workqueue: Failed to create a worker thread: %pe",
> - worker->task);
> + if (pool->flags & POOL_BH) {
> + tasklet_setup(&worker->bh_tasklet, bh_worker_taskletfn);
> + } else {
> + if (pool->cpu >= 0)
> + snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
> + pool->attrs->nice < 0 ? "H" : "");
> + else
> + snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
> +
> + worker->task = kthread_create_on_node(worker_thread, worker,
> + pool->node, "kworker/%s", id_buf);
> + if (IS_ERR(worker->task)) {
> + if (PTR_ERR(worker->task) == -EINTR) {
> + pr_err("workqueue: Interrupted when creating a worker thread \"kworker/%s\"\n",
> + id_buf);
> + } else {
> + pr_err_once("workqueue: Failed to create a worker thread: %pe",
> + worker->task);
> + }
> + goto fail;
> }
> - goto fail;
> - }
>
> - set_user_nice(worker->task, pool->attrs->nice);
> - kthread_bind_mask(worker->task, pool_allowed_cpus(pool));
> + set_user_nice(worker->task, pool->attrs->nice);
> + kthread_bind_mask(worker->task, pool_allowed_cpus(pool));
> + }
>
> /* successful, attach the worker to the pool */
> worker_attach_to_pool(worker, pool);

worker_attach_to_pool() and worker_detach_from_pool also access to
worker->task with kthread_set_per_cpu() and luckily to_kthread()
checks the NULL pointer for it.

IMO, it is better to avoid calling kthread_set_per_cpu() for bh workers.




> @@ -5605,7 +5731,12 @@ static void pr_cont_pool_info(struct worker_pool *pool)
> pr_cont(" cpus=%*pbl", nr_cpumask_bits, pool->attrs->cpumask);
> if (pool->node != NUMA_NO_NODE)
> pr_cont(" node=%d", pool->node);
> - pr_cont(" flags=0x%x nice=%d", pool->flags, pool->attrs->nice);
> + pr_cont(" flags=0x%x", pool->flags);
> + if (pool->flags & POOL_BH)
> + pr_cont(" bh%s",
> + pool->attrs->nice == HIGHPRI_NICE_LEVEL ? "-hi" : "");
> + else
> + pr_cont(" nice=%d", pool->attrs->nice);

There are also some "worker->task" in show_pwq(), show_one_worker_pool(),
and show_cpu_pool_hog() needing taking care of.

Thanks
Lai

2024-02-01 21:57:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/8] workqueue: Implement BH workqueues to eventually replace tasklets

Hello, Lai.

On Thu, Feb 01, 2024 at 07:02:27PM +0800, Lai Jiangshan wrote:
> On Tue, Jan 30, 2024 at 5:16 PM Tejun Heo <[email protected]> wrote:
> > @@ -1184,6 +1211,14 @@ static bool kick_pool(struct worker_pool *pool)
> > if (!need_more_worker(pool) || !worker)
> > return false;
> >
> > + if (pool->flags & POOL_BH) {
> > + if (pool->attrs->nice == HIGHPRI_NICE_LEVEL)
> > + tasklet_hi_schedule(&worker->bh_tasklet);
> > + else
> > + tasklet_schedule(&worker->bh_tasklet);
> > + return true;
> > + }
>
> I think it is more straightforward to call bh_worker_taskletfn[_hi]()
> in tasklet_action() and tasklet_hi_action() rather than add a
> worker->bh_tasklet.
>
> raise_softirq_irqoff() can be used here (kick_pool()) instead.
>
> As the changelog said, once the conversion is complete, tasklet can be
> removed and BH workqueues can directly take over the tasklet softirqs,
> in which case, then, bh_worker_taskletfn[_hi]() can directly take over
> the tasklet_action() and tasklet_hi_action().

Hmmm.... maybe. Yeah, that'd also make it a tiny bit cheaper for hot paths.
Lemme see how that looks.

> I think wq->max_active can be forced to be UINT_MAX or ULONG_MAX
> in the max_active management code to avoid a branch here.

Good point. Will update.

> worker_attach_to_pool() and worker_detach_from_pool also access to
> worker->task with kthread_set_per_cpu() and luckily to_kthread()
> checks the NULL pointer for it.
>
> IMO, it is better to avoid calling kthread_set_per_cpu() for bh workers.

Note that BH worker pools are always DISASSOCIATED, so
worker_attach_to_pool() shouldn't call kthread_set_per_cpu(). Also, BH
workers and worker pools are never destroyed, so worker_detach_from_pool()
shouldn't be called either. I'll add WARN_ONs to clarify.

> > @@ -5605,7 +5731,12 @@ static void pr_cont_pool_info(struct worker_pool *pool)
> > pr_cont(" cpus=%*pbl", nr_cpumask_bits, pool->attrs->cpumask);
> > if (pool->node != NUMA_NO_NODE)
> > pr_cont(" node=%d", pool->node);
> > - pr_cont(" flags=0x%x nice=%d", pool->flags, pool->attrs->nice);
> > + pr_cont(" flags=0x%x", pool->flags);
> > + if (pool->flags & POOL_BH)
> > + pr_cont(" bh%s",
> > + pool->attrs->nice == HIGHPRI_NICE_LEVEL ? "-hi" : "");
> > + else
> > + pr_cont(" nice=%d", pool->attrs->nice);
>
> There are also some "worker->task" in show_pwq(), show_one_worker_pool(),
> and show_cpu_pool_hog() needing taking care of.

Ah, right, I'll hunt them down.

Thanks.

--
tejun

2024-02-02 01:17:37

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v2 3/8] workqueue: Implement BH workqueues to eventually replace tasklets

The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws such as
the execution code accessing the tasklet item after the execution is
complete which can lead to subtle use-after-free in certain usage scenarios
and less-developed flush and cancel mechanisms.

This patch implements BH workqueues which share the same semantics and
features of regular workqueues but execute their work items in the softirq
context. As there is always only one BH execution context per CPU, none of
the concurrency management mechanisms applies and a BH workqueue can be
thought of as a convenience wrapper around softirq.

Except for the inability to sleep while executing and lack of max_active
adjustments, BH workqueues and work items should behave the same as regular
workqueues and work items.

Currently, the execution is hooked to tasklet[_hi]. However, the goal is to
convert all tasklet users over to BH workqueues. Once the conversion is
complete, tasklet can be removed and BH workqueues can directly take over
the tasklet softirqs.

system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in
tasklet, all existing tasklet users should be able to use the system BH
workqueues without creating their own workqueues.

v2: - Instead of using tasklets, hook directly into its softirq action
functions - tasklet[_hi]_action(). This is slightly cheaper and closer
to the eventual code structure we want to arrive at. Suggested by Lai.

- Lai also pointed out several places which need NULL worker->task
handling or can use clarification. Updated.

Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com
Tested-by: Allen Pais <[email protected]>
Cc: Lai Jiangshan <[email protected]>
---
Hello,

The git branch is also updated. Once Lai acks, I'll land the workqueue part
in wq/for-6.9 and create a separate branch for conversion. I'll apply the
safe / acked ones only.

Thanks.

Documentation/core-api/workqueue.rst | 29 ++-
include/linux/workqueue.h | 11 +
kernel/softirq.c | 3 +
kernel/workqueue.c | 290 ++++++++++++++++++++++-----
tools/workqueue/wq_dump.py | 11 +-
5 files changed, 284 insertions(+), 60 deletions(-)

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 33c4539155d9..2d6af6c4665c 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -77,10 +77,12 @@ wants a function to be executed asynchronously it has to set up a work
item pointing to that function and queue that work item on a
workqueue.

-Special purpose threads, called worker threads, execute the functions
-off of the queue, one after the other. If no work is queued, the
-worker threads become idle. These worker threads are managed in so
-called worker-pools.
+A work item can be executed in either a thread or the BH (softirq) context.
+
+For threaded workqueues, special purpose threads, called [k]workers, execute
+the functions off of the queue, one after the other. If no work is queued,
+the worker threads become idle. These worker threads are managed in
+worker-pools.

The cmwq design differentiates between the user-facing workqueues that
subsystems and drivers queue work items on and the backend mechanism
@@ -91,6 +93,12 @@ for high priority ones, for each possible CPU and some extra
worker-pools to serve work items queued on unbound workqueues - the
number of these backing pools is dynamic.

+BH workqueues use the same framework. However, as there can only be one
+concurrent execution context, there's no need to worry about concurrency.
+Each per-CPU BH worker pool contains only one pseudo worker which represents
+the BH execution context. A BH workqueue can be considered a convenience
+interface to softirq.
+
Subsystems and drivers can create and queue work items through special
workqueue API functions as they see fit. They can influence some
aspects of the way the work items are executed by setting flags on the
@@ -106,7 +114,7 @@ unless specifically overridden, a work item of a bound workqueue will
be queued on the worklist of either normal or highpri worker-pool that
is associated to the CPU the issuer is running on.

-For any worker pool implementation, managing the concurrency level
+For any thread pool implementation, managing the concurrency level
(how many execution contexts are active) is an important issue. cmwq
tries to keep the concurrency at a minimal but sufficient level.
Minimal to save resources and sufficient in that the system is used at
@@ -164,6 +172,17 @@ resources, scheduled and executed.
``flags``
---------

+``WQ_BH``
+ BH workqueues can be considered a convenience interface to softirq. BH
+ workqueues are always per-CPU and all BH work items are executed in the
+ queueing CPU's softirq context in the queueing order.
+
+ All BH workqueues must have 0 ``max_active`` and ``WQ_HIGHPRI`` is the
+ only allowed additional flag.
+
+ BH work items cannot sleep. All other features such as delayed queueing,
+ flushing and canceling are supported.
+
``WQ_UNBOUND``
Work items queued to an unbound wq are served by the special
worker-pools which host workers which are not bound to any
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 232baea90a1d..283d7891b4c4 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -353,6 +353,7 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
* Documentation/core-api/workqueue.rst.
*/
enum wq_flags {
+ WQ_BH = 1 << 0, /* execute in bottom half (softirq) context */
WQ_UNBOUND = 1 << 1, /* not bound to any cpu */
WQ_FREEZABLE = 1 << 2, /* freeze during suspend */
WQ_MEM_RECLAIM = 1 << 3, /* may be used for memory reclaim */
@@ -392,6 +393,9 @@ enum wq_flags {
__WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */
__WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */
__WQ_ORDERED_EXPLICIT = 1 << 19, /* internal: alloc_ordered_workqueue() */
+
+ /* BH wq only allows the following flags */
+ __WQ_BH_ALLOWS = WQ_BH | WQ_HIGHPRI,
};

enum wq_consts {
@@ -434,6 +438,9 @@ enum wq_consts {
* they are same as their non-power-efficient counterparts - e.g.
* system_power_efficient_wq is identical to system_wq if
* 'wq_power_efficient' is disabled. See WQ_POWER_EFFICIENT for more info.
+ *
+ * system_bh[_highpri]_wq are convenience interface to softirq. BH work items
+ * are executed in the queueing CPU's BH context in the queueing order.
*/
extern struct workqueue_struct *system_wq;
extern struct workqueue_struct *system_highpri_wq;
@@ -442,6 +449,10 @@ extern struct workqueue_struct *system_unbound_wq;
extern struct workqueue_struct *system_freezable_wq;
extern struct workqueue_struct *system_power_efficient_wq;
extern struct workqueue_struct *system_freezable_power_efficient_wq;
+extern struct workqueue_struct *system_bh_wq;
+extern struct workqueue_struct *system_bh_highpri_wq;
+
+void workqueue_softirq_action(bool highpri);

/**
* alloc_workqueue - allocate a workqueue
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 210cf5f8d92c..547d282548a8 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -27,6 +27,7 @@
#include <linux/tick.h>
#include <linux/irq.h>
#include <linux/wait_bit.h>
+#include <linux/workqueue.h>

#include <asm/softirq_stack.h>

@@ -802,11 +803,13 @@ static void tasklet_action_common(struct softirq_action *a,

static __latent_entropy void tasklet_action(struct softirq_action *a)
{
+ workqueue_softirq_action(false);
tasklet_action_common(a, this_cpu_ptr(&tasklet_vec), TASKLET_SOFTIRQ);
}

static __latent_entropy void tasklet_hi_action(struct softirq_action *a)
{
+ workqueue_softirq_action(true);
tasklet_action_common(a, this_cpu_ptr(&tasklet_hi_vec), HI_SOFTIRQ);
}

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bbb1909fad94..69845af0a01a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -72,8 +72,12 @@ enum worker_pool_flags {
* Note that DISASSOCIATED should be flipped only while holding
* wq_pool_attach_mutex to avoid changing binding state while
* worker_attach_to_pool() is in progress.
+ *
+ * As there can only be one concurrent BH execution context per CPU, a
+ * BH pool is per-CPU and always DISASSOCIATED.
*/
- POOL_MANAGER_ACTIVE = 1 << 0, /* being managed */
+ POOL_BH = 1 << 0, /* is a BH pool */
+ POOL_MANAGER_ACTIVE = 1 << 1, /* being managed */
POOL_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */
};

@@ -115,6 +119,14 @@ enum wq_internal_consts {
WQ_NAME_LEN = 32,
};

+/*
+ * We don't want to trap softirq for too long. See MAX_SOFTIRQ_TIME and
+ * MAX_SOFTIRQ_RESTART in kernel/softirq.c. These are macros because
+ * msecs_to_jiffies() can't be an initializer.
+ */
+#define BH_WORKER_JIFFIES msecs_to_jiffies(2)
+#define BH_WORKER_RESTARTS 10
+
/*
* Structure fields follow one of the following exclusion rules.
*
@@ -443,8 +455,13 @@ static bool wq_debug_force_rr_cpu = false;
#endif
module_param_named(debug_force_rr_cpu, wq_debug_force_rr_cpu, bool, 0644);

+/* the BH worker pools */
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
+ bh_worker_pools);
+
/* the per-cpu worker pools */
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS], cpu_worker_pools);
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
+ cpu_worker_pools);

static DEFINE_IDR(worker_pool_idr); /* PR: idr of all pools */

@@ -478,6 +495,10 @@ struct workqueue_struct *system_power_efficient_wq __ro_after_init;
EXPORT_SYMBOL_GPL(system_power_efficient_wq);
struct workqueue_struct *system_freezable_power_efficient_wq __ro_after_init;
EXPORT_SYMBOL_GPL(system_freezable_power_efficient_wq);
+struct workqueue_struct *system_bh_wq;
+EXPORT_SYMBOL_GPL(system_bh_wq);
+struct workqueue_struct *system_bh_highpri_wq;
+EXPORT_SYMBOL_GPL(system_bh_highpri_wq);

static int worker_thread(void *__worker);
static void workqueue_sysfs_unregister(struct workqueue_struct *wq);
@@ -498,6 +519,11 @@ static void show_one_worker_pool(struct worker_pool *pool);
!lockdep_is_held(&wq_pool_mutex), \
"RCU, wq->mutex or wq_pool_mutex should be held")

+#define for_each_bh_worker_pool(pool, cpu) \
+ for ((pool) = &per_cpu(bh_worker_pools, cpu)[0]; \
+ (pool) < &per_cpu(bh_worker_pools, cpu)[NR_STD_WORKER_POOLS]; \
+ (pool)++)
+
#define for_each_cpu_worker_pool(pool, cpu) \
for ((pool) = &per_cpu(cpu_worker_pools, cpu)[0]; \
(pool) < &per_cpu(cpu_worker_pools, cpu)[NR_STD_WORKER_POOLS]; \
@@ -1186,6 +1212,14 @@ static bool kick_pool(struct worker_pool *pool)
if (!need_more_worker(pool) || !worker)
return false;

+ if (pool->flags & POOL_BH) {
+ if (pool->attrs->nice == HIGHPRI_NICE_LEVEL)
+ raise_softirq_irqoff(HI_SOFTIRQ);
+ else
+ raise_softirq_irqoff(TASKLET_SOFTIRQ);
+ return true;
+ }
+
p = worker->task;

#ifdef CONFIG_SMP
@@ -1668,7 +1702,7 @@ static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq, bool fill)
lockdep_assert_held(&pool->lock);

if (!nna) {
- /* per-cpu workqueue, pwq->nr_active is sufficient */
+ /* BH or per-cpu workqueue, pwq->nr_active is sufficient */
obtained = pwq->nr_active < READ_ONCE(wq->max_active);
goto out;
}
@@ -2517,19 +2551,21 @@ static cpumask_t *pool_allowed_cpus(struct worker_pool *pool)
* cpu-[un]hotplugs.
*/
static void worker_attach_to_pool(struct worker *worker,
- struct worker_pool *pool)
+ struct worker_pool *pool)
{
mutex_lock(&wq_pool_attach_mutex);

/*
- * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
- * stable across this function. See the comments above the flag
- * definition for details.
+ * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains stable
+ * across this function. See the comments above the flag definition for
+ * details. BH workers are, while per-CPU, always DISASSOCIATED.
*/
- if (pool->flags & POOL_DISASSOCIATED)
+ if (pool->flags & POOL_DISASSOCIATED) {
worker->flags |= WORKER_UNBOUND;
- else
+ } else {
+ WARN_ON_ONCE(pool->flags & POOL_BH);
kthread_set_per_cpu(worker->task, pool->cpu);
+ }

if (worker->rescue_wq)
set_cpus_allowed_ptr(worker->task, pool_allowed_cpus(pool));
@@ -2553,6 +2589,9 @@ static void worker_detach_from_pool(struct worker *worker)
struct worker_pool *pool = worker->pool;
struct completion *detach_completion = NULL;

+ /* there is one permanent BH worker per CPU which should never detach */
+ WARN_ON_ONCE(pool->flags & POOL_BH);
+
mutex_lock(&wq_pool_attach_mutex);

kthread_set_per_cpu(worker->task, -1);
@@ -2604,27 +2643,29 @@ static struct worker *create_worker(struct worker_pool *pool)

worker->id = id;

- if (pool->cpu >= 0)
- snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
- pool->attrs->nice < 0 ? "H" : "");
- else
- snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
-
- worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
- "kworker/%s", id_buf);
- if (IS_ERR(worker->task)) {
- if (PTR_ERR(worker->task) == -EINTR) {
- pr_err("workqueue: Interrupted when creating a worker thread \"kworker/%s\"\n",
- id_buf);
- } else {
- pr_err_once("workqueue: Failed to create a worker thread: %pe",
- worker->task);
+ if (!(pool->flags & POOL_BH)) {
+ if (pool->cpu >= 0)
+ snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
+ pool->attrs->nice < 0 ? "H" : "");
+ else
+ snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
+
+ worker->task = kthread_create_on_node(worker_thread, worker,
+ pool->node, "kworker/%s", id_buf);
+ if (IS_ERR(worker->task)) {
+ if (PTR_ERR(worker->task) == -EINTR) {
+ pr_err("workqueue: Interrupted when creating a worker thread \"kworker/%s\"\n",
+ id_buf);
+ } else {
+ pr_err_once("workqueue: Failed to create a worker thread: %pe",
+ worker->task);
+ }
+ goto fail;
}
- goto fail;
- }

- set_user_nice(worker->task, pool->attrs->nice);
- kthread_bind_mask(worker->task, pool_allowed_cpus(pool));
+ set_user_nice(worker->task, pool->attrs->nice);
+ kthread_bind_mask(worker->task, pool_allowed_cpus(pool));
+ }

/* successful, attach the worker to the pool */
worker_attach_to_pool(worker, pool);
@@ -2640,7 +2681,8 @@ static struct worker *create_worker(struct worker_pool *pool)
* check if not woken up soon. As kick_pool() is noop if @pool is empty,
* wake it up explicitly.
*/
- wake_up_process(worker->task);
+ if (worker->task)
+ wake_up_process(worker->task);

raw_spin_unlock_irq(&pool->lock);

@@ -2982,7 +3024,8 @@ __acquires(&pool->lock)
worker->current_work = work;
worker->current_func = work->func;
worker->current_pwq = pwq;
- worker->current_at = worker->task->se.sum_exec_runtime;
+ if (worker->task)
+ worker->current_at = worker->task->se.sum_exec_runtime;
work_data = *work_data_bits(work);
worker->current_color = get_work_color(work_data);

@@ -3080,7 +3123,8 @@ __acquires(&pool->lock)
* stop_machine. At the same time, report a quiescent RCU state so
* the same condition doesn't freeze RCU.
*/
- cond_resched();
+ if (worker->task)
+ cond_resched();

raw_spin_lock_irq(&pool->lock);

@@ -3363,6 +3407,61 @@ static int rescuer_thread(void *__rescuer)
goto repeat;
}

+static void bh_worker(struct worker *worker)
+{
+ struct worker_pool *pool = worker->pool;
+ int nr_restarts = BH_WORKER_RESTARTS;
+ unsigned long end = jiffies + BH_WORKER_JIFFIES;
+
+ raw_spin_lock_irq(&pool->lock);
+ worker_leave_idle(worker);
+
+ /*
+ * This function follows the structure of worker_thread(). See there for
+ * explanations on each step.
+ */
+ if (!need_more_worker(pool))
+ goto done;
+
+ WARN_ON_ONCE(!list_empty(&worker->scheduled));
+ worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
+
+ do {
+ struct work_struct *work =
+ list_first_entry(&pool->worklist,
+ struct work_struct, entry);
+
+ if (assign_work(work, worker, NULL))
+ process_scheduled_works(worker);
+ } while (keep_working(pool) &&
+ --nr_restarts && time_before(jiffies, end));
+
+ worker_set_flags(worker, WORKER_PREP);
+done:
+ worker_enter_idle(worker);
+ kick_pool(pool);
+ raw_spin_unlock_irq(&pool->lock);
+}
+
+/*
+ * TODO: Convert all tasklet users to workqueue and use softirq directly.
+ *
+ * This is currently called from tasklet[_hi]action() and thus is also called
+ * whenever there are tasklets to run. Let's do an early exit if there's nothing
+ * queued. Once conversion from tasklet is complete, the need_more_worker() test
+ * can be dropped.
+ *
+ * After full conversion, we'll add worker->softirq_action, directly use the
+ * softirq action and obtain the worker pointer from the softirq_action pointer.
+ */
+void workqueue_softirq_action(bool highpri)
+{
+ struct worker_pool *pool =
+ &per_cpu(bh_worker_pools, smp_processor_id())[highpri];
+ if (need_more_worker(pool))
+ bh_worker(list_first_entry(&pool->workers, struct worker, node));
+}
+
/**
* check_flush_dependency - check for flush dependency sanity
* @target_wq: workqueue being flushed
@@ -3435,6 +3534,7 @@ static void insert_wq_barrier(struct pool_workqueue *pwq,
struct wq_barrier *barr,
struct work_struct *target, struct worker *worker)
{
+ static __maybe_unused struct lock_class_key bh_key, thr_key;
unsigned int work_flags = 0;
unsigned int work_color;
struct list_head *head;
@@ -3444,8 +3544,13 @@ static void insert_wq_barrier(struct pool_workqueue *pwq,
* as we know for sure that this will not trigger any of the
* checks and call back into the fixup functions where we
* might deadlock.
+ *
+ * BH and threaded workqueues need separate lockdep keys to avoid
+ * spuriously triggering "inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W}
+ * usage".
*/
- INIT_WORK_ONSTACK(&barr->work, wq_barrier_func);
+ INIT_WORK_ONSTACK_KEY(&barr->work, wq_barrier_func,
+ (pwq->wq->flags & WQ_BH) ? &bh_key : &thr_key);
__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work));

init_completion_map(&barr->done, &target->lockdep_map);
@@ -3551,15 +3656,31 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,

static void touch_wq_lockdep_map(struct workqueue_struct *wq)
{
+#ifdef CONFIG_LOCKDEP
+ if (wq->flags & WQ_BH)
+ local_bh_disable();
+
lock_map_acquire(&wq->lockdep_map);
lock_map_release(&wq->lockdep_map);
+
+ if (wq->flags & WQ_BH)
+ local_bh_enable();
+#endif
}

static void touch_work_lockdep_map(struct work_struct *work,
struct workqueue_struct *wq)
{
+#ifdef CONFIG_LOCKDEP
+ if (wq->flags & WQ_BH)
+ local_bh_disable();
+
lock_map_acquire(&work->lockdep_map);
lock_map_release(&work->lockdep_map);
+
+ if (wq->flags & WQ_BH)
+ local_bh_enable();
+#endif
}

/**
@@ -5013,10 +5134,17 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)

if (!(wq->flags & WQ_UNBOUND)) {
for_each_possible_cpu(cpu) {
- struct pool_workqueue **pwq_p =
- per_cpu_ptr(wq->cpu_pwq, cpu);
- struct worker_pool *pool =
- &(per_cpu_ptr(cpu_worker_pools, cpu)[highpri]);
+ struct pool_workqueue **pwq_p;
+ struct worker_pool __percpu *pools;
+ struct worker_pool *pool;
+
+ if (wq->flags & WQ_BH)
+ pools = bh_worker_pools;
+ else
+ pools = cpu_worker_pools;
+
+ pool = &(per_cpu_ptr(pools, cpu)[highpri]);
+ pwq_p = per_cpu_ptr(wq->cpu_pwq, cpu);

*pwq_p = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL,
pool->node);
@@ -5191,6 +5319,13 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
size_t wq_size;
int name_len;

+ if (flags & WQ_BH) {
+ if (WARN_ON_ONCE(flags & ~__WQ_BH_ALLOWS))
+ return NULL;
+ if (WARN_ON_ONCE(max_active))
+ return NULL;
+ }
+
/*
* Unbound && max_active == 1 used to imply ordered, which is no longer
* the case on many machines due to per-pod pools. While
@@ -5228,8 +5363,16 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
pr_warn_once("workqueue: name exceeds WQ_NAME_LEN. Truncating to: %s\n",
wq->name);

- max_active = max_active ?: WQ_DFL_ACTIVE;
- max_active = wq_clamp_max_active(max_active, flags, wq->name);
+ if (flags & WQ_BH) {
+ /*
+ * BH workqueues always share a single execution context per CPU
+ * and don't impose any max_active limit.
+ */
+ max_active = INT_MAX;
+ } else {
+ max_active = max_active ?: WQ_DFL_ACTIVE;
+ max_active = wq_clamp_max_active(max_active, flags, wq->name);
+ }

/* init wq */
wq->flags = flags;
@@ -5410,6 +5553,9 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
*/
void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
{
+ /* max_active doesn't mean anything for BH workqueues */
+ if (WARN_ON(wq->flags & WQ_BH))
+ return;
/* disallow meddling with max_active for ordered workqueues */
if (WARN_ON(wq->flags & __WQ_ORDERED_EXPLICIT))
return;
@@ -5611,7 +5757,24 @@ static void pr_cont_pool_info(struct worker_pool *pool)
pr_cont(" cpus=%*pbl", nr_cpumask_bits, pool->attrs->cpumask);
if (pool->node != NUMA_NO_NODE)
pr_cont(" node=%d", pool->node);
- pr_cont(" flags=0x%x nice=%d", pool->flags, pool->attrs->nice);
+ pr_cont(" flags=0x%x", pool->flags);
+ if (pool->flags & POOL_BH)
+ pr_cont(" bh%s",
+ pool->attrs->nice == HIGHPRI_NICE_LEVEL ? "-hi" : "");
+ else
+ pr_cont(" nice=%d", pool->attrs->nice);
+}
+
+static void pr_cont_worker_id(struct worker *worker)
+{
+ struct worker_pool *pool = worker->pool;
+
+ if (pool->flags & WQ_BH)
+ pr_cont("bh%s",
+ pool->attrs->nice == HIGHPRI_NICE_LEVEL ? "-hi" : "");
+ else
+ pr_cont("%d%s", task_pid_nr(worker->task),
+ worker->rescue_wq ? "(RESCUER)" : "");
}

struct pr_cont_work_struct {
@@ -5688,10 +5851,9 @@ static void show_pwq(struct pool_workqueue *pwq)
if (worker->current_pwq != pwq)
continue;

- pr_cont("%s %d%s:%ps", comma ? "," : "",
- task_pid_nr(worker->task),
- worker->rescue_wq ? "(RESCUER)" : "",
- worker->current_func);
+ pr_cont(" %s", comma ? "," : "");
+ pr_cont_worker_id(worker);
+ pr_cont(":%ps", worker->current_func);
list_for_each_entry(work, &worker->scheduled, entry)
pr_cont_work(false, work, &pcws);
pr_cont_work_flush(comma, (work_func_t)-1L, &pcws);
@@ -5810,8 +5972,8 @@ static void show_one_worker_pool(struct worker_pool *pool)
pr_cont(" manager: %d",
task_pid_nr(pool->manager->task));
list_for_each_entry(worker, &pool->idle_list, entry) {
- pr_cont(" %s%d", first ? "idle: " : "",
- task_pid_nr(worker->task));
+ pr_cont(" %s", first ? "idle: " : "");
+ pr_cont_worker_id(worker);
first = false;
}
pr_cont("\n");
@@ -6084,13 +6246,15 @@ int workqueue_online_cpu(unsigned int cpu)
mutex_lock(&wq_pool_mutex);

for_each_pool(pool, pi) {
- mutex_lock(&wq_pool_attach_mutex);
+ /* BH pools aren't affected by hotplug */
+ if (pool->flags & POOL_BH)
+ continue;

+ mutex_lock(&wq_pool_attach_mutex);
if (pool->cpu == cpu)
rebind_workers(pool);
else if (pool->cpu < 0)
restore_unbound_workers_cpumask(pool, cpu);
-
mutex_unlock(&wq_pool_attach_mutex);
}

@@ -7047,7 +7211,7 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
/* did we stall? */
if (time_after(now, ts + thresh)) {
lockup_detected = true;
- if (pool->cpu >= 0) {
+ if (pool->cpu >= 0 && !(pool->flags & POOL_BH)) {
pool->cpu_stall = true;
cpu_pool_stall = true;
}
@@ -7212,10 +7376,16 @@ void __init workqueue_init_early(void)
pt->pod_node[0] = NUMA_NO_NODE;
pt->cpu_pod[0] = 0;

- /* initialize CPU pools */
+ /* initialize BH and CPU pools */
for_each_possible_cpu(cpu) {
struct worker_pool *pool;

+ i = 0;
+ for_each_bh_worker_pool(pool, cpu) {
+ init_cpu_worker_pool(pool, cpu, std_nice[i++]);
+ pool->flags |= POOL_BH;
+ }
+
i = 0;
for_each_cpu_worker_pool(pool, cpu)
init_cpu_worker_pool(pool, cpu, std_nice[i++]);
@@ -7251,10 +7421,14 @@ void __init workqueue_init_early(void)
system_freezable_power_efficient_wq = alloc_workqueue("events_freezable_pwr_efficient",
WQ_FREEZABLE | WQ_POWER_EFFICIENT,
0);
+ system_bh_wq = alloc_workqueue("events_bh", WQ_BH, 0);
+ system_bh_highpri_wq = alloc_workqueue("events_bh_highpri",
+ WQ_BH | WQ_HIGHPRI, 0);
BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq ||
!system_unbound_wq || !system_freezable_wq ||
!system_power_efficient_wq ||
- !system_freezable_power_efficient_wq);
+ !system_freezable_power_efficient_wq ||
+ !system_bh_wq || !system_bh_highpri_wq);
}

static void __init wq_cpu_intensive_thresh_init(void)
@@ -7320,9 +7494,10 @@ void __init workqueue_init(void)
* up. Also, create a rescuer for workqueues that requested it.
*/
for_each_possible_cpu(cpu) {
- for_each_cpu_worker_pool(pool, cpu) {
+ for_each_bh_worker_pool(pool, cpu)
+ pool->node = cpu_to_node(cpu);
+ for_each_cpu_worker_pool(pool, cpu)
pool->node = cpu_to_node(cpu);
- }
}

list_for_each_entry(wq, &workqueues, list) {
@@ -7333,7 +7508,16 @@ void __init workqueue_init(void)

mutex_unlock(&wq_pool_mutex);

- /* create the initial workers */
+ /*
+ * Create the initial workers. A BH pool has one pseudo worker that
+ * represents the shared BH execution context and thus doesn't get
+ * affected by hotplug events. Create the BH pseudo workers for all
+ * possible CPUs here.
+ */
+ for_each_possible_cpu(cpu)
+ for_each_bh_worker_pool(pool, cpu)
+ BUG_ON(!create_worker(pool));
+
for_each_online_cpu(cpu) {
for_each_cpu_worker_pool(pool, cpu) {
pool->flags &= ~POOL_DISASSOCIATED;
diff --git a/tools/workqueue/wq_dump.py b/tools/workqueue/wq_dump.py
index bd381511bd9a..d29b918306b4 100644
--- a/tools/workqueue/wq_dump.py
+++ b/tools/workqueue/wq_dump.py
@@ -79,7 +79,9 @@ args = parser.parse_args()
wq_type_len = 9

def wq_type_str(wq):
- if wq.flags & WQ_UNBOUND:
+ if wq.flags & WQ_BH:
+ return f'{"bh":{wq_type_len}}'
+ elif wq.flags & WQ_UNBOUND:
if wq.flags & WQ_ORDERED:
return f'{"ordered":{wq_type_len}}'
else:
@@ -97,6 +99,7 @@ wq_pod_types = prog['wq_pod_types']
wq_affn_dfl = prog['wq_affn_dfl']
wq_affn_names = prog['wq_affn_names']

+WQ_BH = prog['WQ_BH']
WQ_UNBOUND = prog['WQ_UNBOUND']
WQ_ORDERED = prog['__WQ_ORDERED']
WQ_MEM_RECLAIM = prog['WQ_MEM_RECLAIM']
@@ -107,6 +110,8 @@ WQ_AFFN_CACHE = prog['WQ_AFFN_CACHE']
WQ_AFFN_NUMA = prog['WQ_AFFN_NUMA']
WQ_AFFN_SYSTEM = prog['WQ_AFFN_SYSTEM']

+POOL_BH = prog['POOL_BH']
+
WQ_NAME_LEN = prog['WQ_NAME_LEN'].value_()
cpumask_str_len = len(cpumask_str(wq_unbound_cpumask))

@@ -151,10 +156,12 @@ max_ref_len = 0

for pi, pool in idr_for_each(worker_pool_idr):
pool = drgn.Object(prog, 'struct worker_pool', address=pool)
- print(f'pool[{pi:0{max_pool_id_len}}] ref={pool.refcnt.value_():{max_ref_len}} nice={pool.attrs.nice.value_():3} ', end='')
+ print(f'pool[{pi:0{max_pool_id_len}}] flags=0x{pool.flags.value_():02x} ref={pool.refcnt.value_():{max_ref_len}} nice={pool.attrs.nice.value_():3} ', end='')
print(f'idle/workers={pool.nr_idle.value_():3}/{pool.nr_workers.value_():3} ', end='')
if pool.cpu >= 0:
print(f'cpu={pool.cpu.value_():3}', end='')
+ if pool.flags & POOL_BH:
+ print(' bh', end='')
else:
print(f'cpus={cpumask_str(pool.attrs.cpumask)}', end='')
print(f' pod_cpus={cpumask_str(pool.attrs.__pod_cpumask)}', end='')
--
2.43.0


2024-02-04 02:22:15

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] workqueue: Implement BH workqueues to eventually replace tasklets

On Fri, Feb 2, 2024 at 9:15 AM Tejun Heo <[email protected]> wrote:

Hello, Tejun

Reviewed-by: Lai Jiangshan <[email protected]>

Thanks
Lai

2024-02-04 21:30:11

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v3 3/8] workqueue: Implement BH workqueues to eventually replace tasklets

From 4cb1ef64609f9b0254184b2947824f4b46ccab22 Mon Sep 17 00:00:00 2001
From: Tejun Heo <[email protected]>
Date: Sun, 4 Feb 2024 11:28:06 -1000

The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws such as
the execution code accessing the tasklet item after the execution is
complete which can lead to subtle use-after-free in certain usage scenarios
and less-developed flush and cancel mechanisms.

This patch implements BH workqueues which share the same semantics and
features of regular workqueues but execute their work items in the softirq
context. As there is always only one BH execution context per CPU, none of
the concurrency management mechanisms applies and a BH workqueue can be
thought of as a convenience wrapper around softirq.

Except for the inability to sleep while executing and lack of max_active
adjustments, BH workqueues and work items should behave the same as regular
workqueues and work items.

Currently, the execution is hooked to tasklet[_hi]. However, the goal is to
convert all tasklet users over to BH workqueues. Once the conversion is
complete, tasklet can be removed and BH workqueues can directly take over
the tasklet softirqs.

system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in
tasklet, all existing tasklet users should be able to use the system BH
workqueues without creating their own workqueues.

v3: - Add missing interrupt.h include.

v2: - Instead of using tasklets, hook directly into its softirq action
functions - tasklet[_hi]_action(). This is slightly cheaper and closer
to the eventual code structure we want to arrive at. Suggested by Lai.

- Lai also pointed out several places which need NULL worker->task
handling or can use clarification. Updated.

Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com
Tested-by: Allen Pais <[email protected]>
Reviewed-by: Lai Jiangshan <[email protected]>
---
Documentation/core-api/workqueue.rst | 29 ++-
include/linux/workqueue.h | 11 +
kernel/softirq.c | 3 +
kernel/workqueue.c | 291 ++++++++++++++++++++++-----
tools/workqueue/wq_dump.py | 11 +-
5 files changed, 285 insertions(+), 60 deletions(-)

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 33c4539155d9..2d6af6c4665c 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -77,10 +77,12 @@ wants a function to be executed asynchronously it has to set up a work
item pointing to that function and queue that work item on a
workqueue.

-Special purpose threads, called worker threads, execute the functions
-off of the queue, one after the other. If no work is queued, the
-worker threads become idle. These worker threads are managed in so
-called worker-pools.
+A work item can be executed in either a thread or the BH (softirq) context.
+
+For threaded workqueues, special purpose threads, called [k]workers, execute
+the functions off of the queue, one after the other. If no work is queued,
+the worker threads become idle. These worker threads are managed in
+worker-pools.

The cmwq design differentiates between the user-facing workqueues that
subsystems and drivers queue work items on and the backend mechanism
@@ -91,6 +93,12 @@ for high priority ones, for each possible CPU and some extra
worker-pools to serve work items queued on unbound workqueues - the
number of these backing pools is dynamic.

+BH workqueues use the same framework. However, as there can only be one
+concurrent execution context, there's no need to worry about concurrency.
+Each per-CPU BH worker pool contains only one pseudo worker which represents
+the BH execution context. A BH workqueue can be considered a convenience
+interface to softirq.
+
Subsystems and drivers can create and queue work items through special
workqueue API functions as they see fit. They can influence some
aspects of the way the work items are executed by setting flags on the
@@ -106,7 +114,7 @@ unless specifically overridden, a work item of a bound workqueue will
be queued on the worklist of either normal or highpri worker-pool that
is associated to the CPU the issuer is running on.

-For any worker pool implementation, managing the concurrency level
+For any thread pool implementation, managing the concurrency level
(how many execution contexts are active) is an important issue. cmwq
tries to keep the concurrency at a minimal but sufficient level.
Minimal to save resources and sufficient in that the system is used at
@@ -164,6 +172,17 @@ resources, scheduled and executed.
``flags``
---------

+``WQ_BH``
+ BH workqueues can be considered a convenience interface to softirq. BH
+ workqueues are always per-CPU and all BH work items are executed in the
+ queueing CPU's softirq context in the queueing order.
+
+ All BH workqueues must have 0 ``max_active`` and ``WQ_HIGHPRI`` is the
+ only allowed additional flag.
+
+ BH work items cannot sleep. All other features such as delayed queueing,
+ flushing and canceling are supported.
+
``WQ_UNBOUND``
Work items queued to an unbound wq are served by the special
worker-pools which host workers which are not bound to any
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 232baea90a1d..283d7891b4c4 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -353,6 +353,7 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
* Documentation/core-api/workqueue.rst.
*/
enum wq_flags {
+ WQ_BH = 1 << 0, /* execute in bottom half (softirq) context */
WQ_UNBOUND = 1 << 1, /* not bound to any cpu */
WQ_FREEZABLE = 1 << 2, /* freeze during suspend */
WQ_MEM_RECLAIM = 1 << 3, /* may be used for memory reclaim */
@@ -392,6 +393,9 @@ enum wq_flags {
__WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */
__WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */
__WQ_ORDERED_EXPLICIT = 1 << 19, /* internal: alloc_ordered_workqueue() */
+
+ /* BH wq only allows the following flags */
+ __WQ_BH_ALLOWS = WQ_BH | WQ_HIGHPRI,
};

enum wq_consts {
@@ -434,6 +438,9 @@ enum wq_consts {
* they are same as their non-power-efficient counterparts - e.g.
* system_power_efficient_wq is identical to system_wq if
* 'wq_power_efficient' is disabled. See WQ_POWER_EFFICIENT for more info.
+ *
+ * system_bh[_highpri]_wq are convenience interface to softirq. BH work items
+ * are executed in the queueing CPU's BH context in the queueing order.
*/
extern struct workqueue_struct *system_wq;
extern struct workqueue_struct *system_highpri_wq;
@@ -442,6 +449,10 @@ extern struct workqueue_struct *system_unbound_wq;
extern struct workqueue_struct *system_freezable_wq;
extern struct workqueue_struct *system_power_efficient_wq;
extern struct workqueue_struct *system_freezable_power_efficient_wq;
+extern struct workqueue_struct *system_bh_wq;
+extern struct workqueue_struct *system_bh_highpri_wq;
+
+void workqueue_softirq_action(bool highpri);

/**
* alloc_workqueue - allocate a workqueue
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 210cf5f8d92c..547d282548a8 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -27,6 +27,7 @@
#include <linux/tick.h>
#include <linux/irq.h>
#include <linux/wait_bit.h>
+#include <linux/workqueue.h>

#include <asm/softirq_stack.h>

@@ -802,11 +803,13 @@ static void tasklet_action_common(struct softirq_action *a,

static __latent_entropy void tasklet_action(struct softirq_action *a)
{
+ workqueue_softirq_action(false);
tasklet_action_common(a, this_cpu_ptr(&tasklet_vec), TASKLET_SOFTIRQ);
}

static __latent_entropy void tasklet_hi_action(struct softirq_action *a)
{
+ workqueue_softirq_action(true);
tasklet_action_common(a, this_cpu_ptr(&tasklet_hi_vec), HI_SOFTIRQ);
}

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 767971a29c7a..78b4b992e1a3 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -29,6 +29,7 @@
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/init.h>
+#include <linux/interrupt.h>
#include <linux/signal.h>
#include <linux/completion.h>
#include <linux/workqueue.h>
@@ -72,8 +73,12 @@ enum worker_pool_flags {
* Note that DISASSOCIATED should be flipped only while holding
* wq_pool_attach_mutex to avoid changing binding state while
* worker_attach_to_pool() is in progress.
+ *
+ * As there can only be one concurrent BH execution context per CPU, a
+ * BH pool is per-CPU and always DISASSOCIATED.
*/
- POOL_MANAGER_ACTIVE = 1 << 0, /* being managed */
+ POOL_BH = 1 << 0, /* is a BH pool */
+ POOL_MANAGER_ACTIVE = 1 << 1, /* being managed */
POOL_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */
};

@@ -115,6 +120,14 @@ enum wq_internal_consts {
WQ_NAME_LEN = 32,
};

+/*
+ * We don't want to trap softirq for too long. See MAX_SOFTIRQ_TIME and
+ * MAX_SOFTIRQ_RESTART in kernel/softirq.c. These are macros because
+ * msecs_to_jiffies() can't be an initializer.
+ */
+#define BH_WORKER_JIFFIES msecs_to_jiffies(2)
+#define BH_WORKER_RESTARTS 10
+
/*
* Structure fields follow one of the following exclusion rules.
*
@@ -443,8 +456,13 @@ static bool wq_debug_force_rr_cpu = false;
#endif
module_param_named(debug_force_rr_cpu, wq_debug_force_rr_cpu, bool, 0644);

+/* the BH worker pools */
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
+ bh_worker_pools);
+
/* the per-cpu worker pools */
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS], cpu_worker_pools);
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
+ cpu_worker_pools);

static DEFINE_IDR(worker_pool_idr); /* PR: idr of all pools */

@@ -478,6 +496,10 @@ struct workqueue_struct *system_power_efficient_wq __ro_after_init;
EXPORT_SYMBOL_GPL(system_power_efficient_wq);
struct workqueue_struct *system_freezable_power_efficient_wq __ro_after_init;
EXPORT_SYMBOL_GPL(system_freezable_power_efficient_wq);
+struct workqueue_struct *system_bh_wq;
+EXPORT_SYMBOL_GPL(system_bh_wq);
+struct workqueue_struct *system_bh_highpri_wq;
+EXPORT_SYMBOL_GPL(system_bh_highpri_wq);

static int worker_thread(void *__worker);
static void workqueue_sysfs_unregister(struct workqueue_struct *wq);
@@ -498,6 +520,11 @@ static void show_one_worker_pool(struct worker_pool *pool);
!lockdep_is_held(&wq_pool_mutex), \
"RCU, wq->mutex or wq_pool_mutex should be held")

+#define for_each_bh_worker_pool(pool, cpu) \
+ for ((pool) = &per_cpu(bh_worker_pools, cpu)[0]; \
+ (pool) < &per_cpu(bh_worker_pools, cpu)[NR_STD_WORKER_POOLS]; \
+ (pool)++)
+
#define for_each_cpu_worker_pool(pool, cpu) \
for ((pool) = &per_cpu(cpu_worker_pools, cpu)[0]; \
(pool) < &per_cpu(cpu_worker_pools, cpu)[NR_STD_WORKER_POOLS]; \
@@ -1186,6 +1213,14 @@ static bool kick_pool(struct worker_pool *pool)
if (!need_more_worker(pool) || !worker)
return false;

+ if (pool->flags & POOL_BH) {
+ if (pool->attrs->nice == HIGHPRI_NICE_LEVEL)
+ raise_softirq_irqoff(HI_SOFTIRQ);
+ else
+ raise_softirq_irqoff(TASKLET_SOFTIRQ);
+ return true;
+ }
+
p = worker->task;

#ifdef CONFIG_SMP
@@ -1668,7 +1703,7 @@ static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq, bool fill)
lockdep_assert_held(&pool->lock);

if (!nna) {
- /* per-cpu workqueue, pwq->nr_active is sufficient */
+ /* BH or per-cpu workqueue, pwq->nr_active is sufficient */
obtained = pwq->nr_active < READ_ONCE(wq->max_active);
goto out;
}
@@ -2523,19 +2558,21 @@ static cpumask_t *pool_allowed_cpus(struct worker_pool *pool)
* cpu-[un]hotplugs.
*/
static void worker_attach_to_pool(struct worker *worker,
- struct worker_pool *pool)
+ struct worker_pool *pool)
{
mutex_lock(&wq_pool_attach_mutex);

/*
- * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
- * stable across this function. See the comments above the flag
- * definition for details.
+ * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains stable
+ * across this function. See the comments above the flag definition for
+ * details. BH workers are, while per-CPU, always DISASSOCIATED.
*/
- if (pool->flags & POOL_DISASSOCIATED)
+ if (pool->flags & POOL_DISASSOCIATED) {
worker->flags |= WORKER_UNBOUND;
- else
+ } else {
+ WARN_ON_ONCE(pool->flags & POOL_BH);
kthread_set_per_cpu(worker->task, pool->cpu);
+ }

if (worker->rescue_wq)
set_cpus_allowed_ptr(worker->task, pool_allowed_cpus(pool));
@@ -2559,6 +2596,9 @@ static void worker_detach_from_pool(struct worker *worker)
struct worker_pool *pool = worker->pool;
struct completion *detach_completion = NULL;

+ /* there is one permanent BH worker per CPU which should never detach */
+ WARN_ON_ONCE(pool->flags & POOL_BH);
+
mutex_lock(&wq_pool_attach_mutex);

kthread_set_per_cpu(worker->task, -1);
@@ -2610,27 +2650,29 @@ static struct worker *create_worker(struct worker_pool *pool)

worker->id = id;

- if (pool->cpu >= 0)
- snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
- pool->attrs->nice < 0 ? "H" : "");
- else
- snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
-
- worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
- "kworker/%s", id_buf);
- if (IS_ERR(worker->task)) {
- if (PTR_ERR(worker->task) == -EINTR) {
- pr_err("workqueue: Interrupted when creating a worker thread \"kworker/%s\"\n",
- id_buf);
- } else {
- pr_err_once("workqueue: Failed to create a worker thread: %pe",
- worker->task);
+ if (!(pool->flags & POOL_BH)) {
+ if (pool->cpu >= 0)
+ snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
+ pool->attrs->nice < 0 ? "H" : "");
+ else
+ snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
+
+ worker->task = kthread_create_on_node(worker_thread, worker,
+ pool->node, "kworker/%s", id_buf);
+ if (IS_ERR(worker->task)) {
+ if (PTR_ERR(worker->task) == -EINTR) {
+ pr_err("workqueue: Interrupted when creating a worker thread \"kworker/%s\"\n",
+ id_buf);
+ } else {
+ pr_err_once("workqueue: Failed to create a worker thread: %pe",
+ worker->task);
+ }
+ goto fail;
}
- goto fail;
- }

- set_user_nice(worker->task, pool->attrs->nice);
- kthread_bind_mask(worker->task, pool_allowed_cpus(pool));
+ set_user_nice(worker->task, pool->attrs->nice);
+ kthread_bind_mask(worker->task, pool_allowed_cpus(pool));
+ }

/* successful, attach the worker to the pool */
worker_attach_to_pool(worker, pool);
@@ -2646,7 +2688,8 @@ static struct worker *create_worker(struct worker_pool *pool)
* check if not woken up soon. As kick_pool() is noop if @pool is empty,
* wake it up explicitly.
*/
- wake_up_process(worker->task);
+ if (worker->task)
+ wake_up_process(worker->task);

raw_spin_unlock_irq(&pool->lock);

@@ -2988,7 +3031,8 @@ __acquires(&pool->lock)
worker->current_work = work;
worker->current_func = work->func;
worker->current_pwq = pwq;
- worker->current_at = worker->task->se.sum_exec_runtime;
+ if (worker->task)
+ worker->current_at = worker->task->se.sum_exec_runtime;
work_data = *work_data_bits(work);
worker->current_color = get_work_color(work_data);

@@ -3086,7 +3130,8 @@ __acquires(&pool->lock)
* stop_machine. At the same time, report a quiescent RCU state so
* the same condition doesn't freeze RCU.
*/
- cond_resched();
+ if (worker->task)
+ cond_resched();

raw_spin_lock_irq(&pool->lock);

@@ -3369,6 +3414,61 @@ static int rescuer_thread(void *__rescuer)
goto repeat;
}

+static void bh_worker(struct worker *worker)
+{
+ struct worker_pool *pool = worker->pool;
+ int nr_restarts = BH_WORKER_RESTARTS;
+ unsigned long end = jiffies + BH_WORKER_JIFFIES;
+
+ raw_spin_lock_irq(&pool->lock);
+ worker_leave_idle(worker);
+
+ /*
+ * This function follows the structure of worker_thread(). See there for
+ * explanations on each step.
+ */
+ if (!need_more_worker(pool))
+ goto done;
+
+ WARN_ON_ONCE(!list_empty(&worker->scheduled));
+ worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
+
+ do {
+ struct work_struct *work =
+ list_first_entry(&pool->worklist,
+ struct work_struct, entry);
+
+ if (assign_work(work, worker, NULL))
+ process_scheduled_works(worker);
+ } while (keep_working(pool) &&
+ --nr_restarts && time_before(jiffies, end));
+
+ worker_set_flags(worker, WORKER_PREP);
+done:
+ worker_enter_idle(worker);
+ kick_pool(pool);
+ raw_spin_unlock_irq(&pool->lock);
+}
+
+/*
+ * TODO: Convert all tasklet users to workqueue and use softirq directly.
+ *
+ * This is currently called from tasklet[_hi]action() and thus is also called
+ * whenever there are tasklets to run. Let's do an early exit if there's nothing
+ * queued. Once conversion from tasklet is complete, the need_more_worker() test
+ * can be dropped.
+ *
+ * After full conversion, we'll add worker->softirq_action, directly use the
+ * softirq action and obtain the worker pointer from the softirq_action pointer.
+ */
+void workqueue_softirq_action(bool highpri)
+{
+ struct worker_pool *pool =
+ &per_cpu(bh_worker_pools, smp_processor_id())[highpri];
+ if (need_more_worker(pool))
+ bh_worker(list_first_entry(&pool->workers, struct worker, node));
+}
+
/**
* check_flush_dependency - check for flush dependency sanity
* @target_wq: workqueue being flushed
@@ -3441,6 +3541,7 @@ static void insert_wq_barrier(struct pool_workqueue *pwq,
struct wq_barrier *barr,
struct work_struct *target, struct worker *worker)
{
+ static __maybe_unused struct lock_class_key bh_key, thr_key;
unsigned int work_flags = 0;
unsigned int work_color;
struct list_head *head;
@@ -3450,8 +3551,13 @@ static void insert_wq_barrier(struct pool_workqueue *pwq,
* as we know for sure that this will not trigger any of the
* checks and call back into the fixup functions where we
* might deadlock.
+ *
+ * BH and threaded workqueues need separate lockdep keys to avoid
+ * spuriously triggering "inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W}
+ * usage".
*/
- INIT_WORK_ONSTACK(&barr->work, wq_barrier_func);
+ INIT_WORK_ONSTACK_KEY(&barr->work, wq_barrier_func,
+ (pwq->wq->flags & WQ_BH) ? &bh_key : &thr_key);
__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work));

init_completion_map(&barr->done, &target->lockdep_map);
@@ -3557,15 +3663,31 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,

static void touch_wq_lockdep_map(struct workqueue_struct *wq)
{
+#ifdef CONFIG_LOCKDEP
+ if (wq->flags & WQ_BH)
+ local_bh_disable();
+
lock_map_acquire(&wq->lockdep_map);
lock_map_release(&wq->lockdep_map);
+
+ if (wq->flags & WQ_BH)
+ local_bh_enable();
+#endif
}

static void touch_work_lockdep_map(struct work_struct *work,
struct workqueue_struct *wq)
{
+#ifdef CONFIG_LOCKDEP
+ if (wq->flags & WQ_BH)
+ local_bh_disable();
+
lock_map_acquire(&work->lockdep_map);
lock_map_release(&work->lockdep_map);
+
+ if (wq->flags & WQ_BH)
+ local_bh_enable();
+#endif
}

/**
@@ -5019,10 +5141,17 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)

if (!(wq->flags & WQ_UNBOUND)) {
for_each_possible_cpu(cpu) {
- struct pool_workqueue **pwq_p =
- per_cpu_ptr(wq->cpu_pwq, cpu);
- struct worker_pool *pool =
- &(per_cpu_ptr(cpu_worker_pools, cpu)[highpri]);
+ struct pool_workqueue **pwq_p;
+ struct worker_pool __percpu *pools;
+ struct worker_pool *pool;
+
+ if (wq->flags & WQ_BH)
+ pools = bh_worker_pools;
+ else
+ pools = cpu_worker_pools;
+
+ pool = &(per_cpu_ptr(pools, cpu)[highpri]);
+ pwq_p = per_cpu_ptr(wq->cpu_pwq, cpu);

*pwq_p = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL,
pool->node);
@@ -5197,6 +5326,13 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
size_t wq_size;
int name_len;

+ if (flags & WQ_BH) {
+ if (WARN_ON_ONCE(flags & ~__WQ_BH_ALLOWS))
+ return NULL;
+ if (WARN_ON_ONCE(max_active))
+ return NULL;
+ }
+
/*
* Unbound && max_active == 1 used to imply ordered, which is no longer
* the case on many machines due to per-pod pools. While
@@ -5234,8 +5370,16 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
pr_warn_once("workqueue: name exceeds WQ_NAME_LEN. Truncating to: %s\n",
wq->name);

- max_active = max_active ?: WQ_DFL_ACTIVE;
- max_active = wq_clamp_max_active(max_active, flags, wq->name);
+ if (flags & WQ_BH) {
+ /*
+ * BH workqueues always share a single execution context per CPU
+ * and don't impose any max_active limit.
+ */
+ max_active = INT_MAX;
+ } else {
+ max_active = max_active ?: WQ_DFL_ACTIVE;
+ max_active = wq_clamp_max_active(max_active, flags, wq->name);
+ }

/* init wq */
wq->flags = flags;
@@ -5416,6 +5560,9 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
*/
void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
{
+ /* max_active doesn't mean anything for BH workqueues */
+ if (WARN_ON(wq->flags & WQ_BH))
+ return;
/* disallow meddling with max_active for ordered workqueues */
if (WARN_ON(wq->flags & __WQ_ORDERED_EXPLICIT))
return;
@@ -5617,7 +5764,24 @@ static void pr_cont_pool_info(struct worker_pool *pool)
pr_cont(" cpus=%*pbl", nr_cpumask_bits, pool->attrs->cpumask);
if (pool->node != NUMA_NO_NODE)
pr_cont(" node=%d", pool->node);
- pr_cont(" flags=0x%x nice=%d", pool->flags, pool->attrs->nice);
+ pr_cont(" flags=0x%x", pool->flags);
+ if (pool->flags & POOL_BH)
+ pr_cont(" bh%s",
+ pool->attrs->nice == HIGHPRI_NICE_LEVEL ? "-hi" : "");
+ else
+ pr_cont(" nice=%d", pool->attrs->nice);
+}
+
+static void pr_cont_worker_id(struct worker *worker)
+{
+ struct worker_pool *pool = worker->pool;
+
+ if (pool->flags & WQ_BH)
+ pr_cont("bh%s",
+ pool->attrs->nice == HIGHPRI_NICE_LEVEL ? "-hi" : "");
+ else
+ pr_cont("%d%s", task_pid_nr(worker->task),
+ worker->rescue_wq ? "(RESCUER)" : "");
}

struct pr_cont_work_struct {
@@ -5694,10 +5858,9 @@ static void show_pwq(struct pool_workqueue *pwq)
if (worker->current_pwq != pwq)
continue;

- pr_cont("%s %d%s:%ps", comma ? "," : "",
- task_pid_nr(worker->task),
- worker->rescue_wq ? "(RESCUER)" : "",
- worker->current_func);
+ pr_cont(" %s", comma ? "," : "");
+ pr_cont_worker_id(worker);
+ pr_cont(":%ps", worker->current_func);
list_for_each_entry(work, &worker->scheduled, entry)
pr_cont_work(false, work, &pcws);
pr_cont_work_flush(comma, (work_func_t)-1L, &pcws);
@@ -5816,8 +5979,8 @@ static void show_one_worker_pool(struct worker_pool *pool)
pr_cont(" manager: %d",
task_pid_nr(pool->manager->task));
list_for_each_entry(worker, &pool->idle_list, entry) {
- pr_cont(" %s%d", first ? "idle: " : "",
- task_pid_nr(worker->task));
+ pr_cont(" %s", first ? "idle: " : "");
+ pr_cont_worker_id(worker);
first = false;
}
pr_cont("\n");
@@ -6090,13 +6253,15 @@ int workqueue_online_cpu(unsigned int cpu)
mutex_lock(&wq_pool_mutex);

for_each_pool(pool, pi) {
- mutex_lock(&wq_pool_attach_mutex);
+ /* BH pools aren't affected by hotplug */
+ if (pool->flags & POOL_BH)
+ continue;

+ mutex_lock(&wq_pool_attach_mutex);
if (pool->cpu == cpu)
rebind_workers(pool);
else if (pool->cpu < 0)
restore_unbound_workers_cpumask(pool, cpu);
-
mutex_unlock(&wq_pool_attach_mutex);
}

@@ -7053,7 +7218,7 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
/* did we stall? */
if (time_after(now, ts + thresh)) {
lockup_detected = true;
- if (pool->cpu >= 0) {
+ if (pool->cpu >= 0 && !(pool->flags & POOL_BH)) {
pool->cpu_stall = true;
cpu_pool_stall = true;
}
@@ -7218,10 +7383,16 @@ void __init workqueue_init_early(void)
pt->pod_node[0] = NUMA_NO_NODE;
pt->cpu_pod[0] = 0;

- /* initialize CPU pools */
+ /* initialize BH and CPU pools */
for_each_possible_cpu(cpu) {
struct worker_pool *pool;

+ i = 0;
+ for_each_bh_worker_pool(pool, cpu) {
+ init_cpu_worker_pool(pool, cpu, std_nice[i++]);
+ pool->flags |= POOL_BH;
+ }
+
i = 0;
for_each_cpu_worker_pool(pool, cpu)
init_cpu_worker_pool(pool, cpu, std_nice[i++]);
@@ -7257,10 +7428,14 @@ void __init workqueue_init_early(void)
system_freezable_power_efficient_wq = alloc_workqueue("events_freezable_pwr_efficient",
WQ_FREEZABLE | WQ_POWER_EFFICIENT,
0);
+ system_bh_wq = alloc_workqueue("events_bh", WQ_BH, 0);
+ system_bh_highpri_wq = alloc_workqueue("events_bh_highpri",
+ WQ_BH | WQ_HIGHPRI, 0);
BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq ||
!system_unbound_wq || !system_freezable_wq ||
!system_power_efficient_wq ||
- !system_freezable_power_efficient_wq);
+ !system_freezable_power_efficient_wq ||
+ !system_bh_wq || !system_bh_highpri_wq);
}

static void __init wq_cpu_intensive_thresh_init(void)
@@ -7326,9 +7501,10 @@ void __init workqueue_init(void)
* up. Also, create a rescuer for workqueues that requested it.
*/
for_each_possible_cpu(cpu) {
- for_each_cpu_worker_pool(pool, cpu) {
+ for_each_bh_worker_pool(pool, cpu)
+ pool->node = cpu_to_node(cpu);
+ for_each_cpu_worker_pool(pool, cpu)
pool->node = cpu_to_node(cpu);
- }
}

list_for_each_entry(wq, &workqueues, list) {
@@ -7339,7 +7515,16 @@ void __init workqueue_init(void)

mutex_unlock(&wq_pool_mutex);

- /* create the initial workers */
+ /*
+ * Create the initial workers. A BH pool has one pseudo worker that
+ * represents the shared BH execution context and thus doesn't get
+ * affected by hotplug events. Create the BH pseudo workers for all
+ * possible CPUs here.
+ */
+ for_each_possible_cpu(cpu)
+ for_each_bh_worker_pool(pool, cpu)
+ BUG_ON(!create_worker(pool));
+
for_each_online_cpu(cpu) {
for_each_cpu_worker_pool(pool, cpu) {
pool->flags &= ~POOL_DISASSOCIATED;
diff --git a/tools/workqueue/wq_dump.py b/tools/workqueue/wq_dump.py
index bd381511bd9a..d29b918306b4 100644
--- a/tools/workqueue/wq_dump.py
+++ b/tools/workqueue/wq_dump.py
@@ -79,7 +79,9 @@ args = parser.parse_args()
wq_type_len = 9

def wq_type_str(wq):
- if wq.flags & WQ_UNBOUND:
+ if wq.flags & WQ_BH:
+ return f'{"bh":{wq_type_len}}'
+ elif wq.flags & WQ_UNBOUND:
if wq.flags & WQ_ORDERED:
return f'{"ordered":{wq_type_len}}'
else:
@@ -97,6 +99,7 @@ wq_pod_types = prog['wq_pod_types']
wq_affn_dfl = prog['wq_affn_dfl']
wq_affn_names = prog['wq_affn_names']

+WQ_BH = prog['WQ_BH']
WQ_UNBOUND = prog['WQ_UNBOUND']
WQ_ORDERED = prog['__WQ_ORDERED']
WQ_MEM_RECLAIM = prog['WQ_MEM_RECLAIM']
@@ -107,6 +110,8 @@ WQ_AFFN_CACHE = prog['WQ_AFFN_CACHE']
WQ_AFFN_NUMA = prog['WQ_AFFN_NUMA']
WQ_AFFN_SYSTEM = prog['WQ_AFFN_SYSTEM']

+POOL_BH = prog['POOL_BH']
+
WQ_NAME_LEN = prog['WQ_NAME_LEN'].value_()
cpumask_str_len = len(cpumask_str(wq_unbound_cpumask))

@@ -151,10 +156,12 @@ max_ref_len = 0

for pi, pool in idr_for_each(worker_pool_idr):
pool = drgn.Object(prog, 'struct worker_pool', address=pool)
- print(f'pool[{pi:0{max_pool_id_len}}] ref={pool.refcnt.value_():{max_ref_len}} nice={pool.attrs.nice.value_():3} ', end='')
+ print(f'pool[{pi:0{max_pool_id_len}}] flags=0x{pool.flags.value_():02x} ref={pool.refcnt.value_():{max_ref_len}} nice={pool.attrs.nice.value_():3} ', end='')
print(f'idle/workers={pool.nr_idle.value_():3}/{pool.nr_workers.value_():3} ', end='')
if pool.cpu >= 0:
print(f'cpu={pool.cpu.value_():3}', end='')
+ if pool.flags & POOL_BH:
+ print(' bh', end='')
else:
print(f'cpus={cpumask_str(pool.attrs.cpumask)}', end='')
print(f' pod_cpus={cpumask_str(pool.attrs.__pod_cpumask)}', end='')
--
2.43.0


2024-02-04 21:34:06

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-6.9] workqueue: Implement BH workqueue and convert several tasklet users

On Mon, Jan 29, 2024 at 11:11:47PM -1000, Tejun Heo wrote:
> 0001-workqueue-Update-lock-debugging-code.patch
> 0002-workqueue-Factor-out-init_cpu_worker_pool.patch
> 0003-workqueue-Implement-BH-workqueues-to-eventually-repl.patch
> 0004-backtracetest-Convert-from-tasklet-to-BH-workqueue.patch
> 0005-usb-core-hcd-Convert-from-tasklet-to-BH-workqueue.patch
> 0006-net-tcp-tsq-Convert-from-tasklet-to-BH-workqueue.patch
> 0007-dm-crypt-Convert-from-tasklet-to-BH-workqueue.patch
> 0008-dm-verity-Convert-from-tasklet-to-BH-workqueue.patch

Applied 0001-0003 to wq/for-6.9. Applied 0004-0005 to
wq/for-6.9-bh-conversions. Will proceed on 0006 and other conversions after
more perf testing.

Thanks.

--
tejun

2024-02-05 04:49:19

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] workqueue: Implement BH workqueues to eventually replace tasklets

On Sun, 4 Feb 2024 11:28:06 -1000 Tejun Heo <[email protected]>
> +static void bh_worker(struct worker *worker)
> +{
> + struct worker_pool *pool = worker->pool;
> + int nr_restarts = BH_WORKER_RESTARTS;
> + unsigned long end = jiffies + BH_WORKER_JIFFIES;
> +
> + raw_spin_lock_irq(&pool->lock);
> + worker_leave_idle(worker);
> +
> + /*
> + * This function follows the structure of worker_thread(). See there for
> + * explanations on each step.
> + */
> + if (!need_more_worker(pool))
> + goto done;
> +
> + WARN_ON_ONCE(!list_empty(&worker->scheduled));
> + worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
> +
> + do {
> + struct work_struct *work =
> + list_first_entry(&pool->worklist,
> + struct work_struct, entry);
> +
> + if (assign_work(work, worker, NULL))
> + process_scheduled_works(worker);
> + } while (keep_working(pool) &&
> + --nr_restarts && time_before(jiffies, end));
> +
> + worker_set_flags(worker, WORKER_PREP);
> +done:
> + worker_enter_idle(worker);
> + kick_pool(pool);
> + raw_spin_unlock_irq(&pool->lock);
> +}

I see no need to exec bh works for 2ms with irq disabled.

2024-02-05 17:48:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] workqueue: Implement BH workqueues to eventually replace tasklets

On Mon, Feb 05, 2024 at 12:48:17PM +0800, Hillf Danton wrote:
> I see no need to exec bh works for 2ms with irq disabled.

The irqs are enabled and locks are released while work items are executing.

Thanks.

--
tejun

2024-02-05 21:54:10

by Allen

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-6.9] workqueue: Implement BH workqueue and convert several tasklet users

> On Mon, Jan 29, 2024 at 11:11:47PM -1000, Tejun Heo wrote:
> > 0001-workqueue-Update-lock-debugging-code.patch
> > 0002-workqueue-Factor-out-init_cpu_worker_pool.patch
> > 0003-workqueue-Implement-BH-workqueues-to-eventually-repl.patch
> > 0004-backtracetest-Convert-from-tasklet-to-BH-workqueue.patch
> > 0005-usb-core-hcd-Convert-from-tasklet-to-BH-workqueue.patch
> > 0006-net-tcp-tsq-Convert-from-tasklet-to-BH-workqueue.patch
> > 0007-dm-crypt-Convert-from-tasklet-to-BH-workqueue.patch
> > 0008-dm-verity-Convert-from-tasklet-to-BH-workqueue.patch
>
> Applied 0001-0003 to wq/for-6.9. Applied 0004-0005 to
> wq/for-6.9-bh-conversions. Will proceed on 0006 and other conversions after
> more perf testing.
>
> Thanks.
>

Thank you. I am basing my work on the branch you have
pushed.(or-6.9-bh-conversions)
https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/log/?h=for-6.9-bh-conversions

In the order of priority, I have started converting drivers/media/*,
drivers/dma/* followed by drivers/net/*
which constitutes the majority. Putting my plan out here so that the
work is not duplicated.
I will write back in a day and share the branch for review.

W.r.t the conversion, there are drivers which call
tasklet_[disable/enable](), which I suppose
can be ignored in the case of workqueues, I am not entirely sure if
this is the right approach.
Please correct me if I am wrong.


Thanks,

- Allen

2024-02-05 22:06:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-6.9] workqueue: Implement BH workqueue and convert several tasklet users

Hello, Allen.

On Mon, Feb 05, 2024 at 12:50:28PM -0800, Allen wrote:
> Thank you. I am basing my work on the branch you have
> pushed.(or-6.9-bh-conversions)
> https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/log/?h=for-6.9-bh-conversions
>
> In the order of priority, I have started converting drivers/media/*,
> drivers/dma/* followed by drivers/net/*
> which constitutes the majority. Putting my plan out here so that the
> work is not duplicated.
> I will write back in a day and share the branch for review.

That's great. Thanks.

> W.r.t the conversion, there are drivers which call
> tasklet_[disable/enable](), which I suppose
> can be ignored in the case of workqueues, I am not entirely sure if
> this is the right approach.
> Please correct me if I am wrong.

I don't think we can ignore them. I was just looking at tasklet_kill() and
thought we're good because that seemed to map well to cancel_work_sync().
workqueue doesn't have the counterpart for tasklet_[disable/enable](). I'll
look through them and think on it.

Thanks.

--
tejun

2024-02-05 22:40:04

by Allen

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-6.9] workqueue: Implement BH workqueue and convert several tasklet users

Tejun,
> On Mon, Feb 05, 2024 at 12:50:28PM -0800, Allen wrote:
> > Thank you. I am basing my work on the branch you have
> > pushed.(or-6.9-bh-conversions)
> > https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/log/?h=for-6.9-bh-conversions
> >
> > In the order of priority, I have started converting drivers/media/*,
> > drivers/dma/* followed by drivers/net/*
> > which constitutes the majority. Putting my plan out here so that the
> > work is not duplicated.
> > I will write back in a day and share the branch for review.
>
> That's great. Thanks.
>
> > W.r.t the conversion, there are drivers which call
> > tasklet_[disable/enable](), which I suppose
> > can be ignored in the case of workqueues, I am not entirely sure if
> > this is the right approach.
> > Please correct me if I am wrong.
>
> I don't think we can ignore them. I was just looking at tasklet_kill() and
> thought we're good because that seemed to map well to cancel_work_sync().
> workqueue doesn't have the counterpart for tasklet_[disable/enable](). I'll
> look through them and think on it.
>

Okay, I will look into it too. I have these rough and completely untested
functions. All I am trying to do is to match what tasklets are currently doing.

static inline void workqueue_disable(struct work_struct *work)
{
cancel_work_sync(work);
flush_workqueue(system_bh_wq);
smp_mb();
}

static inline void workqueue_enable(struct work_struct *work)
{
smp_mb__before_atomic();
// atomic_inc(&work->data);
}

I have to figure out a better way to handle atomic_inc() in the enable
function.

Thanks.

- Allen

2024-02-07 19:03:06

by Allen

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-6.9] workqueue: Implement BH workqueue and convert several tasklet users

Tejun,

> > On Mon, Feb 05, 2024 at 12:50:28PM -0800, Allen wrote:
> > > Thank you. I am basing my work on the branch you have
> > > pushed.(or-6.9-bh-conversions)
> > > https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/log/?h=for-6.9-bh-conversions
> > >
> > > In the order of priority, I have started converting drivers/media/*,
> > > drivers/dma/* followed by drivers/net/*
> > > which constitutes the majority. Putting my plan out here so that the
> > > work is not duplicated.
> > > I will write back in a day and share the branch for review.
> >
> > That's great. Thanks.
> >

I have the first set of conversions ready. Kindly review the patchset
before I post it to the list. These convert drivers/dma/* to use the new
Atomic WQ mechanism.

Based on
https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-6.9-bh-conversions


https://github.com/allenpais/for-6.9-bh-conversions

I am holding on to the patch that converts drivers/media/*, as I haven't found
a right way to replace tasklet_[disable/enable] api's. The rest should be ready
in a day or two.

Thanks,
- Allen

2024-02-08 16:57:05

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-6.9] workqueue: Implement BH workqueue and convert several tasklet users

On Wed, Feb 07, 2024 at 11:02:37AM -0800, Allen wrote:
> https://github.com/allenpais/for-6.9-bh-conversions
>
> I am holding on to the patch that converts drivers/media/*, as I haven't found
> a right way to replace tasklet_[disable/enable] api's. The rest should be ready
> in a day or two.

Yeah, we'll need to add something to workqueue to support that. As for the
rest, looking at the code, I think tasklet_kill() should be converted to
cancel_work_sync(), not flush_work().

Thanks.

--
tejun

2024-02-08 19:21:48

by Allen

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-6.9] workqueue: Implement BH workqueue and convert several tasklet users

On Thu, Feb 8, 2024 at 8:56 AM Tejun Heo <[email protected]> wrote:
>
> On Wed, Feb 07, 2024 at 11:02:37AM -0800, Allen wrote:
> > https://github.com/allenpais/for-6.9-bh-conversions
> >
> > I am holding on to the patch that converts drivers/media/*, as I haven't found
> > a right way to replace tasklet_[disable/enable] api's. The rest should be ready
> > in a day or two.
>
> Yeah, we'll need to add something to workqueue to support that. As for the
> rest, looking at the code, I think tasklet_kill() should be converted to
> cancel_work_sync(), not flush_work().
>

Ah, Thanks for pointing that out. I will update it and get that fixed.

Thanks for the review.

--
- Allen

2024-02-16 05:31:41

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 6/8] net: tcp: tsq: Convert from tasklet to BH workqueue

Hello,

On Mon, Jan 29, 2024 at 11:11:53PM -1000, Tejun Heo wrote:
> The only generic interface to execute asynchronously in the BH context is
> tasklet; however, it's marked deprecated and has some design flaws. To
> replace tasklets, BH workqueue support was recently added. A BH workqueue
> behaves similarly to regular workqueues except that the queued work items
> are executed in the BH context.
>
> This patch converts TCP Small Queues implementation from tasklet to BH
> workqueue.
>
> Semantically, this is an equivalent conversion and there shouldn't be any
> user-visible behavior changes. While workqueue's queueing and execution
> paths are a bit heavier than tasklet's, unless the work item is being queued
> every packet, the difference hopefully shouldn't matter.
>
> My experience with the networking stack is very limited and this patch
> definitely needs attention from someone who actually understands networking.

On Jakub's recommendation, I asked David Wei to perform production memcache
benchmark on the backported conversion patch. There was no discernible
difference before and after. Given that this is likely as hot as it gets for
the path on a real workloal, the conversions shouldn't hopefully be
noticeable in terms of performance impact.

Jakub, I'd really appreciate if you could ack. David, would it be okay if I
add your Tested-by?

Thanks.

--
tejun

2024-02-16 08:27:04

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 6/8] net: tcp: tsq: Convert from tasklet to BH workqueue

On Fri, Feb 16, 2024 at 6:31 AM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Mon, Jan 29, 2024 at 11:11:53PM -1000, Tejun Heo wrote:
> > The only generic interface to execute asynchronously in the BH context is
> > tasklet; however, it's marked deprecated and has some design flaws. To
> > replace tasklets, BH workqueue support was recently added. A BH workqueue
> > behaves similarly to regular workqueues except that the queued work items
> > are executed in the BH context.
> >
> > This patch converts TCP Small Queues implementation from tasklet to BH
> > workqueue.
> >
> > Semantically, this is an equivalent conversion and there shouldn't be any
> > user-visible behavior changes. While workqueue's queueing and execution
> > paths are a bit heavier than tasklet's, unless the work item is being queued
> > every packet, the difference hopefully shouldn't matter.
> >
> > My experience with the networking stack is very limited and this patch
> > definitely needs attention from someone who actually understands networking.
>
> On Jakub's recommendation, I asked David Wei to perform production memcache
> benchmark on the backported conversion patch. There was no discernible
> difference before and after. Given that this is likely as hot as it gets for
> the path on a real workloal, the conversions shouldn't hopefully be
> noticeable in terms of performance impact.
>
> Jakub, I'd really appreciate if you could ack. David, would it be okay if I
> add your Tested-by?

I presume memcache benchmark is using small RPC ?

TSQ matters for high BDP, and is very time sensitive.

Things like slow TX completions (firing from napi poll, BH context)
can hurt TSQ.

If we add on top of these slow TX completions, an additional work
queue overhead, I really am not sure...

I would recommend tests with pfifo_fast qdisc (not FQ which has a
special override for TSQ limits)

Eventually we could add in TCP a measure of the time lost because of
TSQ, regardless of the kick implementation (tasklet or workqueue).
Measuring the delay between when a tcp socket got tcp_wfree approval
to deliver more packets, and time it finally delivered these packets
could be implemented with a bpftrace program.

2024-02-16 15:52:45

by David Wei

[permalink] [raw]
Subject: Re: [PATCH 6/8] net: tcp: tsq: Convert from tasklet to BH workqueue

On 2024-02-16 01:23, Eric Dumazet wrote:
> !-------------------------------------------------------------------|
> This Message Is From an External Sender
>
> |-------------------------------------------------------------------!
>
> On Fri, Feb 16, 2024 at 6:31 AM Tejun Heo <[email protected]> wrote:
>>
>> Hello,
>>
>> On Mon, Jan 29, 2024 at 11:11:53PM -1000, Tejun Heo wrote:
>>> The only generic interface to execute asynchronously in the BH context is
>>> tasklet; however, it's marked deprecated and has some design flaws. To
>>> replace tasklets, BH workqueue support was recently added. A BH workqueue
>>> behaves similarly to regular workqueues except that the queued work items
>>> are executed in the BH context.
>>>
>>> This patch converts TCP Small Queues implementation from tasklet to BH
>>> workqueue.
>>>
>>> Semantically, this is an equivalent conversion and there shouldn't be any
>>> user-visible behavior changes. While workqueue's queueing and execution
>>> paths are a bit heavier than tasklet's, unless the work item is being queued
>>> every packet, the difference hopefully shouldn't matter.
>>>
>>> My experience with the networking stack is very limited and this patch
>>> definitely needs attention from someone who actually understands networking.
>>
>> On Jakub's recommendation, I asked David Wei to perform production memcache
>> benchmark on the backported conversion patch. There was no discernible
>> difference before and after. Given that this is likely as hot as it gets for
>> the path on a real workloal, the conversions shouldn't hopefully be
>> noticeable in terms of performance impact.
>>
>> Jakub, I'd really appreciate if you could ack. David, would it be okay if I
>> add your Tested-by?

Yes, that's fine.

>
> I presume memcache benchmark is using small RPC ?

It is not a benchmark but a prod shadow, but yes the requests are small.

>
> TSQ matters for high BDP, and is very time sensitive.
>
> Things like slow TX completions (firing from napi poll, BH context)
> can hurt TSQ.
>
> If we add on top of these slow TX completions, an additional work
> queue overhead, I really am not sure...
>
> I would recommend tests with pfifo_fast qdisc (not FQ which has a
> special override for TSQ limits)
>
> Eventually we could add in TCP a measure of the time lost because of
> TSQ, regardless of the kick implementation (tasklet or workqueue).
> Measuring the delay between when a tcp socket got tcp_wfree approval
> to deliver more packets, and time it finally delivered these packets
> could be implemented with a bpftrace program.

2024-02-16 16:25:29

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 6/8] net: tcp: tsq: Convert from tasklet to BH workqueue

Hello, Eric. How have you been?

On Fri, Feb 16, 2024 at 09:23:00AM +0100, Eric Dumazet wrote:
..
> TSQ matters for high BDP, and is very time sensitive.
>
> Things like slow TX completions (firing from napi poll, BH context)
> can hurt TSQ.
>
> If we add on top of these slow TX completions, an additional work
> queue overhead, I really am not sure...

Just to be sure, the workqueue here is executing in the same softirq context
as tasklets. This isn't the usual workqueue which has to go through the
scheduler. The only difference would be that workqueue does a bit more work
(e.g. to manage the currenty executing hashtable) than tasklet. It's
unlikely to show noticeable latency penalty in any practical case although
the extra overhead would likely be visible in targeted microbenches where
all that happens is scheduling and running noop work items.

> I would recommend tests with pfifo_fast qdisc (not FQ which has a
> special override for TSQ limits)

David, do you think this is something we can do?

> Eventually we could add in TCP a measure of the time lost because of
> TSQ, regardless of the kick implementation (tasklet or workqueue).
> Measuring the delay between when a tcp socket got tcp_wfree approval
> to deliver more packets, and time it finally delivered these packets
> could be implemented with a bpftrace program.

I don't have enough context here but it sounds like you are worried about
adding latency in that path. This conversion is unlikely to make a
noticeable difference there. The interface and sementics are workqueue but
the work items are being executed exactly the same way from the same
softirqs as tasklets. Would testing with pfifo_fast be sufficient to dispel
your concern?

Thanks.

--
tejun

2024-02-20 17:26:24

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 5/8] usb: core: hcd: Convert from tasklet to BH workqueue

On Mon, 29 Jan 2024, Tejun Heo wrote:

>The only generic interface to execute asynchronously in the BH context is
>tasklet; however, it's marked deprecated and has some design flaws. To
>replace tasklets, BH workqueue support was recently added. A BH workqueue
>behaves similarly to regular workqueues except that the queued work items
>are executed in the BH context.
>
>This patch converts usb hcd from tasklet to BH workqueue.

In the past this tasklet removal was held up by Mauro's device not properly
streaming - hopefully this no longer the case. Cc'ing.

https://lore.kernel.org/all/[email protected]/

>
>Signed-off-by: Tejun Heo <[email protected]>
>Cc: Greg Kroah-Hartman <[email protected]>
>Cc: Alan Stern <[email protected]>
>Cc: [email protected]

Acked-by: Davidlohr Bueso <[email protected]>

>---
> drivers/usb/core/hcd.c | 23 ++++++++++++-----------
> include/linux/usb/hcd.h | 2 +-
> 2 files changed, 13 insertions(+), 12 deletions(-)
>
>diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
>index 12b6dfeaf658..edf74458474a 100644
>--- a/drivers/usb/core/hcd.c
>+++ b/drivers/usb/core/hcd.c
>@@ -1664,9 +1664,10 @@ static void __usb_hcd_giveback_urb(struct urb *urb)
> usb_put_urb(urb);
> }
>
>-static void usb_giveback_urb_bh(struct tasklet_struct *t)
>+static void usb_giveback_urb_bh(struct work_struct *work)
> {
>- struct giveback_urb_bh *bh = from_tasklet(bh, t, bh);
>+ struct giveback_urb_bh *bh =
>+ container_of(work, struct giveback_urb_bh, bh);
> struct list_head local_list;
>
> spin_lock_irq(&bh->lock);
>@@ -1691,9 +1692,9 @@ static void usb_giveback_urb_bh(struct tasklet_struct *t)
> spin_lock_irq(&bh->lock);
> if (!list_empty(&bh->head)) {
> if (bh->high_prio)
>- tasklet_hi_schedule(&bh->bh);
>+ queue_work(system_bh_highpri_wq, &bh->bh);
> else
>- tasklet_schedule(&bh->bh);
>+ queue_work(system_bh_wq, &bh->bh);
> }
> bh->running = false;
> spin_unlock_irq(&bh->lock);
>@@ -1706,7 +1707,7 @@ static void usb_giveback_urb_bh(struct tasklet_struct *t)
> * @status: completion status code for the URB.
> *
> * Context: atomic. The completion callback is invoked in caller's context.
>- * For HCDs with HCD_BH flag set, the completion callback is invoked in tasklet
>+ * For HCDs with HCD_BH flag set, the completion callback is invoked in BH
> * context (except for URBs submitted to the root hub which always complete in
> * caller's context).
> *
>@@ -1725,7 +1726,7 @@ void usb_hcd_giveback_urb(struct usb_hcd *hcd, struct urb *urb, int status)
> struct giveback_urb_bh *bh;
> bool running;
>
>- /* pass status to tasklet via unlinked */
>+ /* pass status to BH via unlinked */
> if (likely(!urb->unlinked))
> urb->unlinked = status;
>
>@@ -1747,9 +1748,9 @@ void usb_hcd_giveback_urb(struct usb_hcd *hcd, struct urb *urb, int status)
> if (running)
> ;
> else if (bh->high_prio)
>- tasklet_hi_schedule(&bh->bh);
>+ queue_work(system_bh_highpri_wq, &bh->bh);
> else
>- tasklet_schedule(&bh->bh);
>+ queue_work(system_bh_wq, &bh->bh);
> }
> EXPORT_SYMBOL_GPL(usb_hcd_giveback_urb);
>
>@@ -2540,7 +2541,7 @@ static void init_giveback_urb_bh(struct giveback_urb_bh *bh)
>
> spin_lock_init(&bh->lock);
> INIT_LIST_HEAD(&bh->head);
>- tasklet_setup(&bh->bh, usb_giveback_urb_bh);
>+ INIT_WORK(&bh->bh, usb_giveback_urb_bh);
> }
>
> struct usb_hcd *__usb_create_hcd(const struct hc_driver *driver,
>@@ -2926,7 +2927,7 @@ int usb_add_hcd(struct usb_hcd *hcd,
> && device_can_wakeup(&hcd->self.root_hub->dev))
> dev_dbg(hcd->self.controller, "supports USB remote wakeup\n");
>
>- /* initialize tasklets */
>+ /* initialize BHs */
> init_giveback_urb_bh(&hcd->high_prio_bh);
> hcd->high_prio_bh.high_prio = true;
> init_giveback_urb_bh(&hcd->low_prio_bh);
>@@ -3036,7 +3037,7 @@ void usb_remove_hcd(struct usb_hcd *hcd)
> mutex_unlock(&usb_bus_idr_lock);
>
> /*
>- * tasklet_kill() isn't needed here because:
>+ * flush_work() isn't needed here because:
> * - driver's disconnect() called from usb_disconnect() should
> * make sure its URBs are completed during the disconnect()
> * callback
>diff --git a/include/linux/usb/hcd.h b/include/linux/usb/hcd.h
>index 00724b4f6e12..f698aac71de3 100644
>--- a/include/linux/usb/hcd.h
>+++ b/include/linux/usb/hcd.h
>@@ -55,7 +55,7 @@ struct giveback_urb_bh {
> bool high_prio;
> spinlock_t lock;
> struct list_head head;
>- struct tasklet_struct bh;
>+ struct work_struct bh;
> struct usb_host_endpoint *completing_ep;
> };
>
>--
>2.43.0
>

2024-02-20 17:56:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 5/8] usb: core: hcd: Convert from tasklet to BH workqueue

On Tue, 20 Feb 2024 at 09:25, Davidlohr Bueso <[email protected]> wrote:
>
> In the past this tasklet removal was held up by Mauro's device not properly
> streaming - hopefully this no longer the case. Cc'ing.
>
> https://lore.kernel.org/all/[email protected]/

Oh, lovely - an actual use-case where the old tasklet code has known
requirements.

Mauro - the BH workqueue should provide the same kind of latency as
the tasklets, and it would be good to validate early that yes, this
workqueue conversion works well in practice. Since you have an actual
real-life test-case, could you give it a try?

You can find the work in

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git
refs/heads/for-6.9-bh-conversions

although it's possible that Tejun has a newer version in some other
branch. Tejun - maybe point Mauro at something he can try out if you
have updated the conversion since?

Linus

2024-02-20 18:19:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/8] usb: core: hcd: Convert from tasklet to BH workqueue

Hello,

On Tue, Feb 20, 2024 at 09:55:30AM -0800, Linus Torvalds wrote:
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git
> refs/heads/for-6.9-bh-conversions
>
> although it's possible that Tejun has a newer version in some other
> branch. Tejun - maybe point Mauro at something he can try out if you
> have updated the conversion since?

Just pushed out the following branch for testing.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-6.9-bh-conversions-test

It's the same branch but combined with the current linus#master to avoid the
rc1 wonkiness.

Thanks.

--
tejun

2024-02-20 19:37:00

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 5/8] usb: core: hcd: Convert from tasklet to BH workqueue

On Tue, 20 Feb 2024, Linus Torvalds wrote:

>Mauro - the BH workqueue should provide the same kind of latency as
>the tasklets, and it would be good to validate early that yes, this
>workqueue conversion works well in practice. Since you have an actual
>real-life test-case, could you give it a try?

In general I think it's worth pointing out that future conversions should
still aim for an equivalent in task context, and now with disable/enable_work
a lot opens up for regular wq conversions. If users/maintainers shout about
latency, then use BH wq.

Thanks,
Davidlohr

2024-02-20 19:44:47

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH 8/8] dm-verity: Convert from tasklet to BH workqueue

On Wed, Jan 31 2024 at 7:19P -0500,
Mike Snitzer <[email protected]> wrote:

> On Wed, Jan 31 2024 at 7:04P -0500,
> Tejun Heo <[email protected]> wrote:
>
> > Hello, Linus.
> >
> > On Wed, Jan 31, 2024 at 03:19:01PM -0800, Linus Torvalds wrote:
> > > On Wed, 31 Jan 2024 at 13:32, Tejun Heo <[email protected]> wrote:
> > > >
> > > > I don't know, so just did the dumb thing. If the caller always guarantees
> > > > that the work items are never queued at the same time, reusing is fine.
> > >
> > > So the reason I thought it would be a good cleanup to introduce that
> > > "atomic" workqueue thing (now "bh") was that this case literally has a
> > > switch between "use tasklets' or "use workqueues".
> > >
> > > So it's not even about "reusing" the workqueue, it's literally a
> > > matter of making it always just use workqueues, and the switch then
> > > becomes just *which* workqueue to use - system or bh.
> >
> > Yeah, that's how the dm-crypt got converted. The patch just before this one.
> > This one probably can be converted the same way. I don't see the work item
> > being re-initialized. It probably is better to initialize the work item
> > together with the enclosing struct and then just queue it when needed.
>
> Sounds good.
>
> > Mikulas, I couldn't decide what to do with the "try_verify_in_tasklet"
> > option and just decided to do the minimal thing hoping that someone more
> > familiar with the code can take over the actual conversion. How much of user
> > interface commitment is that? Should it be renamed or would it be better to
> > leave it be?
>
> cryptsetup did add support for it, so I think it worthwhile to
> preserve the option; but it'd be fine to have it just be a backward
> compatible alias for a more appropriately named option?

Hey Tejun,

I'm not sure where things stand with the 6.9 workqueue changes to add
BH workqueue. I had a look at your various branches and I'm not
seeing where you might have staged any conversion patches (like this
dm-verity one).

I just staged various unrelated dm-verity and dm-crypt 6.8 fixes from
Mikulas that I'll be sending to Linus later this week (for v6.8-rc6).
Those changes required rebasing 'dm-6.9' because of conflicts, here is
the dm-6.9 branch:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.9

So we'll definitely need to rebase your changes on dm-6.9 to convert
dm-crypt and dm-verity over to your BH workqueue. Are you OK with
doing that or would you prefer I merge some 6.9 workqueue branch that
you have into dm-6.9? And then Mikulas and I work to make the required
DM target conversion changes?

However you'd like to do it: please let me know where you have the
latest 6.9 code the adds BH workqueue (and if you have DM target
conversion code that reflects your latest).

Thanks,
Mike

2024-02-20 20:06:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 8/8] dm-verity: Convert from tasklet to BH workqueue

Hello, Mike.

On Tue, Feb 20, 2024 at 02:44:29PM -0500, Mike Snitzer wrote:
> I'm not sure where things stand with the 6.9 workqueue changes to add
> BH workqueue. I had a look at your various branches and I'm not
> seeing where you might have staged any conversion patches (like this
> dm-verity one).

Yeah, the branch is for-6.9-bh-conversions in the wq tree but I haven't
queued the DM patches yet. Wanted to make sure the perf signal from TCP
conversion is okay first. FWIW, while Eric still has concerns, the initial
test didn't show any appreciable regression with production memcache
workload on our side.

> I just staged various unrelated dm-verity and dm-crypt 6.8 fixes from
> Mikulas that I'll be sending to Linus later this week (for v6.8-rc6).
> Those changes required rebasing 'dm-6.9' because of conflicts, here is
> the dm-6.9 branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.9
>
> So we'll definitely need to rebase your changes on dm-6.9 to convert
> dm-crypt and dm-verity over to your BH workqueue. Are you OK with
> doing that or would you prefer I merge some 6.9 workqueue branch that
> you have into dm-6.9? And then Mikulas and I work to make the required
> DM target conversion changes?

Oh, if you don't mind, it'd be best if you could pull wq/for-6.9 into dm-6.9
and do the conversions there.

Thank you.

--
tejun

2024-02-22 21:25:35

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH 8/8] dm-verity: Convert from tasklet to BH workqueue

On Tue, Feb 20 2024 at 3:05P -0500,
Tejun Heo <[email protected]> wrote:

> Hello, Mike.
>
> On Tue, Feb 20, 2024 at 02:44:29PM -0500, Mike Snitzer wrote:
> > I'm not sure where things stand with the 6.9 workqueue changes to add
> > BH workqueue. I had a look at your various branches and I'm not
> > seeing where you might have staged any conversion patches (like this
> > dm-verity one).
>
> Yeah, the branch is for-6.9-bh-conversions in the wq tree but I haven't
> queued the DM patches yet. Wanted to make sure the perf signal from TCP
> conversion is okay first. FWIW, while Eric still has concerns, the initial
> test didn't show any appreciable regression with production memcache
> workload on our side.
>
> > I just staged various unrelated dm-verity and dm-crypt 6.8 fixes from
> > Mikulas that I'll be sending to Linus later this week (for v6.8-rc6).
> > Those changes required rebasing 'dm-6.9' because of conflicts, here is
> > the dm-6.9 branch:
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.9
> >
> > So we'll definitely need to rebase your changes on dm-6.9 to convert
> > dm-crypt and dm-verity over to your BH workqueue. Are you OK with
> > doing that or would you prefer I merge some 6.9 workqueue branch that
> > you have into dm-6.9? And then Mikulas and I work to make the required
> > DM target conversion changes?
>
> Oh, if you don't mind, it'd be best if you could pull wq/for-6.9 into dm-6.9
> and do the conversions there.
>
> Thank you.

I've rebased and made the changes available in this dm-6.9-bh-wq branch:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.9-bh-wq

I left them both DM conversion commits attributed to use despite the
rebase (and churn of renames I did in the dm-verity commit); hopefully
you're fine with that but if not I can split the renames out.

I successfully tested using the cryptsetup testsuite ('make check').
And I've also added these changes to linux-next, via linux-dm.git's
'for-next':
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next

Mikulas, feel free to review/test and possibly add another patch for
dm-verity that only uses a single work_struct in struct dm_verity_io

Thanks,
Mike

2024-02-23 17:22:52

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 8/8] dm-verity: Convert from tasklet to BH workqueue

On Thu, Feb 22, 2024 at 04:24:45PM -0500, Mike Snitzer wrote:
> I've rebased and made the changes available in this dm-6.9-bh-wq branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.9-bh-wq
>
> I left them both DM conversion commits attributed to use despite the
> rebase (and churn of renames I did in the dm-verity commit); hopefully
> you're fine with that but if not I can split the renames out.

Looks good to me.

Thanks.

--
tejun

2024-02-26 02:01:41

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] workqueue: Implement BH workqueues to eventually replace tasklets

On Sun, Feb 04, 2024 at 11:29:46AM -1000, Tejun Heo wrote:
> >From 4cb1ef64609f9b0254184b2947824f4b46ccab22 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <[email protected]>
> Date: Sun, 4 Feb 2024 11:28:06 -1000
>
> The only generic interface to execute asynchronously in the BH context is
> tasklet; however, it's marked deprecated and has some design flaws such as
> the execution code accessing the tasklet item after the execution is
> complete which can lead to subtle use-after-free in certain usage scenarios
> and less-developed flush and cancel mechanisms.
>
> This patch implements BH workqueues which share the same semantics and
> features of regular workqueues but execute their work items in the softirq
> context. As there is always only one BH execution context per CPU, none of
> the concurrency management mechanisms applies and a BH workqueue can be
> thought of as a convenience wrapper around softirq.
>
> Except for the inability to sleep while executing and lack of max_active
> adjustments, BH workqueues and work items should behave the same as regular
> workqueues and work items.
>
> Currently, the execution is hooked to tasklet[_hi]. However, the goal is to
> convert all tasklet users over to BH workqueues. Once the conversion is
> complete, tasklet can be removed and BH workqueues can directly take over
> the tasklet softirqs.
>
> system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in
> tasklet, all existing tasklet users should be able to use the system BH
> workqueues without creating their own workqueues.
>
> v3: - Add missing interrupt.h include.
>
> v2: - Instead of using tasklets, hook directly into its softirq action
> functions - tasklet[_hi]_action(). This is slightly cheaper and closer
> to the eventual code structure we want to arrive at. Suggested by Lai.
>
> - Lai also pointed out several places which need NULL worker->task
> handling or can use clarification. Updated.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Suggested-by: Linus Torvalds <[email protected]>
> Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com
> Tested-by: Allen Pais <[email protected]>
> Reviewed-by: Lai Jiangshan <[email protected]>
> ---
> Documentation/core-api/workqueue.rst | 29 ++-
> include/linux/workqueue.h | 11 +
> kernel/softirq.c | 3 +
> kernel/workqueue.c | 291 ++++++++++++++++++++++-----
> tools/workqueue/wq_dump.py | 11 +-
> 5 files changed, 285 insertions(+), 60 deletions(-)
>
> diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
> index 33c4539155d9..2d6af6c4665c 100644
> --- a/Documentation/core-api/workqueue.rst
> +++ b/Documentation/core-api/workqueue.rst
> @@ -77,10 +77,12 @@ wants a function to be executed asynchronously it has to set up a work
> item pointing to that function and queue that work item on a
> workqueue.
>
> -Special purpose threads, called worker threads, execute the functions
> -off of the queue, one after the other. If no work is queued, the
> -worker threads become idle. These worker threads are managed in so
> -called worker-pools.
> +A work item can be executed in either a thread or the BH (softirq) context.
> +
> +For threaded workqueues, special purpose threads, called [k]workers, execute
> +the functions off of the queue, one after the other. If no work is queued,
> +the worker threads become idle. These worker threads are managed in
> +worker-pools.
>
> The cmwq design differentiates between the user-facing workqueues that
> subsystems and drivers queue work items on and the backend mechanism
> @@ -91,6 +93,12 @@ for high priority ones, for each possible CPU and some extra
> worker-pools to serve work items queued on unbound workqueues - the
> number of these backing pools is dynamic.
>
> +BH workqueues use the same framework. However, as there can only be one
> +concurrent execution context, there's no need to worry about concurrency.
> +Each per-CPU BH worker pool contains only one pseudo worker which represents
> +the BH execution context. A BH workqueue can be considered a convenience
> +interface to softirq.
> +
> Subsystems and drivers can create and queue work items through special
> workqueue API functions as they see fit. They can influence some
> aspects of the way the work items are executed by setting flags on the
> @@ -106,7 +114,7 @@ unless specifically overridden, a work item of a bound workqueue will
> be queued on the worklist of either normal or highpri worker-pool that
> is associated to the CPU the issuer is running on.
>
> -For any worker pool implementation, managing the concurrency level
> +For any thread pool implementation, managing the concurrency level
> (how many execution contexts are active) is an important issue. cmwq
> tries to keep the concurrency at a minimal but sufficient level.
> Minimal to save resources and sufficient in that the system is used at
> @@ -164,6 +172,17 @@ resources, scheduled and executed.
> ``flags``
> ---------
>
> +``WQ_BH``
> + BH workqueues can be considered a convenience interface to softirq. BH
> + workqueues are always per-CPU and all BH work items are executed in the
> + queueing CPU's softirq context in the queueing order.
> +
> + All BH workqueues must have 0 ``max_active`` and ``WQ_HIGHPRI`` is the
> + only allowed additional flag.
> +
> + BH work items cannot sleep. All other features such as delayed queueing,
> + flushing and canceling are supported.
> +
> ``WQ_UNBOUND``
> Work items queued to an unbound wq are served by the special
> worker-pools which host workers which are not bound to any
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index 232baea90a1d..283d7891b4c4 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -353,6 +353,7 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
> * Documentation/core-api/workqueue.rst.
> */
> enum wq_flags {
> + WQ_BH = 1 << 0, /* execute in bottom half (softirq) context */
> WQ_UNBOUND = 1 << 1, /* not bound to any cpu */
> WQ_FREEZABLE = 1 << 2, /* freeze during suspend */
> WQ_MEM_RECLAIM = 1 << 3, /* may be used for memory reclaim */
> @@ -392,6 +393,9 @@ enum wq_flags {
> __WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */
> __WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */
> __WQ_ORDERED_EXPLICIT = 1 << 19, /* internal: alloc_ordered_workqueue() */
> +
> + /* BH wq only allows the following flags */
> + __WQ_BH_ALLOWS = WQ_BH | WQ_HIGHPRI,
> };
>
> enum wq_consts {
> @@ -434,6 +438,9 @@ enum wq_consts {
> * they are same as their non-power-efficient counterparts - e.g.
> * system_power_efficient_wq is identical to system_wq if
> * 'wq_power_efficient' is disabled. See WQ_POWER_EFFICIENT for more info.
> + *
> + * system_bh[_highpri]_wq are convenience interface to softirq. BH work items
> + * are executed in the queueing CPU's BH context in the queueing order.
> */
> extern struct workqueue_struct *system_wq;
> extern struct workqueue_struct *system_highpri_wq;
> @@ -442,6 +449,10 @@ extern struct workqueue_struct *system_unbound_wq;
> extern struct workqueue_struct *system_freezable_wq;
> extern struct workqueue_struct *system_power_efficient_wq;
> extern struct workqueue_struct *system_freezable_power_efficient_wq;
> +extern struct workqueue_struct *system_bh_wq;
> +extern struct workqueue_struct *system_bh_highpri_wq;
> +
> +void workqueue_softirq_action(bool highpri);
>
> /**
> * alloc_workqueue - allocate a workqueue
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 210cf5f8d92c..547d282548a8 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -27,6 +27,7 @@
> #include <linux/tick.h>
> #include <linux/irq.h>
> #include <linux/wait_bit.h>
> +#include <linux/workqueue.h>
>
> #include <asm/softirq_stack.h>
>
> @@ -802,11 +803,13 @@ static void tasklet_action_common(struct softirq_action *a,
>
> static __latent_entropy void tasklet_action(struct softirq_action *a)
> {
> + workqueue_softirq_action(false);
> tasklet_action_common(a, this_cpu_ptr(&tasklet_vec), TASKLET_SOFTIRQ);
> }
>
> static __latent_entropy void tasklet_hi_action(struct softirq_action *a)
> {
> + workqueue_softirq_action(true);
> tasklet_action_common(a, this_cpu_ptr(&tasklet_hi_vec), HI_SOFTIRQ);
> }
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 767971a29c7a..78b4b992e1a3 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -29,6 +29,7 @@
> #include <linux/kernel.h>
> #include <linux/sched.h>
> #include <linux/init.h>
> +#include <linux/interrupt.h>
> #include <linux/signal.h>
> #include <linux/completion.h>
> #include <linux/workqueue.h>
> @@ -72,8 +73,12 @@ enum worker_pool_flags {
> * Note that DISASSOCIATED should be flipped only while holding
> * wq_pool_attach_mutex to avoid changing binding state while
> * worker_attach_to_pool() is in progress.
> + *
> + * As there can only be one concurrent BH execution context per CPU, a
> + * BH pool is per-CPU and always DISASSOCIATED.
> */
> - POOL_MANAGER_ACTIVE = 1 << 0, /* being managed */
> + POOL_BH = 1 << 0, /* is a BH pool */
> + POOL_MANAGER_ACTIVE = 1 << 1, /* being managed */
> POOL_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */
> };
>
> @@ -115,6 +120,14 @@ enum wq_internal_consts {
> WQ_NAME_LEN = 32,
> };
>
> +/*
> + * We don't want to trap softirq for too long. See MAX_SOFTIRQ_TIME and
> + * MAX_SOFTIRQ_RESTART in kernel/softirq.c. These are macros because
> + * msecs_to_jiffies() can't be an initializer.
> + */
> +#define BH_WORKER_JIFFIES msecs_to_jiffies(2)
> +#define BH_WORKER_RESTARTS 10

Sorry, late to the party, but I wonder how this play along with cpu
hotplug? Say we've queued a lot BH_WORK on a CPU, and we offline that
cpu, wouldn't that end up with a few BH_WORK left on that CPU not being
executed?

[Cc Thomas]

Regards,
Boqun

> +
[..]

2024-02-26 18:47:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] workqueue: Implement BH workqueues to eventually replace tasklets

Hello,

On Sun, Feb 25, 2024 at 06:00:48PM -0800, Boqun Feng wrote:
> Sorry, late to the party, but I wonder how this play along with cpu
> hotplug? Say we've queued a lot BH_WORK on a CPU, and we offline that
> cpu, wouldn't that end up with a few BH_WORK left on that CPU not being
> executed?

Ah, good point. tasklets get transferred out of offline CPU. Workqueue
counterpart doesn't do that. I'll fix that.

Thanks.

--
tejun

2024-02-27 01:39:12

by Tejun Heo

[permalink] [raw]
Subject: [PATCH for-6.9] workqueue: Drain BH work items on hot-unplugged CPUs

Boqun pointed out that workqueues aren't handling BH work items on offlined
CPUs. Unlike tasklet which transfers out the pending tasks from
CPUHP_SOFTIRQ_DEAD, BH workqueue would just leave them pending which is
problematic. Note that this behavior is specific to BH workqueues as the
non-BH per-CPU workers just become unbound when the CPU goes offline.

This patch fixes the issue by draining the pending BH work items from an
offlined CPU from CPUHP_SOFTIRQ_DEAD. Because work items carry more context,
it's not as easy to transfer the pending work items from one pool to
another. Instead, run BH work items which execute the offlined pools on an
online CPU.

Note that this assumes that no further BH work items will be queued on the
offlined CPUs. This assumption is shared with tasklet and should be fine for
conversions. However, this issue also exists for per-CPU workqueues which
will just keep executing work items queued after CPU offline on unbound
workers and workqueue should reject per-CPU and BH work items queued on
offline CPUs. This will be addressed separately later.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Boqun Feng <[email protected]>
Link: http://lkml.kernel.org/r/Zdvw0HdSXcU3JZ4g@boqun-archlinux
---
include/linux/workqueue.h | 1
kernel/softirq.c | 2 +
kernel/workqueue.c | 91 ++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 91 insertions(+), 3 deletions(-)

--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -458,6 +458,7 @@ extern struct workqueue_struct *system_b
extern struct workqueue_struct *system_bh_highpri_wq;

void workqueue_softirq_action(bool highpri);
+void workqueue_softirq_dead(unsigned int cpu);

/**
* alloc_workqueue - allocate a workqueue
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -932,6 +932,8 @@ static void run_ksoftirqd(unsigned int c
#ifdef CONFIG_HOTPLUG_CPU
static int takeover_tasklets(unsigned int cpu)
{
+ workqueue_softirq_dead(cpu);
+
/* CPU is dead, so no lock needed. */
local_irq_disable();

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -81,6 +81,7 @@ enum worker_pool_flags {
POOL_BH = 1 << 0, /* is a BH pool */
POOL_MANAGER_ACTIVE = 1 << 1, /* being managed */
POOL_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */
+ POOL_BH_DRAINING = 1 << 3, /* draining after CPU offline */
};

enum worker_flags {
@@ -1218,7 +1219,9 @@ static struct irq_work *bh_pool_irq_work
static void kick_bh_pool(struct worker_pool *pool)
{
#ifdef CONFIG_SMP
- if (unlikely(pool->cpu != smp_processor_id())) {
+ /* see drain_dead_softirq_workfn() for BH_DRAINING */
+ if (unlikely(pool->cpu != smp_processor_id() &&
+ !(pool->flags & POOL_BH_DRAINING))) {
irq_work_queue_on(bh_pool_irq_work(pool), pool->cpu);
return;
}
@@ -3155,6 +3158,7 @@ __acquires(&pool->lock)
struct worker_pool *pool = worker->pool;
unsigned long work_data;
int lockdep_start_depth, rcu_start_depth;
+ bool bh_draining = pool->flags & POOL_BH_DRAINING;
#ifdef CONFIG_LOCKDEP
/*
* It is permissible to free the struct work_struct from
@@ -3220,7 +3224,9 @@ __acquires(&pool->lock)

rcu_start_depth = rcu_preempt_depth();
lockdep_start_depth = lockdep_depth(current);
- lock_map_acquire(&pwq->wq->lockdep_map);
+ /* see drain_dead_softirq_workfn() */
+ if (!bh_draining)
+ lock_map_acquire(&pwq->wq->lockdep_map);
lock_map_acquire(&lockdep_map);
/*
* Strictly speaking we should mark the invariant state without holding
@@ -3253,7 +3259,8 @@ __acquires(&pool->lock)
trace_workqueue_execute_end(work, worker->current_func);
pwq->stats[PWQ_STAT_COMPLETED]++;
lock_map_release(&lockdep_map);
- lock_map_release(&pwq->wq->lockdep_map);
+ if (!bh_draining)
+ lock_map_release(&pwq->wq->lockdep_map);

if (unlikely((worker->task && in_atomic()) ||
lockdep_depth(current) != lockdep_start_depth ||
@@ -3615,6 +3622,84 @@ void workqueue_softirq_action(bool highp
bh_worker(list_first_entry(&pool->workers, struct worker, node));
}

+struct wq_drain_dead_softirq_work {
+ struct work_struct work;
+ struct worker_pool *pool;
+ struct completion done;
+};
+
+static void drain_dead_softirq_workfn(struct work_struct *work)
+{
+ struct wq_drain_dead_softirq_work *dead_work =
+ container_of(work, struct wq_drain_dead_softirq_work, work);
+ struct worker_pool *pool = dead_work->pool;
+ bool repeat;
+
+ /*
+ * @pool's CPU is dead and we want to execute its still pending work
+ * items from this BH work item which is running on a different CPU. As
+ * its CPU is dead, @pool can't be kicked and, as work execution path
+ * will be nested, a lockdep annotation needs to be suppressed. Mark
+ * @pool with %POOL_BH_DRAINING for the special treatments.
+ */
+ raw_spin_lock_irq(&pool->lock);
+ pool->flags |= POOL_BH_DRAINING;
+ raw_spin_unlock_irq(&pool->lock);
+
+ bh_worker(list_first_entry(&pool->workers, struct worker, node));
+
+ raw_spin_lock_irq(&pool->lock);
+ pool->flags &= ~POOL_BH_DRAINING;
+ repeat = need_more_worker(pool);
+ raw_spin_unlock_irq(&pool->lock);
+
+ /*
+ * bh_worker() might hit consecutive execution limit and bail. If there
+ * still are pending work items, reschedule self and return so that we
+ * don't hog this CPU's BH.
+ */
+ if (repeat) {
+ if (pool->attrs->nice == HIGHPRI_NICE_LEVEL)
+ queue_work(system_bh_highpri_wq, work);
+ else
+ queue_work(system_bh_wq, work);
+ } else {
+ complete(&dead_work->done);
+ }
+}
+
+/*
+ * @cpu is dead. Drain the remaining BH work items on the current CPU. It's
+ * possible to allocate dead_work per CPU and avoid flushing. However, then we
+ * have to worry about draining overlapping with CPU coming back online or
+ * nesting (one CPU's dead_work queued on another CPU which is also dead and so
+ * on). Let's keep it simple and drain them synchronously. These are BH work
+ * items which shouldn't be requeued on the same pool. Shouldn't take long.
+ */
+void workqueue_softirq_dead(unsigned int cpu)
+{
+ int i;
+
+ for (i = 0; i < NR_STD_WORKER_POOLS; i++) {
+ struct worker_pool *pool = &per_cpu(bh_worker_pools, cpu)[i];
+ struct wq_drain_dead_softirq_work dead_work;
+
+ if (!need_more_worker(pool))
+ continue;
+
+ INIT_WORK(&dead_work.work, drain_dead_softirq_workfn);
+ dead_work.pool = pool;
+ init_completion(&dead_work.done);
+
+ if (pool->attrs->nice == HIGHPRI_NICE_LEVEL)
+ queue_work(system_bh_highpri_wq, &dead_work.work);
+ else
+ queue_work(system_bh_wq, &dead_work.work);
+
+ wait_for_completion(&dead_work.done);
+ }
+}
+
/**
* check_flush_dependency - check for flush dependency sanity
* @target_wq: workqueue being flushed

2024-02-29 20:38:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH for-6.9] workqueue: Drain BH work items on hot-unplugged CPUs

On Mon, Feb 26, 2024 at 03:38:55PM -1000, Tejun Heo wrote:
> Boqun pointed out that workqueues aren't handling BH work items on offlined
> CPUs. Unlike tasklet which transfers out the pending tasks from
> CPUHP_SOFTIRQ_DEAD, BH workqueue would just leave them pending which is
> problematic. Note that this behavior is specific to BH workqueues as the
> non-BH per-CPU workers just become unbound when the CPU goes offline.
>
> This patch fixes the issue by draining the pending BH work items from an
> offlined CPU from CPUHP_SOFTIRQ_DEAD. Because work items carry more context,
> it's not as easy to transfer the pending work items from one pool to
> another. Instead, run BH work items which execute the offlined pools on an
> online CPU.
>
> Note that this assumes that no further BH work items will be queued on the
> offlined CPUs. This assumption is shared with tasklet and should be fine for
> conversions. However, this issue also exists for per-CPU workqueues which
> will just keep executing work items queued after CPU offline on unbound
> workers and workqueue should reject per-CPU and BH work items queued on
> offline CPUs. This will be addressed separately later.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Reported-by: Boqun Feng <[email protected]>
> Link: http://lkml.kernel.org/r/Zdvw0HdSXcU3JZ4g@boqun-archlinux

Applying this to wq/for-6.9.

Thanks.

--
tejun

2024-02-29 21:13:05

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH for-6.9] workqueue: Drain BH work items on hot-unplugged CPUs

On Thu, Feb 29, 2024 at 10:37:28AM -1000, Tejun Heo wrote:
> On Mon, Feb 26, 2024 at 03:38:55PM -1000, Tejun Heo wrote:
> > Boqun pointed out that workqueues aren't handling BH work items on offlined
> > CPUs. Unlike tasklet which transfers out the pending tasks from
> > CPUHP_SOFTIRQ_DEAD, BH workqueue would just leave them pending which is
> > problematic. Note that this behavior is specific to BH workqueues as the
> > non-BH per-CPU workers just become unbound when the CPU goes offline.
> >
> > This patch fixes the issue by draining the pending BH work items from an
> > offlined CPU from CPUHP_SOFTIRQ_DEAD. Because work items carry more context,
> > it's not as easy to transfer the pending work items from one pool to
> > another. Instead, run BH work items which execute the offlined pools on an
> > online CPU.
> >
> > Note that this assumes that no further BH work items will be queued on the
> > offlined CPUs. This assumption is shared with tasklet and should be fine for
> > conversions. However, this issue also exists for per-CPU workqueues which
> > will just keep executing work items queued after CPU offline on unbound
> > workers and workqueue should reject per-CPU and BH work items queued on
> > offline CPUs. This will be addressed separately later.
> >
> > Signed-off-by: Tejun Heo <[email protected]>
> > Reported-by: Boqun Feng <[email protected]>
> > Link: http://lkml.kernel.org/r/Zdvw0HdSXcU3JZ4g@boqun-archlinux
>
> Applying this to wq/for-6.9.
>

FWIW,

Reviewed-by: Boqun Feng <[email protected]>

(I took a look yesterday, but hasn't gotten the time to reply..)

Regards,
Boqun

> Thanks.
>
> --
> tejun