LinuxLists.cc - [PATCHSET] percpu_ref, RCU: Audit RCU usages in percpu_ref users

2018-03-06 17:29:28

[permalink] [raw]

Subject: [PATCHSET] percpu_ref, RCU: Audit RCU usages in percpu_ref users

Hello,

Jann Horn found that aio was depending on the internal RCU grace
periods of percpu-ref and that it's broken because aio uses regular
RCU while percpu_ref uses sched-RCU.

Depending on percpu_ref's internal grace periods isn't a good idea
because

* The RCU type might not match.

* percpu_ref's grace periods are used to switch to atomic mode. They
aren't between the last put and the invocation of the last release.
This is easy to get confused about and can lead to subtle bugs.

* percpu_ref might not have grace periods at all depending on its
current operation mode.

This patchset audits all percpu_ref users for their RCU usages,
clarifies percpu_ref documentation that the internal grace periods
must not be depended upon, and introduces rcu_work to simplify
bouncing to a workqueue after an RCU grace period.

This patchset contains the following seven patches.

0001-fs-aio-Add-explicit-RCU-grace-period-when-freeing-ki.patch
0002-fs-aio-Use-RCU-accessors-for-kioctx_table-table.patch
0003-RDMAVT-Fix-synchronization-around-percpu_ref.patch
0004-HMM-Remove-superflous-RCU-protection-around-radix-tr.patch
0005-block-Remove-superflous-rcu_read_-un-lock_sched-in-b.patch
0006-percpu_ref-Update-doc-to-dissuade-users-from-dependi.patch
0007-RCU-workqueue-Implement-rcu_work.patch

0001-0003 are fixes and tagged -stable.

0004-0005 remove (seemingly) superflous RCU read lock usages.

0006 updates the doc and 0007 introduces rcu_work.

This patchset is also available in the following git tree.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git percpu_ref-rcu-audit

diffstat follows. Thanks.

block/blk-core.c | 2 -
drivers/infiniband/sw/rdmavt/mr.c | 10 ++++---
fs/aio.c | 39 ++++++++++++++++-----------
include/linux/cgroup-defs.h | 2 -
include/linux/percpu-refcount.h | 18 ++++++++----
include/linux/workqueue.h | 38 ++++++++++++++++++++++++++
kernel/cgroup/cgroup.c | 21 ++++----------
kernel/workqueue.c | 54 ++++++++++++++++++++++++++++++++++++++
lib/percpu-refcount.c | 2 +
mm/hmm.c | 12 +-------
10 files changed, 145 insertions(+), 53 deletions(-)

--
tejun

2018-03-06 17:35:09

[permalink] [raw]

Subject: [PATCH 1/7] fs/aio: Add explicit RCU grace period when freeing kioctx

While fixing refcounting, e34ecee2ae79 ("aio: Fix a trinity splat")
incorrectly removed explicit RCU grace period before freeing kioctx.
The intention seems to be depending on the internal RCU grace periods
of percpu_ref; however, percpu_ref uses a different flavor of RCU,
sched-RCU. This can lead to kioctx being freed while RCU read
protected dereferences are still in progress.

Fix it by updating free_ioctx() to go through call_rcu() explicitly.

v2: Comment added to explain double bouncing.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Jann Horn <[email protected]>
Fixes: e34ecee2ae79 ("aio: Fix a trinity splat")
Cc: Kent Overstreet <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
---
fs/aio.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index a062d75..eb2e0cf 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -115,7 +115,8 @@ struct kioctx {
struct page **ring_pages;
long nr_pages;

- struct work_struct free_work;
+ struct rcu_head free_rcu;
+ struct work_struct free_work; /* see free_ioctx() */

/*
* signals when all in-flight requests are done
@@ -588,6 +589,12 @@ static int kiocb_cancel(struct aio_kiocb *kiocb)
return cancel(&kiocb->common);
}

+/*
+ * free_ioctx() should be RCU delayed to synchronize against the RCU
+ * protected lookup_ioctx() and also needs process context to call
+ * aio_free_ring(), so the double bouncing through kioctx->free_rcu and
+ * ->free_work.
+ */
static void free_ioctx(struct work_struct *work)
{
struct kioctx *ctx = container_of(work, struct kioctx, free_work);
@@ -601,6 +608,14 @@ static void free_ioctx(struct work_struct *work)
kmem_cache_free(kioctx_cachep, ctx);
}

+static void free_ioctx_rcufn(struct rcu_head *head)
+{
+ struct kioctx *ctx = container_of(head, struct kioctx, free_rcu);
+
+ INIT_WORK(&ctx->free_work, free_ioctx);
+ schedule_work(&ctx->free_work);
+}
+
static void free_ioctx_reqs(struct percpu_ref *ref)
{
struct kioctx *ctx = container_of(ref, struct kioctx, reqs);
@@ -609,8 +624,8 @@ static void free_ioctx_reqs(struct percpu_ref *ref)
if (ctx->rq_wait && atomic_dec_and_test(&ctx->rq_wait->count))
complete(&ctx->rq_wait->comp);

- INIT_WORK(&ctx->free_work, free_ioctx);
- schedule_work(&ctx->free_work);
+ /* Synchronize against RCU protected table->table[] dereferences */
+ call_rcu(&ctx->free_rcu, free_ioctx_rcufn);
}

/*
@@ -838,7 +853,7 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
table->table[ctx->id] = NULL;
spin_unlock(&mm->ioctx_lock);

- /* percpu_ref_kill() will do the necessary call_rcu() */
+ /* free_ioctx_reqs() will do the necessary RCU synchronization */
wake_up_all(&ctx->wait);

/*
--
2.9.5

2018-03-06 17:36:21

[permalink] [raw]

Subject: [PATCH 3/7] RDMAVT: Fix synchronization around percpu_ref

rvt_mregion uses percpu_ref for reference counting and RCU to protect
accesses from lkey_table. When a rvt_mregion needs to be freed, it
first gets unregistered from lkey_table and then rvt_check_refs() is
called to wait for in-flight usages before the rvt_mregion is freed.

rvt_check_refs() seems to have a couple issues.

* It has a fast exit path which tests percpu_ref_is_zero(). However,
a percpu_ref reading zero doesn't mean that the object can be
released. In fact, the ->release() callback might not even have
started executing yet. Proceeding with freeing can lead to
use-after-free.

* lkey_table is RCU protected but there is no RCU grace period in the
free path. percpu_ref uses RCU internally but it's sched-RCU whose
grace periods are different from regular RCU. Also, it generally
isn't a good idea to depend on internal behaviors like this.

To address the above issues, this patch removes the the fast exit and
adds an explicit synchronize_rcu().

Signed-off-by: Tejun Heo <[email protected]>
Cc: Dennis Dalessandro <[email protected]>
Cc: Mike Marciniszyn <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
---
Hello, Dennis, Mike.

I don't know RDMA at all and this patch is only compile tested. Can
you please take a careful look?

Thanks.

drivers/infiniband/sw/rdmavt/mr.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/sw/rdmavt/mr.c b/drivers/infiniband/sw/rdmavt/mr.c
index 1b2e536..cc429b5 100644
--- a/drivers/infiniband/sw/rdmavt/mr.c
+++ b/drivers/infiniband/sw/rdmavt/mr.c
@@ -489,11 +489,13 @@ static int rvt_check_refs(struct rvt_mregion *mr, const char *t)
unsigned long timeout;
struct rvt_dev_info *rdi = ib_to_rvt(mr->pd->device);

- if (percpu_ref_is_zero(&mr->refcount))
- return 0;
- /* avoid dma mr */
- if (mr->lkey)
+ if (mr->lkey) {
+ /* avoid dma mr */
rvt_dereg_clean_qps(mr);
+ /* @mr was indexed on rcu protected @lkey_table */
+ synchronize_rcu();
+ }
+
timeout = wait_for_completion_timeout(&mr->comp, 5 * HZ);
if (!timeout) {
rvt_pr_err(rdi,
--
2.9.5

2018-03-06 17:36:42

[permalink] [raw]

Subject: [PATCH 2/7] fs/aio: Use RCU accessors for kioctx_table->table[]

While converting ioctx index from a list to a table, db446a08c23d
("aio: convert the ioctx list to table lookup v3") missed tagging
kioctx_table->table[] as an array of RCU pointers and using the
appropriate RCU accessors. This introduces a small window in the
lookup path where init and access may race.

Mark kioctx_table->table[] with __rcu and use the approriate RCU
accessors when using the field.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Jann Horn <[email protected]>
Fixes: db446a08c23d ("aio: convert the ioctx list to table lookup v3")
Cc: Benjamin LaHaise <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
---
fs/aio.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index eb2e0cf..6bcd3fb 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -68,9 +68,9 @@ struct aio_ring {
#define AIO_RING_PAGES 8

struct kioctx_table {
- struct rcu_head rcu;
- unsigned nr;
- struct kioctx *table[];
+ struct rcu_head rcu;
+ unsigned nr;
+ struct kioctx __rcu *table[];
};

struct kioctx_cpu {
@@ -330,7 +330,7 @@ static int aio_ring_mremap(struct vm_area_struct *vma)
for (i = 0; i < table->nr; i++) {
struct kioctx *ctx;

- ctx = table->table[i];
+ ctx = rcu_dereference(table->table[i]);
if (ctx && ctx->aio_ring_file == file) {
if (!atomic_read(&ctx->dead)) {
ctx->user_id = ctx->mmap_base = vma->vm_start;
@@ -666,9 +666,9 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm)
while (1) {
if (table)
for (i = 0; i < table->nr; i++)
- if (!table->table[i]) {
+ if (!rcu_access_pointer(table->table[i])) {
ctx->id = i;
- table->table[i] = ctx;
+ rcu_assign_pointer(table->table[i], ctx);
spin_unlock(&mm->ioctx_lock);

/* While kioctx setup is in progress,
@@ -849,8 +849,8 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
}

table = rcu_dereference_raw(mm->ioctx_table);
- WARN_ON(ctx != table->table[ctx->id]);
- table->table[ctx->id] = NULL;
+ WARN_ON(ctx != rcu_access_pointer(table->table[ctx->id]));
+ RCU_INIT_POINTER(table->table[ctx->id], NULL);
spin_unlock(&mm->ioctx_lock);

/* free_ioctx_reqs() will do the necessary RCU synchronization */
@@ -895,7 +895,8 @@ void exit_aio(struct mm_struct *mm)

skipped = 0;
for (i = 0; i < table->nr; ++i) {
- struct kioctx *ctx = table->table[i];
+ struct kioctx *ctx =
+ rcu_dereference_protected(table->table[i], true);

if (!ctx) {
skipped++;
@@ -1084,7 +1085,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
if (!table || id >= table->nr)
goto out;

- ctx = table->table[id];
+ ctx = rcu_dereference(table->table[id]);
if (ctx && ctx->user_id == ctx_id) {
percpu_ref_get(&ctx->users);
ret = ctx;
--
2.9.5

2018-03-06 17:36:53

[permalink] [raw]

Subject: [PATCH 4/7] HMM: Remove superflous RCU protection around radix tree lookup

hmm_devmem_find() requires rcu_read_lock_held() but there's nothing
which actually uses the RCU protection. The only caller is
hmm_devmem_pages_create() which already grabs the mutex and does
superflous rcu_read_lock/unlock() around the function.

This doesn't add anything and just adds to confusion. Remove the RCU
protection and open-code the radix tree lookup. If this needs to
become more sophisticated in the future, let's add them back when
necessary.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
---
Hello, Jérôme.

This came up while auditing percpu_ref users for missing explicit RCU
grace periods. HMM doesn't seem to depend on RCU protection at all,
so I thought it'd be better to remove it for now. It's only compile
tested.

Thanks.

mm/hmm.c | 12 ++----------
1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98..d4627c5 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -845,13 +845,6 @@ static void hmm_devmem_release(struct device *dev, void *data)
hmm_devmem_radix_release(resource);
}

-static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
-{
- WARN_ON_ONCE(!rcu_read_lock_held());
-
- return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
-}
-
static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
{
resource_size_t key, align_start, align_size, align_end;
@@ -892,9 +885,8 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) {
struct hmm_devmem *dup;

- rcu_read_lock();
- dup = hmm_devmem_find(key);
- rcu_read_unlock();
+ dup = radix_tree_lookup(&hmm_devmem_radix,
+ key >> PA_SECTION_SHIFT);
if (dup) {
dev_err(device, "%s: collides with mapping for %s\n",
__func__, dev_name(dup->device));
--
2.9.5

2018-03-06 17:37:09

[permalink] [raw]

Subject: [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter()

3a0a529971ec ("block, scsi: Make SCSI quiesce and resume work
reliably") added rcu_read_[un]lock_sched() to blk_queue_enter() along
with other changes but it doesn't seem to be doing anything.

blk_queue_enter() is called with @q - the pointer to the target
request_queue, so the caller obviously has to guarantee that @q can be
dereferenced, and inside the RCU-sched protected area, there's nothing
which needs further protection.

Let's remove the superflous rcu_read_[un]lock_sched().

Signed-off-by: Tejun Heo <[email protected]>
Cc: Bart Van Assche <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Linus Torvalds <[email protected]>
---
Hello, Bart.

This came up while auditing percpu_ref users for problematic RCU
usages. I couldn't understand what the RCU protection was doing.
It'd be great if you can take a look and tell me whether I missed
something.

Thanks.

block/blk-core.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d82c4f..e3b4d1849 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -827,7 +827,6 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
bool success = false;
int ret;

- rcu_read_lock_sched();
if (percpu_ref_tryget_live(&q->q_usage_counter)) {
/*
* The code that sets the PREEMPT_ONLY flag is
@@ -840,7 +839,6 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
percpu_ref_put(&q->q_usage_counter);
}
}
- rcu_read_unlock_sched();

if (success)
return 0;
--
2.9.5

2018-03-06 17:37:15

[permalink] [raw]

Subject: [PATCH 6/7] percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods

percpu_ref internally uses sched-RCU to implement the percpu -> atomic
mode switching and the documentation suggested that this could be
depended upon. This doesn't seem like a good idea.

* percpu_ref uses sched-RCU which has different grace periods regular
RCU. Users may combine percpu_ref with regular RCU usage and
incorrectly believe that regular RCU grace periods are performed by
percpu_ref. This can lead to, for example, use-after-free due to
premature freeing.

* percpu_ref has a grace period when switching from percpu to atomic
mode. It doesn't have one between the last put and release. This
distinction is subtle and can lead to surprising bugs.

* percpu_ref allows starting in and switching to atomic mode manually
for debugging and other purposes. This means that there may not be
any grace periods from kill to release.

This patch makes it clear that the grace periods are percpu_ref's
internal implementation detail and can't be depended upon by the
users.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Linus Torvalds <[email protected]>
---
include/linux/percpu-refcount.h | 18 ++++++++++++------
lib/percpu-refcount.c | 2 ++
2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 864d167..009cdf3 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -30,10 +30,14 @@
* calls io_destroy() or the process exits.
*
* In the aio code, kill_ioctx() is called when we wish to destroy a kioctx; it
- * calls percpu_ref_kill(), then hlist_del_rcu() and synchronize_rcu() to remove
- * the kioctx from the proccess's list of kioctxs - after that, there can't be
- * any new users of the kioctx (from lookup_ioctx()) and it's then safe to drop
- * the initial ref with percpu_ref_put().
+ * removes the kioctx from the proccess's table of kioctxs and kills percpu_ref.
+ * After that, there can't be any new users of the kioctx (from lookup_ioctx())
+ * and it's then safe to drop the initial ref with percpu_ref_put().
+ *
+ * Note that the free path, free_ioctx(), needs to go through explicit call_rcu()
+ * to synchronize with RCU protected lookup_ioctx(). percpu_ref operations don't
+ * imply RCU grace periods of any kind and if a user wants to combine percpu_ref
+ * with RCU protection, it must be done explicitly.
*
* Code that does a two stage shutdown like this often needs some kind of
* explicit synchronization to ensure the initial refcount can only be dropped
@@ -113,8 +117,10 @@ void percpu_ref_reinit(struct percpu_ref *ref);
* Must be used to drop the initial ref on a percpu refcount; must be called
* precisely once before shutdown.
*
- * Puts @ref in non percpu mode, then does a call_rcu() before gathering up the
- * percpu counters and dropping the initial ref.
+ * Switches @ref into atomic mode before gathering up the percpu counters
+ * and dropping the initial ref.
+ *
+ * There are no implied RCU grace periods between kill and release.
*/
static inline void percpu_ref_kill(struct percpu_ref *ref)
{
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 30e7dd8..9f96fa7 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -322,6 +322,8 @@ EXPORT_SYMBOL_GPL(percpu_ref_switch_to_percpu);
* This function normally doesn't block and can be called from any context
* but it may block if @confirm_kill is specified and @ref is in the
* process of switching to atomic mode by percpu_ref_switch_to_atomic().
+ *
+ * There are no implied RCU grace periods between kill and release.
*/
void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
percpu_ref_func_t *confirm_kill)
--
2.9.5

2018-03-06 17:37:46

[permalink] [raw]

Subject: [PATCH 7/7] RCU, workqueue: Implement rcu_work

There are cases where RCU callback needs to be bounced to a sleepable
context. This is currently done by the RCU callback queueing a work
item, which can be cumbersome to write and confusing to read.

This patch introduces rcu_work, a workqueue work variant which gets
executed after a RCU grace period, and converts the open coded
bouncing in fs/aio and kernel/cgroup.

v2: Use rcu_barrier() instead of synchronize_rcu() to wait for
completion of previously queued rcu callback as per Paul.

Signed-off-by: Tejun Heo <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Linus Torvalds <[email protected]>
---
fs/aio.c | 21 +++++-------------
include/linux/cgroup-defs.h | 2 +-
include/linux/workqueue.h | 38 +++++++++++++++++++++++++++++++
kernel/cgroup/cgroup.c | 21 ++++++------------
kernel/workqueue.c | 54 +++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 106 insertions(+), 30 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 6bcd3fb..88d7927 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -115,8 +115,7 @@ struct kioctx {
struct page **ring_pages;
long nr_pages;

- struct rcu_head free_rcu;
- struct work_struct free_work; /* see free_ioctx() */
+ struct rcu_work free_rwork; /* see free_ioctx() */

/*
* signals when all in-flight requests are done
@@ -592,13 +591,12 @@ static int kiocb_cancel(struct aio_kiocb *kiocb)
/*
* free_ioctx() should be RCU delayed to synchronize against the RCU
* protected lookup_ioctx() and also needs process context to call
- * aio_free_ring(), so the double bouncing through kioctx->free_rcu and
- * ->free_work.
+ * aio_free_ring(). Use rcu_work.
*/
static void free_ioctx(struct work_struct *work)
{
- struct kioctx *ctx = container_of(work, struct kioctx, free_work);
-
+ struct kioctx *ctx = container_of(to_rcu_work(work), struct kioctx,
+ free_rwork);
pr_debug("freeing %p\n", ctx);

aio_free_ring(ctx);
@@ -608,14 +606,6 @@ static void free_ioctx(struct work_struct *work)
kmem_cache_free(kioctx_cachep, ctx);
}

-static void free_ioctx_rcufn(struct rcu_head *head)
-{
- struct kioctx *ctx = container_of(head, struct kioctx, free_rcu);
-
- INIT_WORK(&ctx->free_work, free_ioctx);
- schedule_work(&ctx->free_work);
-}
-
static void free_ioctx_reqs(struct percpu_ref *ref)
{
struct kioctx *ctx = container_of(ref, struct kioctx, reqs);
@@ -625,7 +615,8 @@ static void free_ioctx_reqs(struct percpu_ref *ref)
complete(&ctx->rq_wait->comp);

/* Synchronize against RCU protected table->table[] dereferences */
- call_rcu(&ctx->free_rcu, free_ioctx_rcufn);
+ INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
+ queue_rcu_work(system_wq, &ctx->free_rwork);
}

/*
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 9f242b8..92d7640 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -151,8 +151,8 @@ struct cgroup_subsys_state {
atomic_t online_cnt;

/* percpu_ref killing and RCU release */
- struct rcu_head rcu_head;
struct work_struct destroy_work;
+ struct rcu_work destroy_rwork;

/*
* PI: the parent css. Placed here for cache proximity to following
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index bc0cda1..b39f3a4 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -13,6 +13,7 @@
#include <linux/threads.h>
#include <linux/atomic.h>
#include <linux/cpumask.h>
+#include <linux/rcupdate.h>

struct workqueue_struct;

@@ -120,6 +121,15 @@ struct delayed_work {
int cpu;
};

+struct rcu_work {
+ struct work_struct work;
+ struct rcu_head rcu;
+
+ /* target workqueue and CPU ->rcu uses to queue ->work */
+ struct workqueue_struct *wq;
+ int cpu;
+};
+
/**
* struct workqueue_attrs - A struct for workqueue attributes.
*
@@ -151,6 +161,11 @@ static inline struct delayed_work *to_delayed_work(struct work_struct *work)
return container_of(work, struct delayed_work, work);
}

+static inline struct rcu_work *to_rcu_work(struct work_struct *work)
+{
+ return container_of(work, struct rcu_work, work);
+}
+
struct execute_work {
struct work_struct work;
};
@@ -266,6 +281,12 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
#define INIT_DEFERRABLE_WORK_ONSTACK(_work, _func) \
__INIT_DELAYED_WORK_ONSTACK(_work, _func, TIMER_DEFERRABLE)

+#define INIT_RCU_WORK(_work, _func) \
+ INIT_WORK(&(_work)->work, (_func))
+
+#define INIT_RCU_WORK_ONSTACK(_work, _func) \
+ INIT_WORK_ONSTACK(&(_work)->work, (_func))
+
/**
* work_pending - Find out whether a work item is currently pending
* @work: The work item in question
@@ -447,6 +468,8 @@ extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
struct delayed_work *work, unsigned long delay);
extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
struct delayed_work *dwork, unsigned long delay);
+extern bool queue_rcu_work_on(int cpu, struct workqueue_struct *wq,
+ struct rcu_work *rwork);

extern void flush_workqueue(struct workqueue_struct *wq);
extern void drain_workqueue(struct workqueue_struct *wq);
@@ -463,6 +486,8 @@ extern bool flush_delayed_work(struct delayed_work *dwork);
extern bool cancel_delayed_work(struct delayed_work *dwork);
extern bool cancel_delayed_work_sync(struct delayed_work *dwork);

+extern bool flush_rcu_work(struct rcu_work *rwork);
+
extern void workqueue_set_max_active(struct workqueue_struct *wq,
int max_active);
extern struct work_struct *current_work(void);
@@ -520,6 +545,19 @@ static inline bool mod_delayed_work(struct workqueue_struct *wq,
}

/**
+ * queue_rcu_work - queue work on a workqueue after a RCU grace period
+ * @wq: workqueue to use
+ * @rwork: RCU work to queue
+ *
+ * Equivalent to queue_rcu_work_on() but tries to use the local CPU.
+ */
+static inline bool queue_rcu_work(struct workqueue_struct *wq,
+ struct rcu_work *rwork)
+{
+ return queue_rcu_work_on(WORK_CPU_UNBOUND, wq, rwork);
+}
+
+/**
* schedule_work_on - put work task on a specific cpu
* @cpu: cpu to put the work task on
* @work: job to be done
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8cda3bc..4c5d4ca0 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4514,10 +4514,10 @@ static struct cftype cgroup_base_files[] = {
* and thus involve punting to css->destroy_work adding two additional
* steps to the already complex sequence.
*/
-static void css_free_work_fn(struct work_struct *work)
+static void css_free_rwork_fn(struct work_struct *work)
{
- struct cgroup_subsys_state *css =
- container_of(work, struct cgroup_subsys_state, destroy_work);
+ struct cgroup_subsys_state *css = container_of(to_rcu_work(work),
+ struct cgroup_subsys_state, destroy_rwork);
struct cgroup_subsys *ss = css->ss;
struct cgroup *cgrp = css->cgroup;

@@ -4563,15 +4563,6 @@ static void css_free_work_fn(struct work_struct *work)
}
}

-static void css_free_rcu_fn(struct rcu_head *rcu_head)
-{
- struct cgroup_subsys_state *css =
- container_of(rcu_head, struct cgroup_subsys_state, rcu_head);
-
- INIT_WORK(&css->destroy_work, css_free_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
-}
-
static void css_release_work_fn(struct work_struct *work)
{
struct cgroup_subsys_state *css =
@@ -4621,7 +4612,8 @@ static void css_release_work_fn(struct work_struct *work)

mutex_unlock(&cgroup_mutex);

- call_rcu(&css->rcu_head, css_free_rcu_fn);
+ INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
+ queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
}

static void css_release(struct percpu_ref *ref)
@@ -4755,7 +4747,8 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
err_list_del:
list_del_rcu(&css->sibling);
err_free_css:
- call_rcu(&css->rcu_head, css_free_rcu_fn);
+ INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
+ queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
return ERR_PTR(err);
}

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bb9a519..e26c2f4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1604,6 +1604,40 @@ bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
}
EXPORT_SYMBOL_GPL(mod_delayed_work_on);

+static void rcu_work_rcufn(struct rcu_head *rcu)
+{
+ struct rcu_work *rwork = container_of(rcu, struct rcu_work, rcu);
+
+ /* read the comment in __queue_work() */
+ local_irq_disable();
+ __queue_work(rwork->cpu, rwork->wq, &rwork->work);
+ local_irq_enable();
+}
+
+/**
+ * queue_rcu_work_on - queue work on specific CPU after a RCU grace period
+ * @cpu: CPU number to execute work on
+ * @wq: workqueue to use
+ * @rwork: work to queue
+ *
+ * Return: %false if @work was already on a queue, %true otherwise.
+ */
+bool queue_rcu_work_on(int cpu, struct workqueue_struct *wq,
+ struct rcu_work *rwork)
+{
+ struct work_struct *work = &rwork->work;
+
+ if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+ rwork->wq = wq;
+ rwork->cpu = cpu;
+ call_rcu(&rwork->rcu, rcu_work_rcufn);
+ return true;
+ }
+
+ return false;
+}
+EXPORT_SYMBOL(queue_rcu_work_on);
+
/**
* worker_enter_idle - enter idle state
* @worker: worker which is entering idle state
@@ -3001,6 +3035,26 @@ bool flush_delayed_work(struct delayed_work *dwork)
}
EXPORT_SYMBOL(flush_delayed_work);

+/**
+ * flush_rcu_work - wait for a rwork to finish executing the last queueing
+ * @rwork: the rcu work to flush
+ *
+ * Return:
+ * %true if flush_rcu_work() waited for the work to finish execution,
+ * %false if it was already idle.
+ */
+bool flush_rcu_work(struct rcu_work *rwork)
+{
+ if (test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&rwork->work))) {
+ rcu_barrier();
+ flush_work(&rwork->work);
+ return true;
+ } else {
+ return flush_work(&rwork->work);
+ }
+}
+EXPORT_SYMBOL(flush_rcu_work);
+
static bool __cancel_work(struct work_struct *work, bool is_dwork)
{
unsigned long flags;
--
2.9.5

2018-03-06 17:54:36

by Bart Van Assche

[permalink] [raw]

Subject: Re: [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter()

On Tue, 2018-03-06 at 09:33 -0800, Tejun Heo wrote:
> 3a0a529971ec ("block, scsi: Make SCSI quiesce and resume work
> reliably") added rcu_read_[un]lock_sched() to blk_queue_enter() along
> with other changes but it doesn't seem to be doing anything.
>
> blk_queue_enter() is called with @q - the pointer to the target
> request_queue, so the caller obviously has to guarantee that @q can be
> dereferenced, and inside the RCU-sched protected area, there's nothing
> which needs further protection.
>
> Let's remove the superflous rcu_read_[un]lock_sched().
>
> [ ... ]
>
> This came up while auditing percpu_ref users for problematic RCU
> usages. I couldn't understand what the RCU protection was doing.
> It'd be great if you can take a look and tell me whether I missed
> something.

Hello Tejun,

I think the rcu_read_lock() and rcu_read_unlock() are really necessary in this
code. In the LWN article "The RCU-barrier menagerie" it is explained that RCU
can be used to enforce write ordering globally if the code that reads the writes
that are ordered with an RCU read lock (https://lwn.net/Articles/573497/). See
also the following comment in scsi_device_quiesce():

/*
* Ensure that the effect of blk_set_preempt_only() will be visible
* for percpu_ref_tryget() callers that occur after the queue
* unfreeze even if the queue was already frozen before this function
* was called. See also https://lwn.net/Articles/573497/.
*/

Since this patch introduces a subtle and hard to debug race condition, please
drop this patch.

Thanks,

Bart.

2018-03-06 18:01:02

by Jerome Glisse

[permalink] [raw]

Subject: Re: [PATCH 4/7] HMM: Remove superflous RCU protection around radix tree lookup

On Tue, Mar 06, 2018 at 09:33:13AM -0800, Tejun Heo wrote:
> hmm_devmem_find() requires rcu_read_lock_held() but there's nothing
> which actually uses the RCU protection. The only caller is
> hmm_devmem_pages_create() which already grabs the mutex and does
> superflous rcu_read_lock/unlock() around the function.
>
> This doesn't add anything and just adds to confusion. Remove the RCU
> protection and open-code the radix tree lookup. If this needs to
> become more sophisticated in the future, let's add them back when
> necessary.
>
> Signed-off-by: Tejun Heo <[email protected]>

Reviewed-by: J?r?me Glisse <[email protected]>

> Cc: [email protected]
> Cc: Linus Torvalds <[email protected]>

> ---
> Hello, J?r?me.
>
> This came up while auditing percpu_ref users for missing explicit RCU
> grace periods. HMM doesn't seem to depend on RCU protection at all,
> so I thought it'd be better to remove it for now. It's only compile
> tested.

Good catch some left over of old logic. I have more cleanup queued up
now that i am about to post nouveau patches to use all this. Thanks for
fixing this.

>
> Thanks.
>
> mm/hmm.c | 12 ++----------
> 1 file changed, 2 insertions(+), 10 deletions(-)
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 320545b98..d4627c5 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -845,13 +845,6 @@ static void hmm_devmem_release(struct device *dev, void *data)
> hmm_devmem_radix_release(resource);
> }
>
> -static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
> -{
> - WARN_ON_ONCE(!rcu_read_lock_held());
> -
> - return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
> -}
> -
> static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
> {
> resource_size_t key, align_start, align_size, align_end;
> @@ -892,9 +885,8 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
> for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) {
> struct hmm_devmem *dup;
>
> - rcu_read_lock();
> - dup = hmm_devmem_find(key);
> - rcu_read_unlock();
> + dup = radix_tree_lookup(&hmm_devmem_radix,
> + key >> PA_SECTION_SHIFT);
> if (dup) {
> dev_err(device, "%s: collides with mapping for %s\n",
> __func__, dev_name(dup->device));
> --
> 2.9.5
>

2018-03-06 18:32:13

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work

On Tue, Mar 6, 2018 at 9:33 AM, Tejun Heo <[email protected]> wrote:
>
> This patch introduces rcu_work, a workqueue work variant which gets
> executed after a RCU grace period, and converts the open coded
> bouncing in fs/aio and kernel/cgroup.

So I like the concept, but I have two comments:

- can we split this patch up, so that if somebody bisects a problem
to it, we'll see if it's cgroup or aio that triggers it?

- this feels wrong:

> +struct rcu_work {
> + struct work_struct work;
> + struct rcu_head rcu;
> +
> + /* target workqueue and CPU ->rcu uses to queue ->work */
> + struct workqueue_struct *wq;
> + int cpu;
> +};

That "int cpu" really doesn't feel like it makes sense for an
rcu_work. The rcu_call() part fundamentally will happen on any CPU,
and sure, it could then schedule the work on something else, but that
doesn't sound like a particularly sound interface.

So I'd like to either just make the thing always just use
WORK_CPU_UNBOUND, or hear some kind of (handwaving ok) explanation for
why something else would ever make sense. If the action is
fundamentally delayed by RCU, why would it make a difference which CPU
it runs on?

One reason for that is that I get this feeling that the multiple
stages of waiting *might* be unnecessary. Are there no situations
where a "rcu_work" might just end up devolving to be just a regular
work? Or maybe situations where the rcu callback is done in process
context, and the work can just be done immediately? I'm a tiny bit
worried about queueing artifacts, where we end up having tons of
resources in flight.

But this really is just a "this feels wrong". I have no real hard
technical reason.

Linus

2018-03-07 02:51:15

by Lai Jiangshan

[permalink] [raw]

Subject: Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work

On Wed, Mar 7, 2018 at 1:33 AM, Tejun Heo <[email protected]> wrote:

> +/**
> + * queue_rcu_work_on - queue work on specific CPU after a RCU grace period
> + * @cpu: CPU number to execute work on
> + * @wq: workqueue to use
> + * @rwork: work to queue

For many people, "RCU grace period" is clear enough, but not ALL.

So please make it a little more clear that it just queues work after
a *Normal* RCU grace period. it supports only one RCU variant.

> + *
> + * Return: %false if @work was already on a queue, %true otherwise.
> + */

I'm afraid this will be a hard-using API.

The user can't find a plan B when it returns false, especially when
the user expects the work must be called at least once again
after an RCU grace period.

And the error-prone part of it is that, like other queue_work() functions,
the return value of it is often ignored and makes the problem worse.

So, since workqueue.c provides this API, it should handle this
problem. For example, by calling call_rcu() again in this case, but
everything will be much more complex: a synchronization is needed
for "calling call_rcu() again" and allowing the work item called
twice after the last queue_rcu_work() is not workqueue style.

Some would argue that the delayed_work has the same problem
when the user expects the work must be called at least once again
after a period of time. But time interval is easy to detect, the user
can check the time and call the queue_delayed_work() again
when needed which is also a frequent design pattern. And
for rcu, it is hard to use this design pattern since it is hard
to detect (new) rcu grace period without using call_rcu().

I would not provide this API. it is not a NACK. I'm just trying
expressing my thinking about the API. I'd rather RCU be changed
and RCU callbacks are changed to be sleepable. But a complete
overhaul cleanup on the whole source tree for compatibility
is needed at first, an even more complex job.

> +bool queue_rcu_work_on(int cpu, struct workqueue_struct *wq,
> + struct rcu_work *rwork)
> +{
> + struct work_struct *work = &rwork->work;
> +
> + if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> + rwork->wq = wq;
> + rwork->cpu = cpu;
> + call_rcu(&rwork->rcu, rcu_work_rcufn);
> + return true;
> + }
> +
> + return false;
> +}
> +EXPORT_SYMBOL(queue_rcu_work_on);
> +

2018-03-07 15:41:35

by Dennis Dalessandro

[permalink] [raw]

Subject: Re: [PATCH 3/7] RDMAVT: Fix synchronization around percpu_ref

On 3/6/2018 12:33 PM, Tejun Heo wrote:
> rvt_mregion uses percpu_ref for reference counting and RCU to protect
> accesses from lkey_table. When a rvt_mregion needs to be freed, it
> first gets unregistered from lkey_table and then rvt_check_refs() is
> called to wait for in-flight usages before the rvt_mregion is freed.
>
> rvt_check_refs() seems to have a couple issues.
>
> * It has a fast exit path which tests percpu_ref_is_zero(). However,
> a percpu_ref reading zero doesn't mean that the object can be
> released. In fact, the ->release() callback might not even have
> started executing yet. Proceeding with freeing can lead to
> use-after-free.
>
> * lkey_table is RCU protected but there is no RCU grace period in the
> free path. percpu_ref uses RCU internally but it's sched-RCU whose
> grace periods are different from regular RCU. Also, it generally
> isn't a good idea to depend on internal behaviors like this.
>
> To address the above issues, this patch removes the the fast exit and

Typo above too many "the".

> adds an explicit synchronize_rcu().
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Dennis Dalessandro <[email protected]>
> Cc: Mike Marciniszyn <[email protected]>
> Cc: [email protected]
> Cc: Linus Torvalds <[email protected]>
> ---
> Hello, Dennis, Mike.
>
> I don't know RDMA at all and this patch is only compile tested. Can
> you please take a careful look?

Looks good to me, passes my basic sanity tests. Thanks!

Acked-by: Dennis Dalessandro <[email protected]>

2018-03-07 16:24:52

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work

On Wed, Mar 07, 2018 at 06:54:08AM -0800, Paul E. McKenney wrote:
> One downside of allowing RCU callback functions to sleep is that
> one poorly written callback can block a bunch of other ones.

Not to mention that we really want the RCU callbacks to be simple and
short. Needing fancy things in the callback really should be the
exception not the rule.

2018-03-07 17:17:26

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work

On Wed, Mar 07, 2018 at 10:49:49AM +0800, Lai Jiangshan wrote:
> On Wed, Mar 7, 2018 at 1:33 AM, Tejun Heo <[email protected]> wrote:
>
> > +/**
> > + * queue_rcu_work_on - queue work on specific CPU after a RCU grace period
> > + * @cpu: CPU number to execute work on
> > + * @wq: workqueue to use
> > + * @rwork: work to queue
>
> For many people, "RCU grace period" is clear enough, but not ALL.
>
> So please make it a little more clear that it just queues work after
> a *Normal* RCU grace period. it supports only one RCU variant.
>
>
> > + *
> > + * Return: %false if @work was already on a queue, %true otherwise.
> > + */
>
> I'm afraid this will be a hard-using API.
>
> The user can't find a plan B when it returns false, especially when
> the user expects the work must be called at least once again
> after an RCU grace period.
>
> And the error-prone part of it is that, like other queue_work() functions,
> the return value of it is often ignored and makes the problem worse.
>
> So, since workqueue.c provides this API, it should handle this
> problem. For example, by calling call_rcu() again in this case, but
> everything will be much more complex: a synchronization is needed
> for "calling call_rcu() again" and allowing the work item called
> twice after the last queue_rcu_work() is not workqueue style.

I confess that I had not thought of this aspect, but doesn't the
rcu_barrier() in v2 of this patchset guarantee that it has passed
the RCU portion of the overall wait time? Given that, what I am
missing is now this differs from flush_work() on a normal work item.

So could you please show me the sequence of events that leads to this
problem? (Again, against v2 of this patch, which replaces the
synchronize_rcu() with rcu_barrier().)

> Some would argue that the delayed_work has the same problem
> when the user expects the work must be called at least once again
> after a period of time. But time interval is easy to detect, the user
> can check the time and call the queue_delayed_work() again
> when needed which is also a frequent design pattern. And
> for rcu, it is hard to use this design pattern since it is hard
> to detect (new) rcu grace period without using call_rcu().
>
> I would not provide this API. it is not a NACK. I'm just trying
> expressing my thinking about the API. I'd rather RCU be changed
> and RCU callbacks are changed to be sleepable. But a complete
> overhaul cleanup on the whole source tree for compatibility
> is needed at first, an even more complex job.

One downside of allowing RCU callback functions to sleep is that
one poorly written callback can block a bunch of other ones.
One advantage of Tejun's approach is that such delays only affect
the workqueues, which are already designed to handle such delays.

Thanx, Paul

> > +bool queue_rcu_work_on(int cpu, struct workqueue_struct *wq,
> > + struct rcu_work *rwork)
> > +{
> > + struct work_struct *work = &rwork->work;
> > +
> > + if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> > + rwork->wq = wq;
> > + rwork->cpu = cpu;
> > + call_rcu(&rwork->rcu, rcu_work_rcufn);
> > + return true;
> > + }
> > +
> > + return false;
> > +}
> > +EXPORT_SYMBOL(queue_rcu_work_on);
> > +
>

2018-03-07 17:59:18

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work

On Wed, Mar 07, 2018 at 05:23:17PM +0100, Peter Zijlstra wrote:
> On Wed, Mar 07, 2018 at 06:54:08AM -0800, Paul E. McKenney wrote:
> > One downside of allowing RCU callback functions to sleep is that
> > one poorly written callback can block a bunch of other ones.
>
> Not to mention that we really want the RCU callbacks to be simple and
> short. Needing fancy things in the callback really should be the
> exception not the rule.

Very much agreed with that as well!

Besides, Tejun's queue_rcu_work() provides this functionality and
seems pretty straightforward to use.

Thanx, Paul

2018-03-08 00:31:02

by Lai Jiangshan

[permalink] [raw]

Subject: Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work

On Wed, Mar 7, 2018 at 10:54 PM, Paul E. McKenney
<[email protected]> wrote:
> On Wed, Mar 07, 2018 at 10:49:49AM +0800, Lai Jiangshan wrote:
>> On Wed, Mar 7, 2018 at 1:33 AM, Tejun Heo <[email protected]> wrote:
>>
>> > +/**
>> > + * queue_rcu_work_on - queue work on specific CPU after a RCU grace period
>> > + * @cpu: CPU number to execute work on
>> > + * @wq: workqueue to use
>> > + * @rwork: work to queue
>>
>> For many people, "RCU grace period" is clear enough, but not ALL.
>>
>> So please make it a little more clear that it just queues work after
>> a *Normal* RCU grace period. it supports only one RCU variant.
>>
>>
>> > + *
>> > + * Return: %false if @work was already on a queue, %true otherwise.
>> > + */
>>
>> I'm afraid this will be a hard-using API.
>>
>> The user can't find a plan B when it returns false, especially when
>> the user expects the work must be called at least once again
>> after an RCU grace period.
>>
>> And the error-prone part of it is that, like other queue_work() functions,
>> the return value of it is often ignored and makes the problem worse.
>>
>> So, since workqueue.c provides this API, it should handle this
>> problem. For example, by calling call_rcu() again in this case, but
>> everything will be much more complex: a synchronization is needed
>> for "calling call_rcu() again" and allowing the work item called
>> twice after the last queue_rcu_work() is not workqueue style.
>
> I confess that I had not thought of this aspect, but doesn't the
> rcu_barrier() in v2 of this patchset guarantee that it has passed
> the RCU portion of the overall wait time? Given that, what I am
> missing is now this differs from flush_work() on a normal work item.
>
> So could you please show me the sequence of events that leads to this
> problem? (Again, against v2 of this patch, which replaces the
> synchronize_rcu() with rcu_barrier().)

I mentioned a subtle use case that user would think it is supported
since the comment doesn't disallow it.

It is clear that the user expects
the work must be called at least once after the API returns
the work must be called after an RCU grace period

But in the case when the user expects the work must be called
at least once again after "queue_rcu_work() + an RCU grace period",
the API is not competent to it if the work is queued.
Although the user can detect it by the return value of
queue_rcu_work(), the user hardly makes his expectation
happen by adding appropriate code.

>
>> Some would argue that the delayed_work has the same problem
>> when the user expects the work must be called at least once again
>> after a period of time. But time interval is easy to detect, the user
>> can check the time and call the queue_delayed_work() again
>> when needed which is also a frequent design pattern. And
>> for rcu, it is hard to use this design pattern since it is hard
>> to detect (new) rcu grace period without using call_rcu().
>>
>> I would not provide this API. it is not a NACK. I'm just trying
>> expressing my thinking about the API. I'd rather RCU be changed
>> and RCU callbacks are changed to be sleepable. But a complete
>> overhaul cleanup on the whole source tree for compatibility
>> is needed at first, an even more complex job.
>
> One downside of allowing RCU callback functions to sleep is that
> one poorly written callback can block a bunch of other ones.
> One advantage of Tejun's approach is that such delays only affect
> the workqueues, which are already designed to handle such delays.
>
> Thanx, Paul
>
>> > +bool queue_rcu_work_on(int cpu, struct workqueue_struct *wq,
>> > + struct rcu_work *rwork)
>> > +{
>> > + struct work_struct *work = &rwork->work;
>> > +
>> > + if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
>> > + rwork->wq = wq;
>> > + rwork->cpu = cpu;
>> > + call_rcu(&rwork->rcu, rcu_work_rcufn);
>> > + return true;
>> > + }
>> > +
>> > + return false;
>> > +}
>> > +EXPORT_SYMBOL(queue_rcu_work_on);
>> > +
>>
>

2018-03-08 17:31:07

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work

On Thu, Mar 08, 2018 at 08:29:53AM +0800, Lai Jiangshan wrote:
> On Wed, Mar 7, 2018 at 10:54 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Wed, Mar 07, 2018 at 10:49:49AM +0800, Lai Jiangshan wrote:
> >> On Wed, Mar 7, 2018 at 1:33 AM, Tejun Heo <[email protected]> wrote:
> >>
> >> > +/**
> >> > + * queue_rcu_work_on - queue work on specific CPU after a RCU grace period
> >> > + * @cpu: CPU number to execute work on
> >> > + * @wq: workqueue to use
> >> > + * @rwork: work to queue
> >>
> >> For many people, "RCU grace period" is clear enough, but not ALL.
> >>
> >> So please make it a little more clear that it just queues work after
> >> a *Normal* RCU grace period. it supports only one RCU variant.
> >>
> >>
> >> > + *
> >> > + * Return: %false if @work was already on a queue, %true otherwise.
> >> > + */
> >>
> >> I'm afraid this will be a hard-using API.
> >>
> >> The user can't find a plan B when it returns false, especially when
> >> the user expects the work must be called at least once again
> >> after an RCU grace period.
> >>
> >> And the error-prone part of it is that, like other queue_work() functions,
> >> the return value of it is often ignored and makes the problem worse.
> >>
> >> So, since workqueue.c provides this API, it should handle this
> >> problem. For example, by calling call_rcu() again in this case, but
> >> everything will be much more complex: a synchronization is needed
> >> for "calling call_rcu() again" and allowing the work item called
> >> twice after the last queue_rcu_work() is not workqueue style.
> >
> > I confess that I had not thought of this aspect, but doesn't the
> > rcu_barrier() in v2 of this patchset guarantee that it has passed
> > the RCU portion of the overall wait time? Given that, what I am
> > missing is now this differs from flush_work() on a normal work item.
> >
> > So could you please show me the sequence of events that leads to this
> > problem? (Again, against v2 of this patch, which replaces the
> > synchronize_rcu() with rcu_barrier().)
>
> I mentioned a subtle use case that user would think it is supported
> since the comment doesn't disallow it.
>
> It is clear that the user expects
> the work must be called at least once after the API returns

I would have said "before the API returns" rather than "after the
API returns". What am I missing here?

> the work must be called after an RCU grace period
>
> But in the case when the user expects the work must be called
> at least once again after "queue_rcu_work() + an RCU grace period",
> the API is not competent to it if the work is queued.
> Although the user can detect it by the return value of
> queue_rcu_work(), the user hardly makes his expectation
> happen by adding appropriate code.

Beyond a certain point, I need to defer to Tejun on this, but I thought
that a false return meant that the work executed before the flush call.
Here I am assuming that the caller knows that the queue_work_rcu()
already happened and that another queue_work_rcu() won't be invoked on
that same work item until the flush completes. Without this sort of
assumption, I agree that there could be problems, but without these
assumptions it seems to me that you would have exactly the same problems
with the non-RCU APIs.

What am I missing?

Thanx, Paul

> >> Some would argue that the delayed_work has the same problem
> >> when the user expects the work must be called at least once again
> >> after a period of time. But time interval is easy to detect, the user
> >> can check the time and call the queue_delayed_work() again
> >> when needed which is also a frequent design pattern. And
> >> for rcu, it is hard to use this design pattern since it is hard
> >> to detect (new) rcu grace period without using call_rcu().
> >>
> >> I would not provide this API. it is not a NACK. I'm just trying
> >> expressing my thinking about the API. I'd rather RCU be changed
> >> and RCU callbacks are changed to be sleepable. But a complete
> >> overhaul cleanup on the whole source tree for compatibility
> >> is needed at first, an even more complex job.
> >
> > One downside of allowing RCU callback functions to sleep is that
> > one poorly written callback can block a bunch of other ones.
> > One advantage of Tejun's approach is that such delays only affect
> > the workqueues, which are already designed to handle such delays.
> >
> > Thanx, Paul
> >
> >> > +bool queue_rcu_work_on(int cpu, struct workqueue_struct *wq,
> >> > + struct rcu_work *rwork)
> >> > +{
> >> > + struct work_struct *work = &rwork->work;
> >> > +
> >> > + if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> >> > + rwork->wq = wq;
> >> > + rwork->cpu = cpu;
> >> > + call_rcu(&rwork->rcu, rcu_work_rcufn);
> >> > + return true;
> >> > + }
> >> > +
> >> > + return false;
> >> > +}
> >> > +EXPORT_SYMBOL(queue_rcu_work_on);
> >> > +
> >>
> >
>

2018-03-09 15:38:34

[permalink] [raw]

Subject: Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work

Hello, Linus.

On Tue, Mar 06, 2018 at 10:30:29AM -0800, Linus Torvalds wrote:
> - can we split this patch up, so that if somebody bisects a problem
> to it, we'll see if it's cgroup or aio that triggers it?

Will do.

> So I'd like to either just make the thing always just use
> WORK_CPU_UNBOUND, or hear some kind of (handwaving ok) explanation for
> why something else would ever make sense. If the action is
> fundamentally delayed by RCU, why would it make a difference which CPU
> it runs on?

It was mostly for consistency with other interfaces. Let's drop
queue_work_on() for now. If ever necessary, we can easily add it back
later.

Thanks.

--
tejun

2018-03-09 16:23:18

[permalink] [raw]

Subject: Re: [PATCH 7/7] RCU, workqueue: Implement rcu_work

Hello,

On Thu, Mar 08, 2018 at 08:29:53AM +0800, Lai Jiangshan wrote:
> I mentioned a subtle use case that user would think it is supported
> since the comment doesn't disallow it.
>
> It is clear that the user expects
> the work must be called at least once after the API returns
> the work must be called after an RCU grace period
>
> But in the case when the user expects the work must be called
> at least once again after "queue_rcu_work() + an RCU grace period",
> the API is not competent to it if the work is queued.
> Although the user can detect it by the return value of
> queue_rcu_work(), the user hardly makes his expectation
> happen by adding appropriate code.

We should definitely document it better but it isn't any different
from delayed_work, and I don't see a reason to deviate.

Thanks.

--
tejun

2018-03-14 18:47:53

[permalink] [raw]

Subject: Re: [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter()

Hello, Bart.

On Tue, Mar 06, 2018 at 05:52:50PM +0000, Bart Van Assche wrote:
> I think the rcu_read_lock() and rcu_read_unlock() are really necessary in this
> code. In the LWN article "The RCU-barrier menagerie" it is explained that RCU
> can be used to enforce write ordering globally if the code that reads the writes

Yeah but, for it to be meaningful, just like barriers, it has to be
paired with something on the other side.

> that are ordered with an RCU read lock (https://lwn.net/Articles/573497/). See
> also the following comment in scsi_device_quiesce():
>
> /*
> * Ensure that the effect of blk_set_preempt_only() will be visible
> * for percpu_ref_tryget() callers that occur after the queue
> * unfreeze even if the queue was already frozen before this function
> * was called. See also https://lwn.net/Articles/573497/.
> */
>
> Since this patch introduces a subtle and hard to debug race condition, please
> drop this patch.

Hah, the pairing is between scsi_device_quiesce() and
blk_queue_enter()? But that doesn't make sense either because
scsi_device_quiesce() is doing regular synchronize_rcu() and
blk_queue_entre() is doing rcu_read_lock_sched(). They don't
interlock with each other in any way.

So, the right thing to do here would be somehow moving the RCU
synchronization into blk_set_preempt_only() and switching to regular
RCU protection in blk_queue_enter(). The code as-is isn't really
doing anything.

I'll drop this patch from the series for now. Let's revisit it later.

Thanks.

--
tejun

2018-03-14 20:07:45

by Bart Van Assche

[permalink] [raw]

Subject: Re: [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter()

On Wed, 2018-03-14 at 11:46 -0700, [email protected] wrote:
> > that are ordered with an RCU read lock (https://lwn.net/Articles/573497/). See
> > also the following comment in scsi_device_quiesce():
> >
> > /*
> > * Ensure that the effect of blk_set_preempt_only() will be visible
> > * for percpu_ref_tryget() callers that occur after the queue
> > * unfreeze even if the queue was already frozen before this function
> > * was called. See also https://lwn.net/Articles/573497/.
> > */
> >
> > Since this patch introduces a subtle and hard to debug race condition, please
> > drop this patch.
>
> Hah, the pairing is between scsi_device_quiesce() and
> blk_queue_enter()? But that doesn't make sense either because
> scsi_device_quiesce() is doing regular synchronize_rcu() and
> blk_queue_enter() is doing rcu_read_lock_sched(). They don't
> interlock with each other in any way.

Can you clarify this further? From <linux/rcupdate.h>:

static inline void synchronize_rcu(void)
{
synchronize_sched();
}

> So, the right thing to do here would be somehow moving the RCU
> synchronization into blk_set_preempt_only() and switching to regular
> RCU protection in blk_queue_enter(). The code as-is isn't really
> doing anything.

Since the RCU protection in blk_queue_enter() surrounds a percpu_ref_tryget_live()
call and since percpu_ref_tryget_live() calls rcu_read_lock/unlock_sched() itself
I don't think that it is allowed to switch to regular RCU protection in
blk_queue_enter() without modifying the RCU implementation.

Bart.

2018-03-14 20:10:39

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter()

On Wed, Mar 14, 2018 at 08:05:30PM +0000, Bart Van Assche wrote:
> Can you clarify this further? From <linux/rcupdate.h>:
>
> static inline void synchronize_rcu(void)
> {
> synchronize_sched();
> }

You'll find that is for !CONFIG_PREEMPT_RCU.

2018-03-14 20:15:28

by Bart Van Assche

[permalink] [raw]

Subject: Re: [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter()

On Wed, 2018-03-14 at 21:08 +0100, Peter Zijlstra wrote:
> On Wed, Mar 14, 2018 at 08:05:30PM +0000, Bart Van Assche wrote:
> > Can you clarify this further? From <linux/rcupdate.h>:
> >
> > static inline void synchronize_rcu(void)
> > {
> > synchronize_sched();
> > }
>
> You'll find that is for !CONFIG_PREEMPT_RCU.

Thanks for the clarification. I will submit a fix for the SCSI code.

Bart.