The down/up_read_non_owner() is a nasty hack in the API of the rwsem
operations. It was once removed, but then resurrected for use with
bcache. Not only is the API an abomination to the rwsem API, it also
prevents bcache from ever being compiled with PREEMPT_RT, as the RT
kernel requires all rwsems to have an owner.
Instead of keeping the down/up_read_non_owner() around, it is better to
modify the one user of it and have it do something a bit differently.
>From reading the bcache code, the reason for the non_owner usage is
that a request is made, and writers must wait for that request to
finish before they can continue. But because the request is completed
by another task, the rwsem can not be held for read and then released
on completion.
Instead of abusing the rwsem code for this, I added a refcount
"nr_requests" to the cached_dev structure as well as a "write_waiters"
wait queue. When a request is to be made, the rwsem is still taken for
read, but this time with an owner. The refcount is incremented and
before exiting the function, the rwsem is released.
The writer will then take the rwsem for write, check the refcount, if
it is not zero, it will release the rwsem, add itself to a wait_event()
waiting for refcount to become zero, and then try again.
This should be a suitable replacement for the abuse of the rwsem
non_owner API.
Signed-off-by: Steven Rostedt <[email protected]>
Index: linux-trace.git/drivers/md/bcache/bcache.h
===================================================================
--- linux-trace.git.orig/drivers/md/bcache/bcache.h
+++ linux-trace.git/drivers/md/bcache/bcache.h
@@ -486,6 +486,15 @@ struct cached_dev {
atomic_t running;
/*
+ * Requests in progress.
+ */
+ atomic_t nr_requests;
+ /*
+ * Writers waiting for requests to finish.
+ */
+ wait_queue_head_t write_waiters;
+
+ /*
* Writes take a shared lock from start to finish; scanning for dirty
* data to refill the rb tree requires an exclusive lock.
*/
Index: linux-trace.git/drivers/md/bcache/request.c
===================================================================
--- linux-trace.git.orig/drivers/md/bcache/request.c
+++ linux-trace.git/drivers/md/bcache/request.c
@@ -998,7 +998,9 @@ static void cached_dev_write_complete(st
struct search *s = container_of(cl, struct search, cl);
struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
- up_read_non_owner(&dc->writeback_lock);
+ if (atomic_dec_return(&dc->nr_requests) == 0)
+ wake_up(&dc->write_waiters);
+
cached_dev_bio_complete(cl);
}
@@ -1024,7 +1026,8 @@ static void request_write(struct cached_
bch_keybuf_check_overlapping(&s->op.c->moving_gc_keys, &start, &end);
check_should_skip(dc, s);
- down_read_non_owner(&dc->writeback_lock);
+ down_read(&dc->writeback_lock);
+ atomic_inc(&dc->nr_requests);
if (bch_keybuf_check_overlapping(&dc->writeback_keys, &start, &end)) {
s->op.skip = false;
@@ -1053,6 +1056,7 @@ static void request_write(struct cached_
}
out:
closure_call(&s->op.cl, bch_insert_data, NULL, cl);
+ up_read(&dc->writeback_lock);
continue_at(cl, cached_dev_write_complete, NULL);
skip:
s->op.skip = true;
Index: linux-trace.git/drivers/md/bcache/super.c
===================================================================
--- linux-trace.git.orig/drivers/md/bcache/super.c
+++ linux-trace.git/drivers/md/bcache/super.c
@@ -1081,6 +1081,8 @@ static void register_bdev(struct cache_s
dc->sb_bio.bi_io_vec[0].bv_page = sb_page;
get_page(sb_page);
+ init_waitqueue_head(&dc->write_waiters);
+
if (cached_dev_init(dc, sb->block_size << 9))
goto err;
Index: linux-trace.git/drivers/md/bcache/writeback.c
===================================================================
--- linux-trace.git.orig/drivers/md/bcache/writeback.c
+++ linux-trace.git/drivers/md/bcache/writeback.c
@@ -121,6 +121,20 @@ static void dirty_init(struct keybuf_key
bch_bio_map(bio, NULL);
}
+/*
+ * If requests are in transit, we must wait for them to finish.
+ */
+static void down_writeback_lock(struct cached_dev *dc)
+{
+again:
+ down_write(&dc->writeback_lock);
+ if (atomic_read(&dc->nr_requests)) {
+ up_write(&dc->writeback_lock);
+ wait_event(dc->write_waiters, !atomic_read(&dc->nr_requests));
+ goto again;
+ }
+}
+
static void refill_dirty(struct closure *cl)
{
struct cached_dev *dc = container_of(cl, struct cached_dev,
@@ -134,7 +148,7 @@ static void refill_dirty(struct closure
!dc->writeback_running)
closure_return(cl);
- down_write(&dc->writeback_lock);
+ down_writeback_lock(dc);
if (!atomic_read(&dc->has_dirty)) {
SET_BDEV_STATE(&dc->sb, BDEV_STATE_CLEAN);
On Tue, Aug 20, 2013 at 11:16:02AM -0400, Steven Rostedt wrote:
>
> The down/up_read_non_owner() is a nasty hack in the API of the rwsem
> operations. It was once removed, but then resurrected for use with
> bcache. Not only is the API an abomination to the rwsem API, it also
> prevents bcache from ever being compiled with PREEMPT_RT, as the RT
> kernel requires all rwsems to have an owner.
>
> Instead of keeping the down/up_read_non_owner() around, it is better to
> modify the one user of it and have it do something a bit differently.
>
> From reading the bcache code, the reason for the non_owner usage is
> that a request is made, and writers must wait for that request to
> finish before they can continue. But because the request is completed
> by another task, the rwsem can not be held for read and then released
> on completion.
>
> Instead of abusing the rwsem code for this, I added a refcount
> "nr_requests" to the cached_dev structure as well as a "write_waiters"
> wait queue. When a request is to be made, the rwsem is still taken for
> read, but this time with an owner. The refcount is incremented and
> before exiting the function, the rwsem is released.
>
> The writer will then take the rwsem for write, check the refcount, if
> it is not zero, it will release the rwsem, add itself to a wait_event()
> waiting for refcount to become zero, and then try again.
I _really_ disagree with this approach.
I get that there's a problem, but the bcache code REALLY IS USING THE
RWSEM AS A LOCK; the answer isn't to open code the lock!
Apologies to Christoph for getting distracted and not responding when
you started to explain what the issues were for RT. I'm not really
convinced they're that insurmountable (one of the issues was debugging,
which the _non_owner() stuff always handled just fine), but I'll take it
on faith that this usage is incompatible with rwsems + the RT
functionality since I haven't actually had time to dig into it.
So assuming that's the case, IMO the sanest thing to do is make a new
type of lock - "rwsem_non_process" or somesuch - and use _that_ in
bcache. Not open coding the lock.
It can even live in the bcache code if we want since there currently
wouldn't be any other users, I don't really care. But open coding it?
Come on... makes me wonder what other code in the kernel is open coding
locks because it couldn't release it in the same process context that
took the lock for whatever reason.
Also, nack this patch because increasing the number of atomic ops to
shared cachelines in our fast path. If it does end up being open coded,
I'll make a more efficient version.
On Tue, 20 Aug 2013 19:19:00 -0700
Kent Overstreet <[email protected]> wrote:
> On Tue, Aug 20, 2013 at 11:16:02AM -0400, Steven Rostedt wrote:
> >
> I _really_ disagree with this approach.
I had a feeling you would love it.
>
> I get that there's a problem, but the bcache code REALLY IS USING THE
> RWSEM AS A LOCK; the answer isn't to open code the lock!
Actually, it is using it as a semaphore. The problem with Linux
was that it only had spin locks and semaphores. People would use the
semaphore as either a completion or a mutex. Luckily, we created both
of those and replaced 90% of all semaphores with either a mutex or a
completion.
The rwsem, sorta has the same issue. But it was pretty much always used
as a read/write sleeping lock. It was not used as a semaphore. But it
seems that the bcache really does use it as a semaphore here as it
isn't just a mutex location. It's a special kind of completion, but the
completion is in a strange way.
>
> Apologies to Christoph for getting distracted and not responding when
> you started to explain what the issues were for RT. I'm not really
> convinced they're that insurmountable (one of the issues was debugging,
> which the _non_owner() stuff always handled just fine), but I'll take it
> on faith that this usage is incompatible with rwsems + the RT
> functionality since I haven't actually had time to dig into it.
The thing is, RT requires priority inheritance. When a task blocks on a
rwsem, it has to boost the owner of that rwsem otherwise it risks
unbounded priority inversion.
I have a hack that actually implements a work around for non_owner, but
it adds overhead to all locks to do so.
>
> So assuming that's the case, IMO the sanest thing to do is make a new
> type of lock - "rwsem_non_process" or somesuch - and use _that_ in
> bcache. Not open coding the lock.
I actually started that, but decided this was the easier patch to send.
Don't worry, I was expecting this email. No surprises here ;-)
This email was taken from Andrew Morton's play book (I see you Cc'd
him, I forgot to). It's one of these patches of "It's so bad that the
owner of the code will fix the issue out of fear that this patch may
make it in".
>
> It can even live in the bcache code if we want since there currently
> wouldn't be any other users, I don't really care. But open coding it?
> Come on... makes me wonder what other code in the kernel is open coding
> locks because it couldn't release it in the same process context that
> took the lock for whatever reason.
Most cases just have completions. This looks like a case where "don't
do something while transaction is taking place". Which can usually get
away with a wait event.
>
> Also, nack this patch because increasing the number of atomic ops to
> shared cachelines in our fast path. If it does end up being open coded,
> I'll make a more efficient version.
Perhaps I can whip up a "struct rwsem_non_owner" and have a very
limited API for that, and then you can use that.
Thanks,
-- Steve
On Tue, Aug 20, 2013 at 10:41:53PM -0400, Steven Rostedt wrote:
> > I get that there's a problem, but the bcache code REALLY IS USING THE
> > RWSEM AS A LOCK; the answer isn't to open code the lock!
>
> Actually, it is using it as a semaphore. The problem with Linux
> was that it only had spin locks and semaphores. People would use the
> semaphore as either a completion or a mutex. Luckily, we created both
> of those and replaced 90% of all semaphores with either a mutex or a
> completion.
>
> The rwsem, sorta has the same issue. But it was pretty much always used
> as a read/write sleeping lock. It was not used as a semaphore. But it
> seems that the bcache really does use it as a semaphore here as it
> isn't just a mutex location. It's a special kind of completion, but the
> completion is in a strange way.
I think we're mostly quibbling over semantics here, and it's not that
big a deal - that said, I don't really get your distinction between a
semaphore and a mutex; I'd distinguish them by saying a semaphore can
count an arbitrary number of resources (the number it's initialized to),
and a mutex is always initialized to one - the fact that in the kernel
mutexes must be released by the same process that took them (while
important for PI and debugging) is not fundamental.
"rw semaphore" never made any sense to me; they're not counting
anything, like regular semaphores. They're just sleepable rw locks.
So it _sounds_ like you're saying bcache is using it as a semaphore,
beacuse it's using it as a lock we don't release in the original
context? in analogy to linux mutexes and semaphores? Not quite sure what
you mean.
> > Apologies to Christoph for getting distracted and not responding when
> > you started to explain what the issues were for RT. I'm not really
> > convinced they're that insurmountable (one of the issues was debugging,
> > which the _non_owner() stuff always handled just fine), but I'll take it
> > on faith that this usage is incompatible with rwsems + the RT
> > functionality since I haven't actually had time to dig into it.
>
> The thing is, RT requires priority inheritance. When a task blocks on a
> rwsem, it has to boost the owner of that rwsem otherwise it risks
> unbounded priority inversion.
>
> I have a hack that actually implements a work around for non_owner, but
> it adds overhead to all locks to do so.
Ok - I'd just be curious why the PI bit can't be factored out into a
wrapper like how the debug stuff is handled with the _non_owner bits,
but I can certainly believe that.
> > So assuming that's the case, IMO the sanest thing to do is make a new
> > type of lock - "rwsem_non_process" or somesuch - and use _that_ in
> > bcache. Not open coding the lock.
>
> I actually started that, but decided this was the easier patch to send.
> Don't worry, I was expecting this email. No surprises here ;-)
>
> This email was taken from Andrew Morton's play book (I see you Cc'd
> him, I forgot to). It's one of these patches of "It's so bad that the
> owner of the code will fix the issue out of fear that this patch may
> make it in".
Ah, I see :)
> > It can even live in the bcache code if we want since there currently
> > wouldn't be any other users, I don't really care. But open coding it?
> > Come on... makes me wonder what other code in the kernel is open coding
> > locks because it couldn't release it in the same process context that
> > took the lock for whatever reason.
>
> Most cases just have completions. This looks like a case where "don't
> do something while transaction is taking place". Which can usually get
> away with a wait event.
Eh, only in sense that it's easy to implement mutexes and rwlocks with
wait_event().
To simplify the algorithm just a bit (only leaving out some
optimizations), bcache is using it to protect a rb tree, which containts
"things that are undergoing background writeback".
For foreground writes, we've got to check if the write overlaps with
anything in this rb tree. If so, we force _this_ write to writeback;
background writeback will write stale data to the backing device, but
that's fine because the data in the cache will still be marked as dirty.
To add stuff to this rb tree - i.e. for background writeback to start
flushing some dirty data - it's got to make sure it's not going to
overwrite some in progress foreground writethrough write.
Tracking every outstanding foreground write would be expensive, so
foreground writes take a read lock on the rb tree (both to check it for
anything they overlap with, and for their duration), and background
writeback takes a write lock when it goes to scan for dirty data to add
to the rb tree.
Even if you complain it's not _just_ protecting the rb tree, we're still
using it for mutual exclusion, plain and simple, and that's all a lock
is.
> > Also, nack this patch because increasing the number of atomic ops to
> > shared cachelines in our fast path. If it does end up being open coded,
> > I'll make a more efficient version.
>
> Perhaps I can whip up a "struct rwsem_non_owner" and have a very
> limited API for that, and then you can use that.
That would be perfect :)
On Tue, 20 Aug 2013 20:36:58 -0700
Kent Overstreet <[email protected]> wrote:
> I think we're mostly quibbling over semantics here, and it's not that
> big a deal - that said, I don't really get your distinction between a
> semaphore and a mutex; I'd distinguish them by saying a semaphore can
> count an arbitrary number of resources (the number it's initialized to),
> and a mutex is always initialized to one - the fact that in the kernel
> mutexes must be released by the same process that took them (while
> important for PI and debugging) is not fundamental.
Understood. As, I usually am focused on PI and debugging, it may be
more fundamental to me than others.
>
> "rw semaphore" never made any sense to me; they're not counting
> anything, like regular semaphores. They're just sleepable rw locks.
Me either.
>
> So it _sounds_ like you're saying bcache is using it as a semaphore,
> beacuse it's using it as a lock we don't release in the original
> context? in analogy to linux mutexes and semaphores? Not quite sure what
> you mean.
Actually, I'm thinking you are using it more as a completion than a
semaphore.
> > I have a hack that actually implements a work around for non_owner, but
> > it adds overhead to all locks to do so.
>
> Ok - I'd just be curious why the PI bit can't be factored out into a
> wrapper like how the debug stuff is handled with the _non_owner bits,
> but I can certainly believe that.
The problem is that all sleeping locks go through rt_mutex. The locks
are replaced with locks that incorporate PI.
> To simplify the algorithm just a bit (only leaving out some
> optimizations), bcache is using it to protect a rb tree, which containts
> "things that are undergoing background writeback".
I think this is what makes bcache unique in the kernel, and I don't
think there's open coded locks elsewhere.
>
> For foreground writes, we've got to check if the write overlaps with
> anything in this rb tree. If so, we force _this_ write to writeback;
> background writeback will write stale data to the backing device, but
> that's fine because the data in the cache will still be marked as dirty.
>
> To add stuff to this rb tree - i.e. for background writeback to start
> flushing some dirty data - it's got to make sure it's not going to
> overwrite some in progress foreground writethrough write.
>
> Tracking every outstanding foreground write would be expensive, so
> foreground writes take a read lock on the rb tree (both to check it for
> anything they overlap with, and for their duration), and background
> writeback takes a write lock when it goes to scan for dirty data to add
> to the rb tree.
>
> Even if you complain it's not _just_ protecting the rb tree, we're still
> using it for mutual exclusion, plain and simple, and that's all a lock
> is.
Well, rwlocks and rwsems are not mutually exclusive ;-) The read side
has multiple accessors. A mutual exclusion also would be a single task
needing exclusive access to a resource. But as things are done in the
background, that is not the case. But we are arguing semantics here and
not helping with a solution.
>
>
> > > Also, nack this patch because increasing the number of atomic ops to
> > > shared cachelines in our fast path. If it does end up being open coded,
> > > I'll make a more efficient version.
> >
> > Perhaps I can whip up a "struct rwsem_non_owner" and have a very
> > limited API for that, and then you can use that.
>
> That would be perfect :)
OK, I'll see what I can do.
-- Steve