It seems the recent kernel is slower mounting hard disk than older
kernels. I've not bisect down to exact when this happen as it might
already been reported or solved. I'm on the latest commit, but it
doesn't seems to be fixed yet.
commit 3587b1b097d70c2eb9fee95ea7995d13c05f66e5
Author: Al Viro <[email protected]>
Date: Sun Nov 18 19:19:00 2012 +0000
fanotify: fix FAN_Q_OVERFLOW case of fanotify_read()
Thanks,
Jeff
On Mon 19-11-12 08:33:25, Jeff Chua wrote:
> It seems the recent kernel is slower mounting hard disk than older
> kernels. I've not bisect down to exact when this happen as it might
> already been reported or solved. I'm on the latest commit, but it
> doesn't seems to be fixed yet.
>
> commit 3587b1b097d70c2eb9fee95ea7995d13c05f66e5
> Author: Al Viro <[email protected]>
> Date: Sun Nov 18 19:19:00 2012 +0000
>
> fanotify: fix FAN_Q_OVERFLOW case of fanotify_read()
I haven't heard about such problem so far. What filesystem are you using?
Can you quantify 'is slower'? Bisecting would be welcome of course.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Nov 21, 2012 at 2:09 AM, Jan Kara <[email protected]> wrote:
> I haven't heard about such problem so far. What filesystem are you using?
I've tried ext2/ext3/ext4/reiserfs/btrfs ... all seems to be slower
than before. Seems to be fs independent.
> Can you quantify 'is slower'? Bisecting would be welcome of course.
Haven't measure, but it seems to be 1 sec instead of .3 sec to mount.
I'll start bisecting:(
So, since I've 6 mounts on start up, now it takes 4 seconds longer to boot up.
Thanks,
Jeff
Doesn't sound like a fsdevel issue since it seems to be independent of
filesystems. More like some generic block layer thing. Adding Jens
(and quoting the whole thing)
Jens, any ideas? Most of your stuff came in after -rc2, which would
fit with the fact that most of the slowdown seems to be after -rc2
according to Jeff.
Jeff, more bisecting would be good, though.
Linus
On Thu, Nov 22, 2012 at 4:30 AM, Jeff Chua <[email protected]> wrote:
> On Wed, Nov 21, 2012 at 11:46 PM, Jeff Chua <[email protected]> wrote:
>> On Wed, Nov 21, 2012 at 2:09 AM, Jan Kara <[email protected]> wrote:
>>> I haven't heard about such problem so far. What filesystem are you using?
>>
>> I've tried ext2/ext3/ext4/reiserfs/btrfs ... all seems to be slower
>> than before. Seems to be fs independent.
>>
>>> Can you quantify 'is slower'? Bisecting would be welcome of course.
>>
>> Haven't measure, but it seems to be 1 sec instead of .3 sec to mount.
>> I'll start bisecting:(
>> So, since I've 6 mounts on start up, now it takes 4 seconds longer to boot up.
>
> I started bisecting, and it seems to have multiple points of slowing
> down. The latest kernel is really slow mounting /dev/sda1 (fs
> independent) ... 0.529sec. Kernel 3.6.0 took 0.012sec, Kernel
> 3.7.0-rc2 took 0.168sec. Again, these are very early results. I'm
> verifying them again.
>
>
> Jeff
On Wed, Nov 21, 2012 at 11:46 PM, Jeff Chua <[email protected]> wrote:
> On Wed, Nov 21, 2012 at 2:09 AM, Jan Kara <[email protected]> wrote:
>> I haven't heard about such problem so far. What filesystem are you using?
>
> I've tried ext2/ext3/ext4/reiserfs/btrfs ... all seems to be slower
> than before. Seems to be fs independent.
>
>> Can you quantify 'is slower'? Bisecting would be welcome of course.
>
> Haven't measure, but it seems to be 1 sec instead of .3 sec to mount.
> I'll start bisecting:(
> So, since I've 6 mounts on start up, now it takes 4 seconds longer to boot up.
I started bisecting, and it seems to have multiple points of slowing
down. The latest kernel is really slow mounting /dev/sda1 (fs
independent) ... 0.529sec. Kernel 3.6.0 took 0.012sec, Kernel
3.7.0-rc2 took 0.168sec. Again, these are very early results. I'm
verifying them again.
Jeff
On 2012-11-22 20:21, Linus Torvalds wrote:
> Doesn't sound like a fsdevel issue since it seems to be independent of
> filesystems. More like some generic block layer thing. Adding Jens
> (and quoting the whole thing)
>
> Jens, any ideas? Most of your stuff came in after -rc2, which would
> fit with the fact that most of the slowdown seems to be after -rc2
> according to Jeff.
No ideas. Looking at what went in from my side, only the rq plug sorting
is a core change, and that should not cause any change in behaviour for
a single device. That's commit 975927b9.
> Jeff, more bisecting would be good, though.
Probably required, yes...
--
Jens Axboe
On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe <[email protected]> wrote:
> On 2012-11-22 20:21, Linus Torvalds wrote:
>> Doesn't sound like a fsdevel issue since it seems to be independent of
>> filesystems. More like some generic block layer thing. Adding Jens
>> (and quoting the whole thing)
>>
>> Jens, any ideas? Most of your stuff came in after -rc2, which would
>> fit with the fact that most of the slowdown seems to be after -rc2
>> according to Jeff.
>
> No ideas. Looking at what went in from my side, only the rq plug sorting
> is a core change, and that should not cause any change in behaviour for
> a single device. That's commit 975927b9.
>
>> Jeff, more bisecting would be good, though.
>
> Probably required, yes...
This one slows mount from 0.012s to 0.168s.
commit 62ac665ff9fc07497ca524bd20d6a96893d11071
Author: Mikulas Patocka <[email protected]>
Date: Wed Sep 26 07:46:43 2012 +0200
blockdev: turn a rw semaphore into a percpu rw semaphore
There were couple of more changes to percpu-rw-semaphores after
3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't
really know, but I'm suspecting these. Still bisecting.
commit 5c1eabe68501d1e1b1586c7f4c46cc531828c4ab
Author: Mikulas Patocka <[email protected]>
Date: Mon Oct 22 19:37:47 2012 -0400
percpu-rw-semaphores: use light/heavy barriers
commit 1bf11c53535ab87e3bf14ecdf6747bf46f601c5d
Author: Mikulas Patocka <[email protected]>
Date: Mon Oct 22 19:39:16 2012 -0400
percpu-rw-semaphores: use rcu_read_lock_sched
commit 1a25b1c4ce189e3926f2981f3302352a930086db
Author: Mikulas Patocka <[email protected]>
Date: Mon Oct 15 17:20:17 2012 -0400
Lock splice_read and splice_write functions
I couldn't unpatch with the latest kernel, but still trying.
Jeff
On Sat, Nov 24, 2012 at 6:21 AM, Jeff Chua <[email protected]> wrote:
> On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe <[email protected]> wrote:
>> On 2012-11-22 20:21, Linus Torvalds wrote:
>>> Doesn't sound like a fsdevel issue since it seems to be independent of
>>> filesystems. More like some generic block layer thing. Adding Jens
>>> (and quoting the whole thing)
>>>
>>> Jens, any ideas? Most of your stuff came in after -rc2, which would
>>> fit with the fact that most of the slowdown seems to be after -rc2
>>> according to Jeff.
>>
>> No ideas. Looking at what went in from my side, only the rq plug sorting
>> is a core change, and that should not cause any change in behaviour for
>> a single device. That's commit 975927b9.
>>
>>> Jeff, more bisecting would be good, though.
>>
>> Probably required, yes...
>
>
> This one slows mount from 0.012s to 0.168s.
>
> commit 62ac665ff9fc07497ca524bd20d6a96893d11071
> Author: Mikulas Patocka <[email protected]>
> Date: Wed Sep 26 07:46:43 2012 +0200
>
> blockdev: turn a rw semaphore into a percpu rw semaphore
>
>
> There were couple of more changes to percpu-rw-semaphores after
> 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't
> really know, but I'm suspecting these. Still bisecting.
>
> commit 5c1eabe68501d1e1b1586c7f4c46cc531828c4ab
> Author: Mikulas Patocka <[email protected]>
> Date: Mon Oct 22 19:37:47 2012 -0400
>
> percpu-rw-semaphores: use light/heavy barriers
>
>
> commit 1bf11c53535ab87e3bf14ecdf6747bf46f601c5d
> Author: Mikulas Patocka <[email protected]>
> Date: Mon Oct 22 19:39:16 2012 -0400
>
> percpu-rw-semaphores: use rcu_read_lock_sched
I reverted these 3 patches and mount time is now 0.012s.
# time mount /dev/sda1 /mnt; sync; sync; umount /mnt
> commit 1a25b1c4ce189e3926f2981f3302352a930086db
> Author: Mikulas Patocka <[email protected]>
> Date: Mon Oct 15 17:20:17 2012 -0400
>
> Lock splice_read and splice_write functions
Looks like this one is not the problem. But I reverted it anyway
because it's part of the same chunk.
Happy Thanksgiving!
Jeff
On Sat, Nov 24, 2012 at 7:31 AM, Jeff Chua <[email protected]> wrote:
> On Sat, Nov 24, 2012 at 6:21 AM, Jeff Chua <[email protected]> wrote:
>> On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe <[email protected]> wrote:
>>> On 2012-11-22 20:21, Linus Torvalds wrote:
>>>> Doesn't sound like a fsdevel issue since it seems to be independent of
>>>> filesystems. More like some generic block layer thing. Adding Jens
>>>> (and quoting the whole thing)
>>>>
>>>> Jens, any ideas? Most of your stuff came in after -rc2, which would
>>>> fit with the fact that most of the slowdown seems to be after -rc2
>>>> according to Jeff.
>>>
>>> No ideas. Looking at what went in from my side, only the rq plug sorting
>>> is a core change, and that should not cause any change in behaviour for
>>> a single device. That's commit 975927b9.
>>>
>>>> Jeff, more bisecting would be good, though.
>>>
>>> Probably required, yes...
>>
>>
>> This one slows mount from 0.012s to 0.168s.
>>
>> commit 62ac665ff9fc07497ca524bd20d6a96893d11071
>> Author: Mikulas Patocka <[email protected]>
>> Date: Wed Sep 26 07:46:43 2012 +0200
>>
>> blockdev: turn a rw semaphore into a percpu rw semaphore
>>
>>
>> There were couple of more changes to percpu-rw-semaphores after
>> 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't
>> really know, but I'm suspecting these. Still bisecting.
>>
>> commit 5c1eabe68501d1e1b1586c7f4c46cc531828c4ab
>> Author: Mikulas Patocka <[email protected]>
>> Date: Mon Oct 22 19:37:47 2012 -0400
>>
>> percpu-rw-semaphores: use light/heavy barriers
>>
>>
>> commit 1bf11c53535ab87e3bf14ecdf6747bf46f601c5d
>> Author: Mikulas Patocka <[email protected]>
>> Date: Mon Oct 22 19:39:16 2012 -0400
>>
>> percpu-rw-semaphores: use rcu_read_lock_sched
>
>
> I reverted these 3 patches and mount time is now 0.012s.
>
> # time mount /dev/sda1 /mnt; sync; sync; umount /mnt
>
>
>> commit 1a25b1c4ce189e3926f2981f3302352a930086db
>> Author: Mikulas Patocka <[email protected]>
>> Date: Mon Oct 15 17:20:17 2012 -0400
>>
>> Lock splice_read and splice_write functions
>
> Looks like this one is not the problem. But I reverted it anyway
> because it's part of the same chunk.
>
>
> Happy Thanksgiving!
I'm on latest git 3.7.0-rc6 5e351cdc998db82935d1248a053a1be37d1160fd
and with the above patches reverted and mount time is now 0.012s.
Jeff
On Sat, 24 Nov 2012, Jeff Chua wrote:
> On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe <[email protected]> wrote:
> > On 2012-11-22 20:21, Linus Torvalds wrote:
> >> Doesn't sound like a fsdevel issue since it seems to be independent of
> >> filesystems. More like some generic block layer thing. Adding Jens
> >> (and quoting the whole thing)
> >>
> >> Jens, any ideas? Most of your stuff came in after -rc2, which would
> >> fit with the fact that most of the slowdown seems to be after -rc2
> >> according to Jeff.
> >
> > No ideas. Looking at what went in from my side, only the rq plug sorting
> > is a core change, and that should not cause any change in behaviour for
> > a single device. That's commit 975927b9.
> >
> >> Jeff, more bisecting would be good, though.
> >
> > Probably required, yes...
>
>
> This one slows mount from 0.012s to 0.168s.
>
> commit 62ac665ff9fc07497ca524bd20d6a96893d11071
> Author: Mikulas Patocka <[email protected]>
> Date: Wed Sep 26 07:46:43 2012 +0200
>
> blockdev: turn a rw semaphore into a percpu rw semaphore
>
>
> There were couple of more changes to percpu-rw-semaphores after
> 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't
> really know, but I'm suspecting these. Still bisecting.
The problem there is that you either use normal semaphores and slow down
I/O or you use percpu-semaphores, you don't slow down I/O, but you slow
down mount.
So it's better to slow down mount.
(if you don't use any semaphore at all, as it was in 3.6 kernel and
before, there is a race condition that can crash the kernel if someone
does mount and direct I/O read on the same device at the same time)
You can improve mount time if you change all occurences of
synchronize_sched() in include/linux/percpu-rwsem.h to
synchronize_sched_expedited().
But some people say that synchronize_sched_expedited() is bad for real
time latency. (can there be something like: if (realtime)
synchronize_sched(); else synchronize_sched_expedited(); ?)
Mikulas
On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka <[email protected]> wrote:
> So it's better to slow down mount.
I am quite proud of the linux boot time pitting against other OS. Even
with 10 partitions. Linux can boot up in just a few seconds, but now
you're saying that we need to do this semaphore check at boot up. By
doing so, it's inducing additional 4 seconds during boot up.
What about moving the locking mechanism to the "mount" program itself?
Won't that be more feasible?
As for the cases of simultaneous mounts, it's usually administrator
that's doing something bad. I would say this is not a kernel issue.
Jeff
On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua <[email protected]> wrote:
> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka <[email protected]> wrote:
>> So it's better to slow down mount.
>
> I am quite proud of the linux boot time pitting against other OS. Even
> with 10 partitions. Linux can boot up in just a few seconds, but now
> you're saying that we need to do this semaphore check at boot up. By
> doing so, it's inducing additional 4 seconds during boot up.
By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
(2.8GHz). I wonder if those on slower hard disk or slower CPU, what
kind of degradation would this cause or just the same?
Thanks,
Jeff
On 2012-11-27 06:57, Jeff Chua wrote:
> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua <[email protected]> wrote:
>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka <[email protected]> wrote:
>>> So it's better to slow down mount.
>>
>> I am quite proud of the linux boot time pitting against other OS. Even
>> with 10 partitions. Linux can boot up in just a few seconds, but now
>> you're saying that we need to do this semaphore check at boot up. By
>> doing so, it's inducing additional 4 seconds during boot up.
>
> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
> kind of degradation would this cause or just the same?
It'd likely be the same slow down time wise, but as a percentage it
would appear smaller on a slower disk.
Could you please test Mikulas' suggestion of changing
synchronize_sched() in include/linux/percpu-rwsem.h to
synchronize_sched_expedited()?
linux-next also has a re-write of the per-cpu rw sems, out of Andrews
tree. It would be a good data point it you could test that, too.
In any case, the slow down definitely isn't acceptable. Fixing an
obscure issue like block sizes changing while O_DIRECT is in flight
definitely does NOT warrant a mount slow down.
--
Jens Axboe
On 2012-11-27 08:38, Jens Axboe wrote:
> On 2012-11-27 06:57, Jeff Chua wrote:
>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua <[email protected]> wrote:
>>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka <[email protected]> wrote:
>>>> So it's better to slow down mount.
>>>
>>> I am quite proud of the linux boot time pitting against other OS. Even
>>> with 10 partitions. Linux can boot up in just a few seconds, but now
>>> you're saying that we need to do this semaphore check at boot up. By
>>> doing so, it's inducing additional 4 seconds during boot up.
>>
>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
>> kind of degradation would this cause or just the same?
>
> It'd likely be the same slow down time wise, but as a percentage it
> would appear smaller on a slower disk.
>
> Could you please test Mikulas' suggestion of changing
> synchronize_sched() in include/linux/percpu-rwsem.h to
> synchronize_sched_expedited()?
>
> linux-next also has a re-write of the per-cpu rw sems, out of Andrews
> tree. It would be a good data point it you could test that, too.
>
> In any case, the slow down definitely isn't acceptable. Fixing an
> obscure issue like block sizes changing while O_DIRECT is in flight
> definitely does NOT warrant a mount slow down.
Here's Olegs patch, might be easier for you than switching to
linux-next. Please try that.
From: Oleg Nesterov <[email protected]>
Subject: percpu_rw_semaphore: reimplement to not block the readers unnecessarily
Currently the writer does msleep() plus synchronize_sched() 3 times to
acquire/release the semaphore, and during this time the readers are
blocked completely. Even if the "write" section was not actually started
or if it was already finished.
With this patch down_write/up_write does synchronize_sched() twice and
down_read/up_read are still possible during this time, just they use the
slow path.
percpu_down_write() first forces the readers to use rw_semaphore and
increment the "slow" counter to take the lock for reading, then it
takes that rw_semaphore for writing and blocks the readers.
Also. With this patch the code relies on the documented behaviour of
synchronize_sched(), it doesn't try to pair synchronize_sched() with
barrier.
Signed-off-by: Oleg Nesterov <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Cc: Ananth N Mavinakayanahalli <[email protected]>
Cc: Anton Arapov <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
include/linux/percpu-rwsem.h | 85 +++-------------------
lib/Makefile | 2
lib/percpu-rwsem.c | 123 +++++++++++++++++++++++++++++++++
3 files changed, 138 insertions(+), 72 deletions(-)
diff -puN include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily include/linux/percpu-rwsem.h
--- a/include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
+++ a/include/linux/percpu-rwsem.h
@@ -2,82 +2,25 @@
#define _LINUX_PERCPU_RWSEM_H
#include <linux/mutex.h>
+#include <linux/rwsem.h>
#include <linux/percpu.h>
-#include <linux/rcupdate.h>
-#include <linux/delay.h>
+#include <linux/wait.h>
struct percpu_rw_semaphore {
- unsigned __percpu *counters;
- bool locked;
- struct mutex mtx;
+ unsigned int __percpu *fast_read_ctr;
+ struct mutex writer_mutex;
+ struct rw_semaphore rw_sem;
+ atomic_t slow_read_ctr;
+ wait_queue_head_t write_waitq;
};
-#define light_mb() barrier()
-#define heavy_mb() synchronize_sched()
+extern void percpu_down_read(struct percpu_rw_semaphore *);
+extern void percpu_up_read(struct percpu_rw_semaphore *);
-static inline void percpu_down_read(struct percpu_rw_semaphore *p)
-{
- rcu_read_lock_sched();
- if (unlikely(p->locked)) {
- rcu_read_unlock_sched();
- mutex_lock(&p->mtx);
- this_cpu_inc(*p->counters);
- mutex_unlock(&p->mtx);
- return;
- }
- this_cpu_inc(*p->counters);
- rcu_read_unlock_sched();
- light_mb(); /* A, between read of p->locked and read of data, paired with D */
-}
-
-static inline void percpu_up_read(struct percpu_rw_semaphore *p)
-{
- light_mb(); /* B, between read of the data and write to p->counter, paired with C */
- this_cpu_dec(*p->counters);
-}
-
-static inline unsigned __percpu_count(unsigned __percpu *counters)
-{
- unsigned total = 0;
- int cpu;
-
- for_each_possible_cpu(cpu)
- total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
-
- return total;
-}
-
-static inline void percpu_down_write(struct percpu_rw_semaphore *p)
-{
- mutex_lock(&p->mtx);
- p->locked = true;
- synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
- while (__percpu_count(p->counters))
- msleep(1);
- heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
-}
-
-static inline void percpu_up_write(struct percpu_rw_semaphore *p)
-{
- heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
- p->locked = false;
- mutex_unlock(&p->mtx);
-}
-
-static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
-{
- p->counters = alloc_percpu(unsigned);
- if (unlikely(!p->counters))
- return -ENOMEM;
- p->locked = false;
- mutex_init(&p->mtx);
- return 0;
-}
-
-static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
-{
- free_percpu(p->counters);
- p->counters = NULL; /* catch use after free bugs */
-}
+extern void percpu_down_write(struct percpu_rw_semaphore *);
+extern void percpu_up_write(struct percpu_rw_semaphore *);
+
+extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
+extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
#endif
diff -puN lib/Makefile~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily lib/Makefile
--- a/lib/Makefile~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
+++ a/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmd
idr.o int_sqrt.o extable.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
- is_single_threaded.o plist.o decompress.o earlycpio.o
+ is_single_threaded.o plist.o decompress.o earlycpio.o percpu-rwsem.o
lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
diff -puN /dev/null lib/percpu-rwsem.c
--- /dev/null
+++ a/lib/percpu-rwsem.c
@@ -0,0 +1,123 @@
+#include <linux/percpu-rwsem.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+
+int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
+{
+ brw->fast_read_ctr = alloc_percpu(int);
+ if (unlikely(!brw->fast_read_ctr))
+ return -ENOMEM;
+
+ mutex_init(&brw->writer_mutex);
+ init_rwsem(&brw->rw_sem);
+ atomic_set(&brw->slow_read_ctr, 0);
+ init_waitqueue_head(&brw->write_waitq);
+ return 0;
+}
+
+void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
+{
+ free_percpu(brw->fast_read_ctr);
+ brw->fast_read_ctr = NULL; /* catch use after free bugs */
+}
+
+static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
+{
+ bool success = false;
+
+ preempt_disable();
+ if (likely(!mutex_is_locked(&brw->writer_mutex))) {
+ __this_cpu_add(*brw->fast_read_ctr, val);
+ success = true;
+ }
+ preempt_enable();
+
+ return success;
+}
+
+/*
+ * Like the normal down_read() this is not recursive, the writer can
+ * come after the first percpu_down_read() and create the deadlock.
+ */
+void percpu_down_read(struct percpu_rw_semaphore *brw)
+{
+ if (likely(update_fast_ctr(brw, +1)))
+ return;
+
+ down_read(&brw->rw_sem);
+ atomic_inc(&brw->slow_read_ctr);
+ up_read(&brw->rw_sem);
+}
+
+void percpu_up_read(struct percpu_rw_semaphore *brw)
+{
+ if (likely(update_fast_ctr(brw, -1)))
+ return;
+
+ /* false-positive is possible but harmless */
+ if (atomic_dec_and_test(&brw->slow_read_ctr))
+ wake_up_all(&brw->write_waitq);
+}
+
+static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
+{
+ unsigned int sum = 0;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ sum += per_cpu(*brw->fast_read_ctr, cpu);
+ per_cpu(*brw->fast_read_ctr, cpu) = 0;
+ }
+
+ return sum;
+}
+
+/*
+ * A writer takes ->writer_mutex to exclude other writers and to force the
+ * readers to switch to the slow mode, note the mutex_is_locked() check in
+ * update_fast_ctr().
+ *
+ * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
+ * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
+ * counter it represents the number of active readers.
+ *
+ * Finally the writer takes ->rw_sem for writing and blocks the new readers,
+ * then waits until the slow counter becomes zero.
+ */
+void percpu_down_write(struct percpu_rw_semaphore *brw)
+{
+ /* also blocks update_fast_ctr() which checks mutex_is_locked() */
+ mutex_lock(&brw->writer_mutex);
+
+ /*
+ * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
+ * so that update_fast_ctr() can't succeed.
+ *
+ * 2. Ensures we see the result of every previous this_cpu_add() in
+ * update_fast_ctr().
+ *
+ * 3. Ensures that if any reader has exited its critical section via
+ * fast-path, it executes a full memory barrier before we return.
+ */
+ synchronize_sched();
+
+ /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
+ atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
+
+ /* block the new readers completely */
+ down_write(&brw->rw_sem);
+
+ /* wait for all readers to complete their percpu_up_read() */
+ wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
+}
+
+void percpu_up_write(struct percpu_rw_semaphore *brw)
+{
+ /* allow the new readers, but only the slow-path */
+ up_write(&brw->rw_sem);
+
+ /* insert the barrier before the next fast-path in down_read */
+ synchronize_sched();
+
+ mutex_unlock(&brw->writer_mutex);
+}
--
Jens Axboe
Jens,
Limited access now at Incheon Airport. Will try the patch out when I arrived.
Thanks,
Jeff
On 11/27/12, Jens Axboe <[email protected]> wrote:
> On 2012-11-27 08:38, Jens Axboe wrote:
>> On 2012-11-27 06:57, Jeff Chua wrote:
>>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua <[email protected]>
>>> wrote:
>>>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka <[email protected]>
>>>> wrote:
>>>>> So it's better to slow down mount.
>>>>
>>>> I am quite proud of the linux boot time pitting against other OS. Even
>>>> with 10 partitions. Linux can boot up in just a few seconds, but now
>>>> you're saying that we need to do this semaphore check at boot up. By
>>>> doing so, it's inducing additional 4 seconds during boot up.
>>>
>>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
>>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
>>> kind of degradation would this cause or just the same?
>>
>> It'd likely be the same slow down time wise, but as a percentage it
>> would appear smaller on a slower disk.
>>
>> Could you please test Mikulas' suggestion of changing
>> synchronize_sched() in include/linux/percpu-rwsem.h to
>> synchronize_sched_expedited()?
>>
>> linux-next also has a re-write of the per-cpu rw sems, out of Andrews
>> tree. It would be a good data point it you could test that, too.
>>
>> In any case, the slow down definitely isn't acceptable. Fixing an
>> obscure issue like block sizes changing while O_DIRECT is in flight
>> definitely does NOT warrant a mount slow down.
>
> Here's Olegs patch, might be easier for you than switching to
> linux-next. Please try that.
>
> From: Oleg Nesterov <[email protected]>
> Subject: percpu_rw_semaphore: reimplement to not block the readers
> unnecessarily
>
> Currently the writer does msleep() plus synchronize_sched() 3 times to
> acquire/release the semaphore, and during this time the readers are
> blocked completely. Even if the "write" section was not actually started
> or if it was already finished.
>
> With this patch down_write/up_write does synchronize_sched() twice and
> down_read/up_read are still possible during this time, just they use the
> slow path.
>
> percpu_down_write() first forces the readers to use rw_semaphore and
> increment the "slow" counter to take the lock for reading, then it
> takes that rw_semaphore for writing and blocks the readers.
>
> Also. With this patch the code relies on the documented behaviour of
> synchronize_sched(), it doesn't try to pair synchronize_sched() with
> barrier.
>
> Signed-off-by: Oleg Nesterov <[email protected]>
> Reviewed-by: Paul E. McKenney <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Mikulas Patocka <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Srikar Dronamraju <[email protected]>
> Cc: Ananth N Mavinakayanahalli <[email protected]>
> Cc: Anton Arapov <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> include/linux/percpu-rwsem.h | 85 +++-------------------
> lib/Makefile | 2
> lib/percpu-rwsem.c | 123 +++++++++++++++++++++++++++++++++
> 3 files changed, 138 insertions(+), 72 deletions(-)
>
> diff -puN
> include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
> include/linux/percpu-rwsem.h
> ---
> a/include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
> +++ a/include/linux/percpu-rwsem.h
> @@ -2,82 +2,25 @@
> #define _LINUX_PERCPU_RWSEM_H
>
> #include <linux/mutex.h>
> +#include <linux/rwsem.h>
> #include <linux/percpu.h>
> -#include <linux/rcupdate.h>
> -#include <linux/delay.h>
> +#include <linux/wait.h>
>
> struct percpu_rw_semaphore {
> - unsigned __percpu *counters;
> - bool locked;
> - struct mutex mtx;
> + unsigned int __percpu *fast_read_ctr;
> + struct mutex writer_mutex;
> + struct rw_semaphore rw_sem;
> + atomic_t slow_read_ctr;
> + wait_queue_head_t write_waitq;
> };
>
> -#define light_mb() barrier()
> -#define heavy_mb() synchronize_sched()
> +extern void percpu_down_read(struct percpu_rw_semaphore *);
> +extern void percpu_up_read(struct percpu_rw_semaphore *);
>
> -static inline void percpu_down_read(struct percpu_rw_semaphore *p)
> -{
> - rcu_read_lock_sched();
> - if (unlikely(p->locked)) {
> - rcu_read_unlock_sched();
> - mutex_lock(&p->mtx);
> - this_cpu_inc(*p->counters);
> - mutex_unlock(&p->mtx);
> - return;
> - }
> - this_cpu_inc(*p->counters);
> - rcu_read_unlock_sched();
> - light_mb(); /* A, between read of p->locked and read of data, paired with
> D */
> -}
> -
> -static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> -{
> - light_mb(); /* B, between read of the data and write to p->counter, paired
> with C */
> - this_cpu_dec(*p->counters);
> -}
> -
> -static inline unsigned __percpu_count(unsigned __percpu *counters)
> -{
> - unsigned total = 0;
> - int cpu;
> -
> - for_each_possible_cpu(cpu)
> - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
> -
> - return total;
> -}
> -
> -static inline void percpu_down_write(struct percpu_rw_semaphore *p)
> -{
> - mutex_lock(&p->mtx);
> - p->locked = true;
> - synchronize_sched(); /* make sure that all readers exit the
> rcu_read_lock_sched region */
> - while (__percpu_count(p->counters))
> - msleep(1);
> - heavy_mb(); /* C, between read of p->counter and write to data, paired
> with B */
> -}
> -
> -static inline void percpu_up_write(struct percpu_rw_semaphore *p)
> -{
> - heavy_mb(); /* D, between write to data and write to p->locked, paired
> with A */
> - p->locked = false;
> - mutex_unlock(&p->mtx);
> -}
> -
> -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
> -{
> - p->counters = alloc_percpu(unsigned);
> - if (unlikely(!p->counters))
> - return -ENOMEM;
> - p->locked = false;
> - mutex_init(&p->mtx);
> - return 0;
> -}
> -
> -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
> -{
> - free_percpu(p->counters);
> - p->counters = NULL; /* catch use after free bugs */
> -}
> +extern void percpu_down_write(struct percpu_rw_semaphore *);
> +extern void percpu_up_write(struct percpu_rw_semaphore *);
> +
> +extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
> +extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
>
> #endif
> diff -puN
> lib/Makefile~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
> lib/Makefile
> ---
> a/lib/Makefile~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
> +++ a/lib/Makefile
> @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmd
> idr.o int_sqrt.o extable.o \
> sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
> proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
> - is_single_threaded.o plist.o decompress.o earlycpio.o
> + is_single_threaded.o plist.o decompress.o earlycpio.o percpu-rwsem.o
>
> lib-$(CONFIG_MMU) += ioremap.o
> lib-$(CONFIG_SMP) += cpumask.o
> diff -puN /dev/null lib/percpu-rwsem.c
> --- /dev/null
> +++ a/lib/percpu-rwsem.c
> @@ -0,0 +1,123 @@
> +#include <linux/percpu-rwsem.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched.h>
> +
> +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> +{
> + brw->fast_read_ctr = alloc_percpu(int);
> + if (unlikely(!brw->fast_read_ctr))
> + return -ENOMEM;
> +
> + mutex_init(&brw->writer_mutex);
> + init_rwsem(&brw->rw_sem);
> + atomic_set(&brw->slow_read_ctr, 0);
> + init_waitqueue_head(&brw->write_waitq);
> + return 0;
> +}
> +
> +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> +{
> + free_percpu(brw->fast_read_ctr);
> + brw->fast_read_ctr = NULL; /* catch use after free bugs */
> +}
> +
> +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int
> val)
> +{
> + bool success = false;
> +
> + preempt_disable();
> + if (likely(!mutex_is_locked(&brw->writer_mutex))) {
> + __this_cpu_add(*brw->fast_read_ctr, val);
> + success = true;
> + }
> + preempt_enable();
> +
> + return success;
> +}
> +
> +/*
> + * Like the normal down_read() this is not recursive, the writer can
> + * come after the first percpu_down_read() and create the deadlock.
> + */
> +void percpu_down_read(struct percpu_rw_semaphore *brw)
> +{
> + if (likely(update_fast_ctr(brw, +1)))
> + return;
> +
> + down_read(&brw->rw_sem);
> + atomic_inc(&brw->slow_read_ctr);
> + up_read(&brw->rw_sem);
> +}
> +
> +void percpu_up_read(struct percpu_rw_semaphore *brw)
> +{
> + if (likely(update_fast_ctr(brw, -1)))
> + return;
> +
> + /* false-positive is possible but harmless */
> + if (atomic_dec_and_test(&brw->slow_read_ctr))
> + wake_up_all(&brw->write_waitq);
> +}
> +
> +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> +{
> + unsigned int sum = 0;
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + sum += per_cpu(*brw->fast_read_ctr, cpu);
> + per_cpu(*brw->fast_read_ctr, cpu) = 0;
> + }
> +
> + return sum;
> +}
> +
> +/*
> + * A writer takes ->writer_mutex to exclude other writers and to force the
> + * readers to switch to the slow mode, note the mutex_is_locked() check in
> + * update_fast_ctr().
> + *
> + * After that the readers can only inc/dec the slow ->slow_read_ctr
> counter,
> + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> + * counter it represents the number of active readers.
> + *
> + * Finally the writer takes ->rw_sem for writing and blocks the new
> readers,
> + * then waits until the slow counter becomes zero.
> + */
> +void percpu_down_write(struct percpu_rw_semaphore *brw)
> +{
> + /* also blocks update_fast_ctr() which checks mutex_is_locked() */
> + mutex_lock(&brw->writer_mutex);
> +
> + /*
> + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> + * so that update_fast_ctr() can't succeed.
> + *
> + * 2. Ensures we see the result of every previous this_cpu_add() in
> + * update_fast_ctr().
> + *
> + * 3. Ensures that if any reader has exited its critical section via
> + * fast-path, it executes a full memory barrier before we return.
> + */
> + synchronize_sched();
> +
> + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
> + atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
> +
> + /* block the new readers completely */
> + down_write(&brw->rw_sem);
> +
> + /* wait for all readers to complete their percpu_up_read() */
> + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
> +}
> +
> +void percpu_up_write(struct percpu_rw_semaphore *brw)
> +{
> + /* allow the new readers, but only the slow-path */
> + up_write(&brw->rw_sem);
> +
> + /* insert the barrier before the next fast-path in down_read */
> + synchronize_sched();
> +
> + mutex_unlock(&brw->writer_mutex);
> +}
>
> --
> Jens Axboe
>
>
On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe <[email protected]> wrote:
> On 2012-11-27 06:57, Jeff Chua wrote:
>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua <[email protected]> wrote:
>>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka <[email protected]> wrote:
>>>> So it's better to slow down mount.
>>>
>>> I am quite proud of the linux boot time pitting against other OS. Even
>>> with 10 partitions. Linux can boot up in just a few seconds, but now
>>> you're saying that we need to do this semaphore check at boot up. By
>>> doing so, it's inducing additional 4 seconds during boot up.
>>
>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
>> kind of degradation would this cause or just the same?
>
> It'd likely be the same slow down time wise, but as a percentage it
> would appear smaller on a slower disk.
>
> Could you please test Mikulas' suggestion of changing
> synchronize_sched() in include/linux/percpu-rwsem.h to
> synchronize_sched_expedited()?
Tested. It seems as fast as before, but may be a "tick" slower. Just
perception. I was getting pretty much 0.012s with everything reverted.
With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
So, it's good.
> linux-next also has a re-write of the per-cpu rw sems, out of Andrews
> tree. It would be a good data point it you could test that, too.
Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.
# time mount /dev/sda1 /mnt; sync; sync; umount /mnt
So, here's the comparison ...
0.500s 3.7.0-rc7
0.168s 3.7.0-rc2
0.012s 3.6.0
0.013s 3.7.0-rc7 + synchronize_sched_expedited()
0.350s 3.7.0-rc7 + Oleg's patch.
Thanks,
Jeff.
On 2012-11-27 11:06, Jeff Chua wrote:
> On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe <[email protected]> wrote:
>> On 2012-11-27 06:57, Jeff Chua wrote:
>>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua <[email protected]> wrote:
>>>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka <[email protected]> wrote:
>>>>> So it's better to slow down mount.
>>>>
>>>> I am quite proud of the linux boot time pitting against other OS. Even
>>>> with 10 partitions. Linux can boot up in just a few seconds, but now
>>>> you're saying that we need to do this semaphore check at boot up. By
>>>> doing so, it's inducing additional 4 seconds during boot up.
>>>
>>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
>>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
>>> kind of degradation would this cause or just the same?
>>
>> It'd likely be the same slow down time wise, but as a percentage it
>> would appear smaller on a slower disk.
>>
>> Could you please test Mikulas' suggestion of changing
>> synchronize_sched() in include/linux/percpu-rwsem.h to
>> synchronize_sched_expedited()?
>
> Tested. It seems as fast as before, but may be a "tick" slower. Just
> perception. I was getting pretty much 0.012s with everything reverted.
> With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
> So, it's good.
Excellent
>> linux-next also has a re-write of the per-cpu rw sems, out of Andrews
>> tree. It would be a good data point it you could test that, too.
>
> Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.
Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix
the real issue, which is having to do synchronize_sched() in the first
place.
> # time mount /dev/sda1 /mnt; sync; sync; umount /mnt
>
>
> So, here's the comparison ...
>
> 0.500s 3.7.0-rc7
> 0.168s 3.7.0-rc2
> 0.012s 3.6.0
> 0.013s 3.7.0-rc7 + synchronize_sched_expedited()
> 0.350s 3.7.0-rc7 + Oleg's patch.
I wonder how many of them are due to changing to the same block size.
Does the below patch make a difference?
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1a1e5e3..f041c56 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -126,29 +126,28 @@ int set_blocksize(struct block_device *bdev, int size)
if (size < bdev_logical_block_size(bdev))
return -EINVAL;
- /* Prevent starting I/O or mapping the device */
- percpu_down_write(&bdev->bd_block_size_semaphore);
-
/* Check that the block device is not memory mapped */
mapping = bdev->bd_inode->i_mapping;
mutex_lock(&mapping->i_mmap_mutex);
if (mapping_mapped(mapping)) {
mutex_unlock(&mapping->i_mmap_mutex);
- percpu_up_write(&bdev->bd_block_size_semaphore);
return -EBUSY;
}
mutex_unlock(&mapping->i_mmap_mutex);
/* Don't change the size if it is same as current */
if (bdev->bd_block_size != size) {
- sync_blockdev(bdev);
- bdev->bd_block_size = size;
- bdev->bd_inode->i_blkbits = blksize_bits(size);
- kill_bdev(bdev);
+ /* Prevent starting I/O */
+ percpu_down_write(&bdev->bd_block_size_semaphore);
+ if (bdev->bd_block_size != size) {
+ sync_blockdev(bdev);
+ bdev->bd_block_size = size;
+ bdev->bd_inode->i_blkbits = blksize_bits(size);
+ kill_bdev(bdev);
+ }
+ percpu_up_write(&bdev->bd_block_size_semaphore);
}
- percpu_up_write(&bdev->bd_block_size_semaphore);
-
return 0;
}
@@ -1649,14 +1648,12 @@ EXPORT_SYMBOL_GPL(blkdev_aio_write);
static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
{
+ struct address_space *mapping = file->f_mapping;
int ret;
- struct block_device *bdev = I_BDEV(file->f_mapping->host);
-
- percpu_down_read(&bdev->bd_block_size_semaphore);
+ mutex_lock(&mapping->i_mmap_mutex);
ret = generic_file_mmap(file, vma);
-
- percpu_up_read(&bdev->bd_block_size_semaphore);
+ mutex_unlock(&mapping->i_mmap_mutex);
return ret;
}
--
Jens Axboe
On Tue, 27 Nov 2012, Jens Axboe wrote:
> On 2012-11-27 11:06, Jeff Chua wrote:
> > On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe <[email protected]> wrote:
> >> On 2012-11-27 06:57, Jeff Chua wrote:
> >>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua <[email protected]> wrote:
> >>>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka <[email protected]> wrote:
> >>>>> So it's better to slow down mount.
> >>>>
> >>>> I am quite proud of the linux boot time pitting against other OS. Even
> >>>> with 10 partitions. Linux can boot up in just a few seconds, but now
> >>>> you're saying that we need to do this semaphore check at boot up. By
> >>>> doing so, it's inducing additional 4 seconds during boot up.
> >>>
> >>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
> >>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
> >>> kind of degradation would this cause or just the same?
> >>
> >> It'd likely be the same slow down time wise, but as a percentage it
> >> would appear smaller on a slower disk.
> >>
> >> Could you please test Mikulas' suggestion of changing
> >> synchronize_sched() in include/linux/percpu-rwsem.h to
> >> synchronize_sched_expedited()?
> >
> > Tested. It seems as fast as before, but may be a "tick" slower. Just
> > perception. I was getting pretty much 0.012s with everything reverted.
> > With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
> > So, it's good.
>
> Excellent
>
> >> linux-next also has a re-write of the per-cpu rw sems, out of Andrews
> >> tree. It would be a good data point it you could test that, too.
> >
> > Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.
>
> Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix
> the real issue, which is having to do synchronize_sched() in the first
> place.
>
> > # time mount /dev/sda1 /mnt; sync; sync; umount /mnt
> >
> >
> > So, here's the comparison ...
> >
> > 0.500s 3.7.0-rc7
> > 0.168s 3.7.0-rc2
> > 0.012s 3.6.0
> > 0.013s 3.7.0-rc7 + synchronize_sched_expedited()
> > 0.350s 3.7.0-rc7 + Oleg's patch.
>
> I wonder how many of them are due to changing to the same block size.
> Does the below patch make a difference?
This patch is wrong because you must check if the device is mapped while
holding bdev->bd_block_size_semaphore (because
bdev->bd_block_size_semaphore prevents new mappings from being created)
I'm sending another patch that has the same effect.
Note that ext[234] filesystems set blocksize to 1024 temporarily during
mount, so it doesn't help much (it only helps for other filesystems, such
as jfs). For ext[234], you have a device with default block size 4096, the
filesystem sets block size to 1024 during mount, reads the super block and
sets it back to 4096.
If you want, you can fix ext[234] to avoid this behavior.
Mikulas
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 1a1e5e3..f041c56 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -126,29 +126,28 @@ int set_blocksize(struct block_device *bdev, int size)
> if (size < bdev_logical_block_size(bdev))
> return -EINVAL;
>
> - /* Prevent starting I/O or mapping the device */
> - percpu_down_write(&bdev->bd_block_size_semaphore);
> -
> /* Check that the block device is not memory mapped */
> mapping = bdev->bd_inode->i_mapping;
> mutex_lock(&mapping->i_mmap_mutex);
> if (mapping_mapped(mapping)) {
> mutex_unlock(&mapping->i_mmap_mutex);
> - percpu_up_write(&bdev->bd_block_size_semaphore);
> return -EBUSY;
> }
> mutex_unlock(&mapping->i_mmap_mutex);
>
> /* Don't change the size if it is same as current */
> if (bdev->bd_block_size != size) {
> - sync_blockdev(bdev);
> - bdev->bd_block_size = size;
> - bdev->bd_inode->i_blkbits = blksize_bits(size);
> - kill_bdev(bdev);
> + /* Prevent starting I/O */
> + percpu_down_write(&bdev->bd_block_size_semaphore);
> + if (bdev->bd_block_size != size) {
> + sync_blockdev(bdev);
> + bdev->bd_block_size = size;
> + bdev->bd_inode->i_blkbits = blksize_bits(size);
> + kill_bdev(bdev);
> + }
> + percpu_up_write(&bdev->bd_block_size_semaphore);
> }
>
> - percpu_up_write(&bdev->bd_block_size_semaphore);
> -
> return 0;
> }
>
> @@ -1649,14 +1648,12 @@ EXPORT_SYMBOL_GPL(blkdev_aio_write);
>
> static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
> {
> + struct address_space *mapping = file->f_mapping;
> int ret;
> - struct block_device *bdev = I_BDEV(file->f_mapping->host);
> -
> - percpu_down_read(&bdev->bd_block_size_semaphore);
>
> + mutex_lock(&mapping->i_mmap_mutex);
> ret = generic_file_mmap(file, vma);
> -
> - percpu_up_read(&bdev->bd_block_size_semaphore);
> + mutex_unlock(&mapping->i_mmap_mutex);
>
> return ret;
> }
>
> --
> Jens Axboe
>
block_dev: don't take the write lock if block size doesn't change
Taking the write lock has a big performance impact on the whole system
(because of synchronize_sched_expedited). This patch avoids taking the
write lock if the block size doesn't change (i.e. when mounting
filesystem with block size equal to the default block size).
The logic to test if the block device is mapped was moved to a separate
function is_bdev_mapped to avoid code duplication.
Signed-off-by: Mikulas Patocka <[email protected]>
---
fs/block_dev.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)
Index: linux-3.7-rc7/fs/block_dev.c
===================================================================
--- linux-3.7-rc7.orig/fs/block_dev.c 2012-11-28 04:09:01.000000000 +0100
+++ linux-3.7-rc7/fs/block_dev.c 2012-11-28 04:13:53.000000000 +0100
@@ -114,10 +114,18 @@ void invalidate_bdev(struct block_device
}
EXPORT_SYMBOL(invalidate_bdev);
-int set_blocksize(struct block_device *bdev, int size)
+static int is_bdev_mapped(struct block_device *bdev)
{
- struct address_space *mapping;
+ int ret_val;
+ struct address_space *mapping = bdev->bd_inode->i_mapping;
+ mutex_lock(&mapping->i_mmap_mutex);
+ ret_val = mapping_mapped(mapping);
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return ret_val;
+}
+int set_blocksize(struct block_device *bdev, int size)
+{
/* Size must be a power of two, and between 512 and PAGE_SIZE */
if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
return -EINVAL;
@@ -126,18 +134,21 @@ int set_blocksize(struct block_device *b
if (size < bdev_logical_block_size(bdev))
return -EINVAL;
+ /*
+ * If the block size doesn't change, don't take the write lock.
+ * We check for is_bdev_mapped anyway, for consistent behavior.
+ */
+ if (size == bdev->bd_block_size)
+ return is_bdev_mapped(bdev) ? -EBUSY : 0;
+
/* Prevent starting I/O or mapping the device */
percpu_down_write(&bdev->bd_block_size_semaphore);
/* Check that the block device is not memory mapped */
- mapping = bdev->bd_inode->i_mapping;
- mutex_lock(&mapping->i_mmap_mutex);
- if (mapping_mapped(mapping)) {
- mutex_unlock(&mapping->i_mmap_mutex);
+ if (is_bdev_mapped(bdev)) {
percpu_up_write(&bdev->bd_block_size_semaphore);
return -EBUSY;
}
- mutex_unlock(&mapping->i_mmap_mutex);
/* Don't change the size if it is same as current */
if (bdev->bd_block_size != size) {
On Tue, 27 Nov 2012, Jeff Chua wrote:
> On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe <[email protected]> wrote:
> > On 2012-11-27 06:57, Jeff Chua wrote:
> >> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua <[email protected]> wrote:
> >>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka <[email protected]> wrote:
> >>>> So it's better to slow down mount.
> >>>
> >>> I am quite proud of the linux boot time pitting against other OS. Even
> >>> with 10 partitions. Linux can boot up in just a few seconds, but now
> >>> you're saying that we need to do this semaphore check at boot up. By
> >>> doing so, it's inducing additional 4 seconds during boot up.
> >>
> >> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
> >> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
> >> kind of degradation would this cause or just the same?
> >
> > It'd likely be the same slow down time wise, but as a percentage it
> > would appear smaller on a slower disk.
> >
> > Could you please test Mikulas' suggestion of changing
> > synchronize_sched() in include/linux/percpu-rwsem.h to
> > synchronize_sched_expedited()?
>
> Tested. It seems as fast as before, but may be a "tick" slower. Just
> perception. I was getting pretty much 0.012s with everything reverted.
> With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
> So, it's good.
>
>
> > linux-next also has a re-write of the per-cpu rw sems, out of Andrews
> > tree. It would be a good data point it you could test that, too.
>
> Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.
>
> # time mount /dev/sda1 /mnt; sync; sync; umount /mnt
>
>
> So, here's the comparison ...
>
> 0.500s 3.7.0-rc7
> 0.168s 3.7.0-rc2
> 0.012s 3.6.0
> 0.013s 3.7.0-rc7 + synchronize_sched_expedited()
> 0.350s 3.7.0-rc7 + Oleg's patch.
>
>
> Thanks,
> Jeff.
OK, I'm seinding two patches to reduce mount times. If it is possible to
put them to 3.7.0, put them there.
Mikulas
---
percpu-rwsem: use synchronize_sched_expedited
Use synchronize_sched_expedited() instead of synchronize_sched()
to improve mount speed.
This patch improves mount time from 0.500s to 0.013s.
Note: if realtime people complain about the use
synchronize_sched_expedited() and synchronize_rcu_expedited(), I suggest
that they introduce an option CONFIG_REALTIME or
/proc/sys/kernel/realtime and turn off these *_expedited functions if
the option is enabled (i.e. turn synchronize_sched_expedited into
synchronize_sched and synchronize_rcu_expedited into synchronize_rcu).
Signed-off-by: Mikulas Patocka <[email protected]>
---
include/linux/percpu-rwsem.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
Index: linux-3.7-rc7/include/linux/percpu-rwsem.h
===================================================================
--- linux-3.7-rc7.orig/include/linux/percpu-rwsem.h 2012-11-28 02:41:03.000000000 +0100
+++ linux-3.7-rc7/include/linux/percpu-rwsem.h 2012-11-28 02:41:15.000000000 +0100
@@ -13,7 +13,7 @@ struct percpu_rw_semaphore {
};
#define light_mb() barrier()
-#define heavy_mb() synchronize_sched()
+#define heavy_mb() synchronize_sched_expedited()
static inline void percpu_down_read(struct percpu_rw_semaphore *p)
{
@@ -51,7 +51,7 @@ static inline void percpu_down_write(str
{
mutex_lock(&p->mtx);
p->locked = true;
- synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
+ synchronize_sched_expedited(); /* make sure that all readers exit the rcu_read_lock_sched region */
while (__percpu_count(p->counters))
msleep(1);
heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
On 2012-11-28 04:57, Mikulas Patocka wrote:
>
>
> On Tue, 27 Nov 2012, Jens Axboe wrote:
>
>> On 2012-11-27 11:06, Jeff Chua wrote:
>>> On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe <[email protected]> wrote:
>>>> On 2012-11-27 06:57, Jeff Chua wrote:
>>>>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua <[email protected]> wrote:
>>>>>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka <[email protected]> wrote:
>>>>>>> So it's better to slow down mount.
>>>>>>
>>>>>> I am quite proud of the linux boot time pitting against other OS. Even
>>>>>> with 10 partitions. Linux can boot up in just a few seconds, but now
>>>>>> you're saying that we need to do this semaphore check at boot up. By
>>>>>> doing so, it's inducing additional 4 seconds during boot up.
>>>>>
>>>>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
>>>>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
>>>>> kind of degradation would this cause or just the same?
>>>>
>>>> It'd likely be the same slow down time wise, but as a percentage it
>>>> would appear smaller on a slower disk.
>>>>
>>>> Could you please test Mikulas' suggestion of changing
>>>> synchronize_sched() in include/linux/percpu-rwsem.h to
>>>> synchronize_sched_expedited()?
>>>
>>> Tested. It seems as fast as before, but may be a "tick" slower. Just
>>> perception. I was getting pretty much 0.012s with everything reverted.
>>> With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
>>> So, it's good.
>>
>> Excellent
>>
>>>> linux-next also has a re-write of the per-cpu rw sems, out of Andrews
>>>> tree. It would be a good data point it you could test that, too.
>>>
>>> Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.
>>
>> Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix
>> the real issue, which is having to do synchronize_sched() in the first
>> place.
>>
>>> # time mount /dev/sda1 /mnt; sync; sync; umount /mnt
>>>
>>>
>>> So, here's the comparison ...
>>>
>>> 0.500s 3.7.0-rc7
>>> 0.168s 3.7.0-rc2
>>> 0.012s 3.6.0
>>> 0.013s 3.7.0-rc7 + synchronize_sched_expedited()
>>> 0.350s 3.7.0-rc7 + Oleg's patch.
>>
>> I wonder how many of them are due to changing to the same block size.
>> Does the below patch make a difference?
>
> This patch is wrong because you must check if the device is mapped while
> holding bdev->bd_block_size_semaphore (because
> bdev->bd_block_size_semaphore prevents new mappings from being created)
No it doesn't. If you read the patch, that was moved to i_mmap_mutex.
> I'm sending another patch that has the same effect.
>
>
> Note that ext[234] filesystems set blocksize to 1024 temporarily during
> mount, so it doesn't help much (it only helps for other filesystems, such
> as jfs). For ext[234], you have a device with default block size 4096, the
> filesystem sets block size to 1024 during mount, reads the super block and
> sets it back to 4096.
That is true, hence I was hesitant to think it'll actually help. In any
case, basically any block device will have at least one blocksize
transitioned when being mounted for the first time. I wonder if we just
shouldn't default to having a 4kb soft block size to avoid that one,
though it is working around the issue to some degree.
--
Jens Axboe
On Wed, Nov 28, 2012 at 4:33 PM, Jens Axboe <[email protected]> wrote:
> On 2012-11-28 04:57, Mikulas Patocka wrote:
>>
>>
>> On Tue, 27 Nov 2012, Jens Axboe wrote:
>>
>>> On 2012-11-27 11:06, Jeff Chua wrote:
>>>> On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe <[email protected]> wrote:
>>>>> On 2012-11-27 06:57, Jeff Chua wrote:
>>>>>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua <[email protected]> wrote:
>>>>>>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka <[email protected]> wrote:
>>>>>>>> So it's better to slow down mount.
>>>>>>>
>>>>>>> I am quite proud of the linux boot time pitting against other OS. Even
>>>>>>> with 10 partitions. Linux can boot up in just a few seconds, but now
>>>>>>> you're saying that we need to do this semaphore check at boot up. By
>>>>>>> doing so, it's inducing additional 4 seconds during boot up.
>>>>>>
>>>>>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
>>>>>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
>>>>>> kind of degradation would this cause or just the same?
>>>>>
>>>>> It'd likely be the same slow down time wise, but as a percentage it
>>>>> would appear smaller on a slower disk.
>>>>>
>>>>> Could you please test Mikulas' suggestion of changing
>>>>> synchronize_sched() in include/linux/percpu-rwsem.h to
>>>>> synchronize_sched_expedited()?
>>>>
>>>> Tested. It seems as fast as before, but may be a "tick" slower. Just
>>>> perception. I was getting pretty much 0.012s with everything reverted.
>>>> With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
>>>> So, it's good.
>>>
>>> Excellent
>>>
>>>>> linux-next also has a re-write of the per-cpu rw sems, out of Andrews
>>>>> tree. It would be a good data point it you could test that, too.
>>>>
>>>> Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.
>>>
>>> Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix
>>> the real issue, which is having to do synchronize_sched() in the first
>>> place.
>>>
>>>> # time mount /dev/sda1 /mnt; sync; sync; umount /mnt
>>>>
>>>>
>>>> So, here's the comparison ...
>>>>
>>>> 0.500s 3.7.0-rc7
>>>> 0.168s 3.7.0-rc2
>>>> 0.012s 3.6.0
>>>> 0.013s 3.7.0-rc7 + synchronize_sched_expedited()
>>>> 0.350s 3.7.0-rc7 + Oleg's patch.
>>>
>>> I wonder how many of them are due to changing to the same block size.
>>> Does the below patch make a difference?
>>
>> This patch is wrong because you must check if the device is mapped while
>> holding bdev->bd_block_size_semaphore (because
>> bdev->bd_block_size_semaphore prevents new mappings from being created)
>
> No it doesn't. If you read the patch, that was moved to i_mmap_mutex.
>
>> I'm sending another patch that has the same effect.
>>
>>
>> Note that ext[234] filesystems set blocksize to 1024 temporarily during
>> mount, so it doesn't help much (it only helps for other filesystems, such
>> as jfs). For ext[234], you have a device with default block size 4096, the
>> filesystem sets block size to 1024 during mount, reads the super block and
>> sets it back to 4096.
>
> That is true, hence I was hesitant to think it'll actually help. In any
> case, basically any block device will have at least one blocksize
> transitioned when being mounted for the first time. I wonder if we just
> shouldn't default to having a 4kb soft block size to avoid that one,
> though it is working around the issue to some degree.
I tested on reiserfs. It helped. 0.012s as in 3.6.0, but as Mikulas
mentioned, it didn't really improve much for ext2.
Jeff.
On Wed, Nov 28, 2012 at 11:59 AM, Mikulas Patocka <[email protected]> wrote:
>
>
> On Tue, 27 Nov 2012, Jeff Chua wrote:
>
>> On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe <[email protected]> wrote:
>> > On 2012-11-27 06:57, Jeff Chua wrote:
>> >> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua <[email protected]> wrote:
>> >>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka <[email protected]> wrote:
>> >>>> So it's better to slow down mount.
>> >>>
>> >>> I am quite proud of the linux boot time pitting against other OS. Even
>> >>> with 10 partitions. Linux can boot up in just a few seconds, but now
>> >>> you're saying that we need to do this semaphore check at boot up. By
>> >>> doing so, it's inducing additional 4 seconds during boot up.
>> >>
>> >> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
>> >> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
>> >> kind of degradation would this cause or just the same?
>> >
>> > It'd likely be the same slow down time wise, but as a percentage it
>> > would appear smaller on a slower disk.
>> >
>> > Could you please test Mikulas' suggestion of changing
>> > synchronize_sched() in include/linux/percpu-rwsem.h to
>> > synchronize_sched_expedited()?
>>
>> Tested. It seems as fast as before, but may be a "tick" slower. Just
>> perception. I was getting pretty much 0.012s with everything reverted.
>> With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
>> So, it's good.
>>
>>
>> > linux-next also has a re-write of the per-cpu rw sems, out of Andrews
>> > tree. It would be a good data point it you could test that, too.
>>
>> Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.
>>
>> # time mount /dev/sda1 /mnt; sync; sync; umount /mnt
>>
>>
>> So, here's the comparison ...
>>
>> 0.500s 3.7.0-rc7
>> 0.168s 3.7.0-rc2
>> 0.012s 3.6.0
>> 0.013s 3.7.0-rc7 + synchronize_sched_expedited()
>> 0.350s 3.7.0-rc7 + Oleg's patch.
>>
>>
>> Thanks,
>> Jeff.
>
> OK, I'm seinding two patches to reduce mount times. If it is possible to
> put them to 3.7.0, put them there.
>
> Mikulas
>
> ---
>
> percpu-rwsem: use synchronize_sched_expedited
>
> Use synchronize_sched_expedited() instead of synchronize_sched()
> to improve mount speed.
>
> This patch improves mount time from 0.500s to 0.013s.
>
> Note: if realtime people complain about the use
> synchronize_sched_expedited() and synchronize_rcu_expedited(), I suggest
> that they introduce an option CONFIG_REALTIME or
> /proc/sys/kernel/realtime and turn off these *_expedited functions if
> the option is enabled (i.e. turn synchronize_sched_expedited into
> synchronize_sched and synchronize_rcu_expedited into synchronize_rcu).
>
> Signed-off-by: Mikulas Patocka <[email protected]>
>
> ---
> include/linux/percpu-rwsem.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> Index: linux-3.7-rc7/include/linux/percpu-rwsem.h
> ===================================================================
> --- linux-3.7-rc7.orig/include/linux/percpu-rwsem.h 2012-11-28 02:41:03.000000000 +0100
> +++ linux-3.7-rc7/include/linux/percpu-rwsem.h 2012-11-28 02:41:15.000000000 +0100
> @@ -13,7 +13,7 @@ struct percpu_rw_semaphore {
> };
>
> #define light_mb() barrier()
> -#define heavy_mb() synchronize_sched()
> +#define heavy_mb() synchronize_sched_expedited()
>
> static inline void percpu_down_read(struct percpu_rw_semaphore *p)
> {
> @@ -51,7 +51,7 @@ static inline void percpu_down_write(str
> {
> mutex_lock(&p->mtx);
> p->locked = true;
> - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
> + synchronize_sched_expedited(); /* make sure that all readers exit the rcu_read_lock_sched region */
> while (__percpu_count(p->counters))
> msleep(1);
> heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
Mikulas,
Tested this one and this is good! Back to 3.6.0 behavior.
As for the 2nd patch (block_dev.c), it didn't really make any
difference for ext2/3/4, but for reiserfs, it does. So, won't just the
patch about(synchronize_sched_expedited) be good enough?
Thanks,
Jeff
On Wed, Nov 28, 2012 at 12:01 PM, Mikulas Patocka <[email protected]> wrote:
> block_dev: don't take the write lock if block size doesn't change
>
> Taking the write lock has a big performance impact on the whole system
> (because of synchronize_sched_expedited). This patch avoids taking the
> write lock if the block size doesn't change (i.e. when mounting
> filesystem with block size equal to the default block size).
>
> The logic to test if the block device is mapped was moved to a separate
> function is_bdev_mapped to avoid code duplication.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
>
> ---
> fs/block_dev.c | 25 ++++++++++++++++++-------
> 1 file changed, 18 insertions(+), 7 deletions(-)
>
> Index: linux-3.7-rc7/fs/block_dev.c
> ===================================================================
> --- linux-3.7-rc7.orig/fs/block_dev.c 2012-11-28 04:09:01.000000000 +0100
> +++ linux-3.7-rc7/fs/block_dev.c 2012-11-28 04:13:53.000000000 +0100
> @@ -114,10 +114,18 @@ void invalidate_bdev(struct block_device
> }
> EXPORT_SYMBOL(invalidate_bdev);
>
> -int set_blocksize(struct block_device *bdev, int size)
> +static int is_bdev_mapped(struct block_device *bdev)
> {
> - struct address_space *mapping;
> + int ret_val;
> + struct address_space *mapping = bdev->bd_inode->i_mapping;
> + mutex_lock(&mapping->i_mmap_mutex);
> + ret_val = mapping_mapped(mapping);
> + mutex_unlock(&mapping->i_mmap_mutex);
> + return ret_val;
> +}
>
> +int set_blocksize(struct block_device *bdev, int size)
> +{
> /* Size must be a power of two, and between 512 and PAGE_SIZE */
> if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
> return -EINVAL;
> @@ -126,18 +134,21 @@ int set_blocksize(struct block_device *b
> if (size < bdev_logical_block_size(bdev))
> return -EINVAL;
>
> + /*
> + * If the block size doesn't change, don't take the write lock.
> + * We check for is_bdev_mapped anyway, for consistent behavior.
> + */
> + if (size == bdev->bd_block_size)
> + return is_bdev_mapped(bdev) ? -EBUSY : 0;
> +
> /* Prevent starting I/O or mapping the device */
> percpu_down_write(&bdev->bd_block_size_semaphore);
>
> /* Check that the block device is not memory mapped */
> - mapping = bdev->bd_inode->i_mapping;
> - mutex_lock(&mapping->i_mmap_mutex);
> - if (mapping_mapped(mapping)) {
> - mutex_unlock(&mapping->i_mmap_mutex);
> + if (is_bdev_mapped(bdev)) {
> percpu_up_write(&bdev->bd_block_size_semaphore);
> return -EBUSY;
> }
> - mutex_unlock(&mapping->i_mmap_mutex);
>
> /* Don't change the size if it is same as current */
> if (bdev->bd_block_size != size) {
This patch didn't really make any difference for ext2/3/4 but for
reiserfs it does.
With the synchronize_sched_expedited() patch applied, it didn't make
any difference.
Thanks,
Jeff
On Wed, 28 Nov 2012, Jens Axboe wrote:
> On 2012-11-28 04:57, Mikulas Patocka wrote:
> >
> > This patch is wrong because you must check if the device is mapped while
> > holding bdev->bd_block_size_semaphore (because
> > bdev->bd_block_size_semaphore prevents new mappings from being created)
>
> No it doesn't. If you read the patch, that was moved to i_mmap_mutex.
Hmm, it was wrong before the patch and it is wrong after the patch too.
The problem is that ->mmap method doesn't do the actual mapping, the
caller of ->mmap (mmap_region) does it. So we must actually catch
mmap_region and protect it with the lock, not ->mmap.
Mikulas
---
Introduce a method to catch mmap_region
For block devices, we must prevent the device from being mapped while
block size is being changed.
This was implemented by catching the mmap method and taking
bd_block_size_semaphore in it.
The problem is that the generic_file_mmap method does almost nothing, it
only sets vma->vm_ops = &generic_file_vm_ops, all the hard work is done
by the caller, mmap_region. Consequently, locking the mmap method is
insufficient, to prevent the race condition. The race condition can
happen for example this way:
process 1:
Starts mapping a block device, it goes to mmap_region, calls
file->f_op->mmap (blkdev_mmap - that acquires and releases
bd_block_size_semaphore), and reschedule happens just after blkdev_mmap
returns.
process 2:
Changes block device size, goes into set_blocksize, acquires
percpu_down_write, calls mapping_mapped to test if the device is mapped
(it still isn't). Then, it reschedules.
process 1:
Continues in mmap_region, successfully finishes the mapping.
process 2:
Continues in set_blocksize, changes the block size while the block
device is mapped. (which is incorrect and may result in a crash or
misbehavior).
To fix this possible race condition, we need to catch the function that
does real mapping - mmap_region. This patch adds a new method,
file_operations->mmap_region. mmap_region calls the method
file_operations->mmap_region if it is non-NULL, otherwise, it calls
generic_mmap_region.
Signed-off-by: Mikulas Patocka <[email protected]>
---
fs/block_dev.c | 12 ++++++++----
include/linux/fs.h | 4 ++++
include/linux/mm.h | 3 +++
mm/mmap.c | 10 ++++++++++
4 files changed, 25 insertions(+), 4 deletions(-)
Index: linux-3.7-rc7/include/linux/fs.h
===================================================================
--- linux-3.7-rc7.orig/include/linux/fs.h 2012-11-28 17:45:55.000000000 +0100
+++ linux-3.7-rc7/include/linux/fs.h 2012-11-28 18:02:45.000000000 +0100
@@ -27,6 +27,7 @@
#include <linux/lockdep.h>
#include <linux/percpu-rwsem.h>
#include <linux/blk_types.h>
+#include <linux/mm_types.h>
#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
@@ -1528,6 +1529,9 @@ struct file_operations {
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
+ unsigned long (*mmap_region)(struct file *, unsigned long,
+ unsigned long, unsigned long, vm_flags_t,
+ unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
Index: linux-3.7-rc7/include/linux/mm.h
===================================================================
--- linux-3.7-rc7.orig/include/linux/mm.h 2012-11-28 17:45:55.000000000 +0100
+++ linux-3.7-rc7/include/linux/mm.h 2012-11-28 17:47:12.000000000 +0100
@@ -1444,6 +1444,9 @@ extern unsigned long get_unmapped_area(s
extern unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, unsigned long flags,
vm_flags_t vm_flags, unsigned long pgoff);
+extern unsigned long generic_mmap_region(struct file *file, unsigned long addr,
+ unsigned long len, unsigned long flags,
+ vm_flags_t vm_flags, unsigned long pgoff);
extern unsigned long do_mmap_pgoff(struct file *, unsigned long,
unsigned long, unsigned long,
unsigned long, unsigned long);
Index: linux-3.7-rc7/mm/mmap.c
===================================================================
--- linux-3.7-rc7.orig/mm/mmap.c 2012-11-28 17:45:55.000000000 +0100
+++ linux-3.7-rc7/mm/mmap.c 2012-11-28 17:57:40.000000000 +0100
@@ -1244,6 +1244,16 @@ unsigned long mmap_region(struct file *f
unsigned long len, unsigned long flags,
vm_flags_t vm_flags, unsigned long pgoff)
{
+ if (file && file->f_op->mmap_region)
+ return file->f_op->mmap_region(file, addr, len, flags, vm_flags,
+ pgoff);
+ return generic_mmap_region(file, addr, len, flags, vm_flags, pgoff);
+}
+
+unsigned long generic_mmap_region(struct file *file, unsigned long addr,
+ unsigned long len, unsigned long flags,
+ vm_flags_t vm_flags, unsigned long pgoff)
+{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev;
int correct_wcount = 0;
Index: linux-3.7-rc7/fs/block_dev.c
===================================================================
--- linux-3.7-rc7.orig/fs/block_dev.c 2012-11-28 17:45:56.000000000 +0100
+++ linux-3.7-rc7/fs/block_dev.c 2012-11-28 17:47:12.000000000 +0100
@@ -1658,14 +1658,17 @@ ssize_t blkdev_aio_write(struct kiocb *i
}
EXPORT_SYMBOL_GPL(blkdev_aio_write);
-static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
+static unsigned long blkdev_mmap_region(struct file *file, unsigned long addr,
+ unsigned long len, unsigned long flags,
+ vm_flags_t vm_flags,
+ unsigned long pgoff)
{
- int ret;
+ unsigned long ret;
struct block_device *bdev = I_BDEV(file->f_mapping->host);
percpu_down_read(&bdev->bd_block_size_semaphore);
- ret = generic_file_mmap(file, vma);
+ ret = generic_mmap_region(file, addr, len, flags, vm_flags, pgoff);
percpu_up_read(&bdev->bd_block_size_semaphore);
@@ -1737,7 +1740,8 @@ const struct file_operations def_blk_fop
.write = do_sync_write,
.aio_read = blkdev_aio_read,
.aio_write = blkdev_aio_write,
- .mmap = blkdev_mmap,
+ .mmap_region = blkdev_mmap_region,
+ .mmap = generic_file_mmap,
.fsync = blkdev_fsync,
.unlocked_ioctl = block_ioctl,
#ifdef CONFIG_COMPAT
No, this is crap.
We don't introduce random hooks like this just because the block layer
has shit-for-brains and cannot be bothered to do things right.
The fact is, the whole locking in the block layer open routine is
total and utter crap. It doesn't lock the right thing, even with your
change *anyway* (or with the change Jens had). Absolutely nothing in
"mmap_region()" cares at all about the block-size anywhere - it's
generic, after all - so locking around it is f*cking pointless. There
is no way in hell that the caller of ->mmap can *ever* care about the
block size, since it never even looks at it.
Don't do random crap like this.
Why does the code think that mmap matters so much anyway? As you say,
the mmap itself does *nothing*. It has no impact for the block size.
Linus
On Wed, Nov 28, 2012 at 9:25 AM, Mikulas Patocka <[email protected]> wrote:
>
>
> On Wed, 28 Nov 2012, Jens Axboe wrote:
>
>> On 2012-11-28 04:57, Mikulas Patocka wrote:
>> >
>> > This patch is wrong because you must check if the device is mapped while
>> > holding bdev->bd_block_size_semaphore (because
>> > bdev->bd_block_size_semaphore prevents new mappings from being created)
>>
>> No it doesn't. If you read the patch, that was moved to i_mmap_mutex.
>
> Hmm, it was wrong before the patch and it is wrong after the patch too.
>
> The problem is that ->mmap method doesn't do the actual mapping, the
> caller of ->mmap (mmap_region) does it. So we must actually catch
> mmap_region and protect it with the lock, not ->mmap.
On Wed, Nov 28, 2012 at 11:15:12AM -0800, Linus Torvalds wrote:
> No, this is crap.
>
> We don't introduce random hooks like this just because the block layer
> has shit-for-brains and cannot be bothered to do things right.
>
> The fact is, the whole locking in the block layer open routine is
> total and utter crap. It doesn't lock the right thing, even with your
> change *anyway* (or with the change Jens had). Absolutely nothing in
> "mmap_region()" cares at all about the block-size anywhere - it's
> generic, after all - so locking around it is f*cking pointless. There
> is no way in hell that the caller of ->mmap can *ever* care about the
> block size, since it never even looks at it.
>
> Don't do random crap like this.
>
> Why does the code think that mmap matters so much anyway? As you say,
> the mmap itself does *nothing*. It has no impact for the block size.
... and here I was, trying to massage a reply into form that would be
at least borderline printable... Anyway, this is certainly the wrong
way to go. We want to catch if the damn thing is mapped and mapping_mapped()
leaves a race? Fine, so let's not use mapping_mapped() at all. Have a
private vm_operations - a copy of generic_file_vm_ops with ->open()/->close()
added to it. Incrementing/decrementing a per-block_device atomic counter,
with increment side done under your rwsem, to make sure that 0->positive
transition doesn't change in critical section of set_blocksize(). And don't
use generic_file_mmap(), just do file_accessed() and set ->vm_ops to that
local copy.
On Wed, 28 Nov 2012, Linus Torvalds wrote:
> No, this is crap.
>
> We don't introduce random hooks like this just because the block layer
> has shit-for-brains and cannot be bothered to do things right.
>
> The fact is, the whole locking in the block layer open routine is
> total and utter crap. It doesn't lock the right thing, even with your
> change *anyway* (or with the change Jens had). Absolutely nothing in
> "mmap_region()" cares at all about the block-size anywhere - it's
> generic, after all - so locking around it is f*cking pointless. There
> is no way in hell that the caller of ->mmap can *ever* care about the
> block size, since it never even looks at it.
>
> Don't do random crap like this.
>
> Why does the code think that mmap matters so much anyway? As you say,
> the mmap itself does *nothing*. It has no impact for the block size.
>
> Linus
mmap_region() doesn't care about the block size. But a lot of
page-in/page-out code does.
The problem is that once the block device is mapped, page faults or page
writeback can happen anytime - so the simplest solution is to not allow
the block device being mapped while we change block size.
The function set_blocksize takes bd_block_size_semaphore for write (that
blocks read/write/mmap), then it calls sync_blockdev (now we are sure that
there is no more writeback), then it changes the block size, then it calls
kill_bdev (now we are sure that there are no more any pages with buffers
with the old blocksize).
If you want to allow to change block size while a block device is mapped,
you'd have to add explicit locks around all mm callbacks (so that the
block size can't change while the callback is in progress) - and even
then, there are some unsolvable cases - i.e. what are you going to do if
the user mlocks a mapped block device and you change block size of that
device? - you can't drop the pages (that would violate mlock semantics)
and you can leave them there (because they have buffers with wrong size).
If you don't like what I sent, propose a different solution.
Mikulas
On Wed, Nov 28, 2012 at 11:43 AM, Al Viro <[email protected]> wrote:
> Have a
> private vm_operations - a copy of generic_file_vm_ops with ->open()/->close()
> added to it.
That sounds more reasonable.
However, I suspect the *most* reasonable thing to do is to just remove
the whole damn thing. We really shouldn't care about mmap. If somebody
does a mmap on a block device, and somebody else then changes the
block size, why-ever should we bother to go through any contortions at
*all* to make that kind of insane behavior do anything sane at all.
Just let people mmap things. Then just let the normal page cache
invalidation work right. In fact, it is entirely possible that we
could/should just not even invalidate the page cache at all, just make
sure that the buffer heads attached to any pages get disconnected. No?
Linus
On Wed, Nov 28, 2012 at 11:50 AM, Mikulas Patocka <[email protected]> wrote:
>
> mmap_region() doesn't care about the block size. But a lot of
> page-in/page-out code does.
That seems a bogus argument.
mmap() is in *no* way special. The exact same thing happens for
regular read/write. Yet somehow the mmap code is special-cased, while
the normal read-write code is not.
I suspect it might be *easier* to trigger some issues with mmap, but
that still isn't a good enough reason to special-case it. We don't add
locking to one please just because that one place shows some race
condition more easily. We fix the locking.
So for example, maybe the code that *actually* cares about the buffer
size (the stuff that allocates buffers in fs/buffer.c) needs to take
that new percpu read lock. Basically, any caller of
"alloc_page_buffers()/create_empty_buffers()" or whatever.
I also wonder whether we need it *at*all*. I suspect that we could
easily have multiple block-sizes these days for the same block device.
It *used* to be (millions of years ago, when dinosaurs roamed the
earth) that the block buffers were global and shared with all users of
a partition. But that hasn't been true since we started using the page
cache, and I suspect that some of the block size changing issues are
simply entirely stale.
Yeah, yeah, there could be some coherency issues if people write to
the block device through different block sizes, but I think we have
those coherency issues anyway. The page-cache is not coherent across
different mapping inodes anyway.
So I really suspect that some of this is "legacy logic". Or at least
perhaps _should_ be.
Linus
On Wed, Nov 28, 2012 at 12:03 PM, Linus Torvalds
<[email protected]> wrote:
>
> mmap() is in *no* way special. The exact same thing happens for
> regular read/write. Yet somehow the mmap code is special-cased, while
> the normal read-write code is not.
I just double-checked, because it's been a long time since I actually
looked at the code.
But yeah, block device read/write uses the pure page cache functions.
IOW, it has the *exact* same IO engine as mmap() would have.
So here's my suggestion:
- get rid of *all* the locking in aio_read/write and the splice paths
- get rid of all the stupid mmap games
- instead, add them to the functions that actually use
"blkdev_get_block()" and "blkdev_get_blocks()" and nowhere else.
That's a fairly limited number of functions:
blkdev_{read,write}page(), blkdev_direct_IO() and
blkdev_write_{begin,end}()
Doesn't that sounds simpler? And more logical: it protects the actual
places that use the block size of the device.
I dunno. Maybe there is some fundamental reason why the above is
broken, but it seems to be a much simpler approach. Sure, you need to
guarantee that the people who get the write-lock cannot possibly cause
IO while holding it, but since the only reason to get the write lock
would be to change the block size, that should be pretty simple, no?
Yeah, yeah, I'm probably missing something fundamental, but the above
sounds like the simple approach to fixing things. Aiming for having
the block size read-lock be taken by the things that pass in the
block-size itself.
It would be nice for things to be logical and straightforward.
Linus
On Wed, Nov 28, 2012 at 12:13 PM, Linus Torvalds
<[email protected]> wrote:
>
> I dunno. Maybe there is some fundamental reason why the above is
> broken, but it seems to be a much simpler approach. Sure, you need to
> guarantee that the people who get the write-lock cannot possibly cause
> IO while holding it, but since the only reason to get the write lock
> would be to change the block size, that should be pretty simple, no?
Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably
do unspeakable things to your family and pets.
Linus
On Wed, Nov 28, 2012 at 12:32 PM, Linus Torvalds
<[email protected]> wrote:
>
> Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably
> do unspeakable things to your family and pets.
Btw, *if* this approach works, I suspect we could just switch the
bd_block_size_semaphore semaphore to be a regular rw-sem.
Why? Because now it's no longer ever gotten in the cached IO paths, we
only get it when we're doing much more expensive things (ie actual IO,
and buffer head allocations etc etc). As long as we just work with the
page cache, we never get to the whole lock at all.
Which means that the whole percpu-optimized thing is likely no longer
all that relevant.
But that's an independent thing, and it's only true *if* my patch
works. It looks fine on paper, but maybe there's something
fundamentally broken about it.
One big change my patch does is to move the sync_bdev/kill_bdev to
*after* changing the block size. It does that so that it can guarantee
that any old data (which didn't see the new block size) will be
sync'ed even if there is new IO coming in as we change the block size.
The old code locked the whole sync() region, which doesn't work with
my approach, since the sync will do IO and would thus cause potential
deadlocks while holding the rwsem for writing.
So with this patch, as the block size changes, you can actually have
some old pages with the old block size *and* some different new pages
with the new block size all at the same time. It should all be
perfectly fine, but it's worth pointing out.
(It probably won't trigger in practice, though, since doing IO while
somebody else is changing the blocksize is fundamentally an odd thing
to do, but whatever. I also suspect that we *should* perhaps use the
inode->i_sem thing to serialize concurrent block size changes, but
that's again an independent issue)
Linus
On Wed, 28 Nov 2012, Linus Torvalds wrote:
> On Wed, Nov 28, 2012 at 12:03 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > mmap() is in *no* way special. The exact same thing happens for
> > regular read/write. Yet somehow the mmap code is special-cased, while
> > the normal read-write code is not.
>
> I just double-checked, because it's been a long time since I actually
> looked at the code.
>
> But yeah, block device read/write uses the pure page cache functions.
> IOW, it has the *exact* same IO engine as mmap() would have.
>
> So here's my suggestion:
>
> - get rid of *all* the locking in aio_read/write and the splice paths
> - get rid of all the stupid mmap games
>
> - instead, add them to the functions that actually use
> "blkdev_get_block()" and "blkdev_get_blocks()" and nowhere else.
>
> That's a fairly limited number of functions:
> blkdev_{read,write}page(), blkdev_direct_IO() and
> blkdev_write_{begin,end}()
>
> Doesn't that sounds simpler? And more logical: it protects the actual
> places that use the block size of the device.
>
> I dunno. Maybe there is some fundamental reason why the above is
> broken, but it seems to be a much simpler approach. Sure, you need to
> guarantee that the people who get the write-lock cannot possibly cause
> IO while holding it, but since the only reason to get the write lock
> would be to change the block size, that should be pretty simple, no?
>
> Yeah, yeah, I'm probably missing something fundamental, but the above
> sounds like the simple approach to fixing things. Aiming for having
> the block size read-lock be taken by the things that pass in the
> block-size itself.
>
> It would be nice for things to be logical and straightforward.
>
> Linus
The problem with this approach is that it is very easy to miss points
where it is assumed that the block size doesn't change - and if you miss a
point, it results in a hidden bug that has a little possibility of being
found.
For example, __block_write_full_page and __block_write_begin do
if (!page_has_buffers(page)) { create_empty_buffers... }
and then they do
WARN_ON(bh->b_size != blocksize)
err = get_block(inode, block, bh, 1)
... so if the buffers were left over from some previous call to
create_empty_buffers with a different blocksize, that WARN_ON is trigged.
And it's not only a harmless warning - now bh->b_size is left set to the
old block size, but bh->b_blocknr is set to a number, that was calculated
according to the new block size - and when you submit that buffer with
submit_bh, it is written to the wrong place!
Now, prove that there are no more bugs like this.
Locking the whole read/write/mmap operations is crude, but at least it can
be done without thorough review of all the memory management code.
Mikulas
On Wed, 28 Nov 2012, Al Viro wrote:
> On Wed, Nov 28, 2012 at 11:15:12AM -0800, Linus Torvalds wrote:
> > No, this is crap.
> >
> > We don't introduce random hooks like this just because the block layer
> > has shit-for-brains and cannot be bothered to do things right.
> >
> > The fact is, the whole locking in the block layer open routine is
> > total and utter crap. It doesn't lock the right thing, even with your
> > change *anyway* (or with the change Jens had). Absolutely nothing in
> > "mmap_region()" cares at all about the block-size anywhere - it's
> > generic, after all - so locking around it is f*cking pointless. There
> > is no way in hell that the caller of ->mmap can *ever* care about the
> > block size, since it never even looks at it.
> >
> > Don't do random crap like this.
> >
> > Why does the code think that mmap matters so much anyway? As you say,
> > the mmap itself does *nothing*. It has no impact for the block size.
>
> ... and here I was, trying to massage a reply into form that would be
> at least borderline printable... Anyway, this is certainly the wrong
> way to go. We want to catch if the damn thing is mapped and mapping_mapped()
> leaves a race? Fine, so let's not use mapping_mapped() at all. Have a
> private vm_operations - a copy of generic_file_vm_ops with ->open()/->close()
> added to it. Incrementing/decrementing a per-block_device atomic counter,
> with increment side done under your rwsem, to make sure that 0->positive
> transition doesn't change in critical section of set_blocksize(). And don't
> use generic_file_mmap(), just do file_accessed() and set ->vm_ops to that
> local copy.
This sounds sensible. I'm sending this patch.
Mikulas
---
Do a proper locking for mmap and block size change
For block devices, we must prevent the device from being mapped while
block size is being changed.
This was implemented by catching the mmap method and taking
bd_block_size_semaphore in it.
The problem is that the generic_file_mmap method does almost nothing, it
only sets vma->vm_ops = &generic_file_vm_ops, all the hard work is done
by the caller, mmap_region. Consequently, locking the mmap method is
insufficient, to prevent the race condition. The race condition can
happen for example this way:
process 1:
Starts mapping a block device, it goes to mmap_region, calls
file->f_op->mmap (blkdev_mmap - that acquires and releases
bd_block_size_semaphore), and reschedule happens just after blkdev_mmap
returns.
process 2:
Changes block device size, goes into set_blocksize, acquires
percpu_down_write, calls mapping_mapped to test if the device is mapped
(it still isn't). Then, it reschedules.
process 1:
Continues in mmap_region, successfully finishes the mapping.
process 2:
Continues in set_blocksize, changes the block size while the block
device is mapped. (which is incorrect and may result in a crash or
misbehavior).
To fix this possible race condition, we introduce a counter
bd_mmap_count that counts the number of vmas that maps the block device.
bd_mmap_count is incremented in blkdev_mmap and in blkdev_vm_open, it is
decremented in blkdev_vm_close. We don't allow changing block size while
bd_mmap_count is non-zero.
Signed-off-by: Mikulas Patocka <[email protected]>
---
fs/block_dev.c | 49 ++++++++++++++++++++++++++++++++-----------------
include/linux/fs.h | 3 +++
2 files changed, 35 insertions(+), 17 deletions(-)
Index: linux-3.7-rc7/fs/block_dev.c
===================================================================
--- linux-3.7-rc7.orig/fs/block_dev.c 2012-11-28 21:13:12.000000000 +0100
+++ linux-3.7-rc7/fs/block_dev.c 2012-11-28 21:33:23.000000000 +0100
@@ -35,6 +35,7 @@ struct bdev_inode {
struct inode vfs_inode;
};
+static const struct vm_operations_struct def_blk_vm_ops;
static const struct address_space_operations def_blk_aops;
static inline struct bdev_inode *BDEV_I(struct inode *inode)
@@ -114,16 +115,6 @@ void invalidate_bdev(struct block_device
}
EXPORT_SYMBOL(invalidate_bdev);
-static int is_bdev_mapped(struct block_device *bdev)
-{
- int ret_val;
- struct address_space *mapping = bdev->bd_inode->i_mapping;
- mutex_lock(&mapping->i_mmap_mutex);
- ret_val = mapping_mapped(mapping);
- mutex_unlock(&mapping->i_mmap_mutex);
- return ret_val;
-}
-
int set_blocksize(struct block_device *bdev, int size)
{
/* Size must be a power of two, and between 512 and PAGE_SIZE */
@@ -136,16 +127,16 @@ int set_blocksize(struct block_device *b
/*
* If the block size doesn't change, don't take the write lock.
- * We check for is_bdev_mapped anyway, for consistent behavior.
+ * We check for bd_mmap_count anyway, for consistent behavior.
*/
if (size == bdev->bd_block_size)
- return is_bdev_mapped(bdev) ? -EBUSY : 0;
+ return atomic_read(&bdev->bd_mmap_count) ? -EBUSY : 0;
/* Prevent starting I/O or mapping the device */
percpu_down_write(&bdev->bd_block_size_semaphore);
/* Check that the block device is not memory mapped */
- if (is_bdev_mapped(bdev)) {
+ if (atomic_read(&bdev->bd_mmap_count)) {
percpu_up_write(&bdev->bd_block_size_semaphore);
return -EBUSY;
}
@@ -491,6 +482,7 @@ static void bdev_i_callback(struct rcu_h
static void bdev_destroy_inode(struct inode *inode)
{
+ BUG_ON(atomic_read(&I_BDEV(inode)->bd_mmap_count));
call_rcu(&inode->i_rcu, bdev_i_callback);
}
@@ -507,6 +499,7 @@ static void init_once(void *foo)
INIT_LIST_HEAD(&bdev->bd_holder_disks);
#endif
inode_init_once(&ei->vfs_inode);
+ atomic_set(&bdev->bd_mmap_count, 0);
/* Initialize mutex for freeze. */
mutex_init(&bdev->bd_fsfreeze_mutex);
}
@@ -1660,16 +1653,28 @@ EXPORT_SYMBOL_GPL(blkdev_aio_write);
static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
{
- int ret;
struct block_device *bdev = I_BDEV(file->f_mapping->host);
+ file_accessed(file);
+ vma->vm_ops = &def_blk_vm_ops;
+
percpu_down_read(&bdev->bd_block_size_semaphore);
+ atomic_inc(&bdev->bd_mmap_count);
+ percpu_up_read(&bdev->bd_block_size_semaphore);
- ret = generic_file_mmap(file, vma);
+ return 0;
+}
- percpu_up_read(&bdev->bd_block_size_semaphore);
+static void blkdev_vm_open(struct vm_area_struct *area)
+{
+ struct block_device *bdev = I_BDEV(area->vm_file->f_mapping->host);
+ atomic_inc(&bdev->bd_mmap_count);
+}
- return ret;
+static void blkdev_vm_close(struct vm_area_struct *area)
+{
+ struct block_device *bdev = I_BDEV(area->vm_file->f_mapping->host);
+ atomic_dec(&bdev->bd_mmap_count);
}
static ssize_t blkdev_splice_read(struct file *file, loff_t *ppos,
@@ -1719,6 +1724,16 @@ static int blkdev_releasepage(struct pag
return try_to_free_buffers(page);
}
+static const struct vm_operations_struct def_blk_vm_ops = {
+ .open = blkdev_vm_open,
+ .close = blkdev_vm_close,
+#ifdef CONFIG_MMU
+ .fault = filemap_fault,
+ .page_mkwrite = filemap_page_mkwrite,
+ .remap_pages = generic_file_remap_pages,
+#endif
+};
+
static const struct address_space_operations def_blk_aops = {
.readpage = blkdev_readpage,
.writepage = blkdev_writepage,
Index: linux-3.7-rc7/include/linux/fs.h
===================================================================
--- linux-3.7-rc7.orig/include/linux/fs.h 2012-11-28 21:15:42.000000000 +0100
+++ linux-3.7-rc7/include/linux/fs.h 2012-11-28 21:17:59.000000000 +0100
@@ -458,6 +458,9 @@ struct block_device {
*/
unsigned long bd_private;
+ /* The number of vmas */
+ atomic_t bd_mmap_count;
+
/* The counter of freeze processes */
int bd_fsfreeze_count;
/* Mutex for freeze */
On Wed, 28 Nov 2012, Jeff Chua wrote:
> On Wed, Nov 28, 2012 at 12:01 PM, Mikulas Patocka <[email protected]> wrote:
> > block_dev: don't take the write lock if block size doesn't change
> >
> > Taking the write lock has a big performance impact on the whole system
> > (because of synchronize_sched_expedited). This patch avoids taking the
> > write lock if the block size doesn't change (i.e. when mounting
> > filesystem with block size equal to the default block size).
> >
> > The logic to test if the block device is mapped was moved to a separate
> > function is_bdev_mapped to avoid code duplication.
> >
> > Signed-off-by: Mikulas Patocka <[email protected]>
> >
> > ---
> > fs/block_dev.c | 25 ++++++++++++++++++-------
> > 1 file changed, 18 insertions(+), 7 deletions(-)
> >
> > Index: linux-3.7-rc7/fs/block_dev.c
> > ===================================================================
> > --- linux-3.7-rc7.orig/fs/block_dev.c 2012-11-28 04:09:01.000000000 +0100
> > +++ linux-3.7-rc7/fs/block_dev.c 2012-11-28 04:13:53.000000000 +0100
> > @@ -114,10 +114,18 @@ void invalidate_bdev(struct block_device
> > }
> > EXPORT_SYMBOL(invalidate_bdev);
> >
> > -int set_blocksize(struct block_device *bdev, int size)
> > +static int is_bdev_mapped(struct block_device *bdev)
> > {
> > - struct address_space *mapping;
> > + int ret_val;
> > + struct address_space *mapping = bdev->bd_inode->i_mapping;
> > + mutex_lock(&mapping->i_mmap_mutex);
> > + ret_val = mapping_mapped(mapping);
> > + mutex_unlock(&mapping->i_mmap_mutex);
> > + return ret_val;
> > +}
> >
> > +int set_blocksize(struct block_device *bdev, int size)
> > +{
> > /* Size must be a power of two, and between 512 and PAGE_SIZE */
> > if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
> > return -EINVAL;
> > @@ -126,18 +134,21 @@ int set_blocksize(struct block_device *b
> > if (size < bdev_logical_block_size(bdev))
> > return -EINVAL;
> >
> > + /*
> > + * If the block size doesn't change, don't take the write lock.
> > + * We check for is_bdev_mapped anyway, for consistent behavior.
> > + */
> > + if (size == bdev->bd_block_size)
> > + return is_bdev_mapped(bdev) ? -EBUSY : 0;
> > +
> > /* Prevent starting I/O or mapping the device */
> > percpu_down_write(&bdev->bd_block_size_semaphore);
> >
> > /* Check that the block device is not memory mapped */
> > - mapping = bdev->bd_inode->i_mapping;
> > - mutex_lock(&mapping->i_mmap_mutex);
> > - if (mapping_mapped(mapping)) {
> > - mutex_unlock(&mapping->i_mmap_mutex);
> > + if (is_bdev_mapped(bdev)) {
> > percpu_up_write(&bdev->bd_block_size_semaphore);
> > return -EBUSY;
> > }
> > - mutex_unlock(&mapping->i_mmap_mutex);
> >
> > /* Don't change the size if it is same as current */
> > if (bdev->bd_block_size != size) {
>
>
> This patch didn't really make any difference for ext2/3/4 but for
> reiserfs it does.
>
> With the synchronize_sched_expedited() patch applied, it didn't make
> any difference.
>
> Thanks,
> Jeff
It could make difference for computers with many cores -
synchronize_sched_expedited() is expensive there because it synchronizes
all active cores, so if we can avoid it, just do it.
Mikulas
On Wed, 28 Nov 2012, Linus Torvalds wrote:
> On Wed, Nov 28, 2012 at 12:32 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably
> > do unspeakable things to your family and pets.
>
> Btw, *if* this approach works, I suspect we could just switch the
> bd_block_size_semaphore semaphore to be a regular rw-sem.
>
> Why? Because now it's no longer ever gotten in the cached IO paths, we
> only get it when we're doing much more expensive things (ie actual IO,
> and buffer head allocations etc etc). As long as we just work with the
> page cache, we never get to the whole lock at all.
>
> Which means that the whole percpu-optimized thing is likely no longer
> all that relevant.
Using normal semaphore degrades direct-IO performance on raw devices, I
measured that (45.1s with normal rw-semaphore vs. 42.8s with
percpu-rw-semaphore). It could be measured on ramdisk or high-performance
SSDs.
Mikulas
On Wed, Nov 28, 2012 at 1:29 PM, Mikulas Patocka <[email protected]> wrote:
>
> The problem with this approach is that it is very easy to miss points
> where it is assumed that the block size doesn't change - and if you miss a
> point, it results in a hidden bug that has a little possibility of being
> found.
Umm. Mikulas, *your* approach has resulted in bugs. So let's not throw
stones in glass houses, shall we?
The whole reason for this long thread (and several threads before it)
is that your model isn't working and is causing problems. I already
pointed out how bogus your arguments about mmap() locking were, and
then you have the gall to talk about potential bugs, when I have
pointed you to *actual* bugs, and actual mistakes.
> For example, __block_write_full_page and __block_write_begin do
> if (!page_has_buffers(page)) { create_empty_buffers... }
> and then they do
> WARN_ON(bh->b_size != blocksize)
> err = get_block(inode, block, bh, 1)
Right. And none of this is new.
> ... so if the buffers were left over from some previous call to
> create_empty_buffers with a different blocksize, that WARN_ON is trigged.
None of this can happen.
> Locking the whole read/write/mmap operations is crude, but at least it can
> be done without thorough review of all the memory management code.
Umm. Which you clearly didn't do, and raised totally crap arguments for.
In contrast, I have a very simple argument for the correctness of my
patch: every single user of the "get_block[s]()" interface now takes
the lock for as long as get_block[s]() is passed off to somebody else.
And since get_block[s]() is the only way to create those empty
buffers, I think I pretty much proved exactly what you ask for.
And THAT is the whole point and advantage of making locking sane. Sane
locking you can actually *think* about!
In contrast, locking around "mmap()" is absolutely *guaranteed* to be
insane, because mmap() doesn't actually do any of the IO that the lock
is supposed to protect against!
So Mikulas, quite frankly, your arguments argue against you. When you
say "Locking the whole read/write/mmap operations is crude, but at
least it can
be done without thorough", you are doubly correct: it *is* crude, and
it clearly *was* done without thought, since it's a f*cking idiotic
AND INCORRECT thing to do.
Seriously. Locking around "mmap()" is insane. It leads to insane
semantics (the whole EBUSY thing is purely because of that problem)
and it leads to bad code (your "let's add a new "mmap_region" hook is
just disgusting, and while Al's idea of doing it in the existing
"->open" method is at least not nasty, it's definitely extra code and
complexity).
There are serious *CORRECTNESS* advantages to simplicity and
directness. And locking at the right point is definitely very much
part of that.
Anyway, as far as block size goes, we have exactly two cases:
- random IO that does not care about the block size, and will just do
whatever the current block size is (ie normal anonymous accesses to
the block device). This is the case that needs the locking - but it
only needs it around the individual page operations, ie exactly where
I put it. In fact, they can happily deal with different block sizes
for different pages, they don't really care.
- mounted filesystems etc that require a particular block size and
set it at mount time, and they have exclusivity rules
The second case is the case that actually calls set_blocksize(), and
if "kill_bdev()" doesn't get rid of the old blocksizes, then they have
always been in trouble, and would always _continue_ to be in trouble,
regardless of locking.
Linus
On Wed, Nov 28, 2012 at 2:52 PM, Linus Torvalds
<[email protected]> wrote:
>
>> For example, __block_write_full_page and __block_write_begin do
>> if (!page_has_buffers(page)) { create_empty_buffers... }
>> and then they do
>> WARN_ON(bh->b_size != blocksize)
>> err = get_block(inode, block, bh, 1)
>
> Right. And none of this is new.
.. which, btw, is not to say that *other* things aren't new. They are.
The change to actually change the block device buffer size before then
calling "sync_bdev()" is definitely a real change, and as mentioned, I
have not tested the patch in any way. If any block device driver were
to actually compare the IO size they get against the bdev->block_size
thing, they'd see very different behavior (ie they'd see the new block
size as they are asked to write old the old blocks with the old block
size).
So it does change semantics, no question about that. I don't think any
block device does it, though.
A bigger issue is for things that emulate what blkdev.c does, and
doesn't do the locking. I see code in md/bitmap.c that seems a bit
suspicious, for example. That said, it's not *new* breakage, and the
"lock at mmap/read/write() time" approach doesn't fix it either (since
the mapping will be different for the underlying MD device). So I do
think that we should take a look at all the users of
"alloc_page_buffers()" and "create_empty_buffers()" to see what *they*
do to protect the block-size, but I think that's an independent issue
from the raw device access case in fs/block_dev.c..
I guess I have to actually test my patch. I don't have very
interesting test-cases, though.
Linus
On Wed, 28 Nov 2012, Linus Torvalds wrote:
> > For example, __block_write_full_page and __block_write_begin do
> > if (!page_has_buffers(page)) { create_empty_buffers... }
> > and then they do
> > WARN_ON(bh->b_size != blocksize)
> > err = get_block(inode, block, bh, 1)
>
> Right. And none of this is new.
>
> > ... so if the buffers were left over from some previous call to
> > create_empty_buffers with a different blocksize, that WARN_ON is trigged.
>
> None of this can happen.
It can happen. Take your patch (the one that moves bd_block_size_semaphore
into blkdev_readpage, blkdev_writepage and blkdev_write_begin).
Insert msleep(1000) into set_blocksize, just before sync_blockdev.
Run this program:
#define _XOPEN_SOURCE 500
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
static char array[4096];
int main(void)
{
int h;
system("dmsetup remove test 2>/dev/null");
if (system("dmsetup create test --table '0 1 zero'")) exit(1);
h = open("/dev/mapper/test", O_RDWR);
if (h < 0) perror("open"), exit(1);
if (pread(h, array, 512, 0) != 512) perror("pread"), exit(1);
if (system("dmsetup load test --table '0 8 zero'")) exit(1);
if (system("dmsetup suspend test")) exit(1);
if (system("dmsetup resume test")) exit(1);
if (system("blockdev --setbsz 2048 /dev/mapper/test &")) exit(1);
usleep(500000);
if (pwrite(h, array, 4096, 0) != 4096) perror("pwrite"), exit(1);
return 0;
}
--- it triggers WARNING: at fs/buffer.c:1830 in __block_write_begin
[ 1243.300000] Backtrace:
[ 1243.330000] [<0000000040230ba8>] block_write_begin+0x70/0xd0
[ 1243.400000] [<00000000402350cc>] blkdev_write_begin+0xb4/0x208
[ 1243.480000] [<00000000401a9f10>] generic_file_buffered_write+0x248/0x348
[ 1243.570000] [<00000000401ac8c4>] __generic_file_aio_write+0x1fc/0x388
[ 1243.660000] [<0000000040235e74>] blkdev_aio_write+0x64/0xf0
[ 1243.740000] [<00000000401f2108>] do_sync_write+0xd0/0x128
[ 1243.810000] [<00000000401f2930>] vfs_write+0xa0/0x180
[ 1243.880000] [<00000000401f2ecc>] sys_pwrite64+0xb4/0xd8
[ 1243.950000] [<0000000040122104>] parisc_pwrite64+0x1c/0x28
[ 1244.020000] [<0000000040106060>] syscall_exit+0x0/0x14
I'm not saying that your approach is wrong, you just have to carefuly
review all memory management code for assumptions that block size
doesn't change.
Mikulas
[ Sorry, I was offline for a while driving kids around ]
On Wed, Nov 28, 2012 at 4:38 PM, Mikulas Patocka <[email protected]> wrote:
>
> It can happen. Take your patch (the one that moves bd_block_size_semaphore
> into blkdev_readpage, blkdev_writepage and blkdev_write_begin).
Interesting. The code *has* the block size (it's in "bh->b_size"), but
it actually then uses the inode blocksize instead, and verifies the
two against each other. It could just have used the block size
directly (and then used the inode i_blkbits only when no buffers
existed), avoiding that dependency entirely..
It actually does the same thing (with the same verification) in
__block_write_full_page() and (_without_ the verification) in
__block_commit_write().
Ho humm. All of those places actually do hold the rwsem for reading,
it's just that I don't want to hold it for writing over the sync..
Need to think on this,
Linus
On Wed, 28 Nov 2012, Linus Torvalds wrote:
> A bigger issue is for things that emulate what blkdev.c does, and
> doesn't do the locking. I see code in md/bitmap.c that seems a bit
> suspicious, for example. That said, it's not *new* breakage, and the
> "lock at mmap/read/write() time" approach doesn't fix it either (since
> the mapping will be different for the underlying MD device). So I do
> think that we should take a look at all the users of
> "alloc_page_buffers()" and "create_empty_buffers()" to see what *they*
> do to protect the block-size, but I think that's an independent issue
> from the raw device access case in fs/block_dev.c..
>
> I guess I have to actually test my patch. I don't have very
> interesting test-cases, though.
>
> Linus
Yes, it md looks suspicious. It disables write access in
deny_bitmap_write_access (that functions looks buggy on its own - what
happens if i_writecount == 0 or if it is already negative on entry?).
So we could disallow changing block size if i_writecount < 0, that should
do the trick.
Mikulas
On Wed, Nov 28, 2012 at 6:04 PM, Linus Torvalds
<[email protected]> wrote:
>
> Interesting. The code *has* the block size (it's in "bh->b_size"), but
> it actually then uses the inode blocksize instead, and verifies the
> two against each other. It could just have used the block size
> directly (and then used the inode i_blkbits only when no buffers
> existed), avoiding that dependency entirely..
Looking more at this code, that really would be the nicest solution.
There's two cases for the whole get_block() thing:
- filesystems. The block size will not change randomly, and
"get_block()" seriously depends on the block size.
- the raw device. The block size *will* change, but to simplify the
problem, "get_block()" is a 1:1 mapping, so it doesn't even care about
the block size because it will always return "bh->b_blocknr = nr".
So we *could* just say that all the fs/buffer.c code should use
"inode->i_blkbits" for creating buffers (because that's the size new
buffers should always use), but use "bh->b_size" for any *existing*
buffer use.
And looking at it, it's even simple. Except for one *very* annoying
thing: several users really don't want the size of the buffer, they
really do want the *shift* of the buffer size.
In fact, that single issue seems to be the reason why
"inode->i_blkbits" is really used in fs/buffer.c.
Otherwise it would be fairly trivial to just make the pattern be just a simple
if (!page_has_buffers(page))
create_empty_buffers(page, 1 << inode->i_blkbits, 0);
head = page_buffers(page);
blocksize = head->b_size;
and just use the blocksize that way, without any other games. All
done, no silly WARN_ON() to verify against some global block-size, and
the fs/buffer.c code would be perfectly simple, and would have no
problem at all with multiple different blocksizes in different pages
(the page lock serializes the buffers and thus the blocksize at the
per-page level).
But the fact that the code wants to do things like
block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
seriously seems to be the main thing that keeps us using
'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly
enough to hurt (not everywhere, but on some machines).
Very annoying.
Linus
On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
<[email protected]> wrote:
>
> But the fact that the code wants to do things like
>
> block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
>
> seriously seems to be the main thing that keeps us using
> 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly
> enough to hurt (not everywhere, but on some machines).
>
> Very annoying.
Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
whole ilog2 thing, but at the same time, in other cases it actually
seems to improve code generation (ie gets rid of the whole unnecessary
two dereferences through page->mapping->host just to get the block
size, when we have it in the buffer-head that we have to touch
*anyway*).
Comments? Again, untested.
And I notice that Al Viro hasn't been cc'd, which is sad, since he's
been involved in much of fs/block_dev.c.
Al - this is an independent patch to fs/buffer.c to make
fs/block_dev.c able to change the block size of a block device while
there is IO in progress that may still use the old block size. The
discussion has been on fsdevel and lkml, but you may have missed it...
Linus
On Wed, Nov 28, 2012 at 10:16:21PM -0800, Linus Torvalds wrote:
> On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > But the fact that the code wants to do things like
> >
> > block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
> >
> > seriously seems to be the main thing that keeps us using
> > 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly
> > enough to hurt (not everywhere, but on some machines).
> >
> > Very annoying.
>
> Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
> whole ilog2 thing, but at the same time, in other cases it actually
> seems to improve code generation (ie gets rid of the whole unnecessary
> two dereferences through page->mapping->host just to get the block
> size, when we have it in the buffer-head that we have to touch
> *anyway*).
>
> Comments? Again, untested.
>
> And I notice that Al Viro hasn't been cc'd, which is sad, since he's
> been involved in much of fs/block_dev.c.
>
> Al - this is an independent patch to fs/buffer.c to make
> fs/block_dev.c able to change the block size of a block device while
> there is IO in progress that may still use the old block size. The
> discussion has been on fsdevel and lkml, but you may have missed it...
Umm... set_blocksize() is calling kill_bdev(), which does
truncate_inode_pages(mapping, 0). What's going to happen to data in
the dirty pages? IO in progress is not the only thing to worry about...
On Thu, Nov 29, 2012 at 06:25:19AM +0000, Al Viro wrote:
> Umm... set_blocksize() is calling kill_bdev(), which does
> truncate_inode_pages(mapping, 0). What's going to happen to data in
> the dirty pages? IO in progress is not the only thing to worry about...
Note that sync_blockdev() a few lines prior to that is good only if we
have no other processes doing write(2) (or dirtying the mmapped pages,
for that matter). The window isn't too wide, but...
On Wed, Nov 28, 2012 at 10:25 PM, Al Viro <[email protected]> wrote:
>
> Umm... set_blocksize() is calling kill_bdev(), which does
> truncate_inode_pages(mapping, 0). What's going to happen to data in
> the dirty pages? IO in progress is not the only thing to worry about...
Hmm. Yes. I think it works by virtue of "if you change the blocksize
while there is active IO, you're insane and you deserve whatever you
get".
It shouldn't even be fundamentally hard to make it work, although I
suspect it would be more code than it would be worth. The sane model
would be to not use truncate_inode_pages(), but instead just walk the
pages and get rid of the buffer heads with the wrong size. Preferably
*combining* that with the sync_blockdev().
We have no real reason to even invalidate the page cache, it's just
the buffers we want to get rid of.
But I suspect it's true that none of that is really *worth* it,
considering that nobody likely wants to do any concurrent IO. We don't
want to crash, or corrupt the data structures, but I suspect "you get
what you deserve" might actually be the right model ;)
So the current "sync_blockdev()+kill_bdev()" takes care of the *sane*
case (we flush any data that happened *before* the block size change),
and any concurrent writes with block-size changes are "good luck with
that".
Linus
On Wed, Nov 28, 2012 at 10:30 PM, Al Viro <[email protected]> wrote:
>
> Note that sync_blockdev() a few lines prior to that is good only if we
> have no other processes doing write(2) (or dirtying the mmapped pages,
> for that matter). The window isn't too wide, but...
So with Mikulas' patches, the write actually would block (at write
level) due to the locking. The mmap'ed patches may be around and
flushed, but the logic to not allow currently *active* mmaps (with the
rather nasty random -EBUSY return value) should mean that there is no
race.
Or rather, there's a race, but it results in that EBUSY thing.
With my simplfied locking, the sync_blockdev() is right before (not a
few lines prior) to the kill_bdev(), and in a perfect world they'd
actually be one single operation ("write back and invalidate pages
with the wrong block-size"). But they aren't.
Linus
On Wed, Nov 28, 2012 at 10:37:27PM -0800, Linus Torvalds wrote:
> On Wed, Nov 28, 2012 at 10:30 PM, Al Viro <[email protected]> wrote:
> >
> > Note that sync_blockdev() a few lines prior to that is good only if we
> > have no other processes doing write(2) (or dirtying the mmapped pages,
> > for that matter). The window isn't too wide, but...
>
> So with Mikulas' patches, the write actually would block (at write
> level) due to the locking. The mmap'ed patches may be around and
> flushed, but the logic to not allow currently *active* mmaps (with the
> rather nasty random -EBUSY return value) should mean that there is no
> race.
>
> Or rather, there's a race, but it results in that EBUSY thing.
Same as with fs mounted on it, or the sucker having been claimed for
RAID array, etc. Frankly, I'm more than slightly tempted to make
bdev mmap() just claim the sodding device exclusive for as long as
it's mmapped...
In principle, I agree, but... I still have nightmares from mmap/truncate
races way back. You are stepping into what used to be a really nasty
minefield. I'll look into that, but it's *definitely* not -rc8 fodder.
On Thu, Nov 29, 2012 at 2:45 PM, Al Viro <[email protected]> wrote:
> On Wed, Nov 28, 2012 at 10:37:27PM -0800, Linus Torvalds wrote:
>> On Wed, Nov 28, 2012 at 10:30 PM, Al Viro <[email protected]> wrote:
>> >
>> > Note that sync_blockdev() a few lines prior to that is good only if we
>> > have no other processes doing write(2) (or dirtying the mmapped pages,
>> > for that matter). The window isn't too wide, but...
>>
>> So with Mikulas' patches, the write actually would block (at write
>> level) due to the locking. The mmap'ed patches may be around and
>> flushed, but the logic to not allow currently *active* mmaps (with the
>> rather nasty random -EBUSY return value) should mean that there is no
>> race.
>>
>> Or rather, there's a race, but it results in that EBUSY thing.
>
> Same as with fs mounted on it, or the sucker having been claimed for
> RAID array, etc. Frankly, I'm more than slightly tempted to make
> bdev mmap() just claim the sodding device exclusive for as long as
> it's mmapped...
>
> In principle, I agree, but... I still have nightmares from mmap/truncate
> races way back. You are stepping into what used to be a really nasty
> minefield. I'll look into that, but it's *definitely* not -rc8 fodder.
Just let me know which relevant patch(es) you want me to test or break.
Thanks,
Jeff
On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote:
> On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > But the fact that the code wants to do things like
> >
> > block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
> >
> > seriously seems to be the main thing that keeps us using
> > 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly
> > enough to hurt (not everywhere, but on some machines).
> >
> > Very annoying.
>
> Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
> whole ilog2 thing, but at the same time, in other cases it actually
> seems to improve code generation (ie gets rid of the whole unnecessary
> two dereferences through page->mapping->host just to get the block
> size, when we have it in the buffer-head that we have to touch
> *anyway*).
>
> Comments? Again, untested.
Jumping in based on Linus original patch, which is doing something like
this:
set_blocksize() {
block new calls to writepage, prepare/commit_write
set the block size
unblock
< --- can race in here and find bad buffers --->
sync_blockdev()
kill_bdev()
< --- now we're safe --- >
}
We could add a second semaphore and a page_mkwrite call:
set_blocksize() {
block new calls to prepare/commit_write and page_mkwrite(), but
leave writepage unblocked.
sync_blockev()
<--- now we're safe. There are no dirty pages and no ways to
make new ones --->
block new calls to readpage (writepage too for good luck?)
kill_bdev()
set the block size
unblock readpage/writepage
unblock prepare/commit_write and page_mkwrite
}
Another way to look at things:
As Linus said in a different email, we don't need to drop the pages, just
the buffers. Once we've blocked prepare/commit_write,
there is no way to make a partially up to date page with dirty data.
We may make fully uptodate dirty pages, but for those we can
just create dirty buffers for the whole page.
As long as we had prepare/commit write blocked while we ran
sync_blockdev, we can blindly detach any buffers that are the wrong size
and just make new ones.
This may or may not apply to loop.c, I'd have to read that more
carefully.
-chris
On Wed, Nov 28, 2012 at 2:01 PM, Mikulas Patocka <[email protected]> wrote:
>
> This sounds sensible. I'm sending this patch.
This looks much better.
I think I'll apply this for 3.7 (since it's too late to do anything
fancier), and then for 3.8 I will rip out all the locking entirely,
because looking at the fs/buffer.c patch I wrote up, it's all totally
unnecessary.
Adding a ACCESS_ONCE() to the read of the i_blkbits value (when
creating new buffers) simply makes the whole locking thing pointless.
Just make the page lock protect the block size, and make it per-page,
and we're done.
No RCU grace period crap, no expedited mess, no nothing.
Linus
On Thu, Nov 29, 2012 at 07:12:49AM -0700, Chris Mason wrote:
> On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote:
> > On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > But the fact that the code wants to do things like
> > >
> > > block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
> > >
> > > seriously seems to be the main thing that keeps us using
> > > 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly
> > > enough to hurt (not everywhere, but on some machines).
> > >
> > > Very annoying.
> >
> > Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
> > whole ilog2 thing, but at the same time, in other cases it actually
> > seems to improve code generation (ie gets rid of the whole unnecessary
> > two dereferences through page->mapping->host just to get the block
> > size, when we have it in the buffer-head that we have to touch
> > *anyway*).
> >
> > Comments? Again, untested.
>
> Jumping in based on Linus original patch, which is doing something like
> this:
>
> set_blocksize() {
> block new calls to writepage, prepare/commit_write
> set the block size
> unblock
>
> < --- can race in here and find bad buffers --->
>
> sync_blockdev()
> kill_bdev()
>
> < --- now we're safe --- >
> }
>
> We could add a second semaphore and a page_mkwrite call:
>
> set_blocksize() {
>
> block new calls to prepare/commit_write and page_mkwrite(), but
> leave writepage unblocked.
>
> sync_blockev()
>
> <--- now we're safe. There are no dirty pages and no ways to
> make new ones --->
>
> block new calls to readpage (writepage too for good luck?)
>
> kill_bdev()
Whoops, kill_bdev needs the page lock, which sends us into ABBA when
readpage does the down_read. So, slight modification, unblock
readpage/writepage before the kill_bdev. We'd need to change readpage
to discard buffers with the wrong size. The risk is that readpage can
find buffers with the wrong size, and would need to be changed to
discard them.
The patch below is based on Linus' original and doesn't deal with the
readpage race. But it does get the rest of the idea across. It boots
and survives banging no blockdev --setbsz with mkfs, but I definitely
wouldn't trust it.
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1a1e5e3..1377171 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -116,8 +116,6 @@ EXPORT_SYMBOL(invalidate_bdev);
int set_blocksize(struct block_device *bdev, int size)
{
- struct address_space *mapping;
-
/* Size must be a power of two, and between 512 and PAGE_SIZE */
if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
return -EINVAL;
@@ -126,28 +124,40 @@ int set_blocksize(struct block_device *bdev, int size)
if (size < bdev_logical_block_size(bdev))
return -EINVAL;
- /* Prevent starting I/O or mapping the device */
- percpu_down_write(&bdev->bd_block_size_semaphore);
-
- /* Check that the block device is not memory mapped */
- mapping = bdev->bd_inode->i_mapping;
- mutex_lock(&mapping->i_mmap_mutex);
- if (mapping_mapped(mapping)) {
- mutex_unlock(&mapping->i_mmap_mutex);
- percpu_up_write(&bdev->bd_block_size_semaphore);
- return -EBUSY;
- }
- mutex_unlock(&mapping->i_mmap_mutex);
-
/* Don't change the size if it is same as current */
if (bdev->bd_block_size != size) {
+ /* block all modifications via writing and page_mkwrite */
+ percpu_down_write(&bdev->bd_block_size_semaphore);
+
+ /* write everything that was dirty */
sync_blockdev(bdev);
+
+ /* block readpage and writepage */
+ percpu_down_write(&bdev->bd_page_semaphore);
+
bdev->bd_block_size = size;
bdev->bd_inode->i_blkbits = blksize_bits(size);
+
+ /* we can't call kill_bdev with the page_semaphore down
+ * because we'll deadlock against readpage.
+ * The block_size_semaphore should prevent any new
+ * pages from being dirty, but readpage can jump
+ * in once we up the bd_page_sem and find a
+ * page with buffers from the old size.
+ *
+ * The kill_bdev call below is going to get rid
+ * of those buffers, but we do have a race here.
+ * readpage needs to deal with it and verify
+ * any buffers on the page are the right size
+ */
+ percpu_up_write(&bdev->bd_page_semaphore);
+
+ /* drop all the pages and all the buffers */
kill_bdev(bdev);
- }
- percpu_up_write(&bdev->bd_block_size_semaphore);
+ /* open the gates and let everyone back in */
+ percpu_up_write(&bdev->bd_block_size_semaphore);
+ }
return 0;
}
@@ -233,9 +243,14 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
+ struct block_device *bdev = I_BDEV(inode);
+ ssize_t ret;
- return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov, offset,
+ percpu_down_read(&bdev->bd_block_size_semaphore);
+ ret = __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov, offset,
nr_segs, blkdev_get_blocks, NULL, NULL, 0);
+ percpu_up_read(&bdev->bd_block_size_semaphore);
+ return ret;
}
int __sync_blockdev(struct block_device *bdev, int wait)
@@ -358,20 +373,48 @@ EXPORT_SYMBOL(thaw_bdev);
static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
{
- return block_write_full_page(page, blkdev_get_block, wbc);
+ int ret;
+ struct inode *inode = page_mapping(page)->host;
+ struct block_device *bdev = I_BDEV(inode);
+
+ percpu_down_read(&bdev->bd_page_semaphore);
+ ret = block_write_full_page(page, blkdev_get_block, wbc);
+ percpu_up_read(&bdev->bd_page_semaphore);
+
+ return ret;
}
static int blkdev_readpage(struct file * file, struct page * page)
{
- return block_read_full_page(page, blkdev_get_block);
+ int ret;
+ struct inode *inode = page_mapping(page)->host;
+ struct block_device *bdev = I_BDEV(inode);
+
+ percpu_down_read(&bdev->bd_page_semaphore);
+ ret = block_read_full_page(page, blkdev_get_block);
+ percpu_up_read(&bdev->bd_page_semaphore);
+
+ return ret;
}
static int blkdev_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
{
- return block_write_begin(mapping, pos, len, flags, pagep,
+ int ret;
+ struct inode *inode = mapping->host;
+ struct block_device *bdev = I_BDEV(inode);
+
+ percpu_down_read(&bdev->bd_block_size_semaphore);
+ ret = block_write_begin(mapping, pos, len, flags, pagep,
blkdev_get_block);
+ /*
+ * If we had an error, we need to release the size
+ * semaphore, because write_end() will never be called.
+ */
+ if (ret)
+ percpu_up_read(&bdev->bd_block_size_semaphore);
+ return ret;
}
static int blkdev_write_end(struct file *file, struct address_space *mapping,
@@ -379,8 +422,11 @@ static int blkdev_write_end(struct file *file, struct address_space *mapping,
struct page *page, void *fsdata)
{
int ret;
- ret = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+ struct inode *inode = mapping->host;
+ struct block_device *bdev = I_BDEV(inode);
+ ret = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+ percpu_up_read(&bdev->bd_block_size_semaphore);
unlock_page(page);
page_cache_release(page);
@@ -464,6 +510,11 @@ static struct inode *bdev_alloc_inode(struct super_block *sb)
kmem_cache_free(bdev_cachep, ei);
return NULL;
}
+ if (unlikely(percpu_init_rwsem(&ei->bdev.bd_page_semaphore))) {
+ percpu_free_rwsem(&ei->bdev.bd_block_size_semaphore);
+ kmem_cache_free(bdev_cachep, ei);
+ return NULL;
+ }
return &ei->vfs_inode;
}
@@ -474,6 +525,7 @@ static void bdev_i_callback(struct rcu_head *head)
struct bdev_inode *bdi = BDEV_I(inode);
percpu_free_rwsem(&bdi->bdev.bd_block_size_semaphore);
+ percpu_free_rwsem(&bdi->bdev.bd_page_semaphore);
kmem_cache_free(bdev_cachep, bdi);
}
@@ -1596,16 +1648,7 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos)
{
- ssize_t ret;
- struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
-
- percpu_down_read(&bdev->bd_block_size_semaphore);
-
- ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
-
- percpu_up_read(&bdev->bd_block_size_semaphore);
-
- return ret;
+ return generic_file_aio_read(iocb, iov, nr_segs, pos);
}
EXPORT_SYMBOL_GPL(blkdev_aio_read);
@@ -1620,7 +1663,6 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos)
{
struct file *file = iocb->ki_filp;
- struct block_device *bdev = I_BDEV(file->f_mapping->host);
struct blk_plug plug;
ssize_t ret;
@@ -1628,8 +1670,6 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
blk_start_plug(&plug);
- percpu_down_read(&bdev->bd_block_size_semaphore);
-
ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
if (ret > 0 || ret == -EIOCBQUEUED) {
ssize_t err;
@@ -1639,61 +1679,42 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
ret = err;
}
- percpu_up_read(&bdev->bd_block_size_semaphore);
-
blk_finish_plug(&plug);
return ret;
}
EXPORT_SYMBOL_GPL(blkdev_aio_write);
-static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
+int blkdev_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
+ struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
+ struct block_device *bdev = I_BDEV(inode);
int ret;
- struct block_device *bdev = I_BDEV(file->f_mapping->host);
percpu_down_read(&bdev->bd_block_size_semaphore);
-
- ret = generic_file_mmap(file, vma);
-
+ ret = filemap_page_mkwrite(vma, vmf);
percpu_up_read(&bdev->bd_block_size_semaphore);
-
return ret;
}
-static ssize_t blkdev_splice_read(struct file *file, loff_t *ppos,
- struct pipe_inode_info *pipe, size_t len,
- unsigned int flags)
-{
- ssize_t ret;
- struct block_device *bdev = I_BDEV(file->f_mapping->host);
-
- percpu_down_read(&bdev->bd_block_size_semaphore);
-
- ret = generic_file_splice_read(file, ppos, pipe, len, flags);
-
- percpu_up_read(&bdev->bd_block_size_semaphore);
-
- return ret;
-}
+static const struct vm_operations_struct blk_vm_ops = {
+ .fault = filemap_fault,
+ .page_mkwrite = blkdev_page_mkwrite,
+ .remap_pages = generic_file_remap_pages,
+};
-static ssize_t blkdev_splice_write(struct pipe_inode_info *pipe,
- struct file *file, loff_t *ppos, size_t len,
- unsigned int flags)
+static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
{
- ssize_t ret;
- struct block_device *bdev = I_BDEV(file->f_mapping->host);
-
- percpu_down_read(&bdev->bd_block_size_semaphore);
+ struct address_space *mapping = file->f_mapping;
- ret = generic_file_splice_write(pipe, file, ppos, len, flags);
-
- percpu_up_read(&bdev->bd_block_size_semaphore);
+ if (!mapping->a_ops->readpage)
+ return -ENOEXEC;
- return ret;
+ file_accessed(file);
+ vma->vm_ops = &blk_vm_ops;
+ return 0;
}
-
/*
* Try to release a page associated with block device when the system
* is under memory pressure.
@@ -1708,6 +1729,8 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
return try_to_free_buffers(page);
}
+
+
static const struct address_space_operations def_blk_aops = {
.readpage = blkdev_readpage,
.writepage = blkdev_writepage,
@@ -1732,8 +1755,8 @@ const struct file_operations def_blk_fops = {
#ifdef CONFIG_COMPAT
.compat_ioctl = compat_blkdev_ioctl,
#endif
- .splice_read = blkdev_splice_read,
- .splice_write = blkdev_splice_write,
+ .splice_read = generic_file_splice_read,
+ .splice_write = generic_file_splice_write,
};
int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b33cfc9..4f0043d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -463,6 +463,7 @@ struct block_device {
/* Mutex for freeze */
struct mutex bd_fsfreeze_mutex;
/* A semaphore that prevents I/O while block size is being changed */
+ struct percpu_rw_semaphore bd_page_semaphore;
struct percpu_rw_semaphore bd_block_size_semaphore;
};
On Thu, Nov 29, 2012 at 6:12 AM, Chris Mason <[email protected]> wrote:
>
> Jumping in based on Linus original patch, which is doing something like
> this:
>
> set_blocksize() {
> block new calls to writepage, prepare/commit_write
> set the block size
> unblock
>
> < --- can race in here and find bad buffers --->
>
> sync_blockdev()
> kill_bdev()
>
> < --- now we're safe --- >
> }
>
> We could add a second semaphore and a page_mkwrite call:
Yeah, we could be fancy, but the more I think about it, the less I can
say I care.
After all, the only things that do the whole set_blocksize() thing should be:
- filesystems at mount-time
- things like loop/md at block device init time.
and quite frankly, if there are any *concurrent* writes with either of
the above, I really *really* don't think we should care. I mean,
seriously.
So the _only_ real reason for the locking in the first place is to
make sure of internal kernel consistency. We do not want to oops or
corrupt memory if people do odd things. But we really *really* don't
care if somebody writes to a partition at the same time as somebody
else mounts it. Not enough to do extra work to please insane people.
It's also worth noting that NONE OF THIS HAS EVER WORKED IN THE PAST.
The whole sequence always used to be unlocked. The locking is entirely
new. There is certainly not any legacy users that can possibly rely on
"I did writes at the same time as the mount with no serialization, and
it worked". It never has worked.
So I think this is a case of "perfect is the enemy of good".
Especially since I think that with the fs/buffer.c approach, we don't
actually need any locking at all at higher levels.
Linus
On Thu, Nov 29, 2012 at 10:26:56AM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 6:12 AM, Chris Mason <[email protected]> wrote:
> >
> > Jumping in based on Linus original patch, which is doing something like
> > this:
> >
> > set_blocksize() {
> > block new calls to writepage, prepare/commit_write
> > set the block size
> > unblock
> >
> > < --- can race in here and find bad buffers --->
> >
> > sync_blockdev()
> > kill_bdev()
> >
> > < --- now we're safe --- >
> > }
> >
> > We could add a second semaphore and a page_mkwrite call:
>
> Yeah, we could be fancy, but the more I think about it, the less I can
> say I care.
>
> After all, the only things that do the whole set_blocksize() thing should be:
>
> - filesystems at mount-time
>
> - things like loop/md at block device init time.
>
> and quite frankly, if there are any *concurrent* writes with either of
> the above, I really *really* don't think we should care. I mean,
> seriously.
>
> So the _only_ real reason for the locking in the first place is to
> make sure of internal kernel consistency. We do not want to oops or
> corrupt memory if people do odd things. But we really *really* don't
> care if somebody writes to a partition at the same time as somebody
> else mounts it. Not enough to do extra work to please insane people.
>
> It's also worth noting that NONE OF THIS HAS EVER WORKED IN THE PAST.
> The whole sequence always used to be unlocked. The locking is entirely
> new. There is certainly not any legacy users that can possibly rely on
> "I did writes at the same time as the mount with no serialization, and
> it worked". It never has worked.
>
> So I think this is a case of "perfect is the enemy of good".
> Especially since I think that with the fs/buffer.c approach, we don't
> actually need any locking at all at higher levels.
The bigger question is do we have users that expect to be able to set
the blocksize after mmaping the block device (no writes required)? I
actually feel a little bad for taking up internet bandwidth asking, but
it is a change in behaviour.
Regardless, changing mmap for a race in the page cache is just backwards, and
with the current 3.7 code, we can still trigger the race with fadvise ->
readpage in the middle of set_blocksize()
Obviously nobody does any of this, otherwise we'd have tons of reports
from those handy WARN_ONs in fs/buffer.c. So its definitely hard to be
worried one way or another.
-chris
On Thu, Nov 29, 2012 at 9:51 AM, Chris Mason <[email protected]> wrote:
>
> The bigger question is do we have users that expect to be able to set
> the blocksize after mmaping the block device (no writes required)? I
> actually feel a little bad for taking up internet bandwidth asking, but
> it is a change in behaviour.
Yeah, it is. That said, I don't think people will really notice.
Nobody mmap's block devices outside of some databases, afaik, and
nobody sane mounts a partition at the same time a DB is using it. So I
think the new EBUSY check is *ugly*, but I don't realistically believe
that it is a problem. The ugliness of the locking is why I'm not a
huge fan of it, but if it works I can live with it.
But yes, the mmap tests are new with the locking, and could in theory
be problematic if somebody reports that it breaks anything.
And like the locking, they'd just go away if we just do the
fs/buffer.c approach instead. Because doing things in fs/buffer.c
simply means that we don't care (and serialization is provided by the
page lock on a per-page basis, which is what mmap relies on anyway).
So doing the per-page fs/buffer.c approach (along with the
"ACCESS_ONCE()" on inode->i_blkbits to make sure we get *one*
consistent value, even if we don't care *which* value it is) would
basically revert to all the old semantics. The only thing it would
change is that we wouldn't see oopses.
(And in theory, it would allow us to actively mix-and-match different
block sizes for a block device, but realistically I don't think there
are any actual users of that - although I could imagine that a
filesystem would use a smaller block size for file tail-blocks etc,
and still want to use the fs/buffer.c code, so it's *possible* that it
would be useful, but filesystems have been able to do things like that
by just doing their buffers by hand anyway, so it's not really
fundamentally new, just a possible generalization of code)
Linus
On Thu, 29 Nov 2012, Linus Torvalds wrote:
> On Wed, Nov 28, 2012 at 2:01 PM, Mikulas Patocka <[email protected]> wrote:
> >
> > This sounds sensible. I'm sending this patch.
>
> This looks much better.
>
> I think I'll apply this for 3.7 (since it's too late to do anything
> fancier), and then for 3.8 I will rip out all the locking entirely,
> because looking at the fs/buffer.c patch I wrote up, it's all totally
> unnecessary.
>
> Adding a ACCESS_ONCE() to the read of the i_blkbits value (when
> creating new buffers) simply makes the whole locking thing pointless.
> Just make the page lock protect the block size, and make it per-page,
> and we're done.
>
> No RCU grace period crap, no expedited mess, no nothing.
>
> Linus
Yes.
If you remove that percpu rw lock, you also need to rewrite direct i/o
code.
In theory, block device direct i/o doesn't need buffer block size at all.
But in practice, it shares a lot of code with filesystem direct i/o, it
reads the block size multiple times and it crashes if it changes.
Mikulas
On Thu, Nov 29, 2012 at 10:23 AM, Mikulas Patocka <[email protected]> wrote:
>
>
> If you remove that percpu rw lock, you also need to rewrite direct i/o
> code.
>
> In theory, block device direct i/o doesn't need buffer block size at all.
> But in practice, it shares a lot of code with filesystem direct i/o, it
> reads the block size multiple times and it crashes if it changes.
If it's a filesystem, then the size will never change while it is mounted.
So only the direct-block-device case needs to be worried about, no?
And that uses __generic_file_aio_write() and friends, which in turn
use the readpage/writepage functions.
So for block devices, it should be sufficient to make
readpage/writepage (with the writing obviously having all the
"write_begin/write_end/fullpage" variants) be safe as far as I can
see.
Linus
On Thu, Nov 29, 2012 at 9:19 AM, Linus Torvalds
<[email protected]> wrote:
>
> I think I'll apply this for 3.7 (since it's too late to do anything
> fancier), and then for 3.8 I will rip out all the locking entirely,
> because looking at the fs/buffer.c patch I wrote up, it's all totally
> unnecessary.
>
> Adding a ACCESS_ONCE() to the read of the i_blkbits value (when
> creating new buffers) simply makes the whole locking thing pointless.
> Just make the page lock protect the block size, and make it per-page,
> and we're done.
There's a 'block-dev' branch in my git tree, if you guys want to play
around with it.
It actually reverts fs/block-dev.c back to the 3.6 state (except for
some whitespace damage that I refused to re-introduce), so that part
of the changes should be pretty safe and well tested.
The fs/buffer.c changes, of course, are new. It's largely the same
patch I already sent out, with a small helper function to simplify it,
and to keep the whole ACCESS_ONCE() thing in just a single place.
That branch may be re-based in case I get reports or acks or whatever,
so don't rely on it (or if you do, please let me know, and I'll stop
editing it).
The fact that I could just revert the fs/block-dev.c part to a known
state makes me wonder if this might be safe for 3.7 after all (the
fs/buffer.c changes all *look* safe). That way we'd not have to worry
about any new semantics (whether they be EBUSY or any possible locking
slowdowns or RT issues). But I'll think about it, and it would be good
for people to double-check my fs/buffer.c stuff.
Mikulas, does that pass your testing?
Linus
On Thu, Nov 29, 2012 at 12:02:17PM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 9:19 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > I think I'll apply this for 3.7 (since it's too late to do anything
> > fancier), and then for 3.8 I will rip out all the locking entirely,
> > because looking at the fs/buffer.c patch I wrote up, it's all totally
> > unnecessary.
> >
> > Adding a ACCESS_ONCE() to the read of the i_blkbits value (when
> > creating new buffers) simply makes the whole locking thing pointless.
> > Just make the page lock protect the block size, and make it per-page,
> > and we're done.
>
> There's a 'block-dev' branch in my git tree, if you guys want to play
> around with it.
>
> It actually reverts fs/block-dev.c back to the 3.6 state (except for
> some whitespace damage that I refused to re-introduce), so that part
> of the changes should be pretty safe and well tested.
>
> The fs/buffer.c changes, of course, are new. It's largely the same
> patch I already sent out, with a small helper function to simplify it,
> and to keep the whole ACCESS_ONCE() thing in just a single place.
The fs/buffer.c part makes sense during a quick read. But
fs/direct-io.c plays with i_blkbits too. The semaphore was fixing real
bugs there.
-chris
On Thu, Nov 29, 2012 at 11:15 AM, Chris Mason <[email protected]> wrote:
>
> The fs/buffer.c part makes sense during a quick read. But
> fs/direct-io.c plays with i_blkbits too. The semaphore was fixing real
> bugs there.
Ugh. I _hate_ direct-IO. What a mess. And yeah, it seems to be
incestuously playing games that should be in fs/buffer.c. I thought it
was doing the sane thing with the page cache.
(I now realize that Mikulas was talking about this mess, while I
thought he was talking about the AIO code which is largely sane).
Linus
On Thu, Nov 29, 2012 at 12:26:06PM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 11:15 AM, Chris Mason <[email protected]> wrote:
> >
> > The fs/buffer.c part makes sense during a quick read. But
> > fs/direct-io.c plays with i_blkbits too. The semaphore was fixing real
> > bugs there.
>
> Ugh. I _hate_ direct-IO. What a mess. And yeah, it seems to be
> incestuously playing games that should be in fs/buffer.c. I thought it
> was doing the sane thing with the page cache.
>
> (I now realize that Mikulas was talking about this mess, while I
> thought he was talking about the AIO code which is largely sane).
It was all a trick to get you to say the AIO code was sane.
It looks like we could use the private copy of i_blkbits that DIO is
already recording.
blkdev_get_blocks (called during DIO) is also checking i_blkbits, but I
really don't get why that isn't byte based instead. DIO is already
doing the shift & mask game.
I think only clean_blockdev_aliases is intentionally using the inode's
i_blkbits, but again that shouldn't be changing for filesystems so it
seems safe to use the DIO copy.
-chris
On Thu, Nov 29, 2012 at 11:26 AM, Linus Torvalds
<[email protected]> wrote:
>
> (I now realize that Mikulas was talking about this mess, while I
> thought he was talking about the AIO code which is largely sane).
Oh wow.
The direct-IO code really doesn't seem to care at all. I don't think
it needs locking either (it seems to do everything with a private
buffer-head), and the problem appears solely to be that it reads
i_blksize multiple times, so changing it just happens to confuse the
direct-io code.
If it were to read it only once, and then use that value, it looks
like it should all JustWork(tm).
And the right thing to do would seem to just add it to the
"dio_submit" structure, that we already have. And it already *has* a
blkbits field, but that's the "IO blocksize", not the "getblocks
blocksize", if I read that mess correctly.
Of course, it then *ALREADY* has that "blkfactor" thing, which is the
difference between i_blkbits and blktbits, so it effective *does* have
i_blkbits already in the dio_submit structure. But despite it all, it
keeps re-reading i_blksize.
Christ. That code is a mess.
Linus
On Thu, Nov 29, 2012 at 11:48 AM, Chris Mason <[email protected]> wrote:
>
> blkdev_get_blocks (called during DIO) is also checking i_blkbits, but I
> really don't get why that isn't byte based instead. DIO is already
> doing the shift & mask game.
The blkdev_get_blocks() this is just sad.
The underlying data structure is actually byte-based (it's
"i_size_read(bdev->bd_inode"), but we convert it to a block-based
number.
Oops.
Linus
On Thu, Nov 29, 2012 at 11:55 AM, Linus Torvalds
<[email protected]> wrote:
>
> The blkdev_get_blocks() this is just sad.
>
> The underlying data structure is actually byte-based (it's
> "i_size_read(bdev->bd_inode"), but we convert it to a block-based
> number.
>
> Oops.
Oh, it's even worse than that. The DIO code ends up passing in buffer
heads that have sizes bigger than the inode i_blksize, which can cause
problems at the end of the disk. So blkdev_get_blocks() knows about
it, and will then "fix" that and shrink them down. The games with
"max_block" are hilarious.
In a really sad way.
That whole blkdev_get_blocks() function is pure and utter shit.
Linus
On Thu, Nov 29, 2012 at 11:48 AM, Chris Mason <[email protected]> wrote:
>
> It was all a trick to get you to say the AIO code was sane.
It's only sane compared to the DIO code.
That said, I hate AIO much less these days that we've largely merged
the code with the regular IO. It's still a horrible interface, but at
least it is no longer a really disgusting separate implementation in
the kernel of that horrible interface.
So yeah, I guess AIO really is pretty sane these days.
> It looks like we could use the private copy of i_blkbits that DIO is
> already recording.
Yes. But that didn't fix the blkdev_get_blocks() mess you pointed out.
I've pushed out two more commits to the 'block-dev' branch at
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux block-dev
in case anybody wants to take a look.
It is - as usual - entirely untested. It compiles, and I *think* that
blkdev_get_blocks() makes a whole lot more sense this way - as you
said, it should be byte-based (although it actually does the block
number conversion because I worried about overflow - probably
unnecessarily).
Comments?
Linus
On Thu, Nov 29, 2012 at 01:52:22PM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 11:48 AM, Chris Mason <[email protected]> wrote:
> >
> > It was all a trick to get you to say the AIO code was sane.
>
> It's only sane compared to the DIO code.
>
> That said, I hate AIO much less these days that we've largely merged
> the code with the regular IO. It's still a horrible interface, but at
> least it is no longer a really disgusting separate implementation in
> the kernel of that horrible interface.
>
> So yeah, I guess AIO really is pretty sane these days.
>
> > It looks like we could use the private copy of i_blkbits that DIO is
> > already recording.
>
> Yes. But that didn't fix the blkdev_get_blocks() mess you pointed out.
>
> I've pushed out two more commits to the 'block-dev' branch at
>
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux block-dev
>
> in case anybody wants to take a look.
>
> It is - as usual - entirely untested. It compiles, and I *think* that
> blkdev_get_blocks() makes a whole lot more sense this way - as you
> said, it should be byte-based (although it actually does the block
> number conversion because I worried about overflow - probably
> unnecessarily).
>
> Comments?
Your blkdev_get_blocks emails were great reading while at the dentist,
thanks for helping me pass the time.
Just reading the new blkdev_get_blocks, it looks like we're mixing
shifts. In direct-io.c map_bh->b_size is how much we'd like to map, and
it has no relation at all to the actual block size of the device. The
interface is abusing b_size to ask for as large a mapping as possible.
Most importantly, it has no relation to the fs_startblk that we pass in,
which is based on inode->i_blkbits.
So your new check in blkdev_get_blocks:
if (iblock >= end_block) {
Is wrong because iblock and end_block are based on different sizes. I
think we have to do the eof checks inside fs/direct-io.c or change the
get_blocks interface completely.
I really thought fs/direct-io.c was already doing eof checks, but I'm
reading harder.
-chris
On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason <[email protected]> wrote:
>
> Just reading the new blkdev_get_blocks, it looks like we're mixing
> shifts. In direct-io.c map_bh->b_size is how much we'd like to map, and
> it has no relation at all to the actual block size of the device. The
> interface is abusing b_size to ask for as large a mapping as possible.
Ugh. That's a big violation of how buffer-heads are supposed to work:
the block number is very much defined to be in multiples of b_size
(see for example "submit_bh()" that turns it into a sector number).
But you're right. The direct-IO code really *is* violating that, and
knows that get_block() ends up being defined in i_blkbits regardless
of b_size.
What a crock. That direct-IO code is hack-upon-hack. Whoever wrote it
should be shot.
I think the only sane way to fix is is to pass in the block size to
get_blocks(). Which we admittedly should have done long ago, so that's
not a bad fix, but without actually looking at what it involves, I
think it's going to be pretty big patch. All the filesystems that
support the interface need to update it, even if they can then ignore
it, because direct-IO does all these hacks only for the raw device.
And I think it will improve the interface, but damn, direct-IO is
still horrible for playing these kinds of games.
Linus
On Thu, Nov 29, 2012 at 2:16 PM, Linus Torvalds
<[email protected]> wrote:
>
> But you're right. The direct-IO code really *is* violating that, and
> knows that get_block() ends up being defined in i_blkbits regardless
> of b_size.
It turns out fs/ioctl.c does the same - it fills in the buffer head
with some random bh->b_size too. I think it's not even a power of two
in that case.
And I guess it's understandable - they don't actually *use* the
buffer, they just want the offset. So the b_size field really is just
random crap to the users of the get_block interfaces, since they've
never cared before.
Ugh, this was definitely a dark and disgusting underbelly of the VFS
layer. We've not had to really touch it for a *looong* time..
Linus
On Tue, 27 Nov 2012 22:59:52 -0500 (EST)
Mikulas Patocka <[email protected]> wrote:
> percpu-rwsem: use synchronize_sched_expedited
>
> Use synchronize_sched_expedited() instead of synchronize_sched()
> to improve mount speed.
>
> This patch improves mount time from 0.500s to 0.013s.
>
> Note: if realtime people complain about the use
> synchronize_sched_expedited() and synchronize_rcu_expedited(), I suggest
> that they introduce an option CONFIG_REALTIME or
> /proc/sys/kernel/realtime and turn off these *_expedited functions if
> the option is enabled (i.e. turn synchronize_sched_expedited into
> synchronize_sched and synchronize_rcu_expedited into synchronize_rcu).
>
> Signed-off-by: Mikulas Patocka <[email protected]>
So I read through this thread but I really didn't see a clear
description of why mount() got slower. The changelog for 4b05a1c74d1
is spectacularly awful :(
Apparently the slowdown occurred because a blockdev mount patch
62ac665ff9fc07497ca524 ("blockdev: turn a rw semaphore into a percpu rw
semaphore") newly uses percpu rwsems, and percpu rwsems are slow on the
down_write() path.
And using synchronize_sched_expedited() rather than synchronize_sched()
makes percpu_down_write() somewhat less slow. Correct?
Why is it OK to use synchronize_sched_expedited() here? If it's
faster, why can't we use synchronize_sched_expedited() everywhere and
zap synchronize_sched()?
Oleg, this has implications for your
percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.patch.
I've changed that to use synchronize_sched_expedited() everywhere -
please have a think and a retest. Note that elsewhere in this thread,
percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.patch
was found to slow mount down again somewhat, because it adds an
additional synchronize_sched[_expedited]() to the percpu_down_write()
path.
From: Oleg Nesterov <[email protected]>
Subject: percpu_rw_semaphore: reimplement to not block the readers unnecessarily
Currently the writer does msleep() plus synchronize_sched() 3 times to
acquire/release the semaphore, and during this time the readers are
blocked completely. Even if the "write" section was not actually started
or if it was already finished.
With this patch down_write/up_write does synchronize_sched() twice and
down_read/up_read are still possible during this time, just they use the
slow path.
percpu_down_write() first forces the readers to use rw_semaphore and
increment the "slow" counter to take the lock for reading, then it
takes that rw_semaphore for writing and blocks the readers.
Also. With this patch the code relies on the documented behaviour of
synchronize_sched(), it doesn't try to pair synchronize_sched() with
barrier.
Signed-off-by: Oleg Nesterov <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Cc: Ananth N Mavinakayanahalli <[email protected]>
Cc: Anton Arapov <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
include/linux/percpu-rwsem.h | 85 +++-------------------
lib/Makefile | 2
lib/percpu-rwsem.c | 123 +++++++++++++++++++++++++++++++++
3 files changed, 138 insertions(+), 72 deletions(-)
diff -puN include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily include/linux/percpu-rwsem.h
--- a/include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
+++ a/include/linux/percpu-rwsem.h
@@ -2,82 +2,25 @@
#define _LINUX_PERCPU_RWSEM_H
#include <linux/mutex.h>
+#include <linux/rwsem.h>
#include <linux/percpu.h>
-#include <linux/rcupdate.h>
-#include <linux/delay.h>
+#include <linux/wait.h>
struct percpu_rw_semaphore {
- unsigned __percpu *counters;
- bool locked;
- struct mutex mtx;
+ unsigned int __percpu *fast_read_ctr;
+ struct mutex writer_mutex;
+ struct rw_semaphore rw_sem;
+ atomic_t slow_read_ctr;
+ wait_queue_head_t write_waitq;
};
-#define light_mb() barrier()
-#define heavy_mb() synchronize_sched_expedited()
+extern void percpu_down_read(struct percpu_rw_semaphore *);
+extern void percpu_up_read(struct percpu_rw_semaphore *);
-static inline void percpu_down_read(struct percpu_rw_semaphore *p)
-{
- rcu_read_lock_sched();
- if (unlikely(p->locked)) {
- rcu_read_unlock_sched();
- mutex_lock(&p->mtx);
- this_cpu_inc(*p->counters);
- mutex_unlock(&p->mtx);
- return;
- }
- this_cpu_inc(*p->counters);
- rcu_read_unlock_sched();
- light_mb(); /* A, between read of p->locked and read of data, paired with D */
-}
-
-static inline void percpu_up_read(struct percpu_rw_semaphore *p)
-{
- light_mb(); /* B, between read of the data and write to p->counter, paired with C */
- this_cpu_dec(*p->counters);
-}
-
-static inline unsigned __percpu_count(unsigned __percpu *counters)
-{
- unsigned total = 0;
- int cpu;
-
- for_each_possible_cpu(cpu)
- total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
-
- return total;
-}
-
-static inline void percpu_down_write(struct percpu_rw_semaphore *p)
-{
- mutex_lock(&p->mtx);
- p->locked = true;
- synchronize_sched_expedited(); /* make sure that all readers exit the rcu_read_lock_sched region */
- while (__percpu_count(p->counters))
- msleep(1);
- heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
-}
-
-static inline void percpu_up_write(struct percpu_rw_semaphore *p)
-{
- heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
- p->locked = false;
- mutex_unlock(&p->mtx);
-}
-
-static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
-{
- p->counters = alloc_percpu(unsigned);
- if (unlikely(!p->counters))
- return -ENOMEM;
- p->locked = false;
- mutex_init(&p->mtx);
- return 0;
-}
-
-static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
-{
- free_percpu(p->counters);
- p->counters = NULL; /* catch use after free bugs */
-}
+extern void percpu_down_write(struct percpu_rw_semaphore *);
+extern void percpu_up_write(struct percpu_rw_semaphore *);
+
+extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
+extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
#endif
diff -puN lib/Makefile~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily lib/Makefile
--- a/lib/Makefile~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
+++ a/lib/Makefile
@@ -9,7 +9,7 @@ endif
lib-y := ctype.o string.o vsprintf.o cmdline.o \
rbtree.o radix-tree.o dump_stack.o timerqueue.o\
- idr.o int_sqrt.o extable.o \
+ idr.o int_sqrt.o extable.o percpu-rwsem.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
is_single_threaded.o plist.o decompress.o earlycpio.o kobject_uevent.o
diff -puN /dev/null lib/percpu-rwsem.c
--- /dev/null
+++ a/lib/percpu-rwsem.c
@@ -0,0 +1,123 @@
+#include <linux/percpu-rwsem.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+
+int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
+{
+ brw->fast_read_ctr = alloc_percpu(int);
+ if (unlikely(!brw->fast_read_ctr))
+ return -ENOMEM;
+
+ mutex_init(&brw->writer_mutex);
+ init_rwsem(&brw->rw_sem);
+ atomic_set(&brw->slow_read_ctr, 0);
+ init_waitqueue_head(&brw->write_waitq);
+ return 0;
+}
+
+void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
+{
+ free_percpu(brw->fast_read_ctr);
+ brw->fast_read_ctr = NULL; /* catch use after free bugs */
+}
+
+static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
+{
+ bool success = false;
+
+ preempt_disable();
+ if (likely(!mutex_is_locked(&brw->writer_mutex))) {
+ __this_cpu_add(*brw->fast_read_ctr, val);
+ success = true;
+ }
+ preempt_enable();
+
+ return success;
+}
+
+/*
+ * Like the normal down_read() this is not recursive, the writer can
+ * come after the first percpu_down_read() and create the deadlock.
+ */
+void percpu_down_read(struct percpu_rw_semaphore *brw)
+{
+ if (likely(update_fast_ctr(brw, +1)))
+ return;
+
+ down_read(&brw->rw_sem);
+ atomic_inc(&brw->slow_read_ctr);
+ up_read(&brw->rw_sem);
+}
+
+void percpu_up_read(struct percpu_rw_semaphore *brw)
+{
+ if (likely(update_fast_ctr(brw, -1)))
+ return;
+
+ /* false-positive is possible but harmless */
+ if (atomic_dec_and_test(&brw->slow_read_ctr))
+ wake_up_all(&brw->write_waitq);
+}
+
+static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
+{
+ unsigned int sum = 0;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ sum += per_cpu(*brw->fast_read_ctr, cpu);
+ per_cpu(*brw->fast_read_ctr, cpu) = 0;
+ }
+
+ return sum;
+}
+
+/*
+ * A writer takes ->writer_mutex to exclude other writers and to force the
+ * readers to switch to the slow mode, note the mutex_is_locked() check in
+ * update_fast_ctr().
+ *
+ * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
+ * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
+ * counter it represents the number of active readers.
+ *
+ * Finally the writer takes ->rw_sem for writing and blocks the new readers,
+ * then waits until the slow counter becomes zero.
+ */
+void percpu_down_write(struct percpu_rw_semaphore *brw)
+{
+ /* also blocks update_fast_ctr() which checks mutex_is_locked() */
+ mutex_lock(&brw->writer_mutex);
+
+ /*
+ * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
+ * so that update_fast_ctr() can't succeed.
+ *
+ * 2. Ensures we see the result of every previous this_cpu_add() in
+ * update_fast_ctr().
+ *
+ * 3. Ensures that if any reader has exited its critical section via
+ * fast-path, it executes a full memory barrier before we return.
+ */
+ synchronize_sched_expedited();
+
+ /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
+ atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
+
+ /* block the new readers completely */
+ down_write(&brw->rw_sem);
+
+ /* wait for all readers to complete their percpu_up_read() */
+ wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
+}
+
+void percpu_up_write(struct percpu_rw_semaphore *brw)
+{
+ /* allow the new readers, but only the slow-path */
+ up_write(&brw->rw_sem);
+
+ /* insert the barrier before the next fast-path in down_read */
+ synchronize_sched_expedited();
+
+ mutex_unlock(&brw->writer_mutex);
+}
_
On Thu, Nov 29, 2012 at 03:36:38PM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 2:16 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > But you're right. The direct-IO code really *is* violating that, and
> > knows that get_block() ends up being defined in i_blkbits regardless
> > of b_size.
>
> It turns out fs/ioctl.c does the same - it fills in the buffer head
> with some random bh->b_size too. I think it's not even a power of two
> in that case.
>
> And I guess it's understandable - they don't actually *use* the
> buffer, they just want the offset. So the b_size field really is just
> random crap to the users of the get_block interfaces, since they've
> never cared before.
>
> Ugh, this was definitely a dark and disgusting underbelly of the VFS
> layer. We've not had to really touch it for a *looong* time..
I searched through filemap.c for the magic i_size check that would let
us get away with ignoring i_blkbits in get_blocks, but its just not
there. The whole fallback-to-buffered scheme seems to rely on
get_blocks checking for i_size. I really hope I'm just missing
something.
If we're going to change this, I'd vote for something non-bh based. I
didn't check every single FS, but I don't think direct-IO really wants
or needs buffer heads at all.
One less wart in direct-io.c would really be nice, but I'm assuming
it'll take us at least one full release to hammer out a shiny new
get_blocks. Passing i_blkbits would be more mechanical, since all the
filesystems would just ignore it.
-chris
On Thu, Nov 29, 2012 at 5:16 PM, Chris Mason <[email protected]> wrote:
>
> I searched through filemap.c for the magic i_size check that would let
> us get away with ignoring i_blkbits in get_blocks, but its just not
> there. The whole fallback-to-buffered scheme seems to rely on
> get_blocks checking for i_size. I really hope I'm just missing
> something.
So generic_write_checks() limits the size to i_size at for writes (and
for "isblk").
Sure, then it will do the buffered part after that, but that should
all be fine anyway, since by then we use the normal page cache.
For reads, generic_file_aio_read() will check pos < size, but doesn't
seem to actually limit the size of the iovec.
I'm not sure why it doesn't just do "iov_shorten()".
Anyway, having looked at actually passing in the block size to
get_block(), I can say that is a horrible idea. There are tons of
get_block functions (for various filesystems), and *none* of them
really want the block size, because they tend to work on block
indexes. And if they do want the block size, they'll just get it from
the inode or sb, since they are filesystems and it's all stable.
So the *only* of the places that would want the block size is
fs/block_dev.c. And the callers really already seem to do the i_size
check, although they sometimes do it badly. And since there are fewer
callers than there are get_block() implementations, I think we should
just fix the callers and be done with it.
Those generic_file_aio_read/write() functions in fs/direct-io.c really
just seem to be badly written. The fact that they may depend on the
i_size check in get_blocks() is sad, but I think we should fix it and
just remove the check for block devices. That's going to simplify so
much..
I updated the 'block-dev' branch to have that simpler fs/block_dev.c
model instead. I'll look at the iovec shortening later. It's a
non-fast-forward thing, look out!
(I actually think we should just add the max-offset check to
rw_copy_check_uvector(). That one already does the MAX_RW_COUNT thing,
and we could make it do a max_offset check as well).
Linus
On Thu, Nov 29, 2012 at 07:13:02PM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 5:16 PM, Chris Mason <[email protected]> wrote:
> >
> > I searched through filemap.c for the magic i_size check that would let
> > us get away with ignoring i_blkbits in get_blocks, but its just not
> > there. The whole fallback-to-buffered scheme seems to rely on
> > get_blocks checking for i_size. I really hope I'm just missing
> > something.
>
> So generic_write_checks() limits the size to i_size at for writes (and
> for "isblk").
Great, that's what I was missing.
>
> Sure, then it will do the buffered part after that, but that should
> all be fine anyway, since by then we use the normal page cache.
>
> For reads, generic_file_aio_read() will check pos < size, but doesn't
> seem to actually limit the size of the iovec.
I couldn't explain that either.
>
> I'm not sure why it doesn't just do "iov_shorten()".
>
> Anyway, having looked at actually passing in the block size to
> get_block(), I can say that is a horrible idea. There are tons of
> get_block functions (for various filesystems), and *none* of them
> really want the block size, because they tend to work on block
> indexes. And if they do want the block size, they'll just get it from
> the inode or sb, since they are filesystems and it's all stable.
>
> So the *only* of the places that would want the block size is
> fs/block_dev.c. And the callers really already seem to do the i_size
> check, although they sometimes do it badly. And since there are fewer
> callers than there are get_block() implementations, I think we should
> just fix the callers and be done with it.
>
> Those generic_file_aio_read/write() functions in fs/direct-io.c really
> just seem to be badly written. The fact that they may depend on the
> i_size check in get_blocks() is sad, but I think we should fix it and
> just remove the check for block devices. That's going to simplify so
> much..
>
> I updated the 'block-dev' branch to have that simpler fs/block_dev.c
> model instead. I'll look at the iovec shortening later. It's a
> non-fast-forward thing, look out!
>
> (I actually think we should just add the max-offset check to
> rw_copy_check_uvector(). That one already does the MAX_RW_COUNT thing,
> and we could make it do a max_offset check as well).
This is definitely easier, and I can't see any reason not to do it. I'm
used to get_block being expensive and so it didn't even cross my mind.
We can benchmark things just to make sure.
-chris
On Thu, Nov 29, 2012 at 02:16:50PM -0800, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason <[email protected]> wrote:
> >
> > Just reading the new blkdev_get_blocks, it looks like we're mixing
> > shifts. In direct-io.c map_bh->b_size is how much we'd like to map, and
> > it has no relation at all to the actual block size of the device. The
> > interface is abusing b_size to ask for as large a mapping as possible.
>
> Ugh. That's a big violation of how buffer-heads are supposed to work:
> the block number is very much defined to be in multiples of b_size
> (see for example "submit_bh()" that turns it into a sector number).
>
> But you're right. The direct-IO code really *is* violating that, and
> knows that get_block() ends up being defined in i_blkbits regardless
> of b_size.
Same with mpage_readpages(), so it's not just direct IO that has
this problem....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, 29 Nov 2012, Andrew Morton wrote:
> On Tue, 27 Nov 2012 22:59:52 -0500 (EST)
> Mikulas Patocka <[email protected]> wrote:
>
> > percpu-rwsem: use synchronize_sched_expedited
> >
> > Use synchronize_sched_expedited() instead of synchronize_sched()
> > to improve mount speed.
> >
> > This patch improves mount time from 0.500s to 0.013s.
> >
> > Note: if realtime people complain about the use
> > synchronize_sched_expedited() and synchronize_rcu_expedited(), I suggest
> > that they introduce an option CONFIG_REALTIME or
> > /proc/sys/kernel/realtime and turn off these *_expedited functions if
> > the option is enabled (i.e. turn synchronize_sched_expedited into
> > synchronize_sched and synchronize_rcu_expedited into synchronize_rcu).
> >
> > Signed-off-by: Mikulas Patocka <[email protected]>
>
> So I read through this thread but I really didn't see a clear
> description of why mount() got slower. The changelog for 4b05a1c74d1
> is spectacularly awful :(
>
>
> Apparently the slowdown occurred because a blockdev mount patch
> 62ac665ff9fc07497ca524 ("blockdev: turn a rw semaphore into a percpu rw
> semaphore") newly uses percpu rwsems, and percpu rwsems are slow on the
> down_write() path.
>
> And using synchronize_sched_expedited() rather than synchronize_sched()
> makes percpu_down_write() somewhat less slow. Correct?
Yes.
> Why is it OK to use synchronize_sched_expedited() here? If it's
> faster, why can't we use synchronize_sched_expedited() everywhere and
> zap synchronize_sched()?
Because synchronize_sched_expedited sends interrupts to all processors and
it is bad for realtime workloads.
Peter Zijlstra once complained when I used synchronize_rcu_expedited in
bdi_remove_from_list (but he left it there).
I suggest that if it really hurts real time response for someone, let they
introduce a switch to turn it into non-expedited call.
Mikulas
On Thu, Nov 29, 2012 at 10:00:53PM -0500, Mikulas Patocka wrote:
> On Thu, 29 Nov 2012, Andrew Morton wrote:
> > On Tue, 27 Nov 2012 22:59:52 -0500 (EST)
> > Mikulas Patocka <[email protected]> wrote:
> >
> > > percpu-rwsem: use synchronize_sched_expedited
> > >
> > > Use synchronize_sched_expedited() instead of synchronize_sched()
> > > to improve mount speed.
> > >
> > > This patch improves mount time from 0.500s to 0.013s.
> > >
> > > Note: if realtime people complain about the use
> > > synchronize_sched_expedited() and synchronize_rcu_expedited(), I suggest
> > > that they introduce an option CONFIG_REALTIME or
> > > /proc/sys/kernel/realtime and turn off these *_expedited functions if
> > > the option is enabled (i.e. turn synchronize_sched_expedited into
> > > synchronize_sched and synchronize_rcu_expedited into synchronize_rcu).
> > >
> > > Signed-off-by: Mikulas Patocka <[email protected]>
> >
> > So I read through this thread but I really didn't see a clear
> > description of why mount() got slower. The changelog for 4b05a1c74d1
> > is spectacularly awful :(
> >
> >
> > Apparently the slowdown occurred because a blockdev mount patch
> > 62ac665ff9fc07497ca524 ("blockdev: turn a rw semaphore into a percpu rw
> > semaphore") newly uses percpu rwsems, and percpu rwsems are slow on the
> > down_write() path.
> >
> > And using synchronize_sched_expedited() rather than synchronize_sched()
> > makes percpu_down_write() somewhat less slow. Correct?
>
> Yes.
>
> > Why is it OK to use synchronize_sched_expedited() here? If it's
> > faster, why can't we use synchronize_sched_expedited() everywhere and
> > zap synchronize_sched()?
>
> Because synchronize_sched_expedited sends interrupts to all processors and
> it is bad for realtime workloads.
>
> Peter Zijlstra once complained when I used synchronize_rcu_expedited in
> bdi_remove_from_list (but he left it there).
>
> I suggest that if it really hurts real time response for someone, let they
> introduce a switch to turn it into non-expedited call.
Once Frederic's adaptive-ticks work reaches mainline, it will be possible
to avoid the IPIs to CPUs that are executing in user mode, in addition to
the current code's avoiding sending IPIs to CPUs that are idle. That said,
it will still be necessary to send IPIs to CPUs that are executing in
the kernel.
So things will get better, but won't be perfect. Sort of like this was
real life or something. ;-)
Thanx, Paul
On Thu, Nov 29, 2012 at 07:49:10PM -0700, Dave Chinner wrote:
> On Thu, Nov 29, 2012 at 02:16:50PM -0800, Linus Torvalds wrote:
> > On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason <[email protected]> wrote:
> > >
> > > Just reading the new blkdev_get_blocks, it looks like we're mixing
> > > shifts. In direct-io.c map_bh->b_size is how much we'd like to map, and
> > > it has no relation at all to the actual block size of the device. The
> > > interface is abusing b_size to ask for as large a mapping as possible.
> >
> > Ugh. That's a big violation of how buffer-heads are supposed to work:
> > the block number is very much defined to be in multiples of b_size
> > (see for example "submit_bh()" that turns it into a sector number).
> >
> > But you're right. The direct-IO code really *is* violating that, and
> > knows that get_block() ends up being defined in i_blkbits regardless
> > of b_size.
>
> Same with mpage_readpages(), so it's not just direct IO that has
> this problem....
I guess the good news is that block devices don't have readpages. The
bad news would be that we can't put readpages in without much bigger
changes.
-chris
On Fri, Nov 30, 2012 at 01:49:10PM +1100, Dave Chinner wrote:
> > Ugh. That's a big violation of how buffer-heads are supposed to work:
> > the block number is very much defined to be in multiples of b_size
> > (see for example "submit_bh()" that turns it into a sector number).
> >
> > But you're right. The direct-IO code really *is* violating that, and
> > knows that get_block() ends up being defined in i_blkbits regardless
> > of b_size.
>
> Same with mpage_readpages(), so it's not just direct IO that has
> this problem....
The mpage code may actually fall back to BHs.
I have a version of the direct I/O code that uses the iomap_ops from the
multi-page write code that you originally started. It uses the new op
as primary interface for direct I/O and provides a helper for
filesystems that still use buffer heads internally. I'll try to dust it
off and send out a version for the current kernel.
On Fri, Nov 30, 2012 at 6:31 AM, Chris Mason <[email protected]> wrote:
> On Thu, Nov 29, 2012 at 07:49:10PM -0700, Dave Chinner wrote:
>>
>> Same with mpage_readpages(), so it's not just direct IO that has
>> this problem....
>
> I guess the good news is that block devices don't have readpages. The
> bad news would be that we can't put readpages in without much bigger
> changes.
Well, the new block-dev branch no longer cares. It basically says "ok,
we use inode->i_blkbits, but for raw device accesses we know it's
unstable, so we'll just not use it".
So both mpage_readpages and direct-IO should be happy. And it actually
removed redundant code, so it's all good.
Linus
On 11/29, Andrew Morton wrote:
>
> On Tue, 27 Nov 2012 22:59:52 -0500 (EST)
> Mikulas Patocka <[email protected]> wrote:
>
> > percpu-rwsem: use synchronize_sched_expedited
> >
> > Use synchronize_sched_expedited() instead of synchronize_sched()
> > to improve mount speed.
> >
> > This patch improves mount time from 0.500s to 0.013s.
> >
> > Note: if realtime people complain about the use
> > synchronize_sched_expedited() and synchronize_rcu_expedited(), I suggest
> > that they introduce an option CONFIG_REALTIME or
> > /proc/sys/kernel/realtime and turn off these *_expedited functions if
> > the option is enabled (i.e. turn synchronize_sched_expedited into
> > synchronize_sched and synchronize_rcu_expedited into synchronize_rcu).
> >
> > Signed-off-by: Mikulas Patocka <[email protected]>
>
> So I read through this thread but I really didn't see a clear
> description of why mount() got slower. The changelog for 4b05a1c74d1
> is spectacularly awful :(
>
>
> Apparently the slowdown occurred because a blockdev mount patch
> 62ac665ff9fc07497ca524 ("blockdev: turn a rw semaphore into a percpu rw
> semaphore") newly uses percpu rwsems, and percpu rwsems are slow on the
> down_write() path.
>
> And using synchronize_sched_expedited() rather than synchronize_sched()
> makes percpu_down_write() somewhat less slow. Correct?
>
> Why is it OK to use synchronize_sched_expedited() here? If it's
> faster, why can't we use synchronize_sched_expedited() everywhere and
> zap synchronize_sched()?
synchronize_sched_expedited() is much faster but it is heavyweight,
try_stop_cpus() affects the whole system.
We briefly discussed this possibility with Paul, personally I think it
would be better to add percpu_rw_semaphore->use_expedited or add the
percpu_down_write_expedited().
But this change was already applied. I hope we can do this later.
> Oleg, this has implications for your
> percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.patch.
> I've changed that to use synchronize_sched_expedited() everywhere -
> please have a think and a retest.
Thanks a lot Andrew. The fix looks obviously fine, but I'll try to test
anyway.
> Note that elsewhere in this thread,
> percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.patch
> was found to slow mount down again somewhat, because it adds an
> additional synchronize_sched[_expedited]() to the percpu_down_write()
> path.
Cough... this patch _removes_ one synchronize_sched[_expedited]() and
removes msleep ;) Although the main point was not-block-the-readers.
And I think we can remove another one, but this is not trivial and
percpu_free_rwsem() should be might_sleep().
Oleg.
On Fri, Nov 30, 2012 at 11:36:01AM -0500, Christoph Hellwig wrote:
> On Fri, Nov 30, 2012 at 01:49:10PM +1100, Dave Chinner wrote:
> > > Ugh. That's a big violation of how buffer-heads are supposed to work:
> > > the block number is very much defined to be in multiples of b_size
> > > (see for example "submit_bh()" that turns it into a sector number).
> > >
> > > But you're right. The direct-IO code really *is* violating that, and
> > > knows that get_block() ends up being defined in i_blkbits regardless
> > > of b_size.
> >
> > Same with mpage_readpages(), so it's not just direct IO that has
> > this problem....
>
> The mpage code may actually fall back to BHs.
>
> I have a version of the direct I/O code that uses the iomap_ops from the
> multi-page write code that you originally started. It uses the new op
> as primary interface for direct I/O and provides a helper for
> filesystems that still use buffer heads internally. I'll try to dust it
> off and send out a version for the current kernel.
So it was based on this interface?
(I went looking for this code on google a couple of days ago so I
could point at it and say "we should be using an iomap structure,
not buffer heads", but it looks like I never posted it to fsdevel or
the xfs lists...)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 090f0ea..e247d62 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -522,6 +522,7 @@ enum positive_aop_returns {
struct page;
struct address_space;
struct writeback_control;
+struct iomap;
struct iov_iter {
const struct iovec *iov;
@@ -614,6 +615,9 @@ struct address_space_operations {
int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
unsigned long);
int (*error_remove_page)(struct address_space *, struct page *);
+
+ int (*iomap)(struct address_space *mapping, loff_t pos,
+ ssize_t length, struct iomap *iomap, int cmd);
};
/*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
new file mode 100644
index 0000000..7708614
--- /dev/null
+++ b/include/linux/iomap.h
@@ -0,0 +1,45 @@
+#ifndef _IOMAP_H
+#define _IOMAP_H
+
+/* ->iomap a_op command types */
+#define IOMAP_READ 0x01 /* read the current mapping starting at the
+ given position, trimmed to a maximum length.
+ FS's should use this to obtain and lock
+ resources within this range */
+#define IOMAP_RESERVE 0x02 /* reserve space for an allocation that spans
+ the given iomap */
+#define IOMAP_ALLOCATE 0x03 /* allocate space in a given iomap - must have
+ first been reserved */
+#define IOMAP_UNRESERVE 0x04 /* return unused reserved space for the given
+ iomap and used space. This will always be
+ called after a IOMAP_READ so as to allow the
+ FS to release held resources. */
+
+/* types of block ranges for multipage write mappings. */
+#define IOMAP_HOLE 0x01 /* no blocks allocated, need allocation */
+#define IOMAP_DELALLOC 0x02 /* delayed allocation blocks */
+#define IOMAP_MAPPED 0x03 /* blocks allocated @blkno */
+#define IOMAP_UNWRITTEN 0x04 /* blocks allocated @blkno in unwritten state */
+
+struct iomap {
+ sector_t blkno; /* first sector of mapping */
+ loff_t offset; /* file offset of mapping, bytes */
+ ssize_t length; /* length of mapping, bytes */
+ int type; /* type of mapping */
+ void *priv; /* fs private data associated with map */
+};
+
+static inline bool
+iomap_needs_allocation(struct iomap *iomap)
+{
+ return iomap->type == IOMAP_HOLE;
+}
+
+/* multipage write interfaces use iomaps */
+typedef int (*mpw_actor_t)(struct address_space *mapping, void *src,
+ loff_t pos, ssize_t len, struct iomap *iomap);
+
+ssize_t multipage_write_segment(struct address_space *mapping, void *src,
+ loff_t pos, ssize_t length, mpw_actor_t actor);
+
+#endif /* _IOMAP_H */
Cheers,
Dave.
>
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>
--
Dave Chinner
[email protected]
On Sat, Dec 01, 2012 at 09:40:41AM +1100, Dave Chinner wrote:
> So it was based on this interface?
Based on that. I had dropped the inode operation as it's not really
a generic operation but a callback for either the buffered I/O code
or direct I/O and should be treated as such. I've also split the
single multiplexer function into individual ones, but the underlying
data structure and fundamental operations are the same.
> (I went looking for this code on google a couple of days ago so I
> could point at it and say "we should be using an iomap structure,
> not buffer heads", but it looks like I never posted it to fsdevel or
> the xfs lists...)
Your version defintively was up on your kernel.org XFS tree, that's what
I started from.
I'll have a long plane right tonight, let's see if I can get the direct
I/O version updated.