2015-08-14 17:21:55

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH v3 0/8] change sb_writers to use percpu_rw_semaphore

On 08/13, Jan Kara wrote:
>
> Regarding the routing, ideally Al Viro should take these as a VFS
> maintainer.

Al, could you take these patches?

Only cosmetic changes in V3 to address the comments from Jan, I
preserved his acks.

In case you missed all the spam I sent before, let me repeat that
the awful (and currently unneeded) 7/8 will be reverted later. We
need it to ensure that other percpu_rw_semaphore changes routed
via another tree won't break fs/super.c. After that we will add
rcu_sync_dtor(s_writers->rw_sem) into deactivate_locked_super()
and revert this horror.

3/8 documents the lockdep problems we currently have. This is fixed
by the patch below but it depends on xfs ILOCK fixes from Dave, so
I will send it later. Plus another patch which removes the "trylock"
hack in __sb_start_write().

Oleg.

arch/Kconfig | 1 -
fs/btrfs/transaction.c | 8 +--
fs/super.c | 184 +++++++++++++++++++---------------------
fs/xfs/xfs_aops.c | 6 +-
include/linux/fs.h | 23 +++---
include/linux/percpu-rwsem.h | 20 +++++
init/Kconfig | 1 -
kernel/locking/Makefile | 3 +-
kernel/locking/percpu-rwsem.c | 13 +++
lib/Kconfig | 3 -
10 files changed, 136 insertions(+), 126 deletions(-)

--------------------------------------------------------------------------
[PATCH v3 9/8] don't fool lockdep in freeze_super() and thaw_super() paths

sb_wait_write()->percpu_rwsem_release() fools lockdep to avoid the
false-positives. Now that xfs was fixed by Dave we can remove it and
change freeze_super() and thaw_super() to run with s_writers.rw_sem
locks held; we add two trivial helpers for that, sb_freeze_release()
and sb_freeze_acquire().

Signed-off-by: Oleg Nesterov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/super.c | 37 +++++++++++++++++++++++++------------
1 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 4350ff4..91c9756 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1213,25 +1213,34 @@ EXPORT_SYMBOL(__sb_start_write);
static void sb_wait_write(struct super_block *sb, int level)
{
percpu_down_write(sb->s_writers.rw_sem + level-1);
- /*
- * We are going to return to userspace and forget about this lock, the
- * ownership goes to the caller of thaw_super() which does unlock.
- *
- * FIXME: we should do this before return from freeze_super() after we
- * called sync_filesystem(sb) and s_op->freeze_fs(sb), and thaw_super()
- * should re-acquire these locks before s_op->unfreeze_fs(sb). However
- * this leads to lockdep false-positives, so currently we do the early
- * release right after acquire.
- */
- percpu_rwsem_release(sb->s_writers.rw_sem + level-1, 0, _THIS_IP_);
}

-static void sb_freeze_unlock(struct super_block *sb)
+/*
+ * We are going to return to userspace and forget about these locks, the
+ * ownership goes to the caller of thaw_super() which does unlock().
+ */
+static void sb_freeze_release(struct super_block *sb)
+{
+ int level;
+
+ for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
+ percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
+}
+
+/*
+ * Tell lockdep we are holding these locks before we call ->unfreeze_fs(sb).
+ */
+static void sb_freeze_acquire(struct super_block *sb)
{
int level;

for (level = 0; level < SB_FREEZE_LEVELS; ++level)
percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
+}
+
+static void sb_freeze_unlock(struct super_block *sb)
+{
+ int level;

for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
percpu_up_write(sb->s_writers.rw_sem + level);
@@ -1327,6 +1336,7 @@ int freeze_super(struct super_block *sb)
* sees write activity when frozen is set to SB_FREEZE_COMPLETE.
*/
sb->s_writers.frozen = SB_FREEZE_COMPLETE;
+ sb_freeze_release(sb);
up_write(&sb->s_umount);
return 0;
}
@@ -1353,11 +1363,14 @@ int thaw_super(struct super_block *sb)
goto out;
}

+ sb_freeze_acquire(sb);
+
if (sb->s_op->unfreeze_fs) {
error = sb->s_op->unfreeze_fs(sb);
if (error) {
printk(KERN_ERR
"VFS:Filesystem thaw failed\n");
+ sb_freeze_release(sb);
up_write(&sb->s_umount);
return error;
}
--
1.5.5.1


2015-08-14 17:22:10

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH v3 1/8] introduce __sb_writers_{acquired,release}() helpers

Preparation to hide the sb->s_writers internals from xfs and btrfs.
Add 2 trivial define's they can use rather than play with ->s_writers
directly. No changes in btrfs/transaction.o and xfs/xfs_aops.o.

Signed-off-by: Oleg Nesterov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/btrfs/transaction.c | 8 ++------
fs/xfs/xfs_aops.c | 6 ++----
include/linux/fs.h | 5 +++++
3 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 5628e25..6dca4e9 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1620,9 +1620,7 @@ static void do_async_commit(struct work_struct *work)
* Tell lockdep about it.
*/
if (ac->newtrans->type & __TRANS_FREEZABLE)
- rwsem_acquire_read(
- &ac->root->fs_info->sb->s_writers.lock_map[SB_FREEZE_FS-1],
- 0, 1, _THIS_IP_);
+ __sb_writers_acquired(ac->root->fs_info->sb, SB_FREEZE_FS);

current->journal_info = ac->newtrans;

@@ -1661,9 +1659,7 @@ int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,
* async commit thread will be the one to unlock it.
*/
if (ac->newtrans->type & __TRANS_FREEZABLE)
- rwsem_release(
- &root->fs_info->sb->s_writers.lock_map[SB_FREEZE_FS-1],
- 1, _THIS_IP_);
+ __sb_writers_release(root->fs_info->sb, SB_FREEZE_FS);

schedule_work(&ac->work);

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a56960d..8034c78 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -119,8 +119,7 @@ xfs_setfilesize_trans_alloc(
* We may pass freeze protection with a transaction. So tell lockdep
* we released it.
*/
- rwsem_release(&ioend->io_inode->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
- 1, _THIS_IP_);
+ __sb_writers_release(ioend->io_inode->i_sb, SB_FREEZE_FS);
/*
* We hand off the transaction to the completion thread now, so
* clear the flag here.
@@ -171,8 +170,7 @@ xfs_setfilesize_ioend(
* Similarly for freeze protection.
*/
current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
- rwsem_acquire_read(&VFS_I(ip)->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
- 0, 1, _THIS_IP_);
+ __sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);

return xfs_setfilesize(ip, tp, ioend->io_offset, ioend->io_size);
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 35ec87e..78ac768 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1362,6 +1362,11 @@ extern struct timespec current_fs_time(struct super_block *sb);
void __sb_end_write(struct super_block *sb, int level);
int __sb_start_write(struct super_block *sb, int level, bool wait);

+#define __sb_writers_acquired(sb, lev) \
+ rwsem_acquire_read(&(sb)->s_writers.lock_map[(lev)-1], 0, 1, _THIS_IP_)
+#define __sb_writers_release(sb, lev) \
+ rwsem_release(&(sb)->s_writers.lock_map[(lev)-1], 1, _THIS_IP_)
+
/**
* sb_end_write - drop write access to a superblock
* @sb: the super we wrote to
--
1.5.5.1

2015-08-14 17:24:05

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH v3 2/8] fix the broken lockdep logic in __sb_start_write()

1. wait_event(frozen < level) without rwsem_acquire_read() is just
wrong from lockdep perspective. If we are going to deadlock
because the caller is buggy, lockdep can't detect this problem.

2. __sb_start_write() can race with thaw_super() + freeze_super(),
and after "goto retry" the 2nd acquire_freeze_lock() is wrong.

3. The "tell lockdep we are doing trylock" hack doesn't look nice.

I think this is correct, but this logic should be more explicit.
Yes, the recursive read_lock() is fine if we hold the lock on a
higher level. But we do not need to fool lockdep. If we can not
deadlock in this case then try-lock must not fail and we can use
use wait == F throughout this code.

Note: as Dave Chinner explains, the "trylock" hack and the fat comment
can be probably removed. But this needs a separate change and it will
be trivial: just kill __sb_start_write() and rename do_sb_start_write()
back to __sb_start_write().

Signed-off-by: Oleg Nesterov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/super.c | 73 ++++++++++++++++++++++++++++++++---------------------------
1 files changed, 40 insertions(+), 33 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 928c20f..d0fdd49 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1158,38 +1158,11 @@ void __sb_end_write(struct super_block *sb, int level)
}
EXPORT_SYMBOL(__sb_end_write);

-#ifdef CONFIG_LOCKDEP
-/*
- * We want lockdep to tell us about possible deadlocks with freezing but
- * it's it bit tricky to properly instrument it. Getting a freeze protection
- * works as getting a read lock but there are subtle problems. XFS for example
- * gets freeze protection on internal level twice in some cases, which is OK
- * only because we already hold a freeze protection also on higher level. Due
- * to these cases we have to tell lockdep we are doing trylock when we
- * already hold a freeze protection for a higher freeze level.
- */
-static void acquire_freeze_lock(struct super_block *sb, int level, bool trylock,
+static int do_sb_start_write(struct super_block *sb, int level, bool wait,
unsigned long ip)
{
- int i;
-
- if (!trylock) {
- for (i = 0; i < level - 1; i++)
- if (lock_is_held(&sb->s_writers.lock_map[i])) {
- trylock = true;
- break;
- }
- }
- rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, trylock, ip);
-}
-#endif
-
-/*
- * This is an internal function, please use sb_start_{write,pagefault,intwrite}
- * instead.
- */
-int __sb_start_write(struct super_block *sb, int level, bool wait)
-{
+ if (wait)
+ rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 0, ip);
retry:
if (unlikely(sb->s_writers.frozen >= level)) {
if (!wait)
@@ -1198,9 +1171,6 @@ retry:
sb->s_writers.frozen < level);
}

-#ifdef CONFIG_LOCKDEP
- acquire_freeze_lock(sb, level, !wait, _RET_IP_);
-#endif
percpu_counter_inc(&sb->s_writers.counter[level-1]);
/*
* Make sure counter is updated before we check for frozen.
@@ -1211,8 +1181,45 @@ retry:
__sb_end_write(sb, level);
goto retry;
}
+
+ if (!wait)
+ rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 1, ip);
return 1;
}
+
+/*
+ * This is an internal function, please use sb_start_{write,pagefault,intwrite}
+ * instead.
+ */
+int __sb_start_write(struct super_block *sb, int level, bool wait)
+{
+ bool force_trylock = false;
+ int ret;
+
+#ifdef CONFIG_LOCKDEP
+ /*
+ * We want lockdep to tell us about possible deadlocks with freezing
+ * but it's it bit tricky to properly instrument it. Getting a freeze
+ * protection works as getting a read lock but there are subtle
+ * problems. XFS for example gets freeze protection on internal level
+ * twice in some cases, which is OK only because we already hold a
+ * freeze protection also on higher level. Due to these cases we have
+ * to use wait == F (trylock mode) which must not fail.
+ */
+ if (wait) {
+ int i;
+
+ for (i = 0; i < level - 1; i++)
+ if (lock_is_held(&sb->s_writers.lock_map[i])) {
+ force_trylock = true;
+ break;
+ }
+ }
+#endif
+ ret = do_sb_start_write(sb, level, wait && !force_trylock, _RET_IP_);
+ WARN_ON(force_trylock & !ret);
+ return ret;
+}
EXPORT_SYMBOL(__sb_start_write);

/**
--
1.5.5.1

2015-08-14 17:22:16

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH v3 3/8] document rwsem_release() in sb_wait_write()

Not only we need to avoid the warning from lockdep_sys_exit(), the
caller of freeze_super() can never release this lock. Another thread
can do this, so there is another reason for rwsem_release().

Plus the comment should explain why we have to fool lockdep.

Signed-off-by: Oleg Nesterov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/super.c | 12 +++++++++---
1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index d0fdd49..89b58fb 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1236,11 +1236,17 @@ static void sb_wait_write(struct super_block *sb, int level)
{
s64 writers;

+ rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
/*
- * We just cycle-through lockdep here so that it does not complain
- * about returning with lock to userspace
+ * We are going to return to userspace and forget about this lock, the
+ * ownership goes to the caller of thaw_super() which does unlock.
+ *
+ * FIXME: we should do this before return from freeze_super() after we
+ * called sync_filesystem(sb) and s_op->freeze_fs(sb), and thaw_super()
+ * should re-acquire these locks before s_op->unfreeze_fs(sb). However
+ * this leads to lockdep false-positives, so currently we do the early
+ * release right after acquire.
*/
- rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_);

do {
--
1.5.5.1

2015-08-14 17:22:23

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH v3 4/8] percpu-rwsem: introduce percpu_down_read_trylock()

Add percpu_down_read_trylock(), it will have the user soon.

Signed-off-by: Oleg Nesterov <[email protected]>
---
include/linux/percpu-rwsem.h | 1 +
kernel/locking/percpu-rwsem.c | 13 +++++++++++++
2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 3e58226..3ebf982 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -17,6 +17,7 @@ struct percpu_rw_semaphore {
};

extern void percpu_down_read(struct percpu_rw_semaphore *);
+extern int percpu_down_read_trylock(struct percpu_rw_semaphore *);
extern void percpu_up_read(struct percpu_rw_semaphore *);

extern void percpu_down_write(struct percpu_rw_semaphore *);
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 2c54c64..4bc2127 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -73,6 +73,19 @@ void percpu_down_read(struct percpu_rw_semaphore *brw)
__up_read(&brw->rw_sem);
}

+int percpu_down_read_trylock(struct percpu_rw_semaphore *brw)
+{
+ if (unlikely(!update_fast_ctr(brw, +1))) {
+ if (!__down_read_trylock(&brw->rw_sem))
+ return 0;
+ atomic_inc(&brw->slow_read_ctr);
+ __up_read(&brw->rw_sem);
+ }
+
+ rwsem_acquire_read(&brw->rw_sem.dep_map, 0, 1, _RET_IP_);
+ return 1;
+}
+
void percpu_up_read(struct percpu_rw_semaphore *brw)
{
rwsem_release(&brw->rw_sem.dep_map, 1, _RET_IP_);
--
1.5.5.1

2015-08-14 17:22:21

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH v3 5/8] percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()

Add percpu_rwsem_release() and percpu_rwsem_acquire() for the users
which need to return to userspace with percpu-rwsem lock held and/or
pass the ownership to another thread.

TODO: change percpu_rwsem_release() to use rwsem_clear_owner(). We can
either fold kernel/locking/rwsem.h into include/linux/rwsem.h, or add
the non-inline percpu_rwsem_clear_owner().

Signed-off-by: Oleg Nesterov <[email protected]>
---
include/linux/percpu-rwsem.h | 19 +++++++++++++++++++
1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 3ebf982..06af654 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -33,4 +33,23 @@ extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
__percpu_init_rwsem(brw, #brw, &rwsem_key); \
})

+
+#define percpu_rwsem_is_held(sem) lockdep_is_held(&(sem)->rw_sem)
+
+static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
+ bool read, unsigned long ip)
+{
+ lock_release(&sem->rw_sem.dep_map, 1, ip);
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+ if (!read)
+ sem->rw_sem.owner = NULL;
+#endif
+}
+
+static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem,
+ bool read, unsigned long ip)
+{
+ lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip);
+}
+
#endif
--
1.5.5.1

2015-08-14 17:22:25

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH v3 6/8] percpu-rwsem: kill CONFIG_PERCPU_RWSEM

Remove CONFIG_PERCPU_RWSEM, the next patch adds the unconditional
user of percpu_rw_semaphore.

Signed-off-by: Oleg Nesterov <[email protected]>
---
arch/Kconfig | 1 -
init/Kconfig | 1 -
kernel/locking/Makefile | 3 +--
lib/Kconfig | 3 ---
4 files changed, 1 insertions(+), 7 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a65eafb..94d6471 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -87,7 +87,6 @@ config KPROBES_ON_FTRACE

config UPROBES
def_bool n
- select PERCPU_RWSEM
help
Uprobes is the user-space counterpart to kprobes: they
enable instrumentation applications (such as 'perf probe')
diff --git a/init/Kconfig b/init/Kconfig
index b9b824b..dc24dec 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -938,7 +938,6 @@ config NUMA_BALANCING_DEFAULT_ENABLED
menuconfig CGROUPS
bool "Control Group support"
select KERNFS
- select PERCPU_RWSEM
help
This option adds support for grouping sets of processes together, for
use with process control subsystems such as Cpusets, CFS, memory
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index de7a416..a05a29f 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -1,5 +1,5 @@

-obj-y += mutex.o semaphore.o rwsem.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o

ifdef CONFIG_FUNCTION_TRACER
CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
@@ -24,6 +24,5 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock.o
obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
-obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
obj-$(CONFIG_QUEUED_RWLOCKS) += qrwlock.o
obj-$(CONFIG_LOCK_TORTURE_TEST) += locktorture.o
diff --git a/lib/Kconfig b/lib/Kconfig
index 601965a..597aa66 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -53,9 +53,6 @@ config GENERIC_IO
config STMP_DEVICE
bool

-config PERCPU_RWSEM
- bool
-
config ARCH_USE_CMPXCHG_LOCKREF
bool

--
1.5.5.1

2015-08-14 17:22:52

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH v3 7/8] shift percpu_counter_destroy() into destroy_super_work()

Of course, this patch is ugly as hell. It will be (partially)
reverted later. We add it to ensure that other WIP changes in
percpu_rw_semaphore won't break fs/super.c.

We do not even need this change right now, percpu_free_rwsem()
is fine in atomic context. But we are going to change this, it
will be might_sleep() after we merge the rcu_sync() patches.

And even after that we do not really need destroy_super_work(),
we will kill it in any case. Instead, destroy_super_rcu() should
just check that rss->cb_state == CB_IDLE and do call_rcu() again
in the (very unlikely) case this is not true.

So this is just the temporary kludge which helps us to avoid the
conflicts with the changes which will be (hopefully) routed via
rcu tree.

Signed-off-by: Oleg Nesterov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/super.c | 23 +++++++++++++++++++----
include/linux/fs.h | 3 ++-
2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 89b58fb..75436e2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -135,6 +135,24 @@ static unsigned long super_cache_count(struct shrinker *shrink,
return total_objects;
}

+static void destroy_super_work(struct work_struct *work)
+{
+ struct super_block *s = container_of(work, struct super_block,
+ destroy_work);
+ int i;
+
+ for (i = 0; i < SB_FREEZE_LEVELS; i++)
+ percpu_counter_destroy(&s->s_writers.counter[i]);
+ kfree(s);
+}
+
+static void destroy_super_rcu(struct rcu_head *head)
+{
+ struct super_block *s = container_of(head, struct super_block, rcu);
+ INIT_WORK(&s->destroy_work, destroy_super_work);
+ schedule_work(&s->destroy_work);
+}
+
/**
* destroy_super - frees a superblock
* @s: superblock to free
@@ -143,16 +161,13 @@ static unsigned long super_cache_count(struct shrinker *shrink,
*/
static void destroy_super(struct super_block *s)
{
- int i;
list_lru_destroy(&s->s_dentry_lru);
list_lru_destroy(&s->s_inode_lru);
- for (i = 0; i < SB_FREEZE_LEVELS; i++)
- percpu_counter_destroy(&s->s_writers.counter[i]);
security_sb_free(s);
WARN_ON(!list_empty(&s->s_mounts));
kfree(s->s_subtype);
kfree(s->s_options);
- kfree_rcu(s, rcu);
+ call_rcu(&s->rcu, destroy_super_rcu);
}

/**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 78ac768..6addccc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -30,6 +30,7 @@
#include <linux/lockdep.h>
#include <linux/percpu-rwsem.h>
#include <linux/blk_types.h>
+#include <linux/workqueue.h>

#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
@@ -1346,7 +1347,7 @@ struct super_block {
struct list_lru s_dentry_lru ____cacheline_aligned_in_smp;
struct list_lru s_inode_lru ____cacheline_aligned_in_smp;
struct rcu_head rcu;
-
+ struct work_struct destroy_work;
/*
* Indicates how deep in a filesystem stack this SB is
*/
--
1.5.5.1

2015-08-14 17:22:32

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH v3 8/8] change sb_writers to use percpu_rw_semaphore

We can remove everything from struct sb_writers except frozen
and add the array of percpu_rw_semaphore's instead.

This patch doesn't remove sb_writers->wait_unfrozen yet, we keep
it for get_super_thawed(). We will probably remove it later.

This change tries to address the following problems:

- Firstly, __sb_start_write() looks simply buggy. It does
__sb_end_write() if it sees ->frozen, but if it migrates
to another CPU before percpu_counter_dec(), sb_wait_write()
can wrongly succeed if there is another task which holds
the same "semaphore": sb_wait_write() can miss the result
of the previous percpu_counter_inc() but see the result
of this percpu_counter_dec().

- As Dave Hansen reports, it is suboptimal. The trivial
microbenchmark that writes to a tmpfs file in a loop runs
12% faster if we change this code to rely on RCU and kill
the memory barriers.

- This code doesn't look simple. It would be better to rely
on the generic locking code.

According to Dave, this change adds the same performance
improvement.

Note: with this change both freeze_super() and thaw_super() will do
synchronize_sched_expedited() 3 times. This is just ugly. But:

- This will be "fixed" by the rcu_sync changes we are going
to merge. After that freeze_super()->percpu_down_write()
will use synchronize_sched(), and thaw_super() won't use
synchronize() at all.

This doesn't need any changes in fs/super.c.

- Once we merge rcu_sync changes, we can also change super.c
so that all wb_write->rw_sem's will share the single ->rss
in struct sb_writes, then freeze_super() will need only one
synchronize_sched().

Signed-off-by: Oleg Nesterov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/super.c | 111 ++++++++++++++--------------------------------------
include/linux/fs.h | 19 +++------
2 files changed, 36 insertions(+), 94 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 75436e2..4350ff4 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -142,7 +142,7 @@ static void destroy_super_work(struct work_struct *work)
int i;

for (i = 0; i < SB_FREEZE_LEVELS; i++)
- percpu_counter_destroy(&s->s_writers.counter[i]);
+ percpu_free_rwsem(&s->s_writers.rw_sem[i]);
kfree(s);
}

@@ -193,13 +193,11 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
goto fail;

for (i = 0; i < SB_FREEZE_LEVELS; i++) {
- if (percpu_counter_init(&s->s_writers.counter[i], 0,
- GFP_KERNEL) < 0)
+ if (__percpu_init_rwsem(&s->s_writers.rw_sem[i],
+ sb_writers_name[i],
+ &type->s_writers_key[i]))
goto fail;
- lockdep_init_map(&s->s_writers.lock_map[i], sb_writers_name[i],
- &type->s_writers_key[i], 0);
}
- init_waitqueue_head(&s->s_writers.wait);
init_waitqueue_head(&s->s_writers.wait_unfrozen);
s->s_bdi = &noop_backing_dev_info;
s->s_flags = flags;
@@ -1161,47 +1159,10 @@ out:
*/
void __sb_end_write(struct super_block *sb, int level)
{
- percpu_counter_dec(&sb->s_writers.counter[level-1]);
- /*
- * Make sure s_writers are updated before we wake up waiters in
- * freeze_super().
- */
- smp_mb();
- if (waitqueue_active(&sb->s_writers.wait))
- wake_up(&sb->s_writers.wait);
- rwsem_release(&sb->s_writers.lock_map[level-1], 1, _RET_IP_);
+ percpu_up_read(sb->s_writers.rw_sem + level-1);
}
EXPORT_SYMBOL(__sb_end_write);

-static int do_sb_start_write(struct super_block *sb, int level, bool wait,
- unsigned long ip)
-{
- if (wait)
- rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 0, ip);
-retry:
- if (unlikely(sb->s_writers.frozen >= level)) {
- if (!wait)
- return 0;
- wait_event(sb->s_writers.wait_unfrozen,
- sb->s_writers.frozen < level);
- }
-
- percpu_counter_inc(&sb->s_writers.counter[level-1]);
- /*
- * Make sure counter is updated before we check for frozen.
- * freeze_super() first sets frozen and then checks the counter.
- */
- smp_mb();
- if (unlikely(sb->s_writers.frozen >= level)) {
- __sb_end_write(sb, level);
- goto retry;
- }
-
- if (!wait)
- rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 1, ip);
- return 1;
-}
-
/*
* This is an internal function, please use sb_start_{write,pagefault,intwrite}
* instead.
@@ -1209,7 +1170,7 @@ retry:
int __sb_start_write(struct super_block *sb, int level, bool wait)
{
bool force_trylock = false;
- int ret;
+ int ret = 1;

#ifdef CONFIG_LOCKDEP
/*
@@ -1225,13 +1186,17 @@ int __sb_start_write(struct super_block *sb, int level, bool wait)
int i;

for (i = 0; i < level - 1; i++)
- if (lock_is_held(&sb->s_writers.lock_map[i])) {
+ if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
force_trylock = true;
break;
}
}
#endif
- ret = do_sb_start_write(sb, level, wait && !force_trylock, _RET_IP_);
+ if (wait && !force_trylock)
+ percpu_down_read(sb->s_writers.rw_sem + level-1);
+ else
+ ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
+
WARN_ON(force_trylock & !ret);
return ret;
}
@@ -1243,15 +1208,11 @@ EXPORT_SYMBOL(__sb_start_write);
* @level: type of writers we wait for (normal vs page fault)
*
* This function waits until there are no writers of given type to given file
- * system. Caller of this function should make sure there can be no new writers
- * of type @level before calling this function. Otherwise this function can
- * livelock.
+ * system.
*/
static void sb_wait_write(struct super_block *sb, int level)
{
- s64 writers;
-
- rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
+ percpu_down_write(sb->s_writers.rw_sem + level-1);
/*
* We are going to return to userspace and forget about this lock, the
* ownership goes to the caller of thaw_super() which does unlock.
@@ -1262,24 +1223,18 @@ static void sb_wait_write(struct super_block *sb, int level)
* this leads to lockdep false-positives, so currently we do the early
* release right after acquire.
*/
- rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_);
-
- do {
- DEFINE_WAIT(wait);
+ percpu_rwsem_release(sb->s_writers.rw_sem + level-1, 0, _THIS_IP_);
+}

- /*
- * We use a barrier in prepare_to_wait() to separate setting
- * of frozen and checking of the counter
- */
- prepare_to_wait(&sb->s_writers.wait, &wait,
- TASK_UNINTERRUPTIBLE);
+static void sb_freeze_unlock(struct super_block *sb)
+{
+ int level;

- writers = percpu_counter_sum(&sb->s_writers.counter[level-1]);
- if (writers)
- schedule();
+ for (level = 0; level < SB_FREEZE_LEVELS; ++level)
+ percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);

- finish_wait(&sb->s_writers.wait, &wait);
- } while (writers);
+ for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
+ percpu_up_write(sb->s_writers.rw_sem + level);
}

/**
@@ -1338,20 +1293,14 @@ int freeze_super(struct super_block *sb)
return 0;
}

- /* From now on, no new normal writers can start */
sb->s_writers.frozen = SB_FREEZE_WRITE;
- smp_wmb();
-
/* Release s_umount to preserve sb_start_write -> s_umount ordering */
up_write(&sb->s_umount);
-
sb_wait_write(sb, SB_FREEZE_WRITE);
+ down_write(&sb->s_umount);

/* Now we go and block page faults... */
- down_write(&sb->s_umount);
sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
- smp_wmb();
-
sb_wait_write(sb, SB_FREEZE_PAGEFAULT);

/* All writers are done so after syncing there won't be dirty data */
@@ -1359,7 +1308,6 @@ int freeze_super(struct super_block *sb)

/* Now wait for internal filesystem counter */
sb->s_writers.frozen = SB_FREEZE_FS;
- smp_wmb();
sb_wait_write(sb, SB_FREEZE_FS);

if (sb->s_op->freeze_fs) {
@@ -1368,7 +1316,7 @@ int freeze_super(struct super_block *sb)
printk(KERN_ERR
"VFS:Filesystem freeze failed\n");
sb->s_writers.frozen = SB_UNFROZEN;
- smp_wmb();
+ sb_freeze_unlock(sb);
wake_up(&sb->s_writers.wait_unfrozen);
deactivate_locked_super(sb);
return ret;
@@ -1400,8 +1348,10 @@ int thaw_super(struct super_block *sb)
return -EINVAL;
}

- if (sb->s_flags & MS_RDONLY)
+ if (sb->s_flags & MS_RDONLY) {
+ sb->s_writers.frozen = SB_UNFROZEN;
goto out;
+ }

if (sb->s_op->unfreeze_fs) {
error = sb->s_op->unfreeze_fs(sb);
@@ -1413,12 +1363,11 @@ int thaw_super(struct super_block *sb)
}
}

-out:
sb->s_writers.frozen = SB_UNFROZEN;
- smp_wmb();
+ sb_freeze_unlock(sb);
+out:
wake_up(&sb->s_writers.wait_unfrozen);
deactivate_locked_super(sb);
-
return 0;
}
EXPORT_SYMBOL(thaw_super);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6addccc..fb2cb4a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1,7 +1,6 @@
#ifndef _LINUX_FS_H
#define _LINUX_FS_H

-
#include <linux/linkage.h>
#include <linux/wait.h>
#include <linux/kdev_t.h>
@@ -31,6 +30,7 @@
#include <linux/percpu-rwsem.h>
#include <linux/blk_types.h>
#include <linux/workqueue.h>
+#include <linux/percpu-rwsem.h>

#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
@@ -1247,16 +1247,9 @@ enum {
#define SB_FREEZE_LEVELS (SB_FREEZE_COMPLETE - 1)

struct sb_writers {
- /* Counters for counting writers at each level */
- struct percpu_counter counter[SB_FREEZE_LEVELS];
- wait_queue_head_t wait; /* queue for waiting for
- writers / faults to finish */
- int frozen; /* Is sb frozen? */
- wait_queue_head_t wait_unfrozen; /* queue for waiting for
- sb to be thawed */
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
- struct lockdep_map lock_map[SB_FREEZE_LEVELS];
-#endif
+ int frozen; /* Is sb frozen? */
+ wait_queue_head_t wait_unfrozen; /* for get_super_thawed() */
+ struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS];
};

struct super_block {
@@ -1364,9 +1357,9 @@ void __sb_end_write(struct super_block *sb, int level);
int __sb_start_write(struct super_block *sb, int level, bool wait);

#define __sb_writers_acquired(sb, lev) \
- rwsem_acquire_read(&(sb)->s_writers.lock_map[(lev)-1], 0, 1, _THIS_IP_)
+ percpu_rwsem_acquire(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
#define __sb_writers_release(sb, lev) \
- rwsem_release(&(sb)->s_writers.lock_map[(lev)-1], 1, _THIS_IP_)
+ percpu_rwsem_release(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)

/**
* sb_end_write - drop write access to a superblock
--
1.5.5.1

2015-08-15 07:17:17

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] change sb_writers to use percpu_rw_semaphore

On Fri, Aug 14, 2015 at 07:19:35PM +0200, Oleg Nesterov wrote:
> On 08/13, Jan Kara wrote:
> >
> > Regarding the routing, ideally Al Viro should take these as a VFS
> > maintainer.
>
> Al, could you take these patches?

I can live with that. Do you have it as a branch in a public git tree?

2015-08-15 12:05:47

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] change sb_writers to use percpu_rw_semaphore

On 08/15, Al Viro wrote:
>
> On Fri, Aug 14, 2015 at 07:19:35PM +0200, Oleg Nesterov wrote:
> > On 08/13, Jan Kara wrote:
> > >
> > > Regarding the routing, ideally Al Viro should take these as a VFS
> > > maintainer.
> >
> > Al, could you take these patches?
>
> I can live with that. Do you have it as a branch in a public git tree?

Just created

git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc sb_writers_pcpu_rwsem

Based on Linus's tree.

Thanks!

Oleg.
---

The following changes since commit 45e38cff4fce8d6871b5fa5e734e4dc9814d6056:
Linus Torvalds (1):
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc sb_writers_pcpu_rwsem

Oleg Nesterov (8):
introduce __sb_writers_{acquired,release}() helpers
fix the broken lockdep logic in __sb_start_write()
document rwsem_release() in sb_wait_write()
percpu-rwsem: introduce percpu_down_read_trylock()
percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
percpu-rwsem: kill CONFIG_PERCPU_RWSEM
shift percpu_counter_destroy() into destroy_super_work()
change sb_writers to use percpu_rw_semaphore

arch/Kconfig | 1 -
fs/btrfs/transaction.c | 8 +--
fs/super.c | 171 ++++++++++++++++++-----------------------
fs/xfs/xfs_aops.c | 6 +-
include/linux/fs.h | 23 +++---
include/linux/percpu-rwsem.h | 20 +++++
init/Kconfig | 1 -
kernel/locking/Makefile | 3 +-
kernel/locking/percpu-rwsem.c | 13 +++
lib/Kconfig | 3 -
10 files changed, 123 insertions(+), 126 deletions(-)

2015-08-16 13:47:22

by Arthur Marsh

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] change sb_writers to use percpu_rw_semaphore

Oleg Nesterov wrote on 15/08/15 02:49:
> On 08/13, Jan Kara wrote:
>>
>> Regarding the routing, ideally Al Viro should take these as a VFS
>> maintainer.
>
> Al, could you take these patches?
>
> Only cosmetic changes in V3 to address the comments from Jan, I
> preserved his acks.
>
> In case you missed all the spam I sent before, let me repeat that
> the awful (and currently unneeded) 7/8 will be reverted later. We
> need it to ensure that other percpu_rw_semaphore changes routed
> via another tree won't break fs/super.c. After that we will add
> rcu_sync_dtor(s_writers->rw_sem) into deactivate_locked_super()
> and revert this horror.
>
> 3/8 documents the lockdep problems we currently have. This is fixed
> by the patch below but it depends on xfs ILOCK fixes from Dave, so
> I will send it later. Plus another patch which removes the "trylock"
> hack in __sb_start_write().
>
> Oleg.

Would these patches address what I've seen in the last day or so using
Linus' git head kernel and seeing problems like:

[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Initializing cgroup subsys cpuacct
[ 0.000000] Linux version 4.2.0-rc6+ (root@victoria) (gcc version
5.2.1 20150808 (Debian 5.2.1-15) ) #11 SMP PREEMPT Sun Aug 16 07:27:00
ACST 2015
...
[ 6000.096107] INFO: task basename:7796 blocked for more than 120 seconds.
[ 6000.096116] Not tainted 4.2.0-rc6+ #11
[ 6000.096120] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 6000.096123] basename D e7b5b180 0 7796 6936 0x00000000
[ 6000.096132] c0379a84 00000086 c11127a5 e7b5b180 e7b5b5ec 2e0a5fb9
00000557 f5f0b310
[ 6000.096143] f330b180 e7b5b180 c037a000 f5f0b300 7fffffff c0379a90
c155b740 00000000
[ 6000.096154] c0379b04 c155fa1d 00000046 c11127a5 00000246 00000000
c0379ab0 c10a569b
[ 6000.096164] Call Trace:
[ 6000.096174] [<c11127a5>] ? __delayacct_blkio_start+0x15/0x20
[ 6000.096179] [<c155b740>] schedule+0x30/0x80
[ 6000.096184] [<c155fa1d>] schedule_timeout+0x2cd/0x5c0
[ 6000.096188] [<c11127a5>] ? __delayacct_blkio_start+0x15/0x20
[ 6000.096193] [<c10a569b>] ? trace_hardirqs_on+0xb/0x10
[ 6000.096198] [<c10d701c>] ? ktime_get+0xac/0x1a0
[ 6000.096202] [<c11127a5>] ? __delayacct_blkio_start+0x15/0x20
[ 6000.096206] [<c155ad99>] io_schedule_timeout+0x89/0xf0
[ 6000.096209] [<c155beb0>] ? bit_wait+0x40/0x40
[ 6000.096213] [<c155bed5>] bit_wait_io+0x25/0x50
[ 6000.096216] [<c155bbc9>] __wait_on_bit+0x49/0x70
[ 6000.096219] [<c155beb0>] ? bit_wait+0x40/0x40
[ 6000.096223] [<c155bc4d>] out_of_line_wait_on_bit+0x5d/0x70
[ 6000.096226] [<c155beb0>] ? bit_wait+0x40/0x40
[ 6000.096230] [<c109b210>] ? autoremove_wake_function+0x40/0x40
[ 6000.096236] [<c11ed5be>] bh_submit_read+0x7e/0x90
[ 6000.096265] [<f8321354>] ext4_get_branch+0xa4/0x110 [ext4]
[ 6000.096286] [<f8321f14>] ext4_ind_map_blocks+0xd4/0xe30 [ext4]
[ 6000.096291] [<c10a6290>] ? __lock_acquire+0x910/0x16a0
[ 6000.096295] [<c10a6290>] ? __lock_acquire+0x910/0x16a0
[ 6000.096300] [<c155ed53>] ? down_read+0x33/0x50
[ 6000.096315] [<f82ddc9d>] ext4_map_blocks+0x29d/0x4f0 [ext4]
[ 6000.096319] [<c10a548b>] ? mark_held_locks+0x5b/0x90
[ 6000.096323] [<c10a55ec>] ? trace_hardirqs_on_caller+0x12c/0x1d0
[ 6000.096337] [<f82db052>] ? ext4_readpages+0x32/0x40 [ext4]
[ 6000.096358] [<f832d37b>] ext4_mpage_readpages+0x30b/0x8c0 [ext4]
[ 6000.096372] [<f82db052>] ? ext4_readpages+0x32/0x40 [ext4]
[ 6000.096377] [<c1156930>] ? __alloc_pages_nodemask+0x9c0/0xa40
[ 6000.096383] [<c107ed46>] ? preempt_count_sub+0x26/0x70
[ 6000.096397] [<f82db052>] ext4_readpages+0x32/0x40 [ext4]
[ 6000.096411] [<f82db020>] ? do_journal_get_write_access+0xb0/0xb0 [ext4]
[ 6000.096416] [<c115c376>] __do_page_cache_readahead+0x2e6/0x370
[ 6000.096420] [<c115c233>] ? __do_page_cache_readahead+0x1a3/0x370
[ 6000.096426] [<c114fb85>] filemap_fault+0x505/0x570
[ 6000.096430] [<c117bb6f>] ? __do_fault+0x2f/0x80
[ 6000.096435] [<c117bb6f>] __do_fault+0x2f/0x80
[ 6000.096439] [<c1560ca7>] ? _raw_spin_unlock+0x27/0x50
[ 6000.096443] [<c117f412>] handle_mm_fault+0xb22/0x11d0
[ 6000.096448] [<c104aa7a>] __do_page_fault+0x16a/0x500
[ 6000.096452] [<c104ae10>] ? __do_page_fault+0x500/0x500
[ 6000.096456] [<c104ae31>] do_page_fault+0x21/0x30
[ 6000.096460] [<c156282b>] error_code+0x5f/0x64
[ 6000.096464] [<c104ae10>] ? __do_page_fault+0x500/0x500
[ 6000.096468] 2 locks held by basename/7796:
[ 6000.096470] #0: (&mm->mmap_sem){++++++}, at: [<c104aa25>]
__do_page_fault+0x115/0x500
[ 6000.096479] #1: (&ei->i_data_sem){++++..}, at: [<f82ddd9b>]
ext4_map_blocks+0x39b/0x4f0 [ext4]
[ 6000.096500] INFO: task hddtemp:7797 blocked for more than 120 seconds.
[ 6000.096503] Not tainted 4.2.0-rc6+ #11
[ 6000.096505] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 6000.096508] hddtemp D e896d100 0 7797 5140 0x00000000
[ 6000.096514] c02c3a84 00000086 e896d588 e896d100 e896d56c 00000001
c02c3a84 f5f0b310
[ 6000.096525] c176fb00 e896d100 c02c4000 f5f0b300 7fffffff c02c3a90
c155b740 00000000
[ 6000.096535] c02c3b04 c155fa1d 00000046 c11127a5 00000246 00000000
c02c3ab0 c10a569b
[ 6000.096546] Call Trace:
[ 6000.096550] [<c155b740>] schedule+0x30/0x80
[ 6000.096554] [<c155fa1d>] schedule_timeout+0x2cd/0x5c0
[ 6000.096558] [<c11127a5>] ? __delayacct_blkio_start+0x15/0x20
[ 6000.096562] [<c10a569b>] ? trace_hardirqs_on+0xb/0x10
[ 6000.096566] [<c10d701c>] ? ktime_get+0xac/0x1a0
[ 6000.096569] [<c11127a5>] ? __delayacct_blkio_start+0x15/0x20
[ 6000.096574] [<c155ad99>] io_schedule_timeout+0x89/0xf0
[ 6000.096577] [<c109ad07>] ? prepare_to_wait_exclusive+0x47/0x80
[ 6000.096581] [<c155beb0>] ? bit_wait+0x40/0x40
[ 6000.096584] [<c155bed5>] bit_wait_io+0x25/0x50
[ 6000.096587] [<c155bd12>] __wait_on_bit_lock+0x32/0x80
[ 6000.096591] [<c155bdbd>] out_of_line_wait_on_bit_lock+0x5d/0x70
[ 6000.096595] [<c155beb0>] ? bit_wait+0x40/0x40
[ 6000.096598] [<c109b210>] ? autoremove_wake_function+0x40/0x40
[ 6000.096602] [<c11ea166>] bh_uptodate_or_lock+0x66/0x70
[ 6000.096623] [<f8321349>] ext4_get_branch+0x99/0x110 [ext4]
[ 6000.096643] [<f8321f14>] ext4_ind_map_blocks+0xd4/0xe30 [ext4]
[ 6000.096647] [<c10a6290>] ? __lock_acquire+0x910/0x16a0
[ 6000.096651] [<c10a6290>] ? __lock_acquire+0x910/0x16a0
[ 6000.096656] [<c155ed53>] ? down_read+0x33/0x50
[ 6000.096671] [<f82ddc9d>] ext4_map_blocks+0x29d/0x4f0 [ext4]
[ 6000.096675] [<c10a548b>] ? mark_held_locks+0x5b/0x90
[ 6000.096679] [<c10a55ec>] ? trace_hardirqs_on_caller+0x12c/0x1d0
[ 6000.096693] [<f82db052>] ? ext4_readpages+0x32/0x40 [ext4]
[ 6000.096713] [<f832d37b>] ext4_mpage_readpages+0x30b/0x8c0 [ext4]
[ 6000.096727] [<f82db052>] ? ext4_readpages+0x32/0x40 [ext4]
[ 6000.096732] [<c1156930>] ? __alloc_pages_nodemask+0x9c0/0xa40
[ 6000.096747] [<f82db052>] ext4_readpages+0x32/0x40 [ext4]
[ 6000.096761] [<f82db020>] ? do_journal_get_write_access+0xb0/0xb0 [ext4]
[ 6000.096766] [<c115c376>] __do_page_cache_readahead+0x2e6/0x370
[ 6000.096770] [<c115c233>] ? __do_page_cache_readahead+0x1a3/0x370
[ 6000.096775] [<c114fb85>] filemap_fault+0x505/0x570
[ 6000.096779] [<c117bb6f>] ? __do_fault+0x2f/0x80
[ 6000.096783] [<c117bb6f>] __do_fault+0x2f/0x80
[ 6000.096787] [<c1560ca7>] ? _raw_spin_unlock+0x27/0x50
[ 6000.096791] [<c117f412>] handle_mm_fault+0xb22/0x11d0
[ 6000.096796] [<c104aa7a>] __do_page_fault+0x16a/0x500
[ 6000.096800] [<c104ae10>] ? __do_page_fault+0x500/0x500
[ 6000.096803] [<c104ae31>] do_page_fault+0x21/0x30
[ 6000.096807] [<c156282b>] error_code+0x5f/0x64
[ 6000.096811] [<c104ae10>] ? __do_page_fault+0x500/0x500
[ 6000.096815] 2 locks held by hddtemp/7797:
[ 6000.096817] #0: (&mm->mmap_sem){++++++}, at: [<c104aa25>]
__do_page_fault+0x115/0x500
[ 6000.096825] #1: (&ei->i_data_sem){++++..}, at: [<f82ddd9b>]
ext4_map_blocks+0x39b/0x4f0 [ext4]

2015-08-17 11:37:52

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] change sb_writers to use percpu_rw_semaphore

On 08/16, Arthur Marsh wrote:
>
> Would these patches address what I've seen in the last day or so using
> Linus' git head kernel and seeing problems like:

No, this series shouldn't make any difference.



> [ 0.000000] Linux version 4.2.0-rc6+ (root@victoria) (gcc version
> 5.2.1 20150808 (Debian 5.2.1-15) ) #11 SMP PREEMPT Sun Aug 16 07:27:00
> ACST 2015
> ...
> [ 6000.096107] INFO: task basename:7796 blocked for more than 120 seconds.
> [ 6000.096116] Not tainted 4.2.0-rc6+ #11
> [ 6000.096120] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 6000.096123] basename D e7b5b180 0 7796 6936 0x00000000
> [ 6000.096132] c0379a84 00000086 c11127a5 e7b5b180 e7b5b5ec 2e0a5fb9
> 00000557 f5f0b310
> [ 6000.096143] f330b180 e7b5b180 c037a000 f5f0b300 7fffffff c0379a90
> c155b740 00000000
> [ 6000.096154] c0379b04 c155fa1d 00000046 c11127a5 00000246 00000000
> c0379ab0 c10a569b
> [ 6000.096164] Call Trace:
> [ 6000.096174] [<c11127a5>] ? __delayacct_blkio_start+0x15/0x20
> [ 6000.096179] [<c155b740>] schedule+0x30/0x80
> [ 6000.096184] [<c155fa1d>] schedule_timeout+0x2cd/0x5c0
> [ 6000.096188] [<c11127a5>] ? __delayacct_blkio_start+0x15/0x20
> [ 6000.096193] [<c10a569b>] ? trace_hardirqs_on+0xb/0x10
> [ 6000.096198] [<c10d701c>] ? ktime_get+0xac/0x1a0
> [ 6000.096202] [<c11127a5>] ? __delayacct_blkio_start+0x15/0x20
> [ 6000.096206] [<c155ad99>] io_schedule_timeout+0x89/0xf0
> [ 6000.096209] [<c155beb0>] ? bit_wait+0x40/0x40
> [ 6000.096213] [<c155bed5>] bit_wait_io+0x25/0x50
> [ 6000.096216] [<c155bbc9>] __wait_on_bit+0x49/0x70
> [ 6000.096219] [<c155beb0>] ? bit_wait+0x40/0x40
> [ 6000.096223] [<c155bc4d>] out_of_line_wait_on_bit+0x5d/0x70
> [ 6000.096226] [<c155beb0>] ? bit_wait+0x40/0x40
> [ 6000.096230] [<c109b210>] ? autoremove_wake_function+0x40/0x40
> [ 6000.096236] [<c11ed5be>] bh_submit_read+0x7e/0x90
> [ 6000.096265] [<f8321354>] ext4_get_branch+0xa4/0x110 [ext4]
> [ 6000.096286] [<f8321f14>] ext4_ind_map_blocks+0xd4/0xe30 [ext4]
> [ 6000.096291] [<c10a6290>] ? __lock_acquire+0x910/0x16a0
> [ 6000.096295] [<c10a6290>] ? __lock_acquire+0x910/0x16a0
> [ 6000.096300] [<c155ed53>] ? down_read+0x33/0x50
> [ 6000.096315] [<f82ddc9d>] ext4_map_blocks+0x29d/0x4f0 [ext4]
> [ 6000.096319] [<c10a548b>] ? mark_held_locks+0x5b/0x90
> [ 6000.096323] [<c10a55ec>] ? trace_hardirqs_on_caller+0x12c/0x1d0
> [ 6000.096337] [<f82db052>] ? ext4_readpages+0x32/0x40 [ext4]
> [ 6000.096358] [<f832d37b>] ext4_mpage_readpages+0x30b/0x8c0 [ext4]
> [ 6000.096372] [<f82db052>] ? ext4_readpages+0x32/0x40 [ext4]
> [ 6000.096377] [<c1156930>] ? __alloc_pages_nodemask+0x9c0/0xa40
> [ 6000.096383] [<c107ed46>] ? preempt_count_sub+0x26/0x70
> [ 6000.096397] [<f82db052>] ext4_readpages+0x32/0x40 [ext4]
> [ 6000.096411] [<f82db020>] ? do_journal_get_write_access+0xb0/0xb0 [ext4]
> [ 6000.096416] [<c115c376>] __do_page_cache_readahead+0x2e6/0x370
> [ 6000.096420] [<c115c233>] ? __do_page_cache_readahead+0x1a3/0x370
> [ 6000.096426] [<c114fb85>] filemap_fault+0x505/0x570
> [ 6000.096430] [<c117bb6f>] ? __do_fault+0x2f/0x80
> [ 6000.096435] [<c117bb6f>] __do_fault+0x2f/0x80
> [ 6000.096439] [<c1560ca7>] ? _raw_spin_unlock+0x27/0x50
> [ 6000.096443] [<c117f412>] handle_mm_fault+0xb22/0x11d0
> [ 6000.096448] [<c104aa7a>] __do_page_fault+0x16a/0x500
> [ 6000.096452] [<c104ae10>] ? __do_page_fault+0x500/0x500
> [ 6000.096456] [<c104ae31>] do_page_fault+0x21/0x30
> [ 6000.096460] [<c156282b>] error_code+0x5f/0x64
> [ 6000.096464] [<c104ae10>] ? __do_page_fault+0x500/0x500
> [ 6000.096468] 2 locks held by basename/7796:
> [ 6000.096470] #0: (&mm->mmap_sem){++++++}, at: [<c104aa25>]
> __do_page_fault+0x115/0x500
> [ 6000.096479] #1: (&ei->i_data_sem){++++..}, at: [<f82ddd9b>]
> ext4_map_blocks+0x39b/0x4f0 [ext4]
> [ 6000.096500] INFO: task hddtemp:7797 blocked for more than 120 seconds.
> [ 6000.096503] Not tainted 4.2.0-rc6+ #11
> [ 6000.096505] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 6000.096508] hddtemp D e896d100 0 7797 5140 0x00000000
> [ 6000.096514] c02c3a84 00000086 e896d588 e896d100 e896d56c 00000001
> c02c3a84 f5f0b310
> [ 6000.096525] c176fb00 e896d100 c02c4000 f5f0b300 7fffffff c02c3a90
> c155b740 00000000
> [ 6000.096535] c02c3b04 c155fa1d 00000046 c11127a5 00000246 00000000
> c02c3ab0 c10a569b
> [ 6000.096546] Call Trace:
> [ 6000.096550] [<c155b740>] schedule+0x30/0x80
> [ 6000.096554] [<c155fa1d>] schedule_timeout+0x2cd/0x5c0
> [ 6000.096558] [<c11127a5>] ? __delayacct_blkio_start+0x15/0x20
> [ 6000.096562] [<c10a569b>] ? trace_hardirqs_on+0xb/0x10
> [ 6000.096566] [<c10d701c>] ? ktime_get+0xac/0x1a0
> [ 6000.096569] [<c11127a5>] ? __delayacct_blkio_start+0x15/0x20
> [ 6000.096574] [<c155ad99>] io_schedule_timeout+0x89/0xf0
> [ 6000.096577] [<c109ad07>] ? prepare_to_wait_exclusive+0x47/0x80
> [ 6000.096581] [<c155beb0>] ? bit_wait+0x40/0x40
> [ 6000.096584] [<c155bed5>] bit_wait_io+0x25/0x50
> [ 6000.096587] [<c155bd12>] __wait_on_bit_lock+0x32/0x80
> [ 6000.096591] [<c155bdbd>] out_of_line_wait_on_bit_lock+0x5d/0x70
> [ 6000.096595] [<c155beb0>] ? bit_wait+0x40/0x40
> [ 6000.096598] [<c109b210>] ? autoremove_wake_function+0x40/0x40
> [ 6000.096602] [<c11ea166>] bh_uptodate_or_lock+0x66/0x70
> [ 6000.096623] [<f8321349>] ext4_get_branch+0x99/0x110 [ext4]
> [ 6000.096643] [<f8321f14>] ext4_ind_map_blocks+0xd4/0xe30 [ext4]
> [ 6000.096647] [<c10a6290>] ? __lock_acquire+0x910/0x16a0
> [ 6000.096651] [<c10a6290>] ? __lock_acquire+0x910/0x16a0
> [ 6000.096656] [<c155ed53>] ? down_read+0x33/0x50
> [ 6000.096671] [<f82ddc9d>] ext4_map_blocks+0x29d/0x4f0 [ext4]
> [ 6000.096675] [<c10a548b>] ? mark_held_locks+0x5b/0x90
> [ 6000.096679] [<c10a55ec>] ? trace_hardirqs_on_caller+0x12c/0x1d0
> [ 6000.096693] [<f82db052>] ? ext4_readpages+0x32/0x40 [ext4]
> [ 6000.096713] [<f832d37b>] ext4_mpage_readpages+0x30b/0x8c0 [ext4]
> [ 6000.096727] [<f82db052>] ? ext4_readpages+0x32/0x40 [ext4]
> [ 6000.096732] [<c1156930>] ? __alloc_pages_nodemask+0x9c0/0xa40
> [ 6000.096747] [<f82db052>] ext4_readpages+0x32/0x40 [ext4]
> [ 6000.096761] [<f82db020>] ? do_journal_get_write_access+0xb0/0xb0 [ext4]
> [ 6000.096766] [<c115c376>] __do_page_cache_readahead+0x2e6/0x370
> [ 6000.096770] [<c115c233>] ? __do_page_cache_readahead+0x1a3/0x370
> [ 6000.096775] [<c114fb85>] filemap_fault+0x505/0x570
> [ 6000.096779] [<c117bb6f>] ? __do_fault+0x2f/0x80
> [ 6000.096783] [<c117bb6f>] __do_fault+0x2f/0x80
> [ 6000.096787] [<c1560ca7>] ? _raw_spin_unlock+0x27/0x50
> [ 6000.096791] [<c117f412>] handle_mm_fault+0xb22/0x11d0
> [ 6000.096796] [<c104aa7a>] __do_page_fault+0x16a/0x500
> [ 6000.096800] [<c104ae10>] ? __do_page_fault+0x500/0x500
> [ 6000.096803] [<c104ae31>] do_page_fault+0x21/0x30
> [ 6000.096807] [<c156282b>] error_code+0x5f/0x64
> [ 6000.096811] [<c104ae10>] ? __do_page_fault+0x500/0x500
> [ 6000.096815] 2 locks held by hddtemp/7797:
> [ 6000.096817] #0: (&mm->mmap_sem){++++++}, at: [<c104aa25>]
> __do_page_fault+0x115/0x500
> [ 6000.096825] #1: (&ei->i_data_sem){++++..}, at: [<f82ddd9b>]
> ext4_map_blocks+0x39b/0x4f0 [ext4]
>
>