2016-10-28 16:58:08

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET RFC] sched, jbd2: mark sleeps on journal->j_checkpoint_mutex as iowait

Hello,

When there's heavy metadata operation traffic on ext4, the journal
gets filled soon and majority of filesystem users end up blocking on
journal->j_checkpoint_mutex with a stacktrace similar to the
following.

[<ffffffff8c32e758>] __jbd2_log_wait_for_space+0xb8/0x1d0
[<ffffffff8c3285f6>] add_transaction_credits+0x286/0x2a0
[<ffffffff8c32876c>] start_this_handle+0x10c/0x400
[<ffffffff8c328c5b>] jbd2__journal_start+0xdb/0x1e0
[<ffffffff8c30ee5d>] __ext4_journal_start_sb+0x6d/0x120
[<ffffffff8c2d713e>] __ext4_new_inode+0x64e/0x1330
[<ffffffff8c2e9bf0>] ext4_create+0xc0/0x1c0
[<ffffffff8c2570fd>] path_openat+0x124d/0x1380
[<ffffffff8c258501>] do_filp_open+0x91/0x100
[<ffffffff8c2462d0>] do_sys_open+0x130/0x220
[<ffffffff8c2463de>] SyS_open+0x1e/0x20
[<ffffffff8c7ec5b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[<ffffffffffffffff>] 0xffffffffffffffff

Because the sleeps on the mutex aren't accounted as iowait, the system
doesn't show the usual signs of being bogged down by IOs - both iowait
and /proc/stat:procs_blocked stay misleadingly low. While propagation
of iowait through locking constructs is far from being strict, heavy
contention on j_checkpoint_mutex is easy to trigger, obviously iowait
and getting it right can help users in tracking down the issue quite a
bit.

Due to the way io_schedule() is implemented, it currently is hairy to
add an io variant to an existing interface - the schedule() call
itself, which is usually buried deep, should be replaced with
io_schedule(). As we already have current->in_iowait to mark the task
as sleeping for iowait, this can be made easy by breaking up
io_schedule() into multiple steps so that the preparation and marking
can be done before calling an existing interafce and the actual iowait
accounting can be done from inside the scheduler.

What do you think?

This patch contains the following four patches.

0001-sched-move-IO-scheduling-accounting-from-io_schedule.patch
0002-sched-separate-out-io_schedule_prepare-and-io_schedu.patch
0003-mutex-add-mutex_lock_io.patch
0004-jbd2-use-mutex_lock_io-for-journal-j_checkpoint_mute.patch

0001-0002 implement io_schedule_prepare/finish().
0003 implements mutex_lock_io() using io_schedule_prepare/finish().
0004 uses mutex_lock_io() on journal->j_checkpoint_mutex.

This patchset is also available in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-mutex_lock_io

Thanks, diffstat follows.

fs/jbd2/commit.c | 2 -
fs/jbd2/journal.c | 14 ++++++-------
include/linux/mutex.h | 4 +++
include/linux/sched.h | 8 ++-----
kernel/locking/mutex.c | 24 ++++++++++++++++++++++
kernel/sched/core.c | 52 +++++++++++++++++++++++++++++++++++++------------
6 files changed, 79 insertions(+), 25 deletions(-)

--
tejun


2016-10-28 16:58:11

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/4] mutex: add mutex_lock_io()

We sometimes end up propagating IO blocking through mutexes; however,
because there currently is no way of annotating mutex sleeps as
iowait, there are cases where iowait and /proc/stat:procs_blocked
report misleading numbers obscuring the actual state of the system.

This patch adds mutex_lock_io() so that mutex sleeps can be marked as
iowait in those cases.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Jens Axboe <[email protected]>
---
include/linux/mutex.h | 4 ++++
kernel/locking/mutex.c | 24 ++++++++++++++++++++++++
2 files changed, 28 insertions(+)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 2cb7531..5d77ddd 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -142,10 +142,12 @@ extern int __must_check mutex_lock_interruptible_nested(struct mutex *lock,
unsigned int subclass);
extern int __must_check mutex_lock_killable_nested(struct mutex *lock,
unsigned int subclass);
+extern void mutex_lock_io_nested(struct mutex *lock, unsigned int subclass);

#define mutex_lock(lock) mutex_lock_nested(lock, 0)
#define mutex_lock_interruptible(lock) mutex_lock_interruptible_nested(lock, 0)
#define mutex_lock_killable(lock) mutex_lock_killable_nested(lock, 0)
+#define mutex_lock_io(lock) mutex_lock_io_nested(lock, 0)

#define mutex_lock_nest_lock(lock, nest_lock) \
do { \
@@ -157,11 +159,13 @@ do { \
extern void mutex_lock(struct mutex *lock);
extern int __must_check mutex_lock_interruptible(struct mutex *lock);
extern int __must_check mutex_lock_killable(struct mutex *lock);
+extern void mutex_lock_io(struct mutex *lock);

# define mutex_lock_nested(lock, subclass) mutex_lock(lock)
# define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock)
# define mutex_lock_killable_nested(lock, subclass) mutex_lock_killable(lock)
# define mutex_lock_nest_lock(lock, nest_lock) mutex_lock(lock)
+# define mutex_lock_nest_io(lock, nest_lock) mutex_io(lock)
#endif

/*
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90d..a37709c 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -653,6 +653,20 @@ mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass)

EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);

+void __sched
+mutex_lock_io_nested(struct mutex *lock, unsigned int subclass)
+{
+ int token;
+
+ might_sleep();
+
+ token = io_schedule_prepare();
+ __mutex_lock_common(lock, TASK_UNINTERRUPTIBLE,
+ subclass, NULL, _RET_IP_, NULL, 0);
+ io_schedule_finish(token);
+}
+EXPORT_SYMBOL_GPL(mutex_lock_io_nested);
+
static inline int
ww_mutex_deadlock_injection(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
{
@@ -816,6 +830,16 @@ int __sched mutex_lock_killable(struct mutex *lock)
}
EXPORT_SYMBOL(mutex_lock_killable);

+void __sched mutex_lock_io(struct mutex *lock)
+{
+ int token;
+
+ token = io_schedule_prepare();
+ mutex_lock(lock);
+ io_schedule_finish(token);
+}
+EXPORT_SYMBOL_GPL(mutex_lock_io);
+
__visible void __sched
__mutex_lock_slowpath(atomic_t *lock_count)
{
--
2.7.4


2016-10-28 16:58:12

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 4/4] jbd2: use mutex_lock_io() for journal->j_checkpoint_mutex

When an ext4 fs is bogged down by a lot of metadata IOs (in the
reported case, it was deletion of millions of files, but any massive
amount of journal writes would do), after the journal is filled up,
tasks which try to access the filesystem and aren't currently
performing the journal writes end up waiting in
__jbd2_log_wait_for_space() for journal->j_checkpoint_mutex.

Because those mutex sleeps aren't marked as iowait, this condition can
lead to misleadingly low iowait and /proc/stat:procs_blocked. While
iowait propagation is far from strict, this condition can be triggered
fairly easily and annotating these sleeps correctly helps initial
diagnosis quite a bit.

Use the new mutex_lock_io() for journal->j_checkpoint_mutex so that
these sleeps are properly marked as iowait.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Mingbo Wan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
fs/jbd2/commit.c | 2 +-
fs/jbd2/journal.c | 14 +++++++-------
2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 31f8ca0..e772881 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -392,7 +392,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
/* Do we need to erase the effects of a prior jbd2_journal_flush? */
if (journal->j_flags & JBD2_FLUSHED) {
jbd_debug(3, "super block updated\n");
- mutex_lock(&journal->j_checkpoint_mutex);
+ mutex_lock_io(&journal->j_checkpoint_mutex);
/*
* We hold j_checkpoint_mutex so tail cannot change under us.
* We don't need any special data guarantees for writing sb
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 927da49..0aa7d06 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -944,7 +944,7 @@ int __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block)
*/
void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block)
{
- mutex_lock(&journal->j_checkpoint_mutex);
+ mutex_lock_io(&journal->j_checkpoint_mutex);
if (tid_gt(tid, journal->j_tail_sequence))
__jbd2_update_log_tail(journal, tid, block);
mutex_unlock(&journal->j_checkpoint_mutex);
@@ -1304,7 +1304,7 @@ static int journal_reset(journal_t *journal)
journal->j_flags |= JBD2_FLUSHED;
} else {
/* Lock here to make assertions happy... */
- mutex_lock(&journal->j_checkpoint_mutex);
+ mutex_lock_io(&journal->j_checkpoint_mutex);
/*
* Update log tail information. We use WRITE_FUA since new
* transaction will start reusing journal space and so we
@@ -1691,7 +1691,7 @@ int jbd2_journal_destroy(journal_t *journal)
spin_lock(&journal->j_list_lock);
while (journal->j_checkpoint_transactions != NULL) {
spin_unlock(&journal->j_list_lock);
- mutex_lock(&journal->j_checkpoint_mutex);
+ mutex_lock_io(&journal->j_checkpoint_mutex);
err = jbd2_log_do_checkpoint(journal);
mutex_unlock(&journal->j_checkpoint_mutex);
/*
@@ -1713,7 +1713,7 @@ int jbd2_journal_destroy(journal_t *journal)

if (journal->j_sb_buffer) {
if (!is_journal_aborted(journal)) {
- mutex_lock(&journal->j_checkpoint_mutex);
+ mutex_lock_io(&journal->j_checkpoint_mutex);

write_lock(&journal->j_state_lock);
journal->j_tail_sequence =
@@ -1954,7 +1954,7 @@ int jbd2_journal_flush(journal_t *journal)
spin_lock(&journal->j_list_lock);
while (!err && journal->j_checkpoint_transactions != NULL) {
spin_unlock(&journal->j_list_lock);
- mutex_lock(&journal->j_checkpoint_mutex);
+ mutex_lock_io(&journal->j_checkpoint_mutex);
err = jbd2_log_do_checkpoint(journal);
mutex_unlock(&journal->j_checkpoint_mutex);
spin_lock(&journal->j_list_lock);
@@ -1964,7 +1964,7 @@ int jbd2_journal_flush(journal_t *journal)
if (is_journal_aborted(journal))
return -EIO;

- mutex_lock(&journal->j_checkpoint_mutex);
+ mutex_lock_io(&journal->j_checkpoint_mutex);
if (!err) {
err = jbd2_cleanup_journal_tail(journal);
if (err < 0) {
@@ -2024,7 +2024,7 @@ int jbd2_journal_wipe(journal_t *journal, int write)
err = jbd2_journal_skip_recovery(journal);
if (write) {
/* Lock to make assertions happy... */
- mutex_lock(&journal->j_checkpoint_mutex);
+ mutex_lock_io(&journal->j_checkpoint_mutex);
jbd2_mark_journal_empty(journal, WRITE_FUA);
mutex_unlock(&journal->j_checkpoint_mutex);
}
--
2.7.4


2016-10-28 16:58:10

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/4] sched: separate out io_schedule_prepare() and io_schedule_finish()

Now that IO schedule accounting is done inside __schedule(),
io_schedule() can be split into three steps - prep, schedule, and
finish - where the schedule part doesn't need any special annotation.
This allows marking a sleep as iowait by simply wrapping an existing
blocking function with io_schedule_prepare() and io_schedule_finish().

Because task_struct->in_iowait is single bit, the caller of
io_schedule_prepare() needs to record and the pass its state to
io_schedule_finish() to be safe regarding nesting. While this isn't
the prettiest, these functions are mostly gonna be used by core
functions and we don't want to use more space for ->in_iowait.

While at it, as it's simple to do now, reimplement io_schedule()
without unnecessarily going through io_schedule_timeout().

Signed-off-by: Tejun Heo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Jens Axboe <[email protected]>
---
include/linux/sched.h | 8 +++-----
kernel/sched/core.c | 33 ++++++++++++++++++++++++++++-----
2 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b..c025f77 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -441,12 +441,10 @@ extern signed long schedule_timeout_idle(signed long timeout);
asmlinkage void schedule(void);
extern void schedule_preempt_disabled(void);

+extern int __must_check io_schedule_prepare(void);
+extern void io_schedule_finish(int token);
extern long io_schedule_timeout(long timeout);
-
-static inline void io_schedule(void)
-{
- io_schedule_timeout(MAX_SCHEDULE_TIMEOUT);
-}
+extern void io_schedule(void);

void __noreturn do_task_dead(void);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f6baa38..30d3185 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5067,25 +5067,48 @@ int __sched yield_to(struct task_struct *p, bool preempt)
}
EXPORT_SYMBOL_GPL(yield_to);

+int io_schedule_prepare(void)
+{
+ int old_iowait = current->in_iowait;
+
+ current->in_iowait = 1;
+ blk_schedule_flush_plug(current);
+
+ return old_iowait;
+}
+
+void io_schedule_finish(int token)
+{
+ current->in_iowait = token;
+}
+
/*
* This task is about to go to sleep on IO. Increment rq->nr_iowait so
* that process accounting knows that this is a task in IO wait state.
*/
long __sched io_schedule_timeout(long timeout)
{
- int old_iowait = current->in_iowait;
+ int token;
long ret;

- current->in_iowait = 1;
- blk_schedule_flush_plug(current);
-
+ token = io_schedule_prepare();
ret = schedule_timeout(timeout);
- current->in_iowait = old_iowait;
+ io_schedule_finish(token);

return ret;
}
EXPORT_SYMBOL(io_schedule_timeout);

+void io_schedule(void)
+{
+ int token;
+
+ token = io_schedule_prepare();
+ schedule();
+ io_schedule_finish(token);
+}
+EXPORT_SYMBOL(io_schedule);
+
/**
* sys_sched_get_priority_max - return maximum RT priority.
* @policy: scheduling class.
--
2.7.4

2016-10-28 16:58:09

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()

For an interface to support blocking for IOs, it must call
io_schedule() instead of schedule(). This makes it tedious to add IO
blocking to existing interfaces as the switching between schedule()
and io_schedule() is often buried deep.

As we already have a way to mark the task as IO scheduling, this can
be made easier by separating out io_schedule() into multiple steps so
that IO schedule preparation can be performed before invoking a
blocking interface and the actual accounting happens inside
schedule().

io_schedule_timeout() does the following three things prior to calling
schedule_timeout().

1. Mark the task as scheduling for IO.
2. Flush out plugged IOs.
3. Account the IO scheduling.

#1 and #2 can be performed in the prepartaion step while #3 must be
done close to the actual scheduling. This patch moves #3 into
__schedule() so that later patches can separate out preparation and
finish steps from io_schedule().

Signed-off-by: Tejun Heo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Jens Axboe <[email protected]>
---
kernel/sched/core.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 94732d1..f6baa38 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt)
unsigned long *switch_count;
struct pin_cookie cookie;
struct rq *rq;
- int cpu;
+ int cpu, in_iowait;

cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = rq->curr;
+ in_iowait = prev->in_iowait;
+
+ if (in_iowait) {
+ delayacct_blkio_start();
+ atomic_inc(&rq->nr_iowait);
+ }

schedule_debug(prev);

@@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt)
}

balance_callback(rq);
+
+ if (in_iowait) {
+ atomic_dec(&rq->nr_iowait);
+ delayacct_blkio_end();
+ }
}

void __noreturn do_task_dead(void)
@@ -5063,19 +5074,13 @@ EXPORT_SYMBOL_GPL(yield_to);
long __sched io_schedule_timeout(long timeout)
{
int old_iowait = current->in_iowait;
- struct rq *rq;
long ret;

current->in_iowait = 1;
blk_schedule_flush_plug(current);

- delayacct_blkio_start();
- rq = raw_rq();
- atomic_inc(&rq->nr_iowait);
ret = schedule_timeout(timeout);
current->in_iowait = old_iowait;
- atomic_dec(&rq->nr_iowait);
- delayacct_blkio_end();

return ret;
}
--
2.7.4

2016-10-28 18:27:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()

On Fri, Oct 28, 2016 at 12:58:09PM -0400, Tejun Heo wrote:
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt)
> unsigned long *switch_count;
> struct pin_cookie cookie;
> struct rq *rq;
> - int cpu;
> + int cpu, in_iowait;
>
> cpu = smp_processor_id();
> rq = cpu_rq(cpu);
> prev = rq->curr;
> + in_iowait = prev->in_iowait;
> +
> + if (in_iowait) {
> + delayacct_blkio_start();
> + atomic_inc(&rq->nr_iowait);
> + }
>
> schedule_debug(prev);
>
> @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt)
> }
>
> balance_callback(rq);
> +
> + if (in_iowait) {
> + atomic_dec(&rq->nr_iowait);
> + delayacct_blkio_end();
> + }
> }
>
> void __noreturn do_task_dead(void)

Urgh, can't say I like this much. It moves two branches into the
schedule path.

Nor do I really like the idea of having to annotate special mutexes for
the iowait crap.

I'll think more after KS/LPC etc..

2016-10-28 19:07:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()

On Fri, Oct 28, 2016 at 08:27:12PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 28, 2016 at 12:58:09PM -0400, Tejun Heo wrote:
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt)
> > unsigned long *switch_count;
> > struct pin_cookie cookie;
> > struct rq *rq;
> > - int cpu;
> > + int cpu, in_iowait;
> >
> > cpu = smp_processor_id();
> > rq = cpu_rq(cpu);
> > prev = rq->curr;
> > + in_iowait = prev->in_iowait;
> > +
> > + if (in_iowait) {
> > + delayacct_blkio_start();
> > + atomic_inc(&rq->nr_iowait);
> > + }
> >
> > schedule_debug(prev);
> >
> > @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt)
> > }
> >
> > balance_callback(rq);
> > +
> > + if (in_iowait) {
> > + atomic_dec(&rq->nr_iowait);
> > + delayacct_blkio_end();
> > + }
> > }
> >
> > void __noreturn do_task_dead(void)
>
> Urgh, can't say I like this much. It moves two branches into the
> schedule path.
>
> Nor do I really like the idea of having to annotate special mutexes for
> the iowait crap.
>
> I'll think more after KS/LPC etc..

One alternative is to inherit the iowait state of the task we block on.
That'll not get rid of the branches much, but it will remove the new
mutex APIs.

2016-10-28 19:12:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()

Hello, Peter.

On Fri, Oct 28, 2016 at 09:07:02PM +0200, Peter Zijlstra wrote:
> One alternative is to inherit the iowait state of the task we block on.
> That'll not get rid of the branches much, but it will remove the new
> mutex APIs.

Yeah, thought about that briefly but we don't necessarily track mutex
or other synchronization construct owners, things get gnarly with
rwsems (the inode ones sometimes end up in a similar situation), and
we'll probably end up dealing with some surprising propagations down
the line. That said, getting such automatic propagation working would
be great.

Thanks.

--
tejun

2016-10-29 03:21:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()

On Fri, Oct 28, 2016 at 03:12:32PM -0400, Tejun Heo wrote:
> Hello, Peter.
>
> On Fri, Oct 28, 2016 at 09:07:02PM +0200, Peter Zijlstra wrote:
> > One alternative is to inherit the iowait state of the task we block on.
> > That'll not get rid of the branches much, but it will remove the new
> > mutex APIs.
>
> Yeah, thought about that briefly but we don't necessarily track mutex

This one I actually fixed and should be in -next. And it would be
sufficient to cover the use case here.

> or other synchronization construct owners, things get gnarly with
> rwsems (the inode ones sometimes end up in a similar situation), and
> we'll probably end up dealing with some surprising propagations down
> the line.

rwsems could be done for writers only.

2016-10-31 16:46:05

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()

Hello,

On Sat, Oct 29, 2016 at 05:21:26AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 28, 2016 at 03:12:32PM -0400, Tejun Heo wrote:
> > Hello, Peter.
> >
> > On Fri, Oct 28, 2016 at 09:07:02PM +0200, Peter Zijlstra wrote:
> > > One alternative is to inherit the iowait state of the task we block on.
> > > That'll not get rid of the branches much, but it will remove the new
> > > mutex APIs.
> >
> > Yeah, thought about that briefly but we don't necessarily track mutex
>
> This one I actually fixed and should be in -next. And it would be
> sufficient to cover the use case here.

Tracking the owners of mutexes and rwsems does help quite a bit. I
don't think it's as simple as inheriting io sleep state from the
current owner tho. The owner might be running or in a non-IO sleep
when others try to grab the mutex. It is an option to ignore those
cases but this would have a real possibility to lead to surprising
results in some corner cases. If we choose to propagate dynamically,
it becomes an a lot more complex problem and I don't think it'd be
justfiable.

Unless there can be a simple enough and reliable solution, I think
it'd be better to stick with explicit marking.

Thanks.

--
tejun

2016-11-03 15:33:45

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()

Hi Tejun,

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 94732d1..f6baa38 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt)
> unsigned long *switch_count;
> struct pin_cookie cookie;
> struct rq *rq;
> - int cpu;
> + int cpu, in_iowait;
>
> cpu = smp_processor_id();
> rq = cpu_rq(cpu);
> prev = rq->curr;
> + in_iowait = prev->in_iowait;
> +
> + if (in_iowait) {
> + delayacct_blkio_start();
> + atomic_inc(&rq->nr_iowait);
> + }
>
> schedule_debug(prev);
>
> @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt)
> }
>
> balance_callback(rq);
> +
> + if (in_iowait) {
> + atomic_dec(&rq->nr_iowait);
> + delayacct_blkio_end();
> + }
> }

I think, the nr_iowait update can go wrong here.

When the task migrates to a different CPU upon wakeup, this rq points
to a different CPU from the one on which nr_iowait is incremented
before.

Thanks,
Pavan

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

2016-11-08 22:51:02

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()

Hello,

On Thu, Nov 03, 2016 at 09:03:45PM +0530, Pavan Kondeti wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 94732d1..f6baa38 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt)
> > unsigned long *switch_count;
> > struct pin_cookie cookie;
> > struct rq *rq;
> > - int cpu;
> > + int cpu, in_iowait;
> >
> > cpu = smp_processor_id();
> > rq = cpu_rq(cpu);
> > prev = rq->curr;
> > + in_iowait = prev->in_iowait;
> > +
> > + if (in_iowait) {
> > + delayacct_blkio_start();
> > + atomic_inc(&rq->nr_iowait);
> > + }
> >
> > schedule_debug(prev);
> >
> > @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt)
> > }
> >
> > balance_callback(rq);
> > +
> > + if (in_iowait) {
> > + atomic_dec(&rq->nr_iowait);
> > + delayacct_blkio_end();
> > + }
> > }
>
> I think, the nr_iowait update can go wrong here.
>
> When the task migrates to a different CPU upon wakeup, this rq points
> to a different CPU from the one on which nr_iowait is incremented
> before.

Ah, you're right, it should remember the original rq.

Thanks.

--
tejun

2016-12-06 21:29:38

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v2 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()

For an interface to support blocking for IOs, it must call
io_schedule() instead of schedule(). This makes it tedious to add IO
blocking to existing interfaces as the switching between schedule()
and io_schedule() is often buried deep.

As we already have a way to mark the task as IO scheduling, this can
be made easier by separating out io_schedule() into multiple steps so
that IO schedule preparation can be performed before invoking a
blocking interface and the actual accounting happens inside
schedule().

io_schedule_timeout() does the following three things prior to calling
schedule_timeout().

1. Mark the task as scheduling for IO.
2. Flush out plugged IOs.
3. Account the IO scheduling.

#1 and #2 can be performed in the prepartaion step while #3 must be
done close to the actual scheduling. This patch moves #3 into
__schedule() so that later patches can separate out preparation and
finish steps from io_schedule().

v2: Remember the rq in @prev_rq and use it for decrementing nr_iowait
to avoid misattributing the count after the task gets migrated to
another CPU. Noticed by Pavan.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Pavan Kondeti <[email protected]>
---
kernel/sched/core.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3335,12 +3335,18 @@ static void __sched notrace __schedule(b
struct task_struct *prev, *next;
unsigned long *switch_count;
struct pin_cookie cookie;
- struct rq *rq;
- int cpu;
+ struct rq *rq, *prev_rq;
+ int cpu, in_iowait;

cpu = smp_processor_id();
- rq = cpu_rq(cpu);
+ rq = prev_rq = cpu_rq(cpu);
prev = rq->curr;
+ in_iowait = prev->in_iowait;
+
+ if (in_iowait) {
+ delayacct_blkio_start();
+ atomic_inc(&rq->nr_iowait);
+ }

schedule_debug(prev);

@@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(b
}

balance_callback(rq);
+
+ if (in_iowait) {
+ atomic_dec(&prev_rq->nr_iowait);
+ delayacct_blkio_end();
+ }
}

void __noreturn do_task_dead(void)
@@ -5063,19 +5074,13 @@ EXPORT_SYMBOL_GPL(yield_to);
long __sched io_schedule_timeout(long timeout)
{
int old_iowait = current->in_iowait;
- struct rq *rq;
long ret;

current->in_iowait = 1;
blk_schedule_flush_plug(current);

- delayacct_blkio_start();
- rq = raw_rq();
- atomic_inc(&rq->nr_iowait);
ret = schedule_timeout(timeout);
current->in_iowait = old_iowait;
- atomic_dec(&rq->nr_iowait);
- delayacct_blkio_end();

return ret;
}

2016-12-06 21:30:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()

Hello,

On Mon, Oct 31, 2016 at 10:45:56AM -0600, Tejun Heo wrote:
> Tracking the owners of mutexes and rwsems does help quite a bit. I
> don't think it's as simple as inheriting io sleep state from the
> current owner tho. The owner might be running or in a non-IO sleep
> when others try to grab the mutex. It is an option to ignore those
> cases but this would have a real possibility to lead to surprising
> results in some corner cases. If we choose to propagate dynamically,
> it becomes an a lot more complex problem and I don't think it'd be
> justfiable.
>
> Unless there can be a simple enough and reliable solution, I think
> it'd be better to stick with explicit marking.

Just posted the fixed version for the first patch. Any more thoughts
on this?

Thanks.

--
tejun

2016-12-07 09:35:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()

On Tue, Dec 06, 2016 at 04:29:35PM -0500, Tejun Heo wrote:

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3335,12 +3335,18 @@ static void __sched notrace __schedule(b
> struct task_struct *prev, *next;
> unsigned long *switch_count;
> struct pin_cookie cookie;
> - struct rq *rq;
> - int cpu;
> + struct rq *rq, *prev_rq;
> + int cpu, in_iowait;
>
> cpu = smp_processor_id();
> - rq = cpu_rq(cpu);
> + rq = prev_rq = cpu_rq(cpu);
> prev = rq->curr;
> + in_iowait = prev->in_iowait;
> +
> + if (in_iowait) {
> + delayacct_blkio_start();
> + atomic_inc(&rq->nr_iowait);
> + }
>
> schedule_debug(prev);
>
> @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(b
> }
>
> balance_callback(rq);
> +
> + if (in_iowait) {
> + atomic_dec(&prev_rq->nr_iowait);
> + delayacct_blkio_end();
> + }
> }
>
> void __noreturn do_task_dead(void)

So I really dislike this, it mucks with the fast paths for something of
dubious utility.

At the very least move this to the actual blocking paths such that
regular context switches don't see this overhead (also delayacct is
horrific crap).

A little something like so... completely untested.

---
kernel/sched/core.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b08fb257856..935bcd91f4a4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2085,11 +2085,24 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
p->sched_contributes_to_load = !!task_contributes_to_load(p);
p->state = TASK_WAKING;

+ if (p->in_iowait) {
+ delayacct_blkio_end();
+ atomic_dec(&task_rq(p)->nr_iowait);
+ }
+
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;
set_task_cpu(p, cpu);
}
+
+#else /* CONFIG_SMP */
+
+ if (p->in_iowait) {
+ delayacct_blkio_end();
+ atomic_dec(&task_rq(p)->nr_iowait);
+ }
+
#endif /* CONFIG_SMP */

ttwu_queue(p, cpu, wake_flags);
@@ -2139,8 +2152,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct pin_cookie cookie

trace_sched_waking(p);

- if (!task_on_rq_queued(p))
+ if (!task_on_rq_queued(p)) {
+ if (p->in_iowait) {
+ delayacct_blkio_end();
+ atomic_dec(&task_rq(p)->nr_iowait);
+ }
ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+ }

ttwu_do_wakeup(rq, p, 0, cookie);
ttwu_stat(p, smp_processor_id(), 0);
@@ -2948,6 +2966,36 @@ unsigned long long nr_context_switches(void)
return sum;
}

+/*
+ * IO-wait accounting, and how its mostly bollocks (on SMP).
+ *
+ * The idea behind IO-wait account is to account the idle time that we could
+ * have spend running if it were not for IO. That is, if we were to improve the
+ * storage performance, we'd have a proportional reduction in IO-wait time.
+ *
+ * This all works nicely on UP, where, when a task blocks on IO, we account
+ * idle time as IO-wait, because if the storage were faster, it could've been
+ * running and we'd not be idle.
+ *
+ * This has been extended to SMP, by doing the same for each CPU. This however
+ * is broken.
+ *
+ * Imagine for instance the case where two tasks block on one CPU, only the one
+ * CPU will have IO-wait accounted, while the other has regular idle. Even
+ * though, if the storage were faster, both could've ran at the same time,
+ * utilising both CPUs.
+ *
+ * This means, that when looking globally, the current IO-wait accounting on
+ * SMP is a lower bound, by reason of under accounting.
+ *
+ * Worse, since the numbers are provided per CPU, they are sometimes
+ * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly
+ * associated with any one particular CPU, it can wake to another CPU than it
+ * blocked on. This means the per CPU IO-wait number is meaningless.
+ *
+ * Task CPU affinities can make all that even more 'interesting'.
+ */
+
unsigned long nr_iowait(void)
{
unsigned long i, sum = 0;
@@ -2958,6 +3006,13 @@ unsigned long nr_iowait(void)
return sum;
}

+/*
+ * Consumers of these two interfaces, like for example the cpufreq menu
+ * governor are using nonsensical data. Boosting frequency for a CPU that has
+ * IO-wait which might not even end up running the task when it does become
+ * runnable.
+ */
+
unsigned long nr_iowait_cpu(int cpu)
{
struct rq *this = cpu_rq(cpu);
@@ -3369,6 +3424,11 @@ static void __sched notrace __schedule(bool preempt)
deactivate_task(rq, prev, DEQUEUE_SLEEP);
prev->on_rq = 0;

+ if (prev->in_iowait) {
+ atomic_inc(&rq->nr_iowait);
+ delayacct_blkio_start();
+ }
+
/*
* If a worker went to sleep, notify and ask workqueue
* whether it wants to wake up a task to maintain

2016-12-07 20:48:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v3 1/4] sched: move IO scheduling accounting from io_schedule_timeout() into scheduler

Hello,

Yeah, that works. Here's v3 based on your patch. The other patches
still apply correctly.

Thanks.

------ 8< ------
For an interface to support blocking for IOs, it must call
io_schedule() instead of schedule(). This makes it tedious to add IO
blocking to existing interfaces as the switching between schedule()
and io_schedule() is often buried deep.

As we already have a way to mark the task as IO scheduling, this can
be made easier by separating out io_schedule() into multiple steps so
that IO schedule preparation can be performed before invoking a
blocking interface and the actual accounting happens inside the
scheduler.

io_schedule_timeout() does the following three things prior to calling
schedule_timeout().

1. Mark the task as scheduling for IO.
2. Flush out plugged IOs.
3. Account the IO scheduling.

#1 and #2 can be performed in the prepartaion step while #3 must be
done close to the actual scheduling. This patch moves #3 into the
scheduler so that later patches can separate out preparation and
finish steps from io_schedule().

v3: Replaced with PeterZ's implementation which performs nr_iowait
accounting in the sleep and wake up path to avoid unnecessarily
burdening non sleeping paths in __schedule().

v2: Remember the rq in @prev_rq and use it for decrementing nr_iowait
to avoid misattributing the count after the task gets migrated to
another CPU. Noticed by Pavan.

Signed-off-by: Tejun Heo <[email protected]>
Patch-originally-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Pavan Kondeti <[email protected]>
---
kernel/sched/core.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 61 insertions(+), 7 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2085,11 +2085,24 @@ try_to_wake_up(struct task_struct *p, un
p->sched_contributes_to_load = !!task_contributes_to_load(p);
p->state = TASK_WAKING;

+ if (p->in_iowait) {
+ delayacct_blkio_end();
+ atomic_dec(&task_rq(p)->nr_iowait);
+ }
+
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;
set_task_cpu(p, cpu);
}
+
+#else /* CONFIG_SMP */
+
+ if (p->in_iowait) {
+ delayacct_blkio_end();
+ atomic_dec(&task_rq(p)->nr_iowait);
+ }
+
#endif /* CONFIG_SMP */

ttwu_queue(p, cpu, wake_flags);
@@ -2139,8 +2152,13 @@ static void try_to_wake_up_local(struct

trace_sched_waking(p);

- if (!task_on_rq_queued(p))
+ if (!task_on_rq_queued(p)) {
+ if (p->in_iowait) {
+ delayacct_blkio_end();
+ atomic_dec(&rq->nr_iowait);
+ }
ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+ }

ttwu_do_wakeup(rq, p, 0, cookie);
ttwu_stat(p, smp_processor_id(), 0);
@@ -2948,6 +2966,36 @@ unsigned long long nr_context_switches(v
return sum;
}

+/*
+ * IO-wait accounting, and how its mostly bollocks (on SMP).
+ *
+ * The idea behind IO-wait account is to account the idle time that we could
+ * have spend running if it were not for IO. That is, if we were to improve the
+ * storage performance, we'd have a proportional reduction in IO-wait time.
+ *
+ * This all works nicely on UP, where, when a task blocks on IO, we account
+ * idle time as IO-wait, because if the storage were faster, it could've been
+ * running and we'd not be idle.
+ *
+ * This has been extended to SMP, by doing the same for each CPU. This however
+ * is broken.
+ *
+ * Imagine for instance the case where two tasks block on one CPU, only the one
+ * CPU will have IO-wait accounted, while the other has regular idle. Even
+ * though, if the storage were faster, both could've ran at the same time,
+ * utilising both CPUs.
+ *
+ * This means, that when looking globally, the current IO-wait accounting on
+ * SMP is a lower bound, by reason of under accounting.
+ *
+ * Worse, since the numbers are provided per CPU, they are sometimes
+ * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly
+ * associated with any one particular CPU, it can wake to another CPU than it
+ * blocked on. This means the per CPU IO-wait number is meaningless.
+ *
+ * Task CPU affinities can make all that even more 'interesting'.
+ */
+
unsigned long nr_iowait(void)
{
unsigned long i, sum = 0;
@@ -2958,6 +3006,13 @@ unsigned long nr_iowait(void)
return sum;
}

+/*
+ * Consumers of these two interfaces, like for example the cpufreq menu
+ * governor are using nonsensical data. Boosting frequency for a CPU that has
+ * IO-wait which might not even end up running the task when it does become
+ * runnable.
+ */
+
unsigned long nr_iowait_cpu(int cpu)
{
struct rq *this = cpu_rq(cpu);
@@ -3369,6 +3424,11 @@ static void __sched notrace __schedule(b
deactivate_task(rq, prev, DEQUEUE_SLEEP);
prev->on_rq = 0;

+ if (prev->in_iowait) {
+ atomic_inc(&rq->nr_iowait);
+ delayacct_blkio_start();
+ }
+
/*
* If a worker went to sleep, notify and ask workqueue
* whether it wants to wake up a task to maintain
@@ -5063,19 +5123,13 @@ EXPORT_SYMBOL_GPL(yield_to);
long __sched io_schedule_timeout(long timeout)
{
int old_iowait = current->in_iowait;
- struct rq *rq;
long ret;

current->in_iowait = 1;
blk_schedule_flush_plug(current);

- delayacct_blkio_start();
- rq = raw_rq();
- atomic_inc(&rq->nr_iowait);
ret = schedule_timeout(timeout);
current->in_iowait = old_iowait;
- atomic_dec(&rq->nr_iowait);
- delayacct_blkio_end();

return ret;
}