2012-10-11 12:34:12

by Alex Bligh

[permalink] [raw]
Subject: Local DoS through write heavy I/O on CFQ & Deadline

We have noticed significant I/O scheduling issues on both the CFQ and the
deadline scheduler where a non-root user can starve any other process of
any I/O for minutes at a time. The problem is more serious using CFQ but is
still an effective local DoS vector using Deadline.

A simple way to generate the problem is:

dd if=/dev/zero of=- bs=1M count=50000 | dd if=- of=myfile bs=1M count=50000

(note use of 2 dd's is to avoid alleged optimisation of the writing dd
from /dev/zero). zcat-ing a large file with stout redirected to a file
produces a similar error. Using ionice to set idle priority makes no
difference.

To instrument the problem we produced a python script which does a MySQL
select and update every 10 seconds, and time the execution of the update.
This is normally milliseconds, but under user generated load conditions, we
can take this to indefinite (on CFQ) and over a minute (on deadline).
Postgres is affected in a similar manner (i.e. it is not MySQL specific).
Simultaneously we have captured the output of 'vmstat 1 2' and
/proc/meminfo, with appropriate timestamps.

We have reproduced this on multiple hardware environments, using 3.2
(/proc/version_signature gives "Ubuntu 3.2.0-29.46-generic 3.2.24").
Anecdotally we believe the situation has worsened since 3.0.

We believe the problem is that dirty pages writeout is starving the system
of any other I/O, and whilst the process concerned should be penalised to
allow other processes I/O time, this is not happening.

Full info, including logs and scripts can be found at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521

I believe this represents a local DoS vector as an unprivileged user can
effectively stall any root owned process that is performing I/O.

--
Alex Bligh




2012-10-11 13:41:35

by Alan

[permalink] [raw]
Subject: Re: Local DoS through write heavy I/O on CFQ & Deadline

> We have reproduced this on multiple hardware environments, using 3.2
> (/proc/version_signature gives "Ubuntu 3.2.0-29.46-generic 3.2.24").
> Anecdotally we believe the situation has worsened since 3.0.

I've certainly seen this on 3.0 and 3.2, but do you still see it on
3.5/6 ?

2012-10-12 12:57:57

by Alex Bligh

[permalink] [raw]
Subject: Re: Local DoS through write heavy I/O on CFQ & Deadline

Alan,

--On 11 October 2012 14:46:34 +0100 Alan Cox <[email protected]>
wrote:

>> We have reproduced this on multiple hardware environments, using 3.2
>> (/proc/version_signature gives "Ubuntu 3.2.0-29.46-generic 3.2.24").
>> Anecdotally we believe the situation has worsened since 3.0.
>
> I've certainly seen this on 3.0 and 3.2, but do you still see it on
> 3.5/6 ?

We've just tested this. We see exactly the same issue on 3.6.1 using
the current build of the Ubuntu Quantal kernel, which is:

Linux version 3.6.1-030601-generic (apw@gomeisa) (gcc version 4.6.3
(Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201210071322 SMP Sun Oct 7 17:23:28 UTC
2012

More details (including full logs for that kernel) at:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521

--
Alex Bligh

2012-10-12 13:30:48

by Michal Hocko

[permalink] [raw]
Subject: Re: Local DoS through write heavy I/O on CFQ & Deadline

On Thu 11-10-12 13:23:32, Alex Bligh wrote:
> We have noticed significant I/O scheduling issues on both the CFQ and the
> deadline scheduler where a non-root user can starve any other process of
> any I/O for minutes at a time. The problem is more serious using CFQ but is
> still an effective local DoS vector using Deadline.
>
> A simple way to generate the problem is:
>
> dd if=/dev/zero of=- bs=1M count=50000 | dd if=- of=myfile bs=1M count=50000
>
[...]
>
> Full info, including logs and scripts can be found at:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521

You seem to have 8G of RAM and dirty_ratio=20 resp.
dirty_background_ratio=10 which means that 1.5G worth of dirty data
until writer gets throttled which is a lot. Background writeback starts
at 800M which is probably not sufficient as well. Have you tried to set
dirty_bytes at a reasonable value (wrt. to your storage)?
--
Michal Hocko
SUSE Labs

2012-10-12 14:48:39

by Alex Bligh

[permalink] [raw]
Subject: Re: Local DoS through write heavy I/O on CFQ & Deadline



--On 12 October 2012 15:30:45 +0200 Michal Hocko <[email protected]> wrote:

>> Full info, including logs and scripts can be found at:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521
>
> You seem to have 8G of RAM and dirty_ratio=20 resp.
> dirty_background_ratio=10 which means that 1.5G worth of dirty data
> until writer gets throttled which is a lot. Background writeback starts
> at 800M which is probably not sufficient as well. Have you tried to set
> dirty_bytes at a reasonable value (wrt. to your storage)?

This is for an appliance install where we have no idea how much
memory the box has in advance other than 'at least 4G' so it
is difficult to tune by default.

However, I don't think that would solve the problem as the zcat/dd
can always generate data faster than it can be written to disk unless
or until it is throttled, which it never is. Isn't the only thing that
is going to change that it ends up triggering the writeback earlier?

Happy to test etc - what would you suggest, dirty_ratio=5,
dirty_background_ratio=2 ?

--
Alex Bligh

2012-10-12 14:58:42

by Michal Hocko

[permalink] [raw]
Subject: Re: Local DoS through write heavy I/O on CFQ & Deadline

On Fri 12-10-12 15:48:34, Alex Bligh wrote:
>
>
> --On 12 October 2012 15:30:45 +0200 Michal Hocko <[email protected]> wrote:
>
> >>Full info, including logs and scripts can be found at:
> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521
> >
> >You seem to have 8G of RAM and dirty_ratio=20 resp.
> >dirty_background_ratio=10 which means that 1.5G worth of dirty data
> >until writer gets throttled which is a lot. Background writeback starts
> >at 800M which is probably not sufficient as well. Have you tried to set
> >dirty_bytes at a reasonable value (wrt. to your storage)?
>
> This is for an appliance install where we have no idea how much
> memory the box has in advance other than 'at least 4G' so it
> is difficult to tune by default.
>
> However, I don't think that would solve the problem as the zcat/dd
> can always generate data faster than it can be written to disk unless
> or until it is throttled, which it never is.

Once dirty_ratio (resp. dirty_bytes) limit is hit then the process which
writes gets throttled. If this is not the case then there is a bug in
the throttling code.

> Isn't the only thing that is going to change that it ends up
> triggering the writeback earlier?

Set the limit lowe?

> Happy to test etc - what would you suggest, dirty_ratio=5,
> dirty_background_ratio=2 ?

These are measured in percentage. On the other hand if you use
dirty_bytes resp. dirty_background_bytes then you get absolute numbers
independent on the amount of memory.

--
Michal Hocko
SUSE Labs

2012-10-12 16:29:57

by Alex Bligh

[permalink] [raw]
Subject: Re: Local DoS through write heavy I/O on CFQ & Deadline

Michael,

--On 12 October 2012 16:58:39 +0200 Michal Hocko <[email protected]> wrote:

> Once dirty_ratio (resp. dirty_bytes) limit is hit then the process which
> writes gets throttled. If this is not the case then there is a bug in
> the throttling code.

I believe that is the problem.

>> Isn't the only thing that is going to change that it ends up
>> triggering the writeback earlier?
>
> Set the limit lowe?

I think you mean 'lower'. If I do that, what I think will happen
is that it will start the write-back earlier, but the writeback
once started will not keep up with the generation of data, possibly
because the throttling isn't going to work. Note that for
instance using ionice to set priority or class to 'idle'
has no effect. So, to test my hypothesis ...

>> Happy to test etc - what would you suggest, dirty_ratio=5,
>> dirty_background_ratio=2 ?
>
> These are measured in percentage. On the other hand if you use
> dirty_bytes resp. dirty_background_bytes then you get absolute numbers
> independent on the amount of memory.

... what would you suggest I set any of these to in order to test
(assuming the same box) so that it's 'low enough' that if it still
hangs, it's a bug, rather than it's simply 'not low enough'. It's
an 8G box and clearly I'm happy to set either the _ratio or _bytes
entries.

--
Alex Bligh

2012-10-13 13:53:13

by Hillf Danton

[permalink] [raw]
Subject: Re: Local DoS through write heavy I/O on CFQ & Deadline

Hi Alex,

On Sat, Oct 13, 2012 at 12:29 AM, Alex Bligh <[email protected]> wrote:
> Michael,
>
> --On 12 October 2012 16:58:39 +0200 Michal Hocko <[email protected]> wrote:
>
>> Once dirty_ratio (resp. dirty_bytes) limit is hit then the process which
>> writes gets throttled. If this is not the case then there is a bug in
>> the throttling code.
>
>
> I believe that is the problem.

Take a look at the "wait for writeback" problem please.

Linux 3.0+ Disk performance problem - wrong pdflush behaviour
https://lkml.org/lkml/2012/10/10/412

Hillf

2012-10-13 19:33:33

by Alex Bligh

[permalink] [raw]
Subject: Re: Local DoS through write heavy I/O on CFQ & Deadline



--On 13 October 2012 21:53:09 +0800 Hillf Danton <[email protected]> wrote:

> Take a look at the "wait for writeback" problem please.
>
> Linux 3.0+ Disk performance problem - wrong pdflush behaviour
> https://lkml.org/lkml/2012/10/10/412

I'm guessing that's related but may not be the whole story. My
test case is rather simpler, and Viktor says that with the
patch causing his regression reverted, "After I've set the dirty_bytes
over the file size the writes are never blocked.". That suggests
to me that in order to avoid write blocking he needs dirty_bytes
larger than the file size. As the bytes written in my test case
exceed RAM, that's going to be be an issue as dirty_bytes is always
going to be hit; I think it Viktor's case he is trying to avoid
it being hit at all.

Or perhaps I have the wrong end of the stick.

--
Alex Bligh

2012-10-14 02:43:09

by Hillf Danton

[permalink] [raw]
Subject: Re: Local DoS through write heavy I/O on CFQ & Deadline

On Sun, Oct 14, 2012 at 3:33 AM, Alex Bligh <[email protected]> wrote:
> Or perhaps I have the wrong end of the stick.

Never mind a friendly link:)

2012-10-14 21:17:39

by Dave Chinner

[permalink] [raw]
Subject: Re: Local DoS through write heavy I/O on CFQ & Deadline

On Thu, Oct 11, 2012 at 01:23:32PM +0100, Alex Bligh wrote:
> We have noticed significant I/O scheduling issues on both the CFQ and the
> deadline scheduler where a non-root user can starve any other process of
> any I/O for minutes at a time. The problem is more serious using CFQ but is
> still an effective local DoS vector using Deadline.
>
> A simple way to generate the problem is:
>
> dd if=/dev/zero of=- bs=1M count=50000 | dd if=- of=myfile bs=1M count=50000
>
> (note use of 2 dd's is to avoid alleged optimisation of the writing dd
> from /dev/zero). zcat-ing a large file with stout redirected to a file
> produces a similar error. Using ionice to set idle priority makes no
> difference.
>
> To instrument the problem we produced a python script which does a MySQL
> select and update every 10 seconds, and time the execution of the update.
> This is normally milliseconds, but under user generated load conditions, we
> can take this to indefinite (on CFQ) and over a minute (on deadline).
> Postgres is affected in a similar manner (i.e. it is not MySQL specific).
> Simultaneously we have captured the output of 'vmstat 1 2' and
> /proc/meminfo, with appropriate timestamps.

Well, mysql is stuck in fsync(), so of course it's going to have
problems with write latency:

[ 3840.268303] [<ffffffff812650d5>] jbd2_log_wait_commit+0xb5/0x130
[ 3840.268308] [<ffffffff8108aa50>] ? add_wait_queue+0x60/0x60
[ 3840.268313] [<ffffffff81211248>] ext4_sync_file+0x208/0x2d0

And postgres gets stuck there too. So what you are seeing is likely
an ext4 problem, not an IO scheduler problem.

Suggestion: try the same test with XFS. If the problem still exists,
then it *might* be an ioscheduler problem. If it goes away, then
it's an ext4 problem.

Cheers,

Dave.

--
Dave Chinner
[email protected]

2012-10-15 08:17:21

by Michal Hocko

[permalink] [raw]
Subject: Re: Local DoS through write heavy I/O on CFQ & Deadline

On Fri 12-10-12 17:29:50, Alex Bligh wrote:
> Michael,
>
> --On 12 October 2012 16:58:39 +0200 Michal Hocko <[email protected]> wrote:
>
> >Once dirty_ratio (resp. dirty_bytes) limit is hit then the process which
> >writes gets throttled. If this is not the case then there is a bug in
> >the throttling code.
>
> I believe that is the problem.
>
> >>Isn't the only thing that is going to change that it ends up
> >>triggering the writeback earlier?
> >
> >Set the limit lowe?
>
> I think you mean 'lower'. If I do that, what I think will happen
> is that it will start the write-back earlier,

Yes this is primarily controlled by dirty_background_{bytes|ratio}.

> but the writeback once started will not keep up with the generation of
> data, possibly because the throttling isn't going to work.

This would be good to confirm.

> Note that for instance using ionice to set priority or class to 'idle'
> has no effect. So, to test my hypothesis ...

This has been tested with the original dirty_ratio configuration, right?

> >>Happy to test etc - what would you suggest, dirty_ratio=5,
> >>dirty_background_ratio=2 ?
> >
> >These are measured in percentage. On the other hand if you use
> >dirty_bytes resp. dirty_background_bytes then you get absolute numbers
> >independent on the amount of memory.
>
> ... what would you suggest I set any of these to in order to test
> (assuming the same box) so that it's 'low enough' that if it still
> hangs, it's a bug, rather than it's simply 'not low enough'. It's
> an 8G box and clearly I'm happy to set either the _ratio or _bytes
> entries.

I would use _ratio variants as you have a better control over the amount
of dirty data that can accumulate. You will need to experiment a bit to
tune this up. Maybe somebody with more IO experiences can help you more
with this.
I think what you see is related to your filesystem as well. Other
processes probably wait for fsync but the amount of dirty data is so big
it takes really long to finish it.
--
Michal Hocko
SUSE Labs

2012-10-18 21:28:49

by Jan Kara

[permalink] [raw]
Subject: Re: Local DoS through write heavy I/O on CFQ & Deadline

Hello,

On Fri 12-10-12 17:29:50, Alex Bligh wrote:
> --On 12 October 2012 16:58:39 +0200 Michal Hocko <[email protected]> wrote:
Let me explain a couple of things...

> >Once dirty_ratio (resp. dirty_bytes) limit is hit then the process which
> >writes gets throttled. If this is not the case then there is a bug in
> >the throttling code.
> I believe that is the problem.
So I believe the throttling works. What write throttling does is that it
throttles process when more than dirty_bytes (or dirty_ratio) memory would
become dirty. That clearly works as otherwise your testcase will drive the
machine out of memory. Now whenever some memory is cleaned by writeback,
your process is allowed to continue so there is always enough data to
write.

Actually, the throttling is somewhat more clever and doesn't allow a dirty
hog (like your dd test) to use all of the dirtiable limit. Instead the hog
is throttled somewhat earlier leaving some dirtiable memory to other
processes as well. Seeing that mysql process gets blocked during fsync(2)
(and not during write itself) this mechanism works right as well.

> >>Isn't the only thing that is going to change that it ends up
> >>triggering the writeback earlier?
> >
> >Set the limit lowe?
>
> I think you mean 'lower'. If I do that, what I think will happen
> is that it will start the write-back earlier, but the writeback
> once started will not keep up with the generation of data, possibly
> because the throttling isn't going to work. Note that for
> instance using ionice to set priority or class to 'idle'
> has no effect. So, to test my hypothesis ...
Yeah, ionice has its limitations. The problem is that all buffered
writes happen just into memory (so completely independently of ionice
settings). Subsequent writing of dirty memory to disk happens using flusher
thread which is a kernel process and it doesn't know anything about IO
priority set for task which created the file. If you wrote the file with
oflag=direct or oflag=sync you would see that ionice works as expected.

Now what *is* your problem is an ext4 behavior (proper list CCed) as David
Chinner correctly noted. Apparently journal thread is not able to commit
transaction for a long time. I've tried to reproduce your results
(admittedly with replacing myslq with a simplistic "dd if=/dev/zero of=file2
bs=1M count=1 conv=fsync") but I failed. fsync always returns in a couple
of seconds... What ext4 mount options do you use?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-10-18 22:14:11

by Chris Friesen

[permalink] [raw]
Subject: Re: Local DoS through write heavy I/O on CFQ & Deadline

On 10/18/2012 03:28 PM, Jan Kara wrote:

> Yeah, ionice has its limitations. The problem is that all buffered
> writes happen just into memory (so completely independently of ionice
> settings). Subsequent writing of dirty memory to disk happens using flusher
> thread which is a kernel process and it doesn't know anything about IO
> priority set for task which created the file. If you wrote the file with
> oflag=direct or oflag=sync you would see that ionice works as expected.

Has anyone looked at storing the ionice value with the buffered write
request such that the actual writes to disk could be sorted by priority
and done with the ionice level of the original caller?

Chris

2012-10-18 22:24:53

by Jan Kara

[permalink] [raw]
Subject: Re: Local DoS through write heavy I/O on CFQ & Deadline

On Thu 18-10-12 16:13:58, Chris Friesen wrote:
> On 10/18/2012 03:28 PM, Jan Kara wrote:
>
> > Yeah, ionice has its limitations. The problem is that all buffered
> >writes happen just into memory (so completely independently of ionice
> >settings). Subsequent writing of dirty memory to disk happens using flusher
> >thread which is a kernel process and it doesn't know anything about IO
> >priority set for task which created the file. If you wrote the file with
> >oflag=direct or oflag=sync you would see that ionice works as expected.
>
> Has anyone looked at storing the ionice value with the buffered
> write request such that the actual writes to disk could be sorted by
> priority and done with the ionice level of the original caller?
There's nothing as "buffered write request" in kernel. When buffered
write happens, data are just copied into page cache. We could attach a tag
to each modified page in the page cache but that would get really expensive.

Essentially the same problems happens with cgroups where buffered writes
are not accounted as well. There we considered to attach a tag to inodes
(which doesn't work well if processes from different cgroups / with
different IO priority write to the same inode but that's not that common)
which is reasonably cheap. But then you have to build smarts into flusher
thread to prioritize inodes according to tags (you cannot really let
flusher thread just submit IO with that priority because when it gets
blocked, it starves writeback with possible higher priority). Alternatively
you could have separate flusher thread per-cgroup / IO priority. That is
easier from code point of view but throughput suffers because of limited
merging of IO. So all in all the problem is known but hard to tackle.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR