Hello!
On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
is that the x86-64 kernel has the following problem:
When I copy large files to any storage device, be it my HDD with ext4 partitions
or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
then flushes them some time later (quite unpredictably though) or immediately upon
invoking "sync".
How can I disable this memory cache altogether (or at least minimize caching)? When
running the i686 kernel with the same configuration I don't observe this effect - files get
written out almost immediately (for instance "sync" takes less than a second, whereas
on x86-64 it can take a dozen of _minutes_ depending on a file size and storage
performance).
I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX)
- firstly this command is detrimental to the performance of my PC, secondly, it won't help
in this instance.
Swap is totally disabled, usually my memory is entirely free.
My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531
Please, advise.
Best regards,
Artem
On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov <[email protected]> wrote:
>
> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
> is that the x86-64 kernel has the following problem:
>
> When I copy large files to any storage device, be it my HDD with ext4 partitions
> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
> then flushes them some time later (quite unpredictably though) or immediately upon
> invoking "sync".
Yeah, I think we default to a 10% "dirty background memory" (and
allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
of dirty memory for writeout before we even start writing, and twice
that before we start *waiting* for it.
On 32-bit x86, we only count the memory in the low 1GB (really
actually up to about 890MB), so "10% dirty" really means just about
90MB of buffering (and a "hard limit" of ~180MB of dirty).
And that "up to 3.2GB of dirty memory" is just crazy. Our defaults
come from the old days of less memory (and perhaps servers that don't
much care), and the fact that x86-32 ends up having much lower limits
even if you end up having more memory.
You can easily tune it:
echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes
echo $((48*1024*1024)) > /proc/sys/vm/dirty_bytes
or similar. But you're right, we need to make the defaults much saner.
Wu? Andrew? Comments?
Linus
Oct 25, 2013 02:18:50 PM, Linus Torvalds wrote:
On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov wrote:
>>
>> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
>> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
>> is that the x86-64 kernel has the following problem:
>>
>> When I copy large files to any storage device, be it my HDD with ext4 partitions
>> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
>> then flushes them some time later (quite unpredictably though) or immediately upon
>> invoking "sync".
>
>Yeah, I think we default to a 10% "dirty background memory" (and
>allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
>of dirty memory for writeout before we even start writing, and twice
>that before we start *waiting* for it.
>
>On 32-bit x86, we only count the memory in the low 1GB (really
>actually up to about 890MB), so "10% dirty" really means just about
>90MB of buffering (and a "hard limit" of ~180MB of dirty).
>
>And that "up to 3.2GB of dirty memory" is just crazy. Our defaults
>come from the old days of less memory (and perhaps servers that don't
>much care), and the fact that x86-32 ends up having much lower limits
>even if you end up having more memory.
>
>You can easily tune it:
>
> echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes
> echo $((48*1024*1024)) > /proc/sys/vm/dirty_bytes
>
>or similar. But you're right, we need to make the defaults much saner.
>
>Wu? Andrew? Comments?
>
My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
more) this value becomes unrealistic (13GB) and I've already had some
unpleasant effects due to it.
I.e. when I dump a large MySQL database (its dump weighs around 10GB)
- it appears on the disk almost immediately, but then, later, when the kernel
decides to flush it to the disk, the server almost stalls and other IO requests
take a lot more time to complete even though mysqldump is run with ionice -c3,
so the use of ionice has no real effect.
Artem
On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov <[email protected]> wrote:
>
> My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
> percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
> more) this value becomes unrealistic (13GB) and I've already had some
> unpleasant effects due to it.
Right. The percentage notion really goes back to the days when we
typically had 8-64 *megabytes* of memory So if you had a 8MB machine
you wouldn't want to have more than one megabyte of dirty data, but if
you were "Mr Moneybags" and could afford 64MB, you might want to have
up to 8MB dirty!!
Things have changed.
So I would suggest we change the defaults. Or pwehaps make the rule be
that "the ratio numbers are 'ratio of memory up to 1GB'", to make the
semantics similar across 32-bit HIGHMEM machines and 64-bit machines.
The modern way of expressing the dirty limits are to give the actual
absolute byte amounts, but we default to the legacy ratio mode..
Linus
On Fri 131025, Linus Torvalds wrote:
> On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov <[email protected]> wrote:
> >
> > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
> > percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
> > more) this value becomes unrealistic (13GB) and I've already had some
> > unpleasant effects due to it.
>
> Right. The percentage notion really goes back to the days when we
> typically had 8-64 *megabytes* of memory So if you had a 8MB machine
> you wouldn't want to have more than one megabyte of dirty data, but if
> you were "Mr Moneybags" and could afford 64MB, you might want to have
> up to 8MB dirty!!
>
> Things have changed.
>
> So I would suggest we change the defaults. Or pwehaps make the rule be
> that "the ratio numbers are 'ratio of memory up to 1GB'", to make the
> semantics similar across 32-bit HIGHMEM machines and 64-bit machines.
>
> The modern way of expressing the dirty limits are to give the actual
> absolute byte amounts, but we default to the legacy ratio mode..
>
> Linus
Is it currently possible to somehow set above values per block device?
I want default behaviour for almost everything but DVD drives in DVD+RW
packet writing mode may easily take several minutes in case of a sync.
Karl
On Fri, Oct 25, 2013 at 08:30:53AM +0000, Artem S. Tashkinov wrote:
> My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
> percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
> more) this value becomes unrealistic (13GB) and I've already had some
> unpleasant effects due to it.
What I think would make sense is to dynamically measure the speed of
writeback, so that we can set these limits as a function of the device
speed. It's already the case that the writeback limits don't make
sense on a slow USB 2.0 storage stick; I suspect that for really huge
RAID arrays or very fast flash devices, it doesn't make much sense
either.
The problem is that if you have a system that has *both* a USB stick
_and_ a fast flash/RAID storage array both needing writeback, this
doesn't work well --- but what we have right now doesn't work all that
well anyway.
- Ted
On Fri, 25 Oct 2013 05:18:42 -0400 "Theodore Ts'o" <[email protected]> wrote:
> What I think would make sense is to dynamically measure the speed of
> writeback, so that we can set these limits as a function of the device
> speed.
We attempt to do this now - have a look through struct backing_dev_info.
Apparently all this stuff isn't working as desired (and perhaps as designed)
in this case. Will take a look after a return to normalcy ;)
On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton
<[email protected]> wrote:
>
> Apparently all this stuff isn't working as desired (and perhaps as designed)
> in this case. Will take a look after a return to normalcy ;)
It definitely doesn't work. I can trivially reproduce problems by just
having a cheap (==slow) USB key with an ext3 filesystem, and going a
git clone to it. The end result is not pretty, and that's actually not
even a huge amount of data.
Linus
On Fri, 25 Oct 2013 07:25:13 +0000 (UTC) "Artem S. Tashkinov"
<[email protected]> wrote:
> Hello!
>
> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
> is that the x86-64 kernel has the following problem:
>
> When I copy large files to any storage device, be it my HDD with ext4 partitions
> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
> then flushes them some time later (quite unpredictably though) or immediately upon
> invoking "sync".
>
> How can I disable this memory cache altogether (or at least minimize caching)? When
> running the i686 kernel with the same configuration I don't observe this effect - files get
> written out almost immediately (for instance "sync" takes less than a second, whereas
> on x86-64 it can take a dozen of _minutes_ depending on a file size and storage
> performance).
What exactly is bothering you about this? The amount of memory used or the
time until data is flushed?
If the later, then /proc/sys/vm/dirty_expire_centisecs is where you want to
look.
This defaults to 30 seconds (3000 centisecs).
You could make it smaller (providing you also shrink
dirty_writeback_centisecs in a similar ratio) and the VM will flush out data
more quickly.
NeilBrown
>
> I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX)
> - firstly this command is detrimental to the performance of my PC, secondly, it won't help
> in this instance.
>
> Swap is totally disabled, usually my memory is entirely free.
>
> My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531
>
> Please, advise.
>
> Best regards,
>
> Artem
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Fri, 25 Oct 2013, NeilBrown wrote:
> On Fri, 25 Oct 2013 07:25:13 +0000 (UTC) "Artem S. Tashkinov"
> <[email protected]> wrote:
>
>> Hello!
>>
>> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
>> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
>> is that the x86-64 kernel has the following problem:
>>
>> When I copy large files to any storage device, be it my HDD with ext4 partitions
>> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
>> then flushes them some time later (quite unpredictably though) or immediately upon
>> invoking "sync".
>>
>> How can I disable this memory cache altogether (or at least minimize caching)? When
>> running the i686 kernel with the same configuration I don't observe this effect - files get
>> written out almost immediately (for instance "sync" takes less than a second, whereas
>> on x86-64 it can take a dozen of _minutes_ depending on a file size and storage
>> performance).
>
> What exactly is bothering you about this? The amount of memory used or the
> time until data is flushed?
actually, I think the problem is more the impact of the huge write later on.
David Lang
> If the later, then /proc/sys/vm/dirty_expire_centisecs is where you want to
> look.
> This defaults to 30 seconds (3000 centisecs).
> You could make it smaller (providing you also shrink
> dirty_writeback_centisecs in a similar ratio) and the VM will flush out data
> more quickly.
>
> NeilBrown
>
>
>>
>> I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX)
>> - firstly this command is detrimental to the performance of my PC, secondly, it won't help
>> in this instance.
>>
>> Swap is totally disabled, usually my memory is entirely free.
>>
>> My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531
>>
>> Please, advise.
>>
>> Best regards,
>>
>> Artem
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>
>
On Fri, 25 Oct 2013, Linus Torvalds wrote:
> On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov <[email protected]> wrote:
>>
>> My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
>> percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
>> more) this value becomes unrealistic (13GB) and I've already had some
>> unpleasant effects due to it.
>
> Right. The percentage notion really goes back to the days when we
> typically had 8-64 *megabytes* of memory So if you had a 8MB machine
> you wouldn't want to have more than one megabyte of dirty data, but if
> you were "Mr Moneybags" and could afford 64MB, you might want to have
> up to 8MB dirty!!
>
> Things have changed.
>
> So I would suggest we change the defaults. Or pwehaps make the rule be
> that "the ratio numbers are 'ratio of memory up to 1GB'", to make the
> semantics similar across 32-bit HIGHMEM machines and 64-bit machines.
If you go this direction, allow ratios larger than 100%, some people may be
willing to have huge amounts of dirty data on large memory machines (if the load
is extremely bursty, they don't have other needs for I/O, or they have a very
fast storage system, as a few examples)
David Lang
Oct 25, 2013 05:26:45 PM, david wrote:
On Fri, 25 Oct 2013, NeilBrown wrote:
>
>>
>> What exactly is bothering you about this? The amount of memory used or the
>> time until data is flushed?
>
>actually, I think the problem is more the impact of the huge write later on.
Exactly. And not being able to use applications which show you IO performance
like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
my life without being able to see the progress of a copying operation. With the current
dirty cache there's no way to understand how you storage media actually behaves.
Hopefully this issue won't dissolve into obscurity and someone will actually make
up a plan (and a patch) how to make dirty write cache behave in a sane manner
considering the fact that there are devices with very different write speeds and
requirements. It'd be ever better, if I could specify dirty cache as a mount option
(though sane defaults or semi-automatic values based on runtime estimates
won't hurt).
Per device dirty cache seems like a nice idea, I, for one, would like to disable it
altogether or make it an absolute minimum for things like USB flash drives - because
I don't care about multithreaded performance or delayed allocation on such devices -
I'm interested in my data reaching my USB stick ASAP - because it's how most people
use them.
Regards,
Artem
El Viernes, 25 de octubre de 2013 18:26:23 Artem S. Tashkinov escribi?:
> Oct 25, 2013 05:26:45 PM, david wrote:
> >actually, I think the problem is more the impact of the huge write later
> >on.
> Exactly. And not being able to use applications which show you IO
> performance like Midnight Commander. You might prefer to use "cp -a" but I
> cannot imagine my life without being able to see the progress of a copying
> operation. With the current dirty cache there's no way to understand how
> you storage media actually behaves.
This is a problem I also have been suffering for a long time. It's not so much
how much and when the systems syncs dirty data, but how unreponsive the
desktop becomes when it happens (usually, with rsync + large files). Most
programs become completely unreponsive, specially if they have a large memory
consumption (ie. the browser). I need to pause rsync and wait until the
systems writes out all dirty data if I want to do simple things like scrolling
or do any action that uses I/O, otherwise I need to wait minutes.
I have 16 GB of RAM and excluding the browser (which usually uses about half
of a GB) and KDE itself, there are no memory hogs, so it seem like it's
something that shouldn't happen. I can understand that I/O operations are
laggy when there is some other intensive I/O ongoing, but right now the system
becomes completely unreponsive. If I am unlucky and Konsole also becomes
unreponsive, I need to switch to a VT (which also takes time).
I haven't reported it before in part because I didn't know how to do it, "my
browser stalls" is not a very useful description and I didn't know what kind
of data I'm supposed to report.
On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov"
<[email protected]> wrote:
> Oct 25, 2013 05:26:45 PM, david wrote:
> On Fri, 25 Oct 2013, NeilBrown wrote:
> >
> >>
> >> What exactly is bothering you about this? The amount of memory used or the
> >> time until data is flushed?
> >
> >actually, I think the problem is more the impact of the huge write later on.
>
> Exactly. And not being able to use applications which show you IO performance
> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
> my life without being able to see the progress of a copying operation. With the current
> dirty cache there's no way to understand how you storage media actually behaves.
So fix Midnight Commander. If you want the copy to be actually finished when
it says it is finished, then it needs to call 'fsync()' at the end.
>
> Hopefully this issue won't dissolve into obscurity and someone will actually make
> up a plan (and a patch) how to make dirty write cache behave in a sane manner
> considering the fact that there are devices with very different write speeds and
> requirements. It'd be ever better, if I could specify dirty cache as a mount option
> (though sane defaults or semi-automatic values based on runtime estimates
> won't hurt).
>
> Per device dirty cache seems like a nice idea, I, for one, would like to disable it
> altogether or make it an absolute minimum for things like USB flash drives - because
> I don't care about multithreaded performance or delayed allocation on such devices -
> I'm interested in my data reaching my USB stick ASAP - because it's how most people
> use them.
>
As has already been said, you can substantially disable the cache by tuning
down various values in /proc/sys/vm/.
Have you tried?
NeilBrown
Oct 26, 2013 02:44:07 AM, neil wrote:
On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov"
>>
>> Exactly. And not being able to use applications which show you IO performance
>> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
>> my life without being able to see the progress of a copying operation. With the current
>> dirty cache there's no way to understand how you storage media actually behaves.
>
>So fix Midnight Commander. If you want the copy to be actually finished when
>it says it is finished, then it needs to call 'fsync()' at the end.
This sounds like a very bad joke. How applications are supposed to show and
calculate an _average_ write speed if there are no kernel calls/ioctls to actually
make the kernel flush dirty buffers _during_ copying? Actually it's a good way to
solve this problem in user space - alas, even if such calls are implemented, user
space will start using them only in 2018 if not further from that.
>>
>> Per device dirty cache seems like a nice idea, I, for one, would like to disable it
>> altogether or make it an absolute minimum for things like USB flash drives - because
>> I don't care about multithreaded performance or delayed allocation on such devices -
>> I'm interested in my data reaching my USB stick ASAP - because it's how most people
>> use them.
>>
>
>As has already been said, you can substantially disable the cache by tuning
>down various values in /proc/sys/vm/.
>Have you tried?
I don't understand who you are replying to. I asked about per device settings, you are
again referring me to system wide settings - they don't look that good if we're talking
about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no sense
to allocate 20% of physical RAM for things which don't belong to it in the first place.
I don't know any other OS which has a similar behaviour.
And like people (including me) have already mentioned, such a huge dirty cache can
stall their PCs/servers for a considerable amount of time.
Of course, if you don't use Linux on the desktop you don't really care - well, I do. Also
not everyone in this world has an UPS - which means such a huge buffer can lead to a
serious data loss in case of a power blackout.
Regards,
Artem
On Fri, 25 Oct 2013 21:03:44 +0000 (UTC) "Artem S. Tashkinov"
<[email protected]> wrote:
> Oct 26, 2013 02:44:07 AM, neil wrote:
> On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov"
> >>
> >> Exactly. And not being able to use applications which show you IO performance
> >> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
> >> my life without being able to see the progress of a copying operation. With the current
> >> dirty cache there's no way to understand how you storage media actually behaves.
> >
> >So fix Midnight Commander. If you want the copy to be actually finished when
> >it says it is finished, then it needs to call 'fsync()' at the end.
>
> This sounds like a very bad joke. How applications are supposed to show and
> calculate an _average_ write speed if there are no kernel calls/ioctls to actually
> make the kernel flush dirty buffers _during_ copying? Actually it's a good way to
> solve this problem in user space - alas, even if such calls are implemented, user
> space will start using them only in 2018 if not further from that.
But there is a way to flush dirty buffers *during* copies.
man 2 sync_file_range
if giving precise feedback is is paramount importance to you, then this would
be the interface to use.
>
> >>
> >> Per device dirty cache seems like a nice idea, I, for one, would like to disable it
> >> altogether or make it an absolute minimum for things like USB flash drives - because
> >> I don't care about multithreaded performance or delayed allocation on such devices -
> >> I'm interested in my data reaching my USB stick ASAP - because it's how most people
> >> use them.
> >>
> >
> >As has already been said, you can substantially disable the cache by tuning
> >down various values in /proc/sys/vm/.
> >Have you tried?
>
> I don't understand who you are replying to. I asked about per device settings, you are
> again referring me to system wide settings - they don't look that good if we're talking
> about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no sense
> to allocate 20% of physical RAM for things which don't belong to it in the first place.
Sorry, missed the per-device bit.
You could try playing with
/sys/class/bdi/XX:YY/max_ratio
where XX:YY is the major/minor number of the device, so 8:0 for /dev/sda.
Wind it right down for slow devices and you might get something like what you
want.
>
> I don't know any other OS which has a similar behaviour.
I don't know about the internal details of any other OS, so I cannot really
comment.
>
> And like people (including me) have already mentioned, such a huge dirty cache can
> stall their PCs/servers for a considerable amount of time.
Yes. But this is a different issue.
There are two very different issues that should be kept separate.
One is that when "cp" or similar complete, the data hasn't all be written out
yet. It typically takes another 30 seconds before the flush will complete.
You seemed to primarily complain about this, so that is what I originally
address. That is where in the "dirty_*_centisecs" values apply.
The other, quite separate, issue is that Linux will cache more dirty data
than it can write out in a reasonable time. All the tuning parameters refer
to the amount of data (whether as a percentage of RAM or as a number of
bytes), but what people really care about is a number of seconds.
As you might imagine, estimating how long it will take to write out a certain
amount of data is highly non-trivial. The relationship between megabytes and
seconds can be non-linear and can change over time.
Caching nothing at all can hurt a lot of workloads. Caching too much can
obviously hurt too. Caching "5 seconds" worth of data would be ideal, but
would be incredibly difficult to implement.
It is possible that keeping a sliding estimate of device throughput for each
device would be possible, and using that to automatically adjust the
"max_ratio" value (or some related internal thing) might be a 70% solution.
Certainly it would be an interesting project for someone.
>
> Of course, if you don't use Linux on the desktop you don't really care - well, I do. Also
> not everyone in this world has an UPS - which means such a huge buffer can lead to a
> serious data loss in case of a power blackout.
I don't have a desk (just a lap), but I use Linux on all my computers and
I've never really noticed the problem. Maybe I'm just very patient, or maybe
I don't work with large data sets and slow devices.
However I don't think data-loss is really a related issue. Any process that
cares about data safety *must* use fsync at appropriate places. This has
always been true.
NeilBrown
>
> Regards,
>
> Artem
On Fri, Oct 25, 2013 at 02:29:37AM -0700, Andrew Morton wrote:
> On Fri, 25 Oct 2013 05:18:42 -0400 "Theodore Ts'o" <[email protected]> wrote:
>
> > What I think would make sense is to dynamically measure the speed of
> > writeback, so that we can set these limits as a function of the device
> > speed.
>
> We attempt to do this now - have a look through struct backing_dev_info.
To be exact, it's backing_dev_info.write_bandwidth which is estimated
in bdi_update_write_bandwidth() and exported as "BdiWriteBandwidth" in
debugfs file bdi.stats.
> Apparently all this stuff isn't working as desired (and perhaps as designed)
> in this case. Will take a look after a return to normalcy ;)
Right. The write bandwidth estimation is only estimated and used when
background dirty threshold is reached and hence the disk is actively
doing writeback IO -- which is the case that we can do reasonable
estimation of the writeback bandwidth.
Note that this estimated BdiWriteBandwidth may better be named
"writeback" bandwidth because it may change depending on the workload
at the time -- eg. sequential vs. random writes; whether there are
parallel reads or direct IO competing the disk time.
BdiWriteBandwidth is only designed for use by the dirty throttling
logic and is not generally useful/reliable for other purposes.
It's a bit late and I'd like to carry the original question as
exercises in tomorrow's airplanes. :)
Thanks,
Fengguang
On Fri, Oct 25, 2013 at 05:18:42AM -0400, Theodore Ts'o wrote:
> On Fri, Oct 25, 2013 at 08:30:53AM +0000, Artem S. Tashkinov wrote:
> > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
> > percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
> > more) this value becomes unrealistic (13GB) and I've already had some
> > unpleasant effects due to it.
>
> What I think would make sense is to dynamically measure the speed of
> writeback, so that we can set these limits as a function of the device
> speed. It's already the case that the writeback limits don't make
> sense on a slow USB 2.0 storage stick; I suspect that for really huge
> RAID arrays or very fast flash devices, it doesn't make much sense
> either.
>
> The problem is that if you have a system that has *both* a USB stick
> _and_ a fast flash/RAID storage array both needing writeback, this
> doesn't work well --- but what we have right now doesn't work all that
> well anyway.
Ted, when trying to follow up your email, I got a crazy idea and it'd
be better throw it out rather than carrying it to bed. :)
We could do per-bdi dirty thresholds - which has been proposed 1-2
times before by different people.
The per-bdi dirty thresholds could be auto set by the kernel this way:
start it with an initial value of 100MB. When reached, put all the
100MB dirty data to IO and get an estimation of the write bandwidth.
>From then on, set the bdi's dirty threshold to N * bdi_write_bandwidth,
where N is the seconds of dirty data we'd like to cache in memory.
Thanks,
Fengguang
On Fri, Oct 25, 2013 at 09:40:13PM +0200, Diego Calleja wrote:
> El Viernes, 25 de octubre de 2013 18:26:23 Artem S. Tashkinov escribió:
> > Oct 25, 2013 05:26:45 PM, david wrote:
> > >actually, I think the problem is more the impact of the huge write later
> > >on.
> > Exactly. And not being able to use applications which show you IO
> > performance like Midnight Commander. You might prefer to use "cp -a" but I
> > cannot imagine my life without being able to see the progress of a copying
> > operation. With the current dirty cache there's no way to understand how
> > you storage media actually behaves.
>
>
> This is a problem I also have been suffering for a long time. It's not so much
> how much and when the systems syncs dirty data, but how unreponsive the
> desktop becomes when it happens (usually, with rsync + large files). Most
> programs become completely unreponsive, specially if they have a large memory
> consumption (ie. the browser). I need to pause rsync and wait until the
> systems writes out all dirty data if I want to do simple things like scrolling
> or do any action that uses I/O, otherwise I need to wait minutes.
That's a problem. And it's kind of independent of the dirty threshold
-- if you are doing large file copies in the background, it will lead
to continuous disk writes and stalls anyway -- the large dirty threshold
merely delays the write IO time.
> I have 16 GB of RAM and excluding the browser (which usually uses about half
> of a GB) and KDE itself, there are no memory hogs, so it seem like it's
> something that shouldn't happen. I can understand that I/O operations are
> laggy when there is some other intensive I/O ongoing, but right now the system
> becomes completely unreponsive. If I am unlucky and Konsole also becomes
> unreponsive, I need to switch to a VT (which also takes time).
>
> I haven't reported it before in part because I didn't know how to do it, "my
> browser stalls" is not a very useful description and I didn't know what kind
> of data I'm supposed to report.
What's the kernel you are running? And it's writing to a hard disk?
The stalls are most likely caused by either one of
1) write IO starves read IO
2) direct page reclaim blocked when
- trying to writeout PG_dirty pages
- trying to lock PG_writeback pages
Which may be confirmed by running
ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32
or
echo w > /proc/sysrq-trigger # and check dmesg
during the stalls. The latter command works more reliably.
Thanks,
Fengguang
On Sat, Oct 26, 2013 at 12:05:45AM +0100, Fengguang Wu wrote:
>
> Ted, when trying to follow up your email, I got a crazy idea and it'd
> be better throw it out rather than carrying it to bed. :)
>
> We could do per-bdi dirty thresholds - which has been proposed 1-2
> times before by different people.
>
> The per-bdi dirty thresholds could be auto set by the kernel this way:
> start it with an initial value of 100MB. When reached, put all the
> 100MB dirty data to IO and get an estimation of the write bandwidth.
> From then on, set the bdi's dirty threshold to N * bdi_write_bandwidth,
> where N is the seconds of dirty data we'd like to cache in memory.
Sure, although I wonder if it would be worth it calcuate some kind of
rolling average of the write bandwidth while we are doing writeback,
so if it turns out we got unlucky with the contents of the first 100MB
of dirty data (it could be either highly random or highly sequential)
the we'll eventually correct to the right level.
This means that VM would have to keep dirty page counters for each BDI
--- which I thought we weren't doing right now, which is why we have a
global vm.dirty_ratio/vm.dirty_background_ratio threshold. (Or do I
have cause and effect reversed? :-)
- Ted
On Fri 2013-10-25 10:32:16, Linus Torvalds wrote:
> On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton
> <[email protected]> wrote:
> >
> > Apparently all this stuff isn't working as desired (and perhaps as designed)
> > in this case. Will take a look after a return to normalcy ;)
>
> It definitely doesn't work. I can trivially reproduce problems by just
> having a cheap (==slow) USB key with an ext3 filesystem, and going a
> git clone to it. The end result is not pretty, and that's actually not
> even a huge amount of data.
Hmm, I'd expect the result to be "dead USB key". Putting
ext3 on cheap flash device normally just kills the devic :-(.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Sat, Oct 26, 2013 at 4:32 AM, Pavel Machek <[email protected]> wrote:
>
> Hmm, I'd expect the result to be "dead USB key". Putting
> ext3 on cheap flash device normally just kills the devic :-(.
Not my experience. It may be true for some really cheap devices, but
normal USB keys seem to just get really slow, probably due to having
had their flash rewrite algorithm tuned for FAT accesses.
I *do* suspect that to see the really bad behavior, you don't write
just one large file to it, but many smaller ones. "git clone" will
check out all the kernel tree files, obviously.
Linus
On Fri 25-10-13 11:15:55, Karl Kiniger wrote:
> On Fri 131025, Linus Torvalds wrote:
> > On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov <[email protected]> wrote:
> > >
> > > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
> > > percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
> > > more) this value becomes unrealistic (13GB) and I've already had some
> > > unpleasant effects due to it.
> >
> > Right. The percentage notion really goes back to the days when we
> > typically had 8-64 *megabytes* of memory So if you had a 8MB machine
> > you wouldn't want to have more than one megabyte of dirty data, but if
> > you were "Mr Moneybags" and could afford 64MB, you might want to have
> > up to 8MB dirty!!
> >
> > Things have changed.
> >
> > So I would suggest we change the defaults. Or pwehaps make the rule be
> > that "the ratio numbers are 'ratio of memory up to 1GB'", to make the
> > semantics similar across 32-bit HIGHMEM machines and 64-bit machines.
> >
> > The modern way of expressing the dirty limits are to give the actual
> > absolute byte amounts, but we default to the legacy ratio mode..
> >
> > Linus
>
> Is it currently possible to somehow set above values per block device?
Yes, to some extent. You can set /sys/block/<device>/bdi/max_ratio to
the maximum proportion the device's dirty data can take from the total
amount. The caveat currently is that this setting only takes effect after
we have more than (dirty_background_ratio + dirty_ratio)/2 dirty data in
total because that is an amount of dirty data when we start to throttle
processes. So if the device you'd like to limit is the only one which is
currently written to, the limiting doesn't have a big effect.
Andrew has queued up a patch series from Maxim Patlasov which removes this
caveat but currently we don't have a way admin can switch that from
userspace. But I'd like to have that tunable from userspace exactly for the
cases as you describe below.
> I want default behaviour for almost everything but DVD drives in DVD+RW
> packet writing mode may easily take several minutes in case of a sync.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 25-10-13 19:37:53, Ted Tso wrote:
> On Sat, Oct 26, 2013 at 12:05:45AM +0100, Fengguang Wu wrote:
> >
> > Ted, when trying to follow up your email, I got a crazy idea and it'd
> > be better throw it out rather than carrying it to bed. :)
> >
> > We could do per-bdi dirty thresholds - which has been proposed 1-2
> > times before by different people.
> >
> > The per-bdi dirty thresholds could be auto set by the kernel this way:
> > start it with an initial value of 100MB. When reached, put all the
> > 100MB dirty data to IO and get an estimation of the write bandwidth.
> > From then on, set the bdi's dirty threshold to N * bdi_write_bandwidth,
> > where N is the seconds of dirty data we'd like to cache in memory.
>
> Sure, although I wonder if it would be worth it calcuate some kind of
> rolling average of the write bandwidth while we are doing writeback,
> so if it turns out we got unlucky with the contents of the first 100MB
> of dirty data (it could be either highly random or highly sequential)
> the we'll eventually correct to the right level.
We already do average measured throughput over a longer time window and
have kind of rolling average algorithm doing some averaging.
> This means that VM would have to keep dirty page counters for each BDI
> --- which I thought we weren't doing right now, which is why we have a
> global vm.dirty_ratio/vm.dirty_background_ratio threshold. (Or do I
> have cause and effect reversed? :-)
And we do currently keep the number of dirty & under writeback pages per
BDI. We have global limits because mm wants to limit the total number of dirty
pages (as those are harder to free). It doesn't care as much to which device
these pages belong (although it probably should care a bit more because
there are huge differences between how quickly can different devices get rid
of dirty pages).
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, 29 Oct 2013 21:30:50 +0100 Jan Kara <[email protected]> wrote:
> Andrew has queued up a patch series from Maxim Patlasov which removes this
> caveat but currently we don't have a way admin can switch that from
> userspace. But I'd like to have that tunable from userspace exactly for the
> cases as you describe below.
This?
commit 5a53748568f79641eaf40e41081a2f4987f005c2
Author: Maxim Patlasov <[email protected]>
AuthorDate: Wed Sep 11 14:22:46 2013 -0700
Commit: Linus Torvalds <[email protected]>
CommitDate: Wed Sep 11 15:58:04 2013 -0700
mm/page-writeback.c: add strictlimit feature
That's already in mainline, for 3.12.
On Fri 25-10-13 18:26:23, Artem S. Tashkinov wrote:
> Oct 25, 2013 05:26:45 PM, david wrote:
> On Fri, 25 Oct 2013, NeilBrown wrote:
> >
> >>
> >> What exactly is bothering you about this? The amount of memory used or the
> >> time until data is flushed?
> >
> >actually, I think the problem is more the impact of the huge write later on.
>
> Exactly. And not being able to use applications which show you IO
> performance like Midnight Commander. You might prefer to use "cp -a" but
> I cannot imagine my life without being able to see the progress of a
> copying operation. With the current dirty cache there's no way to
> understand how you storage media actually behaves.
Large writes shouldn't stall your desktop, that's certain and we must fix
that. I don't find the problem with copy progress indicators that
pressing...
> Hopefully this issue won't dissolve into obscurity and someone will
> actually make up a plan (and a patch) how to make dirty write cache
> behave in a sane manner considering the fact that there are devices with
> very different write speeds and requirements. It'd be ever better, if I
> could specify dirty cache as a mount option (though sane defaults or
> semi-automatic values based on runtime estimates won't hurt).
>
> Per device dirty cache seems like a nice idea, I, for one, would like to
> disable it altogether or make it an absolute minimum for things like USB
> flash drives - because I don't care about multithreaded performance or
> delayed allocation on such devices - I'm interested in my data reaching
> my USB stick ASAP - because it's how most people use them.
See my other emails in this thread. There are ways to tune the amount of
dirty data allowed per device. Currently the result isn't very satisfactory
but we should have something usable after the next merge window.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 25-10-13 10:32:16, Linus Torvalds wrote:
> On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton
> <[email protected]> wrote:
> >
> > Apparently all this stuff isn't working as desired (and perhaps as designed)
> > in this case. Will take a look after a return to normalcy ;)
>
> It definitely doesn't work. I can trivially reproduce problems by just
> having a cheap (==slow) USB key with an ext3 filesystem, and going a
> git clone to it. The end result is not pretty, and that's actually not
> even a huge amount of data.
I'll try to reproduce this tomorrow so that I can have a look where
exactly are we stuck. But in last few releases problems like this were
caused by problems in reclaim which got fed up by seeing lots of dirty
/ under writeback pages and ended up stuck waiting for IO to finish. Mel
has been tweaking the logic here and there but maybe it haven't got fixed
completely. Mel, do you know about any outstanding issues?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue 29-10-13 13:43:46, Andrew Morton wrote:
> On Tue, 29 Oct 2013 21:30:50 +0100 Jan Kara <[email protected]> wrote:
>
> > Andrew has queued up a patch series from Maxim Patlasov which removes this
> > caveat but currently we don't have a way admin can switch that from
> > userspace. But I'd like to have that tunable from userspace exactly for the
> > cases as you describe below.
>
> This?
>
> commit 5a53748568f79641eaf40e41081a2f4987f005c2
> Author: Maxim Patlasov <[email protected]>
> AuthorDate: Wed Sep 11 14:22:46 2013 -0700
> Commit: Linus Torvalds <[email protected]>
> CommitDate: Wed Sep 11 15:58:04 2013 -0700
>
> mm/page-writeback.c: add strictlimit feature
>
> That's already in mainline, for 3.12.
Yes, I should have checked the code...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, Oct 29, 2013 at 1:57 PM, Jan Kara <[email protected]> wrote:
> On Fri 25-10-13 10:32:16, Linus Torvalds wrote:
>>
>> It definitely doesn't work. I can trivially reproduce problems by just
>> having a cheap (==slow) USB key with an ext3 filesystem, and going a
>> git clone to it. The end result is not pretty, and that's actually not
>> even a huge amount of data.
>
> I'll try to reproduce this tomorrow so that I can have a look where
> exactly are we stuck. But in last few releases problems like this were
> caused by problems in reclaim which got fed up by seeing lots of dirty
> / under writeback pages and ended up stuck waiting for IO to finish. Mel
> has been tweaking the logic here and there but maybe it haven't got fixed
> completely. Mel, do you know about any outstanding issues?
I'm not sure this has ever worked, and in the last few years the
common desktop memory size has continued to grow.
For servers and "serious" desktops, having tons of dirty data doesn't
tend to be as much of a problem, because those environments are pretty
much defined by also having fairly good IO subsystems, and people
seldom use crappy USB devices for more than doing things like reading
pictures off them etc. And you'd not even see the problem under any
such load.
But it's actually really easy to reproduce by just taking your average
USB key and trying to write to it. I just did it with a random ISO
image, and it's _painful_. And it's not that it's painful for doing
most other things in the background, but if you just happen to run
anything that does "sync" (and it happens in scripts), the thing just
comes to a screeching halt. For minutes.
Same obviously goes with trying to eject/unmount the media etc.
We've had this problem before with the whole "ratio of dirty memory"
thing. It was a mistake. It made sense (and came from) back in the
days when people had 16MB or 32MB of RAM, and the concept of "let's
limit dirty memory to x% of that" was actually fairly reasonable. But
that "x%" doesn't make much sense any more. x% of 16GB (which is quite
the reasonable amount of memory for any modern desktop) is a huge
thing, and in the meantime the performance of disks have gone up a lot
(largely thanks to SSD's), but the *minimum* performance of disks
hasn't really improved all that much (largely thanks to USB ;).
So how about we just admit that the whole "ratio" thing was a big
mistake, and tell people that if they want to set a dirty limit, they
should do so in bytes? Which we already really do, but we default to
that ratio nevertheless. Which is why I'd suggest we just say "the
ratio works fine up to a certain amount, and makes no sense past it".
Why not make that "the ratio works fine up to a certain amount, and
makes no sense past it" be part of the calculations. We actually
*hace* exactly that on HIGHMEM machines, where we have this
configuration option of "vm_highmem_is_dirtyable" that defaults to
off. It just doesn't trigger on nonhighmem machines (today: "64-bit").
So I would suggest that we just expose that "vm_highmem_is_dirtyable"
on 64-bit too, and just say that anything over 1GB is highmem. That
means that 32-bit and 64-bit environments will basically act the same,
and I think it makes the defaults a bit saner.
Limiting the amount of dirty memory to 100MB/200MB (for "start
background writing" and "wait synchronously" respectively) even if you
happen to have 16GB of memory sounds like a good idea. Sure, it might
make some benchmarks a bit slower, but it will at least avoid the
"wait forever" symptom. And if you really have a very studly IO
subsystem, the fact that it starts writing out earlier won't really be
a problem.
After all, there are two reasons to do delayed writes:
- temp-files may not be written out at all.
Quite frankly, if you have multi-hundred-megabyte temptiles, you've
got issues
- coalescing writes improves throughput
There are very much diminishing returns, and the big return is to
make sure that we write things out in a good order, which a 100MB
buffer should make more than possible.
so I really think that it's insane to default to 1.6GB of dirty data
before you even start writing it out if you happen to have 16GB of
memory.
And again: if your benchmark is to create a kernel tree and then
immediately delete it, and you used to do that without doing any
actual IO, then yes, the attached patch will make that go much slower.
But for that benchmark, maybe you should just set the dirty limits (in
bytes) by hand, rather than expect the default kernel values to prefer
benchmarks over sanity?
Suggested patch attached. Comments?
Linus
On Tue, Oct 29, 2013 at 1:43 PM, Andrew Morton
<[email protected]> wrote:
> On Tue, 29 Oct 2013 21:30:50 +0100 Jan Kara <[email protected]> wrote:
>
>> Andrew has queued up a patch series from Maxim Patlasov which removes this
>> caveat but currently we don't have a way admin can switch that from
>> userspace. But I'd like to have that tunable from userspace exactly for the
>> cases as you describe below.
>
> This?
>
> mm/page-writeback.c: add strictlimit feature
>
> That's already in mainline, for 3.12.
Nothing currently actually *sets* the BDI_CAP_STRICTLIMIT flag, though.
So it's a potential fix, but it's certainly not a fix now.
Linus
On Tue 29-10-13 14:33:53, Linus Torvalds wrote:
> On Tue, Oct 29, 2013 at 1:57 PM, Jan Kara <[email protected]> wrote:
> > On Fri 25-10-13 10:32:16, Linus Torvalds wrote:
> >>
> >> It definitely doesn't work. I can trivially reproduce problems by just
> >> having a cheap (==slow) USB key with an ext3 filesystem, and going a
> >> git clone to it. The end result is not pretty, and that's actually not
> >> even a huge amount of data.
> >
> > I'll try to reproduce this tomorrow so that I can have a look where
> > exactly are we stuck. But in last few releases problems like this were
> > caused by problems in reclaim which got fed up by seeing lots of dirty
> > / under writeback pages and ended up stuck waiting for IO to finish. Mel
> > has been tweaking the logic here and there but maybe it haven't got fixed
> > completely. Mel, do you know about any outstanding issues?
>
> I'm not sure this has ever worked, and in the last few years the
> common desktop memory size has continued to grow.
>
> For servers and "serious" desktops, having tons of dirty data doesn't
> tend to be as much of a problem, because those environments are pretty
> much defined by also having fairly good IO subsystems, and people
> seldom use crappy USB devices for more than doing things like reading
> pictures off them etc. And you'd not even see the problem under any
> such load.
>
> But it's actually really easy to reproduce by just taking your average
> USB key and trying to write to it. I just did it with a random ISO
> image, and it's _painful_. And it's not that it's painful for doing
> most other things in the background, but if you just happen to run
> anything that does "sync" (and it happens in scripts), the thing just
> comes to a screeching halt. For minutes.
Yes, I agree that caching more than couple of seconds worth of writeback
for a device isn't good.
> Same obviously goes with trying to eject/unmount the media etc.
>
> We've had this problem before with the whole "ratio of dirty memory"
> thing. It was a mistake. It made sense (and came from) back in the
> days when people had 16MB or 32MB of RAM, and the concept of "let's
> limit dirty memory to x% of that" was actually fairly reasonable. But
> that "x%" doesn't make much sense any more. x% of 16GB (which is quite
> the reasonable amount of memory for any modern desktop) is a huge
> thing, and in the meantime the performance of disks have gone up a lot
> (largely thanks to SSD's), but the *minimum* performance of disks
> hasn't really improved all that much (largely thanks to USB ;).
>
> So how about we just admit that the whole "ratio" thing was a big
> mistake, and tell people that if they want to set a dirty limit, they
> should do so in bytes? Which we already really do, but we default to
> that ratio nevertheless. Which is why I'd suggest we just say "the
> ratio works fine up to a certain amount, and makes no sense past it".
>
> Why not make that "the ratio works fine up to a certain amount, and
> makes no sense past it" be part of the calculations. We actually
> *hace* exactly that on HIGHMEM machines, where we have this
> configuration option of "vm_highmem_is_dirtyable" that defaults to
> off. It just doesn't trigger on nonhighmem machines (today: "64-bit").
>
> So I would suggest that we just expose that "vm_highmem_is_dirtyable"
> on 64-bit too, and just say that anything over 1GB is highmem. That
> means that 32-bit and 64-bit environments will basically act the same,
> and I think it makes the defaults a bit saner.
>
> Limiting the amount of dirty memory to 100MB/200MB (for "start
> background writing" and "wait synchronously" respectively) even if you
> happen to have 16GB of memory sounds like a good idea. Sure, it might
> make some benchmarks a bit slower, but it will at least avoid the
> "wait forever" symptom. And if you really have a very studly IO
> subsystem, the fact that it starts writing out earlier won't really be
> a problem.
So I think we both realize this is only about what the default should be.
There will always be people who have loads which benefit from setting dirty
limits high but I agree they are minority. The reason why we left the
limits at what they are now despite them having less and less sence is that
we didn't want to break user expectations. If we cap the dirty limits as
you suggest, I bet we'll get some user complaints and "don't break users"
policy thus tells me we shouldn't do such changes ;)
Also I'm not sure capping dirty limits at 200MB is the best spot. It may be
but I think we should experiment with numbers a bit to check whether we
didn't miss something.
> After all, there are two reasons to do delayed writes:
>
> - temp-files may not be written out at all.
>
> Quite frankly, if you have multi-hundred-megabyte temptiles, you've
> got issues
Actually people do stuff like this e.g. when generating ISO images before
burning them.
> - coalescing writes improves throughput
>
> There are very much diminishing returns, and the big return is to
> make sure that we write things out in a good order, which a 100MB
> buffer should make more than possible.
True.
There is one more aspect:
- transforming random writes into mostly sequential writes
Different userspace programs use simple memory mapped databases which do
random writes into their data files. The less you writeback these the
better (at least from throughput POV). I'm not sure how large are these
files together on average user desktop though but my guess would be that
100 MB *should* be enough for them. Can anyone with GNOME / KDE desktop try
running with limits set this low for some time?
> so I really think that it's insane to default to 1.6GB of dirty data
> before you even start writing it out if you happen to have 16GB of
> memory.
>
> And again: if your benchmark is to create a kernel tree and then
> immediately delete it, and you used to do that without doing any
> actual IO, then yes, the attached patch will make that go much slower.
> But for that benchmark, maybe you should just set the dirty limits (in
> bytes) by hand, rather than expect the default kernel values to prefer
> benchmarks over sanity?
>
> Suggested patch attached. Comments?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, Oct 29, 2013 at 3:13 PM, Jan Kara <[email protected]> wrote:
>
> So I think we both realize this is only about what the default should be.
Yes. Most people will use the defaults, but there will always be
people who tune things for particular loads.
In fact, I think we have gone much too far in saying "all policy in
user space", because the fact is, user space isn't very good at
policy. Especially not at reacting to complex situations with
different devices. From what I've seen, "policy in user space" has
resulted in exactly two modes:
- user space does something stupid and wrong (example: "nice -19 X"
to work around some scheduler oddities)
- user space does nothing at all, and the kernel people say "hey,
user space _could_ set this value Xyz, so it's not our problem, and
it's policy, so we shouldn't touch it".
I think we in the kernel should say "our defaults should be what
everybody sane can use, and they should work fine on average". With
"policy in user space" being for crazy people that do really odd
things and can really spare the time to tune for their particular
issue.
So the "policy in user space" should be about *overriding* kernel
policy choices, not about the kernel never having them.
And this kind of "you can have many different devices and they act
quite differently" is a good example of something complicated that
user space really doesn't have a great model for. And we actually have
much better possible information in the kernel than user space ever is
likely to have.
> Also I'm not sure capping dirty limits at 200MB is the best spot. It may be
> but I think we should experiment with numbers a bit to check whether we
> didn't miss something.
Sure. That said, the patch I suggested basically makes the numbers be
at least roughly comparable across different architectures. So it's
been at least somewhat tested, even if 16GB x86-32 machines are
hopefully pretty rare (but I hear about people installing 32-bit on
modern machines much too often).
>> - temp-files may not be written out at all.
>>
>> Quite frankly, if you have multi-hundred-megabyte temptiles, you've
>> got issues
> Actually people do stuff like this e.g. when generating ISO images before
> burning them.
Yes, but then the temp-file is long-lived enough that it *will* hit
the disk anyway. So it's only the "create temporary file and pretty
much immediately delete it" case that changes behavior (ie compiler
assembly files etc).
If the temp-file is for something like burning an ISO image, the
burning part is slow enough that the temp-file will hit the disk
regardless of when we start writing it.
> There is one more aspect:
> - transforming random writes into mostly sequential writes
Sure. And I think that if you have a big database, that's when you do
end up tweaking the dirty limits.
That said, I'd certainly like it even *more* if the limits really were
per-BDI, and the global limit was in addition to the per-bdi ones.
Because when you have a USB device that gets maybe 10MB/s on
contiguous writes, and 100kB/s on random 4k writes, I think it would
make more sense to make the "start writeout" limits be 1MB/2MB, not
100MB/200MB. So my patch doesn't even take it far enough, it's just a
"let's not be ridiculous". The per-BDI limits don't seem quite ready
for prime time yet, though. Even the new "strict" limits seems to be
more about "trusted filesystems" than about really sane writeback
limits.
Fengguang, comments?
(And I added Maxim to the cc, since he's the author of the strict
mode, and while it is currently limited to FUSE, he did mention USB
storage in the commit message..).
Linus
Oct 30, 2013 02:41:01 AM, Jack wrote:
On Fri 25-10-13 19:37:53, Ted Tso wrote:
>> Sure, although I wonder if it would be worth it calcuate some kind of
>> rolling average of the write bandwidth while we are doing writeback,
>> so if it turns out we got unlucky with the contents of the first 100MB
>> of dirty data (it could be either highly random or highly sequential)
>> the we'll eventually correct to the right level.
> We already do average measured throughput over a longer time window and
>have kind of rolling average algorithm doing some averaging.
>
>> This means that VM would have to keep dirty page counters for each BDI
>> --- which I thought we weren't doing right now, which is why we have a
>> global vm.dirty_ratio/vm.dirty_background_ratio threshold. (Or do I
>> have cause and effect reversed? :-)
> And we do currently keep the number of dirty & under writeback pages per
>BDI. We have global limits because mm wants to limit the total number of dirty
>pages (as those are harder to free). It doesn't care as much to which device
>these pages belong (although it probably should care a bit more because
>there are huge differences between how quickly can different devices get rid
>of dirty pages).
This might sound like an absolutely stupid question which makes no sense at
all, so I want to apologize for it in advance, but since the Linux kernel lacks
revoke(), does that mean that dirty buffers will always occupy the kernel memory
if I for instance remove my USB stick before the kernel has had the time to flush
these buffers?
On Tue, Oct 29, 2013 at 09:57:56PM +0100, Jan Kara wrote:
> On Fri 25-10-13 10:32:16, Linus Torvalds wrote:
> > On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton
> > <[email protected]> wrote:
> > >
> > > Apparently all this stuff isn't working as desired (and perhaps as designed)
> > > in this case. Will take a look after a return to normalcy ;)
> >
> > It definitely doesn't work. I can trivially reproduce problems by just
> > having a cheap (==slow) USB key with an ext3 filesystem, and going a
> > git clone to it. The end result is not pretty, and that's actually not
> > even a huge amount of data.
>
> I'll try to reproduce this tomorrow so that I can have a look where
> exactly are we stuck. But in last few releases problems like this were
> caused by problems in reclaim which got fed up by seeing lots of dirty
> / under writeback pages and ended up stuck waiting for IO to finish. Mel
> has been tweaking the logic here and there but maybe it haven't got fixed
> completely. Mel, do you know about any outstanding issues?
>
Yeah, there are still a few. The work in that general area dealt with
such problems as dirty pages reaching the end of the LRU (excessive CPU
usage), calling wait_on_page_writeback from reclaim context (random
processes stalling even though there was not much memory pressure),
desktop applications stalling randomly (second quick write stalling on
stable writeback). The systemtap script caught those type of areas and I
believe they are fixed up.
There are still problems though. If all dirty pages were backed by a slow
device then dirty limiting is still eventually going to cause stalls in
dirty page balancing. If there is a global sync then the shit can really
hit the fan if it all gets stuck waiting on something like journal space.
Applications that are very fsync happy can still get stalled for long
periods of time behind slower writers as they wait for the IO to flush.
When all this happens there still make be spikes in CPU usage if it scans
the dirty pages excessively without sleeping.
Consciously or unconsciously my desktop applications generally do not fall
foul of these problems. At least one of the desktop environments can stall
because it calls fsync on history and preference files constantly but I
cannot remember which one of if it has been fixed since. I did have a problem
with gnome-terminal as it depended on a library that implemented scrollback
buffering by writing single-line files to /tmp and then truncating them
which would "freeze" the terminal under IO. I now use tmpfs for /tmp to
get around this. When I'm writing to USB sticks I think it tends to stay
between the point where background writing starts and dirty throttling
occurs so I rarely notice any major problems. I'm probably unconsciously
avoiding doing any write-heavy work while a USB stick is plugged in.
Addressing this goes back to tuning dirty ratio or replacing it. Tuning
it always falls foul of "works for one person and not another" and fails
utterly when there is storage with differet speeds. We talked about this a
few months ago but I still suspect that we will have to bite the bullet and
tune based on "do not dirty more data than it takes N seconds to writeback"
using per-bdi writeback estimations. It's just not that trivial to implement
as the writeback speeds can change for a variety of reasons (multiple IO
sources, random vs sequential etc). Hence at one point we think we are
within our target window and then get it completely wrong. Dirty ratio
is a hard guarantee, dirty writeback estimation is best-effort that will
go wrong in some cases.
--
Mel Gorman
SUSE Labs
On Wed 30-10-13 10:07:08, Artem S. Tashkinov wrote:
> Oct 30, 2013 02:41:01 AM, Jack wrote:
> On Fri 25-10-13 19:37:53, Ted Tso wrote:
> >> Sure, although I wonder if it would be worth it calcuate some kind of
> >> rolling average of the write bandwidth while we are doing writeback,
> >> so if it turns out we got unlucky with the contents of the first 100MB
> >> of dirty data (it could be either highly random or highly sequential)
> >> the we'll eventually correct to the right level.
> > We already do average measured throughput over a longer time window and
> >have kind of rolling average algorithm doing some averaging.
> >
> >> This means that VM would have to keep dirty page counters for each BDI
> >> --- which I thought we weren't doing right now, which is why we have a
> >> global vm.dirty_ratio/vm.dirty_background_ratio threshold. (Or do I
> >> have cause and effect reversed? :-)
> > And we do currently keep the number of dirty & under writeback pages per
> >BDI. We have global limits because mm wants to limit the total number of dirty
> >pages (as those are harder to free). It doesn't care as much to which device
> >these pages belong (although it probably should care a bit more because
> >there are huge differences between how quickly can different devices get rid
> >of dirty pages).
>
> This might sound like an absolutely stupid question which makes no sense at
> all, so I want to apologize for it in advance, but since the Linux kernel lacks
> revoke(), does that mean that dirty buffers will always occupy the kernel memory
> if I for instance remove my USB stick before the kernel has had the time to flush
> these buffers?
That's actually a good question. And the answer is that currently when we
hit EIO while writing out dirty data, we just throw away that data. Not
an ideal solution for some cases but it solves the problem with unwriteable
data...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue 131029, Jan Kara wrote:
> On Fri 25-10-13 11:15:55, Karl Kiniger wrote:
> > On Fri 131025, Linus Torvalds wrote:
....
> > Is it currently possible to somehow set above values per block device?
> Yes, to some extent. You can set /sys/block/<device>/bdi/max_ratio to
> the maximum proportion the device's dirty data can take from the total
> amount. The caveat currently is that this setting only takes effect after
> we have more than (dirty_background_ratio + dirty_ratio)/2 dirty data in
> total because that is an amount of dirty data when we start to throttle
> processes. So if the device you'd like to limit is the only one which is
> currently written to, the limiting doesn't have a big effect.
Thanks for the info - thats was I am looking for.
You are right that the limiting doesn't have a big effect right now:
on my 4x speed DVD+RW on /dev/sr0, x86_64, 4GB,
Fedora19:
max_ratio set to 100 - about 500MB buffered, sync time 2:10 min.
max_ratio set to 1 - about 330MB buffered, sync time 1:23 min.
... way too much buffering.
(measured with strace -tt -ewrite dd if=/dev/zero of=bigfile bs=1M count=1000
by looking at the timestamps).
Karl
....
Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR