2013-03-24 05:31:41

by Fredrik Tolf

[permalink] [raw]
Subject: I/O blocked while dirty pages are being flushed

Dear list,

I've got an mmapped file (a Berkeley DB region file) with an access
pattern such that it gets some 10-40 MBs of dirtied pages a couple of
times per minute. When the VM comes around to flush these pages to disk,
that causes loads of problems. Since the dirty pages are rather
interspersed in the file, the flusher posts batches of some 3000-5000
write requests to the disk queue, and since I'm using normal hard drives,
this might sometimes take 10-30 seconds to complete.

While this flush is running, I find that many a process goes into disk
sleep waiting for the flush to complete. This includes the process
manipulating the mmapped file whenever it tries to redirty a page
currently waiting to be flushed, but also, for instance, programs that
write() to log files (since, I guess, the buffer page backing the last
written portion of the log file is being flushed). The common culprits,
then, are sleep_on_page and sleep_on_buffer. All these processes commonly
block for up to several tens of seconds, then, which gets me all kind of
trouble, as I'm sure you can see.

I'd like to hear your opinion on this case. Is Berkeley DB at fault for
causing these kinds of access patterns? Is the kernel at fault for
blocking all these processes needlessly? Is the hardware at fault for
being so hopelessly slow and I should get with the times and get me some
SSDs? Or am I at fault for not finding the obvious configuration settings
to avoid the problem? :)

I'm inclined to think that the kernel is at fault for blocking the
processes needlessly. If the contents of the pages being flushed need to
be preserved until the write is completed, shouldn't they be copied when
written to, rather than blocking the writer for who-knows-how-long? It
seems that if the kernel doesn't do this, then I'm always put at the mercy
of the hardware, and as long as I have free memory, I shouldn't have to
be.

However, I could also see that Berkeley DB is somehow at fault for this
kind of access, causing such massive disk writes, and that perhaps it
should be using SysV SHM regions or such instead of disk-backed files?
Would it be possible, perhaps, to get these files treated more like
anonymous memory, their contents not being flushed back to disk unless
necessary?

It is worth noting, also, that this seems to be a situation introduced
somewhere between 2.6.26 and 2.6.32, because I started noticing it when I
upgraded from Debian 5.0 to 6.0. I've since tried it on 3.2.0, 3.5.4 and
3.7.1, and it appears in every version. However, I can't easily go back
and bisect, because the new init scripts don't support kernels older than
2.6.32, unfortunately.

I'm sorry, also, if this is the completely wrong list for such
discussions, but I couldn't find another one to match better.

Thanks for reading my wall of text!

--

Fredrik Tolf


2013-03-24 06:56:10

by Eric Wong

[permalink] [raw]
Subject: Re: I/O blocked while dirty pages are being flushed

Fredrik Tolf <[email protected]> wrote:
> It is worth noting, also, that this seems to be a situation
> introduced somewhere between 2.6.26 and 2.6.32, because I started
> noticing it when I upgraded from Debian 5.0 to 6.0. I've since tried
> it on 3.2.0, 3.5.4 and 3.7.1, and it appears in every version.
> However, I can't easily go back and bisect, because the new init
> scripts don't support kernels older than 2.6.32, unfortunately.

I'm not sure about Debian-specific changes to the kernel, but
in the stock kernel, the dirty*ratios changes could affect you:

before 2.6.22 dirty_ratio=40 dirty_background_ratio=10
2.6.22-2.6.29 dirty_ratio=10 dirty_background_ratio=5
2.6.30-... dirty_ratio=20 dirty_background_ratio=10

So try lowering these sysctls to 2.6.26 levels (or lower) and see if
that helps.

Fwiw, I usually use dirty_ratio=2 dirty_background_ratio=1 on servers
with a few gigs of RAM (or appropriately low dirty*bytes values).

Lowering dirty*ratio helps servers get more consistent performance under
constant I/O pressure and aggressively throttles processes before a
large amount of dirty pages becomes a problem (as you've noticed).

High dirty*ratio is good for some bursty desktop workloads and some
benchmarks, though...

ref: commit 07db59bd6b0f279c31044cba6787344f63be87ea
ref: commit 1b5e62b42b55c509eea04c3c0f25e42c8b35b564


Heck, on a particularly bad server (2.6.18, pre-dirty_*bytes sysctl)
with lots of RAM and horrible disk throughput (~10M/s), I set both
dirty_writeback_centisecs and dirty_expire_centisecs to 100 to get
acceptable performance for writing HTTP access logs.

2013-03-24 10:27:38

by Bart Van Assche

[permalink] [raw]
Subject: Re: I/O blocked while dirty pages are being flushed

On 03/24/13 06:12, Fredrik Tolf wrote:
> While this flush is running, I find that many a process goes into disk
> sleep waiting for the flush to complete. This includes the process
> manipulating the mmapped file whenever it tries to redirty a page
> currently waiting to be flushed, but also, for instance, programs that
> write() to log files (since, I guess, the buffer page backing the last
> written portion of the log file is being flushed).

Had you already encountered this article: Jonathan Corbet, The trouble
with stable pages, March 13, 2012 (http://lwn.net/Articles/486311/) ?

Bart.

2013-03-25 07:31:52

by Fredrik Tolf

[permalink] [raw]
Subject: Re: I/O blocked while dirty pages are being flushed

On Sun, 24 Mar 2013, Bart Van Assche wrote:
> On 03/24/13 06:12, Fredrik Tolf wrote:
>> While this flush is running, I find that many a process goes into disk
>> sleep waiting for the flush to complete. This includes the process
>> manipulating the mmapped file whenever it tries to redirty a page
>> currently waiting to be flushed, but also, for instance, programs that
>> write() to log files (since, I guess, the buffer page backing the last
>> written portion of the log file is being flushed).
>
> Had you already encountered this article: Jonathan Corbet, The trouble with
> stable pages, March 13, 2012 (http://lwn.net/Articles/486311/) ?

I had not, but that certainly does seem to be the exact problem I'm
having. Thanks for the link; it was a very interesting read! Does anyone
know if any progress has been made since about any resolution of the
situation?

I notice linked mail threads with people saying that they have simply
removed the calls to wait_on_page_writeback to resolve their problems. Can
this still be considered safe, or have other systems than this
block-device checksumming started depending on stable pages since? Like
software RAID, for instance?

--

Fredrik Tolf

2013-03-25 07:34:55

by Fredrik Tolf

[permalink] [raw]
Subject: Re: I/O blocked while dirty pages are being flushed

On Sun, 24 Mar 2013, Eric Wong wrote:
> Fredrik Tolf <[email protected]> wrote:
>> It is worth noting, also, that this seems to be a situation
>> introduced somewhere between 2.6.26 and 2.6.32, because I started
>> noticing it when I upgraded from Debian 5.0 to 6.0. I've since tried
>> it on 3.2.0, 3.5.4 and 3.7.1, and it appears in every version.
>> However, I can't easily go back and bisect, because the new init
>> scripts don't support kernels older than 2.6.32, unfortunately.
>
> So try lowering these sysctls to 2.6.26 levels (or lower) and see if
> that helps.

Thanks for the tip, but since the page dirtying happens in fast bursts for
me, rather than gradually over time, that just caused the same sizes of
writes to happen more often instead, which only made it worse. :)

I'll continue investigating the stable-page route, instead, since that
seems to be my exact problem.

--

Fredrik Tolf

2013-04-09 22:11:36

by Jan Kara

[permalink] [raw]
Subject: Re: I/O blocked while dirty pages are being flushed

On Mon 25-03-13 08:31:43, Fredrik Tolf wrote:
> On Sun, 24 Mar 2013, Bart Van Assche wrote:
> >On 03/24/13 06:12, Fredrik Tolf wrote:
> >>While this flush is running, I find that many a process goes into disk
> >>sleep waiting for the flush to complete. This includes the process
> >>manipulating the mmapped file whenever it tries to redirty a page
> >>currently waiting to be flushed, but also, for instance, programs that
> >>write() to log files (since, I guess, the buffer page backing the last
> >>written portion of the log file is being flushed).
> >
> >Had you already encountered this article: Jonathan Corbet, The
> >trouble with stable pages, March 13, 2012
> >(http://lwn.net/Articles/486311/) ?
>
> I had not, but that certainly does seem to be the exact problem I'm
> having. Thanks for the link; it was a very interesting read! Does
> anyone know if any progress has been made since about any resolution
> of the situation?
Patches to fix the situation are already merged in Linus' tree and will
be in 3.9. So stay tuned...

> I notice linked mail threads with people saying that they have
> simply removed the calls to wait_on_page_writeback to resolve their
> problems. Can this still be considered safe, or have other systems
> than this block-device checksumming started depending on stable
> pages since? Like software RAID, for instance?
It should be safe. Noone except HW really realies on stable pages (or
better, they do the copying if they do need it).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR