2010-11-02 11:52:59

by Sanjoy Mahajan

[permalink] [raw]
Subject: Re: 2.6.36 io bring the system to its knees

Chris Mason <[email protected]> wrote:

> > This has the appearance of some really bad IO or VM latency
> > problem. Unfixed and present in stable kernel versions going from
> > years ago all the way to v2.6.36.
>
> Hmmm, the workload you're describing here has two special parts.
> First it dramatically overloads the disk, and then it has guis doing
> things waiting for the disk.

I think I see this same issue every few days when I back up my hard
drive to a USB hard drive using rsync. While the backup is running, the
interactive response is bad. A reproducible measurement of the badness
is starting an rxvt with F8 (bound to "rxvt &" in my .twmrc). Often it
takes 8 seconds for the window to appear (as it just did about 2 minutes
ago)! (Starting a subsequent rxvt is quick.)

The command for running the backup:

rsync -av --delete /etc /home /media/usbdrive/bak > /tmp/homebackup.log

The hardware is a T60 w/ Intel graphics and wireless, 1.5GB RAM, 5400rpm
160GB harddrive w/ ext3 filesystems, and it's running vanilla 2.6.36.
There's not much memory pressure. The swap is mostly empty, and there's
usually a Firefox eating 500MB of RAM. Even Emacs at 50MB is in the
noise compared to the Firefox.

Here's the 'free' output:

total used free shared buffers cached
Mem: 1545292 1500288 45004 0 92848 713988
-/+ buffers/cache: 693452 851840
Swap: 2000088 22680 1977408

What tests or probes are worth running when the problem reappears in
order to find the root cause?

-Sanjoy

`Until lions have their historians, tales of the hunt shall always
glorify the hunters.' --African Proverb


2010-11-02 13:15:05

by Chris Mason

[permalink] [raw]
Subject: Re: 2.6.36 io bring the system to its knees

On Tue, Nov 02, 2010 at 07:47:15AM -0400, Sanjoy Mahajan wrote:
> Chris Mason <[email protected]> wrote:
>
> > > This has the appearance of some really bad IO or VM latency
> > > problem. Unfixed and present in stable kernel versions going from
> > > years ago all the way to v2.6.36.
> >
> > Hmmm, the workload you're describing here has two special parts.
> > First it dramatically overloads the disk, and then it has guis doing
> > things waiting for the disk.
>
> I think I see this same issue every few days when I back up my hard
> drive to a USB hard drive using rsync. While the backup is running, the
> interactive response is bad. A reproducible measurement of the badness
> is starting an rxvt with F8 (bound to "rxvt &" in my .twmrc). Often it
> takes 8 seconds for the window to appear (as it just did about 2 minutes
> ago)! (Starting a subsequent rxvt is quick.)

So this sounds like the backup is just thrashing your cache. Latencies
starting an app are less surprising than latencies where a running app
doesn't respond at all.

Does rsync have the option to do an fadvise DONTNEED?

-chris

2010-11-04 16:05:57

by Sanjoy Mahajan

[permalink] [raw]
Subject: Re: 2.6.36 io bring the system to its knees

> So this sounds like the backup is just thrashing your cache.

I think it's more than that. Starting an rxvt shouldn't take 8 seconds,
even with a cold cache. Actually, it does take a while, so you do have
a point. I just did

echo 3 > /proc/sys/vm/drop_caches

and then started rxvt. That takes about 3 seconds (which seems long,
but I don't know wherein that slowness lies), of which maybe 0.25
seconds is loading and running 'date':

$ time rxvt -e date
real 0m2.782s
user 0m0.148s
sys 0m0.032s

The 8-second delay during the rsync must have at least two causes: (1)
the cache is wiped out, and (2) the rxvt binary cannot be paged in
quickly because the disk is doing lots of other I/O.

Can the system someknow that paging in the rxvt binary and shared
libraries is interactive I/O, because it was started by an interactive
process, and therefore should take priority over the rsync?

> Does rsync have the option to do an fadvise DONTNEED?

I couldn't find one. It would be good to have a solution that is
independent of the backup app. (The 'locate' cron job does a similar
thrashing of the interactive response.)

-Sanjoy

`Until lions have their historians, tales of the hunt shall always
glorify the hunters.' --African Proverb

2010-11-04 23:35:38

by Steven Barrett

[permalink] [raw]
Subject: Re: 2.6.36 io bring the system to its knees

On 11/04/2010 11:05 AM, Sanjoy Mahajan wrote:
>> So this sounds like the backup is just thrashing your cache.
>
> I think it's more than that. Starting an rxvt shouldn't take 8 seconds,
> even with a cold cache. Actually, it does take a while, so you do have
> a point. I just did
>
> echo 3 > /proc/sys/vm/drop_caches
>
> and then started rxvt. That takes about 3 seconds (which seems long,
> but I don't know wherein that slowness lies), of which maybe 0.25
> seconds is loading and running 'date':
>
> $ time rxvt -e date
> real 0m2.782s
> user 0m0.148s
> sys 0m0.032s
>
> The 8-second delay during the rsync must have at least two causes: (1)
> the cache is wiped out, and (2) the rxvt binary cannot be paged in
> quickly because the disk is doing lots of other I/O.
>
> Can the system someknow that paging in the rxvt binary and shared
> libraries is interactive I/O, because it was started by an interactive
> process, and therefore should take priority over the rsync?
>
>> Does rsync have the option to do an fadvise DONTNEED?
>
> I couldn't find one. It would be good to have a solution that is
> independent of the backup app. (The 'locate' cron job does a similar
> thrashing of the interactive response.)

I'm definitely no expert in Linux' file cache management, but from what
I've experienced... isn't the real problem that the "interactive"
processes, like your web browser or file manager, lose their inode and
dentry cache when rsync runs? Then while rsync is busy reading and
writing to the disk, whenever you click on your interactive application,
it tries to read what it lost to rsync from the disk while rsync is
still thrashing your inode/dentry cache.

This is a major problem even when my system has lots of ram (4gB on this
laptop).

What has helped me, however, is reducing vm.vfs_cache_pressure to a
smaller value (25 here) so that Linux prefers to retain the current
inode / dentry cache rather than suddenly give it up for a new greedy
I/O type of program. The only side effect is that file copying is a
little slower than usual... totally worth it though.

>
> -Sanjoy
>
> `Until lions have their historians, tales of the hunt shall always
> glorify the hunters.' --African Proverb

Steven Barrett