2009-04-28 04:56:04

by Elladan

[permalink] [raw]
Subject: Swappiness vs. mmap() and interactive response

Hi,

So, I just set up Ubuntu Jaunty (using Linux 2.6.28) on a quad core phenom box,
and then I did the following (with XFS over LVM):

mv /500gig/of/data/on/disk/one /disk/two

This quickly caused the system to. grind.. to... a.... complete..... halt.
Basically every UI operation, including the mouse in Xorg, started experiencing
multiple second lag and delays. This made the system essentially unusable --
for example, just flipping to the window where the "mv" command was running
took 10 seconds on more than one occasion. Basically a "click and get coffee"
interface.

There was no particular kernel CPU load -- the SATA DMA seemed fine.

If I actively used the GUI, then the pieces I was using would work better, but
they'd start experiencing astonishing latency again if I just let the UI sit
for a little while. From this, I diagnosed that the problem was probably
related to the VM paging out my GUI.

Next, I set the following:

echo 0 > /proc/sys/vm/swappiness

... hoping it would prevent paging out of the UI in favor of file data that's
only used once. It did appear to help to a small degree, but not much. The
system is still effectively unusable while a file copy is going on.

>From this, I diagnosed that most likely, the kernel was paging out all my
application file mmap() data (such as my executables and shared libraries) in
favor of total garbage VM load from the file copy.

I don't know how to verify that this is true definitively. Are there some
magic numbers in /proc I can look at? However, I did run latencytop, and it
showed massive 2000+ msec latency in the page fault handler, as well as in
various operations such as XFS read.

Could this be something else? There were some long delays in latencytop from
various apps doing fsync as well, but it seems unlikely that this would destroy
latency in Xorg, and again, latency improved whenever I touched an app, for
that app.

Is there any way to fix this, short of rewriting the VM myself? For example,
is there some way I could convince this VM that pages with active mappings are
valuable?

Thanks.


2009-04-28 05:35:42

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

(cc to linux-mm and Rik)


> Hi,
>
> So, I just set up Ubuntu Jaunty (using Linux 2.6.28) on a quad core phenom box,
> and then I did the following (with XFS over LVM):
>
> mv /500gig/of/data/on/disk/one /disk/two
>
> This quickly caused the system to. grind.. to... a.... complete..... halt.
> Basically every UI operation, including the mouse in Xorg, started experiencing
> multiple second lag and delays. This made the system essentially unusable --
> for example, just flipping to the window where the "mv" command was running
> took 10 seconds on more than one occasion. Basically a "click and get coffee"
> interface.

I have some question and request.

1. please post your /proc/meminfo
2. Do above copy make tons swap-out? IOW your disk read much faster than write?
3. cache limitation of memcgroup solve this problem?
4. Which disk have your /bin and /usr/bin?



>
> There was no particular kernel CPU load -- the SATA DMA seemed fine.
>
> If I actively used the GUI, then the pieces I was using would work better, but
> they'd start experiencing astonishing latency again if I just let the UI sit
> for a little while. From this, I diagnosed that the problem was probably
> related to the VM paging out my GUI.
>
> Next, I set the following:
>
> echo 0 > /proc/sys/vm/swappiness
>
> ... hoping it would prevent paging out of the UI in favor of file data that's
> only used once. It did appear to help to a small degree, but not much. The
> system is still effectively unusable while a file copy is going on.
>
> From this, I diagnosed that most likely, the kernel was paging out all my
> application file mmap() data (such as my executables and shared libraries) in
> favor of total garbage VM load from the file copy.
>
> I don't know how to verify that this is true definitively. Are there some
> magic numbers in /proc I can look at? However, I did run latencytop, and it
> showed massive 2000+ msec latency in the page fault handler, as well as in
> various operations such as XFS read.
>
> Could this be something else? There were some long delays in latencytop from
> various apps doing fsync as well, but it seems unlikely that this would destroy
> latency in Xorg, and again, latency improved whenever I touched an app, for
> that app.
>
> Is there any way to fix this, short of rewriting the VM myself? For example,
> is there some way I could convince this VM that pages with active mappings are
> valuable?
>
> Thanks.


2009-04-28 06:37:01

by Elladan

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

On Tue, Apr 28, 2009 at 02:35:29PM +0900, KOSAKI Motohiro wrote:
> (cc to linux-mm and Rik)
>
> > Hi,
> >
> > So, I just set up Ubuntu Jaunty (using Linux 2.6.28) on a quad core phenom box,
> > and then I did the following (with XFS over LVM):
> >
> > mv /500gig/of/data/on/disk/one /disk/two
> >
> > This quickly caused the system to. grind.. to... a.... complete..... halt.
> > Basically every UI operation, including the mouse in Xorg, started experiencing
> > multiple second lag and delays. This made the system essentially unusable --
> > for example, just flipping to the window where the "mv" command was running
> > took 10 seconds on more than one occasion. Basically a "click and get coffee"
> > interface.
>
> I have some question and request.
>
> 1. please post your /proc/meminfo
> 2. Do above copy make tons swap-out? IOW your disk read much faster than write?
> 3. cache limitation of memcgroup solve this problem?
> 4. Which disk have your /bin and /usr/bin?

I'll answer these out of order if you don't mind.

2. Do above copy make tons swap-out? IOW your disk read much faster than write?

The disks should be roughly similar. However:

sda is the read disk, sdb is the write. Here's a few snippets from iostat -xm 10

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 67.70 0.00 373.10 0.20 48.47 0.00 265.90 1.94 5.21 2.10 78.32
sdb 0.00 1889.60 0.00 139.80 0.00 52.52 769.34 35.01 250.45 5.17 72.28
---
sda 5.30 0.00 483.80 0.30 60.65 0.00 256.59 1.59 3.28 1.65 79.72
sdb 0.00 3632.70 0.00 171.10 0.00 61.10 731.39 117.09 709.66 5.84 100.00
---
sda 51.20 0.00 478.10 1.00 65.79 0.01 281.27 2.48 5.18 1.96 93.72
sdb 0.00 2104.60 0.00 174.80 0.00 62.84 736.28 108.50 613.64 5.72 100.00
--
sda 153.20 0.00 349.40 0.20 60.99 0.00 357.30 4.47 13.19 2.85 99.80
sdb 0.00 1766.50 0.00 158.60 0.00 59.89 773.34 110.07 672.25 6.30 99.96

This data seems to indicate the IO performance varies, but the reader is usually faster.

4. Which disk have your /bin and /usr/bin?

sda, the reader.

3. cache limitation of memcgroup solve this problem?

I was unable to get this to work -- do you have some documentation handy?

1. please post your /proc/meminfo

$ cat /proc/meminfo
MemTotal: 3467668 kB
MemFree: 20164 kB
Buffers: 204 kB
Cached: 2295232 kB
SwapCached: 4012 kB
Active: 639608 kB
Inactive: 2620880 kB
Active(anon): 608104 kB
Inactive(anon): 360812 kB
Active(file): 31504 kB
Inactive(file): 2260068 kB
Unevictable: 8 kB
Mlocked: 8 kB
SwapTotal: 4194296 kB
SwapFree: 4186968 kB
Dirty: 147280 kB
Writeback: 8424 kB
AnonPages: 961280 kB
Mapped: 39016 kB
Slab: 81904 kB
SReclaimable: 59044 kB
SUnreclaim: 22860 kB
PageTables: 20548 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 5928128 kB
Committed_AS: 1770348 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 281908 kB
VmallocChunk: 34359449059 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 44928 kB
DirectMap2M: 3622912 kB

2009-04-28 06:52:43

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

Hi

> 3. cache limitation of memcgroup solve this problem?
>
> I was unable to get this to work -- do you have some documentation handy?

Do you have kernel source tarball?
Documentation/cgroups/memory.txt explain usage kindly.



2009-04-28 07:27:21

by Elladan

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

On Tue, Apr 28, 2009 at 03:52:29PM +0900, KOSAKI Motohiro wrote:
> Hi
>
> > 3. cache limitation of memcgroup solve this problem?
> >
> > I was unable to get this to work -- do you have some documentation handy?
>
> Do you have kernel source tarball?
> Documentation/cgroups/memory.txt explain usage kindly.

Thank you. My documentation was out of date.

I created a cgroup with limited memory and placed a copy command in it, and the
latency problem seems to essentially go away. However, I'm also a bit
suspicious that my test might have become invalid, since my IO performance
seems to have dropped somewhat too.

So, am I right in concluding that this more or less implicates bad page
replacement as the culprit? After I dropped vm caches and let my working set
re-form, the memory cgroup seems to be effective at keeping a large pool of
memory free from file pressure.

2009-04-28 07:44:44

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

> On Tue, Apr 28, 2009 at 03:52:29PM +0900, KOSAKI Motohiro wrote:
> > Hi
> >
> > > 3. cache limitation of memcgroup solve this problem?
> > >
> > > I was unable to get this to work -- do you have some documentation handy?
> >
> > Do you have kernel source tarball?
> > Documentation/cgroups/memory.txt explain usage kindly.
>
> Thank you. My documentation was out of date.
>
> I created a cgroup with limited memory and placed a copy command in it, and the
> latency problem seems to essentially go away. However, I'm also a bit
> suspicious that my test might have become invalid, since my IO performance
> seems to have dropped somewhat too.
>
> So, am I right in concluding that this more or less implicates bad page
> replacement as the culprit? After I dropped vm caches and let my working set
> re-form, the memory cgroup seems to be effective at keeping a large pool of
> memory free from file pressure.

Hmm..
it seems your result mean bad page replacement occur. but actually
I hevn't seen such result on my environment.

Hmm, I think I need to make reproduce environmet to your trouble.

Thanks.


2009-04-28 07:49:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

On Tue, 2009-04-28 at 14:35 +0900, KOSAKI Motohiro wrote:
> (cc to linux-mm and Rik)
>
>
> > Hi,
> >
> > So, I just set up Ubuntu Jaunty (using Linux 2.6.28) on a quad core phenom box,
> > and then I did the following (with XFS over LVM):
> >
> > mv /500gig/of/data/on/disk/one /disk/two
> >
> > This quickly caused the system to. grind.. to... a.... complete..... halt.
> > Basically every UI operation, including the mouse in Xorg, started experiencing
> > multiple second lag and delays. This made the system essentially unusable --
> > for example, just flipping to the window where the "mv" command was running
> > took 10 seconds on more than one occasion. Basically a "click and get coffee"
> > interface.
>
> I have some question and request.
>
> 1. please post your /proc/meminfo
> 2. Do above copy make tons swap-out? IOW your disk read much faster than write?
> 3. cache limitation of memcgroup solve this problem?
> 4. Which disk have your /bin and /usr/bin?
>

FWIW I fundamentally object to 3 as being a solution.

I still think the idea of read-ahead driven drop-behind is a good one,
alas last time we brought that up people thought differently.

2009-04-28 07:58:36

by Balbir Singh

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

On Tue, Apr 28, 2009 at 1:18 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2009-04-28 at 14:35 +0900, KOSAKI Motohiro wrote:
>> (cc to linux-mm and Rik)
>>
>>
>> > Hi,
>> >
>> > So, I just set up Ubuntu Jaunty (using Linux 2.6.28) on a quad core phenom box,
>> > and then I did the following (with XFS over LVM):
>> >
>> > mv /500gig/of/data/on/disk/one /disk/two
>> >
>> > This quickly caused the system to. grind.. to... a.... complete..... halt.
>> > Basically every UI operation, including the mouse in Xorg, started experiencing
>> > multiple second lag and delays. ?This made the system essentially unusable --
>> > for example, just flipping to the window where the "mv" command was running
>> > took 10 seconds on more than one occasion. ?Basically a "click and get coffee"
>> > interface.
>>
>> I have some question and request.
>>
>> 1. please post your /proc/meminfo
>> 2. Do above copy make tons swap-out? IOW your disk read much faster than write?
>> 3. cache limitation of memcgroup solve this problem?
>> 4. Which disk have your /bin and /usr/bin?
>>
>
> FWIW I fundamentally object to 3 as being a solution.
>

memcgroup were not created to solve latency problems, but they do
isolate memory and if that helps latency, I don't see why that is a
problem. I don't think isolating applications that we think are not
important and interfere or consume more resources than desired is a
bad solution.

> I still think the idea of read-ahead driven drop-behind is a good one,
> alas last time we brought that up people thought differently.

I vaguely remember the patches, but can't recollect the details.

Balbir

2009-04-28 08:03:38

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

> > 1. please post your /proc/meminfo
> > 2. Do above copy make tons swap-out? IOW your disk read much faster than write?
> > 3. cache limitation of memcgroup solve this problem?
> > 4. Which disk have your /bin and /usr/bin?
> >
>
> FWIW I fundamentally object to 3 as being a solution.

Yes, I also think so.


> I still think the idea of read-ahead driven drop-behind is a good one,
> alas last time we brought that up people thought differently.

hmm.
sorry, I can't recall this patch. do you have any pointer or url?


2009-04-28 08:11:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

On Tue, 2009-04-28 at 13:28 +0530, Balbir Singh wrote:
> On Tue, Apr 28, 2009 at 1:18 PM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, 2009-04-28 at 14:35 +0900, KOSAKI Motohiro wrote:
> >> (cc to linux-mm and Rik)
> >>
> >>
> >> > Hi,
> >> >
> >> > So, I just set up Ubuntu Jaunty (using Linux 2.6.28) on a quad core phenom box,
> >> > and then I did the following (with XFS over LVM):
> >> >
> >> > mv /500gig/of/data/on/disk/one /disk/two
> >> >
> >> > This quickly caused the system to. grind.. to... a.... complete..... halt.
> >> > Basically every UI operation, including the mouse in Xorg, started experiencing
> >> > multiple second lag and delays. This made the system essentially unusable --
> >> > for example, just flipping to the window where the "mv" command was running
> >> > took 10 seconds on more than one occasion. Basically a "click and get coffee"
> >> > interface.
> >>
> >> I have some question and request.
> >>
> >> 1. please post your /proc/meminfo
> >> 2. Do above copy make tons swap-out? IOW your disk read much faster than write?
> >> 3. cache limitation of memcgroup solve this problem?
> >> 4. Which disk have your /bin and /usr/bin?
> >>
> >
> > FWIW I fundamentally object to 3 as being a solution.
> >
>
> memcgroup were not created to solve latency problems, but they do
> isolate memory and if that helps latency, I don't see why that is a
> problem. I don't think isolating applications that we think are not
> important and interfere or consume more resources than desired is a
> bad solution.

So being able to isolate is a good excuse for poor replacement these
days?

Also, exactly because its isolated/limited its sub-optimal.


> > I still think the idea of read-ahead driven drop-behind is a good one,
> > alas last time we brought that up people thought differently.
>
> I vaguely remember the patches, but can't recollect the details.

A quick google gave me this:

http://lkml.org/lkml/2007/7/21/219

2009-04-28 08:25:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

On Tue, 28 Apr 2009 10:11:32 +0200
Peter Zijlstra <[email protected]> wrote:

> On Tue, 2009-04-28 at 13:28 +0530, Balbir Singh wrote:
> > On Tue, Apr 28, 2009 at 1:18 PM, Peter Zijlstra <[email protected]> wrote:
> > > On Tue, 2009-04-28 at 14:35 +0900, KOSAKI Motohiro wrote:
> > >> (cc to linux-mm and Rik)
> > >>
> > >>
> > >> > Hi,
> > >> >
> > >> > So, I just set up Ubuntu Jaunty (using Linux 2.6.28) on a quad core phenom box,
> > >> > and then I did the following (with XFS over LVM):
> > >> >
> > >> > mv /500gig/of/data/on/disk/one /disk/two
> > >> >
> > >> > This quickly caused the system to. grind.. to... a.... complete..... halt.
> > >> > Basically every UI operation, including the mouse in Xorg, started experiencing
> > >> > multiple second lag and delays. This made the system essentially unusable --
> > >> > for example, just flipping to the window where the "mv" command was running
> > >> > took 10 seconds on more than one occasion. Basically a "click and get coffee"
> > >> > interface.
> > >>
> > >> I have some question and request.
> > >>
> > >> 1. please post your /proc/meminfo
> > >> 2. Do above copy make tons swap-out? IOW your disk read much faster than write?
> > >> 3. cache limitation of memcgroup solve this problem?
> > >> 4. Which disk have your /bin and /usr/bin?
> > >>
> > >
> > > FWIW I fundamentally object to 3 as being a solution.
> > >
> >
> > memcgroup were not created to solve latency problems, but they do
> > isolate memory and if that helps latency, I don't see why that is a
> > problem. I don't think isolating applications that we think are not
> > important and interfere or consume more resources than desired is a
> > bad solution.
>
> So being able to isolate is a good excuse for poor replacement these
> days?
>
While the kernel can't catch what's going on and what's wanted.

Thanks,
-Kame

2009-04-28 08:26:53

by Balbir Singh

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

On Tue, Apr 28, 2009 at 1:41 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2009-04-28 at 13:28 +0530, Balbir Singh wrote:
>> On Tue, Apr 28, 2009 at 1:18 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Tue, 2009-04-28 at 14:35 +0900, KOSAKI Motohiro wrote:
>> >> (cc to linux-mm and Rik)
>> >>
>> >>
>> >> > Hi,
>> >> >
>> >> > So, I just set up Ubuntu Jaunty (using Linux 2.6.28) on a quad core phenom box,
>> >> > and then I did the following (with XFS over LVM):
>> >> >
>> >> > mv /500gig/of/data/on/disk/one /disk/two
>> >> >
>> >> > This quickly caused the system to. grind.. to... a.... complete..... halt.
>> >> > Basically every UI operation, including the mouse in Xorg, started experiencing
>> >> > multiple second lag and delays. ?This made the system essentially unusable --
>> >> > for example, just flipping to the window where the "mv" command was running
>> >> > took 10 seconds on more than one occasion. ?Basically a "click and get coffee"
>> >> > interface.
>> >>
>> >> I have some question and request.
>> >>
>> >> 1. please post your /proc/meminfo
>> >> 2. Do above copy make tons swap-out? IOW your disk read much faster than write?
>> >> 3. cache limitation of memcgroup solve this problem?
>> >> 4. Which disk have your /bin and /usr/bin?
>> >>
>> >
>> > FWIW I fundamentally object to 3 as being a solution.
>> >
>>
>> memcgroup were not created to solve latency problems, but they do
>> isolate memory and if that helps latency, I don't see why that is a
>> problem. I don't think isolating applications that we think are not
>> important and interfere or consume more resources than desired is a
>> bad solution.
>
> So being able to isolate is a good excuse for poor replacement these
> days?
>

Nope.. I am not saying that. Poor replacement needs to be fixed, but
unfortunately that is very dependent of the nature of the workload,
poor for one might be good for another, of course there is always the
middle ground based on our understanding of desired behaviour. Having
said that, isolating unimportant tasks might be a trade-off that
works, it *does not* replace the good algorithms we need to have a
default, but provides manual control of an otherwise auto piloted
system. With virtualization mixed workloads are becoming more common
on the system.

Providing the swappiness knob for example is needed because sometimes
the user does know what he/she needs.

> Also, exactly because its isolated/limited its sub-optimal.
>
>
>> > I still think the idea of read-ahead driven drop-behind is a good one,
>> > alas last time we brought that up people thought differently.
>>
>> I vaguely remember the patches, but can't recollect the details.
>
> A quick google gave me this:
>
> ?http://lkml.org/lkml/2007/7/21/219

Thanks! That was quick

2009-04-28 09:14:27

by Fengguang Wu

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

On Tue, Apr 28, 2009 at 09:48:39AM +0200, Peter Zijlstra wrote:
> On Tue, 2009-04-28 at 14:35 +0900, KOSAKI Motohiro wrote:
> > (cc to linux-mm and Rik)
> >
> >
> > > Hi,
> > >
> > > So, I just set up Ubuntu Jaunty (using Linux 2.6.28) on a quad core phenom box,
> > > and then I did the following (with XFS over LVM):
> > >
> > > mv /500gig/of/data/on/disk/one /disk/two
> > >
> > > This quickly caused the system to. grind.. to... a.... complete..... halt.
> > > Basically every UI operation, including the mouse in Xorg, started experiencing
> > > multiple second lag and delays. This made the system essentially unusable --
> > > for example, just flipping to the window where the "mv" command was running
> > > took 10 seconds on more than one occasion. Basically a "click and get coffee"
> > > interface.
> >
> > I have some question and request.
> >
> > 1. please post your /proc/meminfo
> > 2. Do above copy make tons swap-out? IOW your disk read much faster than write?
> > 3. cache limitation of memcgroup solve this problem?
> > 4. Which disk have your /bin and /usr/bin?
> >
>
> FWIW I fundamentally object to 3 as being a solution.
>
> I still think the idea of read-ahead driven drop-behind is a good one,
> alas last time we brought that up people thought differently.

The semi-drop-behind is a great idea for the desktop - to put just
accessed pages to end of LRU. However I'm still afraid it vastly
changes the caching behavior and wont work well as expected in server
workloads - shall we verify this?

Back to this big-cp-hurts-responsibility issue. Background write
requests can easily pass the io scheduler's obstacles and fill up
the disk queue. Now every read request will have to wait 10+ writes
- leading to 10x slow down of major page faults.

I reach this conclusion based on recent CFQ code reviews. Will bring up
a queue depth limiting patch for more exercises..

Thanks,
Fengguang

2009-04-28 09:27:19

by Fengguang Wu

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

On Tue, Apr 28, 2009 at 05:09:16PM +0800, Wu Fengguang wrote:
> On Tue, Apr 28, 2009 at 09:48:39AM +0200, Peter Zijlstra wrote:
> > On Tue, 2009-04-28 at 14:35 +0900, KOSAKI Motohiro wrote:
> > > (cc to linux-mm and Rik)
> > >
> > >
> > > > Hi,
> > > >
> > > > So, I just set up Ubuntu Jaunty (using Linux 2.6.28) on a quad core phenom box,
> > > > and then I did the following (with XFS over LVM):
> > > >
> > > > mv /500gig/of/data/on/disk/one /disk/two
> > > >
> > > > This quickly caused the system to. grind.. to... a.... complete..... halt.
> > > > Basically every UI operation, including the mouse in Xorg, started experiencing
> > > > multiple second lag and delays. This made the system essentially unusable --
> > > > for example, just flipping to the window where the "mv" command was running
> > > > took 10 seconds on more than one occasion. Basically a "click and get coffee"
> > > > interface.
> > >
> > > I have some question and request.
> > >
> > > 1. please post your /proc/meminfo
> > > 2. Do above copy make tons swap-out? IOW your disk read much faster than write?
> > > 3. cache limitation of memcgroup solve this problem?
> > > 4. Which disk have your /bin and /usr/bin?
> > >
> >
> > FWIW I fundamentally object to 3 as being a solution.
> >
> > I still think the idea of read-ahead driven drop-behind is a good one,
> > alas last time we brought that up people thought differently.
>
> The semi-drop-behind is a great idea for the desktop - to put just
> accessed pages to end of LRU. However I'm still afraid it vastly
> changes the caching behavior and wont work well as expected in server
> workloads - shall we verify this?
>
> Back to this big-cp-hurts-responsibility issue. Background write
> requests can easily pass the io scheduler's obstacles and fill up
> the disk queue. Now every read request will have to wait 10+ writes
> - leading to 10x slow down of major page faults.
>
> I reach this conclusion based on recent CFQ code reviews. Will bring up
> a queue depth limiting patch for more exercises..

Sorry - just realized that Elladan's root fs lies in sda - the read side.

Then why shall a single read stream to cause 2000ms major fault delays?
The 'await' value for sda is <10ms, not even close to 2000ms:

> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
> sda 67.70 0.00 373.10 0.20 48.47 0.00 265.90 1.94 5.21 2.10 78.32
> sdb 0.00 1889.60 0.00 139.80 0.00 52.52 769.34 35.01 250.45 5.17 72.28
> ---
> sda 5.30 0.00 483.80 0.30 60.65 0.00 256.59 1.59 3.28 1.65 79.72
> sdb 0.00 3632.70 0.00 171.10 0.00 61.10 731.39 117.09 709.66 5.84 100.00
> ---
> sda 51.20 0.00 478.10 1.00 65.79 0.01 281.27 2.48 5.18 1.96 93.72
> sdb 0.00 2104.60 0.00 174.80 0.00 62.84 736.28 108.50 613.64 5.72 100.00
> --
> sda 153.20 0.00 349.40 0.20 60.99 0.00 357.30 4.47 13.19 2.85 99.80
> sdb 0.00 1766.50 0.00 158.60 0.00 59.89 773.34 110.07 672.25 6.30 99.96


Thanks,
Fengguang

2009-04-28 12:08:43

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

On Tue, Apr 28, 2009 at 05:09:16PM +0800, Wu Fengguang wrote:
> The semi-drop-behind is a great idea for the desktop - to put just
> accessed pages to end of LRU. However I'm still afraid it vastly
> changes the caching behavior and wont work well as expected in server
> workloads - shall we verify this?
>
> Back to this big-cp-hurts-responsibility issue. Background write
> requests can easily pass the io scheduler's obstacles and fill up
> the disk queue. Now every read request will have to wait 10+ writes
> - leading to 10x slow down of major page faults.
>
> I reach this conclusion based on recent CFQ code reviews. Will bring up
> a queue depth limiting patch for more exercises..

We can muck with the I/O scheduler, but another thing to consider is
whether the VM should be more aggressively throttling writes in this
case; it sounds like the big cp in this case may be dirtying pages so
aggressively that it's driving other (more useful) pages out of the
page cache --- if the target disk is slower than the source disk (for
example, backing up a SATA primary disk to a USB-attached backup disk)
no amount of drop-behind is going to help the situation.

So that leaves three areas for exploration:

* Write-throttling
* Drop-behind
* background writes pushing aside foreground reads

Hmm, note that although the original bug reporter is running Ubuntu
Jaunty, and hence 2.6.28, this problem is going to get *worse* with
2.6.30, since we have the ext3 data=ordered latency fixes which will
write out the any journal activity, and worse, any synchornous commits
(i.e., caused by fsync) will force out all of the dirty pages with
WRITE_SYNC priority. So with a heavy load, I suspect this is going to
be more of a VM issue, and especially figuring out how to tune more
aggressive write-throttling may be key here.

- Ted

2009-04-28 15:30:22

by Rik van Riel

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

When there is a lot of streaming IO going on, we do not want
to scan or evict pages from the working set. The old VM used
to skip any mapped page, but still evict indirect blocks and
other data that is useful to cache.

This patch adds logic to skip scanning the anon lists and
the active file list if most of the file pages are on the
inactive file list (where streaming IO pages live), while
at the lowest scanning priority.

If the system is not doing a lot of streaming IO, eg. the
system is running a database workload, then more often used
file pages will be on the active file list and this logic
is automatically disabled.

Signed-off-by: Rik van Riel <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/vmscan.c | 18 ++++++++++++++++--
2 files changed, 17 insertions(+), 2 deletions(-)

Index: linux-2.6.26-rc8-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.26-rc8-mm1.orig/include/linux/mmzone.h 2008-07-07 15:41:32.000000000 -0400
+++ linux-2.6.26-rc8-mm1/include/linux/mmzone.h 2008-07-15 14:58:50.000000000 -0400
@@ -453,6 +453,7 @@ static inline int zone_is_oom_locked(con
* queues ("queue_length >> 12") during an aging round.
*/
#define DEF_PRIORITY 12
+#define PRIO_CACHE_ONLY DEF_PRIORITY+1

/* Maximum number of zones on a zonelist */
#define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)
Index: linux-2.6.26-rc8-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc8-mm1.orig/mm/vmscan.c 2008-07-07 15:41:33.000000000 -0400
+++ linux-2.6.26-rc8-mm1/mm/vmscan.c 2008-07-15 15:10:05.000000000 -0400
@@ -1481,6 +1481,20 @@ static unsigned long shrink_zone(int pri
}
}

+ /*
+ * If there is a lot of sequential IO going on, most of the
+ * file pages will be on the inactive file list. We start
+ * out by reclaiming those pages, without putting pressure on
+ * the working set. We only do this if the bulk of the file pages
+ * are not in the working set (on the active file list).
+ */
+ if (priority == PRIO_CACHE_ONLY &&
+ (nr[LRU_INACTIVE_FILE] > nr[LRU_ACTIVE_FILE]))
+ for_each_evictable_lru(l)
+ /* Scan only the inactive_file list. */
+ if (l != LRU_INACTIVE_FILE)
+ nr[l] = 0;
+
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {
@@ -1609,7 +1623,7 @@ static unsigned long do_try_to_free_page
}
}

- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for (priority = PRIO_CACHE_ONLY; priority >= 0; priority--) {
sc->nr_scanned = 0;
if (!priority)
disable_swap_token();
@@ -1771,7 +1785,7 @@ loop_again:
for (i = 0; i < pgdat->nr_zones; i++)
temp_priority[i] = DEF_PRIORITY;

- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for (priority = PRIO_CACHE_ONLY; priority >= 0; priority--) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;


Attachments:
evict-cache-first.patch (2.92 kB)

2009-04-28 23:29:45

by Rik van Riel

[permalink] [raw]
Subject: [PATCH] vmscan: evict use-once pages first

When the file LRU lists are dominated by streaming IO pages,
evict those pages first, before considering evicting other
pages.

This should be safe from deadlocks or performance problems
because only three things can happen to an inactive file page:
1) referenced twice and promoted to the active list
2) evicted by the pageout code
3) under IO, after which it will get evicted or promoted

The pages freed in this way can either be reused for streaming
IO, or allocated for something else. If the pages are used for
streaming IO, this pageout pattern continues. Otherwise, we will
fall back to the normal pageout pattern.

Signed-off-by: Rik van Riel <[email protected]>

---
Elladan, does this patch fix the issue you are seeing?

Peter, Kosaki, Ted, does this patch look good to you?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eac9577..4c0304e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1489,6 +1489,21 @@ static void shrink_zone(int priority, struct zone *zone,
nr[l] = scan;
}

+ /*
+ * When the system is doing streaming IO, memory pressure here
+ * ensures that active file pages get deactivated, until more
+ * than half of the file pages are on the inactive list.
+ *
+ * Once we get to that situation, protect the system's working
+ * set from being evicted by disabling active file page aging
+ * and swapping of swap backed pages. We still do background
+ * aging of anonymous pages.
+ */
+ if (nr[LRU_INACTIVE_FILE] > nr[LRU_ACTIVE_FILE]) {
+ nr[LRU_ACTIVE_FILE] = 0;
+ nr[LRU_INACTIVE_ANON] = 0;
+ }
+
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {

2009-04-29 03:37:58

by Elladan

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first

Rik,

This patch appears to significantly improve application latency while a large
file copy runs. I'm not seeing behavior that implies continuous bad page
replacement.

I'm still seeing some general lag, which I attribute to general filesystem
slowness. For example, latencytop sees many events like these:

down xfs_buf_lock _xfs_buf_find xfs_buf_get_flags 1475.8 msec 5.9 %

xfs_buf_iowait xfs_buf_iostart xfs_buf_read_flags 1740.9 msec 2.6 %

Writing a page to disk 1042.9 msec 43.7 %

It also occasionally sees long page faults:

Page fault 2068.3 msec 21.3 %

I guess XFS (and the elevator) is just doing a poor job managing latency
(particularly poor since all the IO on /usr/bin is on the reader disk).
Notable:

Creating block layer request 451.4 msec 14.4 %

Thank you,
Elladan

On Tue, Apr 28, 2009 at 07:29:07PM -0400, Rik van Riel wrote:
> When the file LRU lists are dominated by streaming IO pages,
> evict those pages first, before considering evicting other
> pages.
>
> This should be safe from deadlocks or performance problems
> because only three things can happen to an inactive file page:
> 1) referenced twice and promoted to the active list
> 2) evicted by the pageout code
> 3) under IO, after which it will get evicted or promoted
>
> The pages freed in this way can either be reused for streaming
> IO, or allocated for something else. If the pages are used for
> streaming IO, this pageout pattern continues. Otherwise, we will
> fall back to the normal pageout pattern.
>
> Signed-off-by: Rik van Riel <[email protected]>
>
> ---
> Elladan, does this patch fix the issue you are seeing?
>
> Peter, Kosaki, Ted, does this patch look good to you?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eac9577..4c0304e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1489,6 +1489,21 @@ static void shrink_zone(int priority, struct zone *zone,
> nr[l] = scan;
> }
>
> + /*
> + * When the system is doing streaming IO, memory pressure here
> + * ensures that active file pages get deactivated, until more
> + * than half of the file pages are on the inactive list.
> + *
> + * Once we get to that situation, protect the system's working
> + * set from being evicted by disabling active file page aging
> + * and swapping of swap backed pages. We still do background
> + * aging of anonymous pages.
> + */
> + if (nr[LRU_INACTIVE_FILE] > nr[LRU_ACTIVE_FILE]) {
> + nr[LRU_ACTIVE_FILE] = 0;
> + nr[LRU_INACTIVE_ANON] = 0;
> + }
> +
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> nr[LRU_INACTIVE_FILE]) {
> for_each_evictable_lru(l) {

2009-04-29 05:51:26

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

Hi

> On Tue, Apr 28, 2009 at 05:09:16PM +0800, Wu Fengguang wrote:
> > The semi-drop-behind is a great idea for the desktop - to put just
> > accessed pages to end of LRU. However I'm still afraid it vastly
> > changes the caching behavior and wont work well as expected in server
> > workloads - shall we verify this?
> >
> > Back to this big-cp-hurts-responsibility issue. Background write
> > requests can easily pass the io scheduler's obstacles and fill up
> > the disk queue. Now every read request will have to wait 10+ writes
> > - leading to 10x slow down of major page faults.
> >
> > I reach this conclusion based on recent CFQ code reviews. Will bring up
> > a queue depth limiting patch for more exercises..
>
> We can muck with the I/O scheduler, but another thing to consider is
> whether the VM should be more aggressively throttling writes in this
> case; it sounds like the big cp in this case may be dirtying pages so
> aggressively that it's driving other (more useful) pages out of the
> page cache --- if the target disk is slower than the source disk (for
> example, backing up a SATA primary disk to a USB-attached backup disk)
> no amount of drop-behind is going to help the situation.
>
> So that leaves three areas for exploration:
>
> * Write-throttling
> * Drop-behind
> * background writes pushing aside foreground reads
>
> Hmm, note that although the original bug reporter is running Ubuntu
> Jaunty, and hence 2.6.28, this problem is going to get *worse* with
> 2.6.30, since we have the ext3 data=ordered latency fixes which will
> write out the any journal activity, and worse, any synchornous commits
> (i.e., caused by fsync) will force out all of the dirty pages with
> WRITE_SYNC priority. So with a heavy load, I suspect this is going to
> be more of a VM issue, and especially figuring out how to tune more
> aggressive write-throttling may be key here.

firstly, I'd like to report my reproduce test result.

test environment: no lvm, copy ext3 to ext3 (not mv), no change swappiness,
CFQ is used, userland is Fedora10, mmotm(2.6.30-rc1 + mm patch),
CPU opteronx4, mem 4G

mouse move lag: not happend
window move lag: not happend
Mapped page decrease rapidly: not happend (I guess, these page stay in
active list on my system)
page fault large latency: happend (latencytop display >200ms)


Then, I don't doubt vm replacement logic now.
but I need more investigate.
I plan to try following thing today and tommorow.

- XFS
- LVM
- another io scheduler (thanks Ted, good view point)
- Rik's new patch



2009-04-29 06:40:24

by Andrew Morton

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

On Wed, 29 Apr 2009 14:51:07 +0900 (JST) KOSAKI Motohiro <[email protected]> wrote:

> Hi
>
> > On Tue, Apr 28, 2009 at 05:09:16PM +0800, Wu Fengguang wrote:
> > > The semi-drop-behind is a great idea for the desktop - to put just
> > > accessed pages to end of LRU. However I'm still afraid it vastly
> > > changes the caching behavior and wont work well as expected in server
> > > workloads - shall we verify this?
> > >
> > > Back to this big-cp-hurts-responsibility issue. Background write
> > > requests can easily pass the io scheduler's obstacles and fill up
> > > the disk queue. Now every read request will have to wait 10+ writes
> > > - leading to 10x slow down of major page faults.
> > >
> > > I reach this conclusion based on recent CFQ code reviews. Will bring up
> > > a queue depth limiting patch for more exercises..
> >
> > We can muck with the I/O scheduler, but another thing to consider is
> > whether the VM should be more aggressively throttling writes in this
> > case; it sounds like the big cp in this case may be dirtying pages so
> > aggressively that it's driving other (more useful) pages out of the
> > page cache --- if the target disk is slower than the source disk (for
> > example, backing up a SATA primary disk to a USB-attached backup disk)
> > no amount of drop-behind is going to help the situation.
> >
> > So that leaves three areas for exploration:
> >
> > * Write-throttling
> > * Drop-behind
> > * background writes pushing aside foreground reads
> >
> > Hmm, note that although the original bug reporter is running Ubuntu
> > Jaunty, and hence 2.6.28, this problem is going to get *worse* with
> > 2.6.30, since we have the ext3 data=ordered latency fixes which will
> > write out the any journal activity, and worse, any synchornous commits
> > (i.e., caused by fsync) will force out all of the dirty pages with
> > WRITE_SYNC priority. So with a heavy load, I suspect this is going to
> > be more of a VM issue, and especially figuring out how to tune more
> > aggressive write-throttling may be key here.
>
> firstly, I'd like to report my reproduce test result.
>
> test environment: no lvm, copy ext3 to ext3 (not mv), no change swappiness,
> CFQ is used, userland is Fedora10, mmotm(2.6.30-rc1 + mm patch),
> CPU opteronx4, mem 4G
>
> mouse move lag: not happend
> window move lag: not happend
> Mapped page decrease rapidly: not happend (I guess, these page stay in
> active list on my system)
> page fault large latency: happend (latencytop display >200ms)

hm. The last two observations appear to be inconsistent.

Elladan, have you checked to see whether the Mapped: number in
/proc/meminfo is decreasing?

>
> Then, I don't doubt vm replacement logic now.
> but I need more investigate.
> I plan to try following thing today and tommorow.
>
> - XFS
> - LVM
> - another io scheduler (thanks Ted, good view point)
> - Rik's new patch

It's not clear that we know what's happening yet, is it? It's such a
gross problem that you'd think that even our testing would have found
it by now :(

Elladan, do you know if earlier kernels (2.6.26 or thereabouts) had
this severe a problem?

(notes that we _still_ haven't unbusted prev_priority)

2009-04-29 06:42:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first

On Tue, 2009-04-28 at 19:29 -0400, Rik van Riel wrote:

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eac9577..4c0304e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1489,6 +1489,21 @@ static void shrink_zone(int priority, struct zone *zone,
> nr[l] = scan;
> }
>
> + /*
> + * When the system is doing streaming IO, memory pressure here
> + * ensures that active file pages get deactivated, until more
> + * than half of the file pages are on the inactive list.
> + *
> + * Once we get to that situation, protect the system's working
> + * set from being evicted by disabling active file page aging
> + * and swapping of swap backed pages. We still do background
> + * aging of anonymous pages.
> + */
> + if (nr[LRU_INACTIVE_FILE] > nr[LRU_ACTIVE_FILE]) {
> + nr[LRU_ACTIVE_FILE] = 0;
> + nr[LRU_INACTIVE_ANON] = 0;
> + }
> +

Isn't there a hole where LRU_*_FILE << LRU_*_ANON and we now stop
shrinking INACTIVE_ANON even though it makes sense to.

2009-04-29 07:50:37

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

>> Mapped page decrease rapidly: not happend (I guess, these page stay in
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? active list on my system)
>> page fault large latency: ? ? happend (latencytop display >200ms)
>
> hm. ?The last two observations appear to be inconsistent.

it mean existing process don't slow down. but new process creation is very slow.


> Elladan, have you checked to see whether the Mapped: number in
> /proc/meminfo is decreasing?
>
>>
>> Then, I don't doubt vm replacement logic now.
>> but I need more investigate.
>> I plan to try following thing today and tommorow.
>>
>> ?- XFS
>> ?- LVM
>> ?- another io scheduler (thanks Ted, good view point)
>> ?- Rik's new patch
>
> It's not clear that we know what's happening yet, is it? ?It's such a
> gross problem that you'd think that even our testing would have found
> it by now :(

Yes, unclear. but various testing can drill down the reason, I think.


> Elladan, do you know if earlier kernels (2.6.26 or thereabouts) had
> this severe a problem?
>
> (notes that we _still_ haven't unbusted prev_priority)

2009-04-29 07:50:54

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

one mistake

> mouse move lag: ? ? ? ? ? ? ? not happend
> window move lag: ? ? ? ? ? ? ?not happend
> Mapped page decrease rapidly: not happend (I guess, these page stay in
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?active list on my system)
> page fault large latency: ? ? happend (latencytop display >200ms)

^^^^^^^^^

>1200ms

sorry.

2009-04-29 13:31:33

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first

Peter Zijlstra wrote:
> On Tue, 2009-04-28 at 19:29 -0400, Rik van Riel wrote:
>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index eac9577..4c0304e 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1489,6 +1489,21 @@ static void shrink_zone(int priority, struct zone *zone,
>> nr[l] = scan;
>> }
>>
>> + /*
>> + * When the system is doing streaming IO, memory pressure here
>> + * ensures that active file pages get deactivated, until more
>> + * than half of the file pages are on the inactive list.
>> + *
>> + * Once we get to that situation, protect the system's working
>> + * set from being evicted by disabling active file page aging
>> + * and swapping of swap backed pages. We still do background
>> + * aging of anonymous pages.
>> + */
>> + if (nr[LRU_INACTIVE_FILE] > nr[LRU_ACTIVE_FILE]) {
>> + nr[LRU_ACTIVE_FILE] = 0;
>> + nr[LRU_INACTIVE_ANON] = 0;
>> + }
>> +
>
> Isn't there a hole where LRU_*_FILE << LRU_*_ANON and we now stop
> shrinking INACTIVE_ANON even though it makes sense to.

Only temporarily, until the number of active file pages
is larger than the number of inactive ones.

Think of it as reducing the frequency of shrinking anonymous
pages while the system is near the threshold.

--
All rights reversed.

2009-04-29 15:47:36

by Rik van Riel

[permalink] [raw]
Subject: [PATCH] vmscan: evict use-once pages first (v2)

When the file LRU lists are dominated by streaming IO pages,
evict those pages first, before considering evicting other
pages.

This should be safe from deadlocks or performance problems
because only three things can happen to an inactive file page:
1) referenced twice and promoted to the active list
2) evicted by the pageout code
3) under IO, after which it will get evicted or promoted

The pages freed in this way can either be reused for streaming
IO, or allocated for something else. If the pages are used for
streaming IO, this pageout pattern continues. Otherwise, we will
fall back to the normal pageout pattern.

Signed-off-by: Rik van Riel <[email protected]>
---
On Wed, 29 Apr 2009 08:42:29 +0200
Peter Zijlstra <[email protected]> wrote:

> Isn't there a hole where LRU_*_FILE << LRU_*_ANON and we now stop
> shrinking INACTIVE_ANON even though it makes sense to.

Peter, after looking at this again, I believe that the get_scan_ratio
logic should take care of protecting the anonymous pages, so we can
get away with this following, less intrusive patch.

Elladan, does this smaller patch still work as expected?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eac9577..4471dcb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1489,6 +1489,18 @@ static void shrink_zone(int priority, struct zone *zone,
nr[l] = scan;
}

+ /*
+ * When the system is doing streaming IO, memory pressure here
+ * ensures that active file pages get deactivated, until more
+ * than half of the file pages are on the inactive list.
+ *
+ * Once we get to that situation, protect the system's working
+ * set from being evicted by disabling active file page aging.
+ * The logic in get_scan_ratio protects anonymous pages.
+ */
+ if (nr[LRU_INACTIVE_FILE] > nr[LRU_ACTIVE_FILE])
+ nr[LRU_ACTIVE_FILE] = 0;
+
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {

2009-04-29 16:08:09

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first (v2)

Hi

Looks good than previous version. but I have one question.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eac9577..4471dcb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1489,6 +1489,18 @@ static void shrink_zone(int priority, struct zone *zone,
> ? ? ? ? ? ? ? ? ? ? ? ?nr[l] = scan;
> ? ? ? ?}
>
> + ? ? ? /*
> + ? ? ? ?* When the system is doing streaming IO, memory pressure here
> + ? ? ? ?* ensures that active file pages get deactivated, until more
> + ? ? ? ?* than half of the file pages are on the inactive list.
> + ? ? ? ?*
> + ? ? ? ?* Once we get to that situation, protect the system's working
> + ? ? ? ?* set from being evicted by disabling active file page aging.
> + ? ? ? ?* The logic in get_scan_ratio protects anonymous pages.
> + ? ? ? ?*/
> + ? ? ? if (nr[LRU_INACTIVE_FILE] > nr[LRU_ACTIVE_FILE])
> + ? ? ? ? ? ? ? nr[LRU_ACTIVE_FILE] = 0;
> +
> ? ? ? ?while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?nr[LRU_INACTIVE_FILE]) {
> ? ? ? ? ? ? ? ?for_each_evictable_lru(l) {

we handle active_anon vs inactive_anon ratio by shrink_list().
Why do you insert this logic insert shrink_zone() ?

2009-04-29 16:10:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first (v2)

On Wed, 2009-04-29 at 11:47 -0400, Rik van Riel wrote:
> When the file LRU lists are dominated by streaming IO pages,
> evict those pages first, before considering evicting other
> pages.
>
> This should be safe from deadlocks or performance problems
> because only three things can happen to an inactive file page:
> 1) referenced twice and promoted to the active list
> 2) evicted by the pageout code
> 3) under IO, after which it will get evicted or promoted
>
> The pages freed in this way can either be reused for streaming
> IO, or allocated for something else. If the pages are used for
> streaming IO, this pageout pattern continues. Otherwise, we will
> fall back to the normal pageout pattern.
>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> On Wed, 29 Apr 2009 08:42:29 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > Isn't there a hole where LRU_*_FILE << LRU_*_ANON and we now stop
> > shrinking INACTIVE_ANON even though it makes sense to.
>
> Peter, after looking at this again, I believe that the get_scan_ratio
> logic should take care of protecting the anonymous pages, so we can
> get away with this following, less intrusive patch.
>
> Elladan, does this smaller patch still work as expected?

Provided of course that it actually fixes Elladan's issue, this looks
good to me.

Acked-by: Peter Zijlstra <[email protected]>

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eac9577..4471dcb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1489,6 +1489,18 @@ static void shrink_zone(int priority, struct zone *zone,
> nr[l] = scan;
> }
>
> + /*
> + * When the system is doing streaming IO, memory pressure here
> + * ensures that active file pages get deactivated, until more
> + * than half of the file pages are on the inactive list.
> + *
> + * Once we get to that situation, protect the system's working
> + * set from being evicted by disabling active file page aging.
> + * The logic in get_scan_ratio protects anonymous pages.
> + */
> + if (nr[LRU_INACTIVE_FILE] > nr[LRU_ACTIVE_FILE])
> + nr[LRU_ACTIVE_FILE] = 0;
> +
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> nr[LRU_INACTIVE_FILE]) {
> for_each_evictable_lru(l) {
>

2009-04-29 16:18:53

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first (v2)

KOSAKI Motohiro wrote:
> Hi
>
> Looks good than previous version. but I have one question.
>
>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index eac9577..4471dcb 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1489,6 +1489,18 @@ static void shrink_zone(int priority, struct zone *zone,
>> nr[l] = scan;
>> }
>>
>> + /*
>> + * When the system is doing streaming IO, memory pressure here
>> + * ensures that active file pages get deactivated, until more
>> + * than half of the file pages are on the inactive list.
>> + *
>> + * Once we get to that situation, protect the system's working
>> + * set from being evicted by disabling active file page aging.
>> + * The logic in get_scan_ratio protects anonymous pages.
>> + */
>> + if (nr[LRU_INACTIVE_FILE] > nr[LRU_ACTIVE_FILE])
>> + nr[LRU_ACTIVE_FILE] = 0;
>> +
>> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>> nr[LRU_INACTIVE_FILE]) {
>> for_each_evictable_lru(l) {
>>
>
> we handle active_anon vs inactive_anon ratio by shrink_list().
> Why do you insert this logic insert shrink_zone() ?
>
Good question. I guess that at lower priority levels, we get to scan
a lot more pages and we could go from having too many inactive
file pages to not having enough in one invocation of shrink_zone().

That makes shrink_list() the better place to implement this, even if
it means doing this comparison more often.

I'll send a new patch this afternoon.

2009-04-29 17:06:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first

On Tue, Apr 28, 2009 at 08:36:51PM -0700, Elladan wrote:
> Rik,
>
> This patch appears to significantly improve application latency while a large
> file copy runs. I'm not seeing behavior that implies continuous bad page
> replacement.
>
> I'm still seeing some general lag, which I attribute to general filesystem
> slowness. For example, latencytop sees many events like these:
>
> down xfs_buf_lock _xfs_buf_find xfs_buf_get_flags 1475.8 msec 5.9 %

This actually is contention on the buffer lock, and most likely
happens because it's trying to access a buffer that's beeing read
in currently.

>
> xfs_buf_iowait xfs_buf_iostart xfs_buf_read_flags 1740.9 msec 2.6 %

That's an actual metadata read.

> Writing a page to disk 1042.9 msec 43.7 %
>
> It also occasionally sees long page faults:
>
> Page fault 2068.3 msec 21.3 %
>
> I guess XFS (and the elevator) is just doing a poor job managing latency
> (particularly poor since all the IO on /usr/bin is on the reader disk).

The filesystem doesn't really decide which priorities to use, except
for some use of the WRITE_SYNC which is used rather minimall in XFS in
2.6.28.

> Creating block layer request 451.4 msec 14.4 %

I guess that a wait in get_request because we're above nr_requests..

2009-04-29 17:15:09

by Rik van Riel

[permalink] [raw]
Subject: [PATCH] vmscan: evict use-once pages first (v3)

When the file LRU lists are dominated by streaming IO pages,
evict those pages first, before considering evicting other
pages.

This should be safe from deadlocks or performance problems
because only three things can happen to an inactive file page:
1) referenced twice and promoted to the active list
2) evicted by the pageout code
3) under IO, after which it will get evicted or promoted

The pages freed in this way can either be reused for streaming
IO, or allocated for something else. If the pages are used for
streaming IO, this pageout pattern continues. Otherwise, we will
fall back to the normal pageout pattern.

Signed-off-by: Rik van Riel <[email protected]>

---
On Thu, 30 Apr 2009 01:07:51 +0900
KOSAKI Motohiro <[email protected]> wrote:

> we handle active_anon vs inactive_anon ratio by shrink_list().
> Why do you insert this logic insert shrink_zone() ?

Kosaki, this implementation mirrors the anon side of things precisely.
Does this look good?

Elladan, this patch should work just like the second version. Please
let me know how it works for you.

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a9e3b76..dbfe7ba 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -94,6 +94,7 @@ extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
int priority);
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
+int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
struct zone *zone,
enum lru_list lru);
@@ -239,6 +240,12 @@ mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
return 1;
}

+static inline int
+mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
+{
+ return 1;
+}
+
static inline unsigned long
mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
enum lru_list lru)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e44fb0f..026cb5a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -578,6 +578,17 @@ int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
return 0;
}

+int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
+{
+ unsigned long active;
+ unsigned long inactive;
+
+ inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
+ active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
+
+ return (active > inactive);
+}
+
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
struct zone *zone,
enum lru_list lru)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eac9577..a73f675 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1348,12 +1348,48 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
return low;
}

+static int inactive_file_is_low_global(struct zone *zone)
+{
+ unsigned long active, inactive;
+
+ active = zone_page_state(zone, NR_ACTIVE_FILE);
+ inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+
+ return (active > inactive);
+}
+
+/**
+ * inactive_file_is_low - check if file pages need to be deactivated
+ * @zone: zone to check
+ * @sc: scan control of this context
+ *
+ * When the system is doing streaming IO, memory pressure here
+ * ensures that active file pages get deactivated, until more
+ * than half of the file pages are on the inactive list.
+ *
+ * Once we get to that situation, protect the system's working
+ * set from being evicted by disabling active file page aging.
+ *
+ * This uses a different ratio than the anonymous pages, because
+ * the page cache uses a use-once replacement algorithm.
+ */
+static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
+{
+ int low;
+
+ if (scanning_global_lru(sc))
+ low = inactive_file_is_low_global(zone);
+ else
+ low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup);
+ return low;
+}
+
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct zone *zone, struct scan_control *sc, int priority)
{
int file = is_file_lru(lru);

- if (lru == LRU_ACTIVE_FILE) {
+ if (lru == LRU_ACTIVE_FILE && inactive_file_is_low(zone, sc)) {
shrink_active_list(nr_to_scan, zone, sc, priority, file);
return 0;
}

2009-04-30 00:39:57

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first (v3)

> When the file LRU lists are dominated by streaming IO pages,
> evict those pages first, before considering evicting other
> pages.
>
> This should be safe from deadlocks or performance problems
> because only three things can happen to an inactive file page:
> 1) referenced twice and promoted to the active list
> 2) evicted by the pageout code
> 3) under IO, after which it will get evicted or promoted
>
> The pages freed in this way can either be reused for streaming
> IO, or allocated for something else. If the pages are used for
> streaming IO, this pageout pattern continues. Otherwise, we will
> fall back to the normal pageout pattern.
>
> Signed-off-by: Rik van Riel <[email protected]>
>
> ---
> On Thu, 30 Apr 2009 01:07:51 +0900
> KOSAKI Motohiro <[email protected]> wrote:
>
> > we handle active_anon vs inactive_anon ratio by shrink_list().
> > Why do you insert this logic insert shrink_zone() ?
>
> Kosaki, this implementation mirrors the anon side of things precisely.
> Does this look good?
>
> Elladan, this patch should work just like the second version. Please
> let me know how it works for you.

Looks good to me. thanks.
but I don't hit Rik's explained issue, I hope Elladan report his test result.


2009-04-30 04:15:40

by Elladan

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

On Tue, Apr 28, 2009 at 11:34:55PM -0700, Andrew Morton wrote:
> On Wed, 29 Apr 2009 14:51:07 +0900 (JST) KOSAKI Motohiro <[email protected]> wrote:
>
> > Hi
> >
> > > On Tue, Apr 28, 2009 at 05:09:16PM +0800, Wu Fengguang wrote:
> > > > The semi-drop-behind is a great idea for the desktop - to put just
> > > > accessed pages to end of LRU. However I'm still afraid it vastly
> > > > changes the caching behavior and wont work well as expected in server
> > > > workloads - shall we verify this?
> > > >
> > > > Back to this big-cp-hurts-responsibility issue. Background write
> > > > requests can easily pass the io scheduler's obstacles and fill up
> > > > the disk queue. Now every read request will have to wait 10+ writes
> > > > - leading to 10x slow down of major page faults.
> > > >
> > > > I reach this conclusion based on recent CFQ code reviews. Will bring up
> > > > a queue depth limiting patch for more exercises..
> > >
> > > We can muck with the I/O scheduler, but another thing to consider is
> > > whether the VM should be more aggressively throttling writes in this
> > > case; it sounds like the big cp in this case may be dirtying pages so
> > > aggressively that it's driving other (more useful) pages out of the
> > > page cache --- if the target disk is slower than the source disk (for
> > > example, backing up a SATA primary disk to a USB-attached backup disk)
> > > no amount of drop-behind is going to help the situation.
> > >
> > > So that leaves three areas for exploration:
> > >
> > > * Write-throttling
> > > * Drop-behind
> > > * background writes pushing aside foreground reads
> > >
> > > Hmm, note that although the original bug reporter is running Ubuntu
> > > Jaunty, and hence 2.6.28, this problem is going to get *worse* with
> > > 2.6.30, since we have the ext3 data=ordered latency fixes which will
> > > write out the any journal activity, and worse, any synchornous commits
> > > (i.e., caused by fsync) will force out all of the dirty pages with
> > > WRITE_SYNC priority. So with a heavy load, I suspect this is going to
> > > be more of a VM issue, and especially figuring out how to tune more
> > > aggressive write-throttling may be key here.
> >
> > firstly, I'd like to report my reproduce test result.
> >
> > test environment: no lvm, copy ext3 to ext3 (not mv), no change swappiness,
> > CFQ is used, userland is Fedora10, mmotm(2.6.30-rc1 + mm patch),
> > CPU opteronx4, mem 4G
> >
> > mouse move lag: not happend
> > window move lag: not happend
> > Mapped page decrease rapidly: not happend (I guess, these page stay in
> > active list on my system)
> > page fault large latency: happend (latencytop display >200ms)
>
> hm. The last two observations appear to be inconsistent.
>
> Elladan, have you checked to see whether the Mapped: number in
> /proc/meminfo is decreasing?

Yes, Mapped decreases while a large file copy is ongoing. It increases again
if I use the GUI.

> > Then, I don't doubt vm replacement logic now.
> > but I need more investigate.
> > I plan to try following thing today and tommorow.
> >
> > - XFS
> > - LVM
> > - another io scheduler (thanks Ted, good view point)
> > - Rik's new patch
>
> It's not clear that we know what's happening yet, is it? It's such a
> gross problem that you'd think that even our testing would have found
> it by now :(
>
> Elladan, do you know if earlier kernels (2.6.26 or thereabouts) had
> this severe a problem?

No, I don't know about older kernels.

Also, just to add a bit: I'm having some difficulty reproducing the extremely
severe latency I was seeing right off. It's not difficult for me to reproduce
latencies that are painful, but not on the order of 10 second response. Maybe
3 or 4 seconds at most. I didn't have a stopwatch handy originally though, so
it's somewhat subjective, but I wonder if there's some element of the load that
I'm missing.

I had a theory about why this might be: my original repro was copying data
which I believe had been written once, but never read. Plus, I was using
relatime. However, on second thought this doesn't work -- there's only 8000
files, and a re-test with atime turned on isn't much different than with
relatime.

The other possibility is that there was some other background IO load spike,
which I didn't notice at the time. I don't know what that would be though,
unless it was one of gnome's indexing jobs (I didn't see one, though).

-Elladan

2009-04-30 04:47:09

by Andrew Morton

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

On Wed, 29 Apr 2009 21:14:39 -0700 Elladan <[email protected]> wrote:

> > Elladan, have you checked to see whether the Mapped: number in
> > /proc/meminfo is decreasing?
>
> Yes, Mapped decreases while a large file copy is ongoing. It increases again
> if I use the GUI.

OK. If that's still happening to an appreciable extent after you've
increased /proc/sys/vm/swappiness then I'd wager that we have a
bug/regression in that area.

Local variable `scan' in shrink_zone() is vulnerable to multiplicative
overflows on large zones, but I doubt if you have enough memory to
trigger that bug.


From: Andrew Morton <[email protected]>

Local variable `scan' can overflow on zones which are larger than

(2G * 4k) / 100 = 80GB.

Making it 64-bit on 64-bit will fix that up.

Cc: KOSAKI Motohiro <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN mm/vmscan.c~vmscan-avoid-multiplication-overflow-in-shrink_zone mm/vmscan.c
--- a/mm/vmscan.c~vmscan-avoid-multiplication-overflow-in-shrink_zone
+++ a/mm/vmscan.c
@@ -1479,7 +1479,7 @@ static void shrink_zone(int priority, st

for_each_evictable_lru(l) {
int file = is_file_lru(l);
- int scan;
+ unsigned long scan;

scan = zone_nr_pages(zone, sc, l);
if (priority) {
_

2009-04-30 04:55:27

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

> On Wed, 29 Apr 2009 21:14:39 -0700 Elladan <[email protected]> wrote:
>
> > > Elladan, have you checked to see whether the Mapped: number in
> > > /proc/meminfo is decreasing?
> >
> > Yes, Mapped decreases while a large file copy is ongoing. It increases again
> > if I use the GUI.
>
> OK. If that's still happening to an appreciable extent after you've
> increased /proc/sys/vm/swappiness then I'd wager that we have a
> bug/regression in that area.
>
> Local variable `scan' in shrink_zone() is vulnerable to multiplicative
> overflows on large zones, but I doubt if you have enough memory to
> trigger that bug.
>
>
> From: Andrew Morton <[email protected]>
>
> Local variable `scan' can overflow on zones which are larger than
>
> (2G * 4k) / 100 = 80GB.
>
> Making it 64-bit on 64-bit will fix that up.

Agghh, thanks bugfix.

Note: His meminfo indicate his machine has 3.5GB ram. then this
patch don't fix his problem.



>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Lee Schermerhorn <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> mm/vmscan.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff -puN mm/vmscan.c~vmscan-avoid-multiplication-overflow-in-shrink_zone mm/vmscan.c
> --- a/mm/vmscan.c~vmscan-avoid-multiplication-overflow-in-shrink_zone
> +++ a/mm/vmscan.c
> @@ -1479,7 +1479,7 @@ static void shrink_zone(int priority, st
>
> for_each_evictable_lru(l) {
> int file = is_file_lru(l);
> - int scan;
> + unsigned long scan;
>
> scan = zone_nr_pages(zone, sc, l);
> if (priority) {
> _
>
>


2009-04-30 04:56:34

by Elladan

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

On Wed, Apr 29, 2009 at 09:43:32PM -0700, Andrew Morton wrote:
> On Wed, 29 Apr 2009 21:14:39 -0700 Elladan <[email protected]> wrote:
>
> > > Elladan, have you checked to see whether the Mapped: number in
> > > /proc/meminfo is decreasing?
> >
> > Yes, Mapped decreases while a large file copy is ongoing. It increases again
> > if I use the GUI.
>
> OK. If that's still happening to an appreciable extent after you've
> increased /proc/sys/vm/swappiness then I'd wager that we have a
> bug/regression in that area.
>
> Local variable `scan' in shrink_zone() is vulnerable to multiplicative
> overflows on large zones, but I doubt if you have enough memory to
> trigger that bug.

No, I only have 4GB.

This appears to happen with swappiness set to 0 or 60.

-Elladan

2009-04-30 07:21:45

by Elladan

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first (v2)

On Wed, Apr 29, 2009 at 11:47:08AM -0400, Rik van Riel wrote:
> When the file LRU lists are dominated by streaming IO pages,
> evict those pages first, before considering evicting other
> pages.
>
> This should be safe from deadlocks or performance problems
> because only three things can happen to an inactive file page:
> 1) referenced twice and promoted to the active list
> 2) evicted by the pageout code
> 3) under IO, after which it will get evicted or promoted
>
> The pages freed in this way can either be reused for streaming
> IO, or allocated for something else. If the pages are used for
> streaming IO, this pageout pattern continues. Otherwise, we will
> fall back to the normal pageout pattern.
>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> On Wed, 29 Apr 2009 08:42:29 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > Isn't there a hole where LRU_*_FILE << LRU_*_ANON and we now stop
> > shrinking INACTIVE_ANON even though it makes sense to.
>
> Peter, after looking at this again, I believe that the get_scan_ratio
> logic should take care of protecting the anonymous pages, so we can
> get away with this following, less intrusive patch.
>
> Elladan, does this smaller patch still work as expected?

Rik, since the third patch doesn't work on 2.6.28 (without disabling a lot of
code), I went ahead and tested this patch.

The system does seem relatively responsive with this patch for the most part,
with occasional lag. I don't see much evidence at least over the course of a
few minutes that it pages out applications significantly. It seems about
equivalent to the first patch.

Given Andrew Morton's request that I track the Mapped: field in /proc/meminfo,
I went ahead and did that with this patch built into a kernel. Compared to the
standard Ubuntu kernel, this patch keeps significantly more Mapped memory
around, and it shrinks at a slower rate after the test runs for a while.
Eventually, it seems to reach a steady state.

For example, with your patch, Mapped will often go for 30 seconds without
changing significantly. Without your patch, it continuously lost about
500-1000K every 5 seconds, and then jumped up again significantly when I
touched Firefox or other applications. I do see some of that behavior with
your patch too, but it's much less significant.

When I first initiated the background load, Mapped did rapidly decrease from
about 85000K to 47000K. It seems to have reached a fairly steady state since
then. I would guess this implies that the VM paged out parts of my executable
set that aren't touched very often, but isn't applying further pressure to my
active pages? Also for example, after letting the test run for a while, I
scrolled around some tabs in firefox I hadn't used since the test began, and
experienced significant lag.

This seems ok (not disastrous, anyway). I suspect desktop users would
generally prefer the VM were extremely aggressive about keeping their
executables paged in though, much moreso than this patch provides (and note how
popular swappiness=0 seems to be). Paging applications back in seems to
introduce a large amount of UI latency, even if the VM keeps it to a sane level
as with this patch. Also, I don't see many desktop workloads where paging out
applications to grow the data cache is ever helpful -- practically all desktop
workloads where you get a lot of IO involve streaming, not data that might
possibly fit in ram. If I'm just copying a bunch of files around, I'd prefer
that even "worthless" pages such as eg. parts of Firefox that are only used
during load time or during rare config requests (and would thus not appear to
be part of my working set short-term) stay in cache, so I can get the maximum
interactive performance from my application.

Thank you,
Elladan

2009-04-30 08:12:38

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first (v3)

On Wed, Apr 29, 2009 at 01:14:36PM -0400, Rik van Riel wrote:
> When the file LRU lists are dominated by streaming IO pages,
> evict those pages first, before considering evicting other
> pages.
>
> This should be safe from deadlocks or performance problems
> because only three things can happen to an inactive file page:
> 1) referenced twice and promoted to the active list
> 2) evicted by the pageout code
> 3) under IO, after which it will get evicted or promoted
>
> The pages freed in this way can either be reused for streaming
> IO, or allocated for something else. If the pages are used for
> streaming IO, this pageout pattern continues. Otherwise, we will
> fall back to the normal pageout pattern.
>
> Signed-off-by: Rik van Riel <[email protected]>

Although Elladan didn't test this exact patch, he reported on v2 that
the general idea of scanning active files only when they exceed the
inactive set works.

Acked-by: Johannes Weiner <[email protected]>

2009-04-30 12:00:17

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

> test environment: no lvm, copy ext3 to ext3 (not mv), no change swappiness,
> ? ? ? ? ? ? ? ? ?CFQ is used, userland is Fedora10, mmotm(2.6.30-rc1 + mm patch),
> ? ? ? ? ? ? ? ? ?CPU opteronx4, mem 4G
>
> mouse move lag: ? ? ? ? ? ? ? not happend
> window move lag: ? ? ? ? ? ? ?not happend
> Mapped page decrease rapidly: not happend (I guess, these page stay in
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?active list on my system)
> page fault large latency: ? ? happend (latencytop display >1200ms)
>
>
> Then, I don't doubt vm replacement logic now.
> but I need more investigate.
> I plan to try following thing today and tommorow.
>
> ?- XFS
> ?- LVM
> ?- another io scheduler (thanks Ted, good view point)
> ?- Rik's new patch

hm, AS io-scheduler don't make such large latency on my environment.
Elladan, Can you try to AS scheduler? (adding boot option "elevator=as")

2009-04-30 13:08:39

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first (v2)

Elladan wrote:

>> Elladan, does this smaller patch still work as expected?

> The system does seem relatively responsive with this patch for the most part,
> with occasional lag. I don't see much evidence at least over the course of a
> few minutes that it pages out applications significantly. It seems about
> equivalent to the first patch.

OK, good to hear that.

> This seems ok (not disastrous, anyway). I suspect desktop users would
> generally prefer the VM were extremely aggressive about keeping their
> executables paged in though,

I agree that desktop users would probably prefer something even
more aggressive. However, we do need to balance this against
other workloads, where inactive file pages need to be given a
fair chance to be referenced twice and promoted to the active
file list.

Because of that, I have chosen a patch with a minimal risk of
regressions on any workload.

--
All rights reversed.

2009-04-30 13:47:43

by Elladan

[permalink] [raw]
Subject: Re: Swappiness vs. mmap() and interactive response

On Thu, Apr 30, 2009 at 08:59:59PM +0900, KOSAKI Motohiro wrote:
> > test environment: no lvm, copy ext3 to ext3 (not mv), no change swappiness,
> > ? ? ? ? ? ? ? ? ?CFQ is used, userland is Fedora10, mmotm(2.6.30-rc1 + mm patch),
> > ? ? ? ? ? ? ? ? ?CPU opteronx4, mem 4G
> >
> > mouse move lag: ? ? ? ? ? ? ? not happend
> > window move lag: ? ? ? ? ? ? ?not happend
> > Mapped page decrease rapidly: not happend (I guess, these page stay in
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?active list on my system)
> > page fault large latency: ? ? happend (latencytop display >1200ms)
> >
> >
> > Then, I don't doubt vm replacement logic now.
> > but I need more investigate.
> > I plan to try following thing today and tommorow.
> >
> > ?- XFS
> > ?- LVM
> > ?- another io scheduler (thanks Ted, good view point)
> > ?- Rik's new patch
>
> hm, AS io-scheduler don't make such large latency on my environment.
> Elladan, Can you try to AS scheduler? (adding boot option "elevator=as")

I switched at runtime with /sys/block/sd[ab]/queue/scheduler, using Rik's
second patch for page replacement. It was hard to tell if this made much
difference in latency, as reported by latencytop. Both schedulers sometimes
show outliers up to 1400msec or so, and the average latency looks like it may
be similar.

Thanks,
Elladan

2009-04-30 14:01:35

by Elladan

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first (v2)

On Thu, Apr 30, 2009 at 09:08:06AM -0400, Rik van Riel wrote:
> Elladan wrote:
>
>>> Elladan, does this smaller patch still work as expected?
>
>> The system does seem relatively responsive with this patch for the most part,
>> with occasional lag. I don't see much evidence at least over the course of a
>> few minutes that it pages out applications significantly. It seems about
>> equivalent to the first patch.
>
> OK, good to hear that.
>
>> This seems ok (not disastrous, anyway). I suspect desktop users would
>> generally prefer the VM were extremely aggressive about keeping their
>> executables paged in though,
>
> I agree that desktop users would probably prefer something even
> more aggressive. However, we do need to balance this against
> other workloads, where inactive file pages need to be given a
> fair chance to be referenced twice and promoted to the active
> file list.
>
> Because of that, I have chosen a patch with a minimal risk of
> regressions on any workload.

I agree, this seems to work well as a bugfix, for a general purpose system.

I'm just not sure that a general-purpose page replacement algorithm actually
serves most desktop users well. I remember using some kludges back in the
2.2/2.4 days to try to force eviction of application pages when my system was
low on ram on occasion, but for desktop use that naive VM actually seemed
to generally have fewer latency problems.

Plus, since hard disks haven't been improving in speed (except for the surge in
SSDs), but RAM and CPU have been increasing dramatically, any paging or
swapping activity just becomes more and more noticeable.

Thanks,
Elladan

2009-05-01 00:51:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first (v2)

On Thu, 30 Apr 2009 00:20:58 -0700
Elladan <[email protected]> wrote:

> > Elladan, does this smaller patch still work as expected?
>
> Rik, since the third patch doesn't work on 2.6.28 (without disabling a lot of
> code), I went ahead and tested this patch.
>
> The system does seem relatively responsive with this patch for the most part,
> with occasional lag. I don't see much evidence at least over the course of a
> few minutes that it pages out applications significantly. It seems about
> equivalent to the first patch.
>
> Given Andrew Morton's request that I track the Mapped: field in /proc/meminfo,
> I went ahead and did that with this patch built into a kernel. Compared to the
> standard Ubuntu kernel, this patch keeps significantly more Mapped memory
> around, and it shrinks at a slower rate after the test runs for a while.
> Eventually, it seems to reach a steady state.
>
> For example, with your patch, Mapped will often go for 30 seconds without
> changing significantly. Without your patch, it continuously lost about
> 500-1000K every 5 seconds, and then jumped up again significantly when I
> touched Firefox or other applications. I do see some of that behavior with
> your patch too, but it's much less significant.

Were you able to tell whether altering /proc/sys/vm/swappiness appropriately
regulated the rate at which the mapped page count decreased?

Thanks.

2009-05-01 01:01:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first (v2)

On Thu, 30 Apr 2009 17:45:36 -0700
Andrew Morton <[email protected]> wrote:

> Were you able to tell whether altering /proc/sys/vm/swappiness
> appropriately regulated the rate at which the mapped page count
> decreased?

That should not make a difference at all for mapped file
pages, after the change was merged that makes the VM ignores
the referenced bit of mapped active file pages.

Ever since the split LRU code was merged, all that the
swappiness controls is the aggressiveness of file vs
anonymous LRU scanning.

Currently the kernel has no effective code to protect the
page cache working set from streaming IO. Elladan's bug
report shows that we do need some kind of protection...

--
All rights reversed.

2009-05-01 01:18:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first (v2)

On Thu, 30 Apr 2009 20:59:36 -0400
Rik van Riel <[email protected]> wrote:

> On Thu, 30 Apr 2009 17:45:36 -0700
> Andrew Morton <[email protected]> wrote:
>
> > Were you able to tell whether altering /proc/sys/vm/swappiness
> > appropriately regulated the rate at which the mapped page count
> > decreased?
>
> That should not make a difference at all for mapped file
> pages, after the change was merged that makes the VM ignores
> the referenced bit of mapped active file pages.
>
> Ever since the split LRU code was merged, all that the
> swappiness controls is the aggressiveness of file vs
> anonymous LRU scanning.

Which would cause exactly the problem Elladan saw?

> Currently the kernel has no effective code to protect the
> page cache working set from streaming IO. Elladan's bug
> report shows that we do need some kind of protection...

Seems to me that reclaim should treat swapcache-backed mapped mages in
a similar fashion to file-backed mapped pages?

2009-05-01 01:52:12

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first (v2)

On Thu, 30 Apr 2009 18:13:40 -0700
Andrew Morton <[email protected]> wrote:

> On Thu, 30 Apr 2009 20:59:36 -0400
> Rik van Riel <[email protected]> wrote:
>
> > On Thu, 30 Apr 2009 17:45:36 -0700
> > Andrew Morton <[email protected]> wrote:
> >
> > > Were you able to tell whether altering /proc/sys/vm/swappiness
> > > appropriately regulated the rate at which the mapped page count
> > > decreased?
> >
> > That should not make a difference at all for mapped file
> > pages, after the change was merged that makes the VM ignores
> > the referenced bit of mapped active file pages.
> >
> > Ever since the split LRU code was merged, all that the
> > swappiness controls is the aggressiveness of file vs
> > anonymous LRU scanning.
>
> Which would cause exactly the problem Elladan saw?

Yes. It was not noticable in the initial split LRU code,
but after we decided to ignore the referenced bit on active
file pages and deactivate pages regardless, it has gotten
exacerbated.

That change was very good for scalability, so we should not
undo it. However, we do need to put something in place to
protect the working set from streaming IO.

> > Currently the kernel has no effective code to protect the
> > page cache working set from streaming IO. Elladan's bug
> > report shows that we do need some kind of protection...
>
> Seems to me that reclaim should treat swapcache-backed mapped mages in
> a similar fashion to file-backed mapped pages?

Swapcache-backed pages are not on the same set of LRUs as
file-backed mapped pages.

Furthermore, there is no streaming IO on the anon LRUs like
there is on the file LRUs. Only the file LRUs need (and want)
use-once replacement, which means that we only need special
protection of the working set for file-backed pages.

When we implement working set protection, we might as well
do it for frequently accessed unmapped pages too. There is
no reason to restrict this protection to mapped pages.

--
All rights reversed.

2009-05-01 02:59:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first (v2)

On Thu, 30 Apr 2009 21:50:34 -0400 Rik van Riel <[email protected]> wrote:

> > Which would cause exactly the problem Elladan saw?
>
> Yes. It was not noticable in the initial split LRU code,
> but after we decided to ignore the referenced bit on active
> file pages and deactivate pages regardless, it has gotten
> exacerbated.
>
> That change was very good for scalability, so we should not
> undo it. However, we do need to put something in place to
> protect the working set from streaming IO.
>
> > > Currently the kernel has no effective code to protect the
> > > page cache working set from streaming IO. Elladan's bug
> > > report shows that we do need some kind of protection...
> >
> > Seems to me that reclaim should treat swapcache-backed mapped mages in
> > a similar fashion to file-backed mapped pages?
>
> Swapcache-backed pages are not on the same set of LRUs as
> file-backed mapped pages.

yup.

> Furthermore, there is no streaming IO on the anon LRUs like
> there is on the file LRUs. Only the file LRUs need (and want)
> use-once replacement, which means that we only need special
> protection of the working set for file-backed pages.

OK.

> When we implement working set protection, we might as well
> do it for frequently accessed unmapped pages too. There is
> no reason to restrict this protection to mapped pages.

Well. Except for empirical observation, which tells us that biasing
reclaim to prefer to retain mapped memory produces a better result.

2009-05-01 03:11:19

by Elladan

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict use-once pages first (v2)

On Thu, Apr 30, 2009 at 05:45:36PM -0700, Andrew Morton wrote:
> On Thu, 30 Apr 2009 00:20:58 -0700
> Elladan <[email protected]> wrote:
>
> > > Elladan, does this smaller patch still work as expected?
> >
> > Rik, since the third patch doesn't work on 2.6.28 (without disabling a lot of
> > code), I went ahead and tested this patch.
> >
> > The system does seem relatively responsive with this patch for the most part,
> > with occasional lag. I don't see much evidence at least over the course of a
> > few minutes that it pages out applications significantly. It seems about
> > equivalent to the first patch.
> >
> > Given Andrew Morton's request that I track the Mapped: field in /proc/meminfo,
> > I went ahead and did that with this patch built into a kernel. Compared to the
> > standard Ubuntu kernel, this patch keeps significantly more Mapped memory
> > around, and it shrinks at a slower rate after the test runs for a while.
> > Eventually, it seems to reach a steady state.
> >
> > For example, with your patch, Mapped will often go for 30 seconds without
> > changing significantly. Without your patch, it continuously lost about
> > 500-1000K every 5 seconds, and then jumped up again significantly when I
> > touched Firefox or other applications. I do see some of that behavior with
> > your patch too, but it's much less significant.
>
> Were you able to tell whether altering /proc/sys/vm/swappiness appropriately
> regulated the rate at which the mapped page count decreased?

I don't believe so. I tested with swappiness=0 and =60, and in each case the
mapped pages continued to decrease. I don't know at what rate though. If
you'd like more precise data, I can rerun the test with appropriate logging. I
admit my "Hey, latency is terrible and mapped pages is decreasing" testing is
somewhat unscientific.

I get the impression that VM regressions happen fairly regularly. Does anyone
have good unit tests for this? Is seems like a difficult problem, since it's
partly based on pattern and partly timing.

-J