On Thu, 15 Apr 2010 14:15:33 BST, Mel Gorman said:

> Yep. I modified bloat-o-meter to work with stacks (imaginatively calling it
> stack-o-meter) and got the following. The prereq patches are from
> earlier in the thread with the subjects

Think that's a script worth having in-tree?

Attachments:

(No filename) (227.00 B)

2010-04-15 23:33:57

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd

On Thu, Apr 15, 2010 at 10:27:09AM -0700, Suleiman Souhlal wrote:
>
> On Apr 15, 2010, at 2:32 AM, Dave Chinner wrote:
>
> >On Thu, Apr 15, 2010 at 01:05:57AM -0700, Suleiman Souhlal wrote:
> >>
> >>On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
> >>
> >>>Now, vmscan pageout() is one of IO throuput degression source.
> >>>Some IO workload makes very much order-0 allocation and reclaim
> >>>and pageout's 4K IOs are making annoying lots seeks.
> >>>
> >>>At least, kswapd can avoid such pageout() because kswapd don't
> >>>need to consider OOM-Killer situation. that's no risk.
> >>>
> >>>Signed-off-by: KOSAKI Motohiro <[email protected]>
> >>
> >>What's your opinion on trying to cluster the writes done by pageout,
> >>instead of not doing any paging out in kswapd?
> >
> >XFS already does this in ->writepage to try to minimise the impact
> >of the way pageout issues IO. It helps, but it is still not as good
> >as having all the writeback come from the flusher threads because
> >it's still pretty much random IO.
>
> Doesn't the randomness become irrelevant if you can cluster enough
> pages?

No. If you are doing full disk seeks between random chunks, then you
still lose a large amount of throughput. e.g. if the seek time is
10ms and your IO time is 10ms for each 4k page, then increasing the
size ito 64k makes it 10ms seek and 12ms for the IO. We might increase
throughput but we are still limited to 100 IOs per second. We've
gone from 400kB/s to 6MB/s, but that's still an order of magnitude
short of the 100MB/s full size IOs with little in way of seeks
between them will acheive on the same spindle...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-15 23:40:28

by Dave Chinner

[permalink] [raw]

2010-04-23 10:51:46

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH] mm: disallow direct reclaim page writeback

On Fri, Apr 23, 2010 at 11:06:32AM +1000, Dave Chinner wrote:
> On Mon, Apr 19, 2010 at 04:20:34PM +0100, Mel Gorman wrote:
> > On Fri, Apr 16, 2010 at 04:14:03PM +0100, Mel Gorman wrote:
> > > > > Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> > > > > guess dirty pages can cycle around more so it'd need to be cared for.
> > > >
> > > > Well, you keep saying that they break #3, but I haven't seen any
> > > > test cases or results showing that. I've been unable to confirm that
> > > > lumpy reclaim is broken by disallowing writeback in my testing, so
> > > > I'm interested to know what tests you are running that show it is
> > > > broken...
> > > >
> > >
> > > Ok, I haven't actually tested this. The machines I use are tied up
> > > retesting the compaction patches at the moment. The reason why I reckon
> > > it'll be a problem is that when these sync-writeback changes were
> > > introduced, it significantly helped lumpy reclaim for huge pages. I am
> > > making an assumption that backing out those changes will hurt it.
> > >
> > > I'll test for real on Monday and see what falls out.
> > >
> >
> > One machine has completed the test and the results are as expected. When
> > allocating huge pages under stress, your patch drops the success rates
> > significantly. On X86-64, it showed
> >
> > STRESS-HIGHALLOC
> > stress-highalloc stress-highalloc
> > enable-directreclaim disable-directreclaim
> > Under Load 1 89.00 ( 0.00) 73.00 (-16.00)
> > Under Load 2 90.00 ( 0.00) 85.00 (-5.00)
> > At Rest 90.00 ( 0.00) 90.00 ( 0.00)
> >
> > So with direct reclaim, it gets 89% of memory as huge pages at the first
> > attempt but 73% with your patch applied. The "Under Load 2" test happens
> > immediately after. With the start kernel, the first and second attempts
> > are usually the same or very close together. With your patch applied,
> > there are big differences as it was no longer trying to clean pages.
>
> What was the machine config you were testing on (RAM, CPUs, etc)?

2G RAM, AMD Phenom with 4 cores.

> And what are these loads?

Compile-based loads that fill up memory and put it under heavy memory
pressure that also dirties memory. While they are running, a kernel module
is loaded that starts allocating huge pages one at a time so that accurate
timing and the state of the system can be gathered at allocation time. The
number of allocation attempts is 90% of the number of huge pages that exist
in the system.

> Do you have a script that generates
> them? If so, can you share them, please?
>

Yes, but unfortunately they are not in a publishable state. Parts of
them depend on an automation harness that I don't hold the copyright to.

> OOC, what was the effect on the background load - did it go faster
> or slower when writeback was disabled?

Unfortunately, I don't know what the effect on the underlying load is
as it takes longer than the huge page allocation attempts do. The tests
objective is to check how well lumpy reclaim works undedmemory pressure.

However, the time it takes to allocate a huge page increases with direct
reclaim disabled (i.e. your patch) early in the test up until about 40%
of memory was allocated as huge pages. After that, the latencies with
disable-directreclaim are lower until the gives up while the latencies with
enable-directreclaim increase.

In other words, with direct reclaim writing back pages, lumpy reclaim is a
lot more determined to get the pages cleaned and wait on them if necessary. A
compromise patch might be to have a wait_on_page_dirty to be cleared instead
of queueing the IO and wait_on_page_writeback? How long it stalled would
depend heavily on what rate pages were getting cleaned in the background.

> i.e. did we trade of more
> large pages for better overall throughput?
>
> Also, I'm curious as to the repeatability of the tests you are
> doing. I found that from run to run I could see a *massive*
> variance in the results. e.g. one run might only get ~80 huge
> pages at the first attempt, the test run from the same initial
> conditions next might get 440 huge pages at the first attempt.

You are using the nr_hugepages interface and writing a large number to it
so you are also triggering the hugetlbfs retry-logic and have little control
over how many times the allocator gets called on each attempt. How many huge
pages it allocates depends on how much progress it is able to make during
lumpy reclaim.

It's why the tests I run allocate huge pages one at a time and measure
the latencies as it goes. The results tend to be quite reproducible.
Success figures would be the same between runs and the rate of
allocation success would generally be comparable as well.

Your test could do something similar by only ever requesting one additional
page. It will be good enough to measure allocation latency. The gathering
of other system state at the time of failure is not very important here
(where as it was important during anti-frag development hence the use of a
kernel module).

> I saw
> the same variance with or without writeback from direct reclaim
> enabled. Hence only after averaging over tens of runs could I see
> any sort of trend emerge, and it makes me wonder if your testing is
> also seeing this sort of variance....
>

Typically, there is not much variance between tests. Maybe 1-2% in allocation
success rates.

> FWIW, if we look results of the test I did, it showed a 20%
> improvement in large page allocation with a 15% increase in load
> throughput, while you're showing a 16% degradation in large page
> allocation.

With writeback, lumpy reclaim takes a range of pages, cleans them, waits for
the IO before moving on. This causes a seeky IO pattern and takes time. Also
causes a nice amount of trashing.

With your patch, lumpy reclaim would just skip over ranges with dirty pages
until it found clean pages in a suitable range. When there is plenty of
usable memore early in the test, it probably scans more but causes less
IO so would appear faster. Later in the test, it scans more but eventually
encounters too many dirty pages and gives up. Hence, its success rates will
be more random because it depends on where exactly the dirty pages were.

If this is accurate, it will always be the case that your patch causes less
disruption in the system and will appear faster due to the lack of IO but
will be less predictable and give up easier so will have lower success
rates when there are dirty pages in the system.

> Effectively we've got two workloads that show results
> at either end of the spectrum (perhaps they are best case vs worst
> case) but there's no real in-between. What other tests can we run to
> get a better picture of the effect?
>

The underlying workload is only important in how many pages it is
dirtying at any given time. Heck, at one point my test workload was a
single process that created a mapping the size of physical memory and in
test a) would constantly read it and in test b) would constantly write
it. Lumpy reclaim with dirty-page-writeback was always more predictable
and had higher success rates.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab