2001-02-27 10:43:55

by Mike Galbraith

[permalink] [raw]
Subject: [patch][rfc][rft] vm throughput 2.4.2-ac4

Hi,

Attempting to avoid doing I/O has been harmful to throughput here
ever since the queueing/elevator woes were fixed. Ever since then,
tossing attempts at avoidance has improved throughput markedly.

IMHO, any patch which claims to improve throughput via code deletion
should be worth a little eyeball time.. and maybe even a test run ;-)

Comments welcome.

-Mike

--- linux-2.4.2-ac4/mm/page_alloc.c.org Mon Feb 26 11:19:27 2001
+++ linux-2.4.2-ac4/mm/page_alloc.c Tue Feb 27 10:31:10 2001
@@ -274,7 +274,7 @@
struct page * __alloc_pages(zonelist_t *zonelist, unsigned long order)
{
zone_t **zone;
- int direct_reclaim = 0;
+ int direct_reclaim = 0, loop = 0;
unsigned int gfp_mask = zonelist->gfp_mask;
struct page * page;

@@ -366,7 +366,7 @@
* able to free some memory we can't free ourselves
*/
wakeup_kswapd();
- if (gfp_mask & __GFP_WAIT) {
+ if (gfp_mask & __GFP_WAIT && loop) {
__set_current_state(TASK_RUNNING);
current->policy |= SCHED_YIELD;
schedule();
@@ -440,7 +440,7 @@
memory_pressure++;
try_to_free_pages(gfp_mask);
wakeup_bdflush(0);
- if (!order)
+ if (!order || loop++ < (1 << order))
goto try_again;
}
}
--- linux-2.4.2-ac4/mm/vmscan.c.org Mon Feb 26 09:31:46 2001
+++ linux-2.4.2-ac4/mm/vmscan.c Tue Feb 27 09:04:50 2001
@@ -278,6 +278,8 @@
/* Always start by trying to penalize the process that is allocating memory */
if (mm)
retval = swap_out_mm(mm, swap_amount(mm));
+ if (retval)
+ return retval;

/* Then, look at the other mm's */
counter = (mmlist_nr << SWAP_SHIFT) >> priority;
@@ -418,8 +420,8 @@
#define MAX_LAUNDER (1 << page_cluster)
int page_launder(int gfp_mask, int user)
{
- int launder_loop, maxscan, flushed_pages, freed_pages, maxlaunder;
- int can_get_io_locks, sync, target, shortage;
+ int maxscan, flushed_pages, freed_pages, maxlaunder;
+ int can_get_io_locks;
struct list_head * page_lru;
struct page * page;
struct zone_struct * zone;
@@ -430,15 +432,10 @@
*/
can_get_io_locks = gfp_mask & __GFP_IO;

- target = free_shortage();
-
- sync = 0;
- launder_loop = 0;
maxlaunder = 0;
flushed_pages = 0;
freed_pages = 0;

-dirty_page_rescan:
spin_lock(&pagemap_lru_lock);
maxscan = nr_inactive_dirty_pages;
while ((page_lru = inactive_dirty_list.prev) != &inactive_dirty_list &&
@@ -446,6 +443,9 @@
page = list_entry(page_lru, struct page, lru);
zone = page->zone;

+ if ((user && freed_pages + flushed_pages > MAX_LAUNDER)
+ || !free_shortage())
+ break;
/* Wrong page on list?! (list corruption, should not happen) */
if (!PageInactiveDirty(page)) {
printk("VM: page_launder, wrong page on list.\n");
@@ -464,18 +464,7 @@
continue;
}

- /*
- * Disk IO is really expensive, so we make sure we
- * don't do more work than needed.
- * Note that clean pages from zones with enough free
- * pages still get recycled and dirty pages from these
- * zones can get flushed due to IO clustering.
- */
- if (freed_pages + flushed_pages > target && !free_shortage())
- break;
- if (launder_loop && !maxlaunder)
- break;
- if (launder_loop && zone->inactive_clean_pages +
+ if (zone->inactive_clean_pages +
zone->free_pages > zone->pages_high)
goto skip_page;

@@ -500,14 +489,6 @@
if (!writepage)
goto page_active;

- /* First time through? Move it to the back of the list */
- if (!launder_loop) {
- list_del(page_lru);
- list_add(page_lru, &inactive_dirty_list);
- UnlockPage(page);
- continue;
- }
-
/* OK, do a physical asynchronous write to swap. */
ClearPageDirty(page);
page_cache_get(page);
@@ -517,7 +498,6 @@
/* XXX: all ->writepage()s should use nr_async_pages */
if (!PageSwapCache(page))
flushed_pages++;
- maxlaunder--;
page_cache_release(page);

/* And re-start the thing.. */
@@ -535,7 +515,7 @@
* buffer pages
*/
if (page->buffers) {
- int wait, clearedbuf;
+ int clearedbuf;
/*
* Since we might be doing disk IO, we have to
* drop the spinlock and take an extra reference
@@ -545,16 +525,8 @@
page_cache_get(page);
spin_unlock(&pagemap_lru_lock);

- /* Will we do (asynchronous) IO? */
- if (launder_loop && maxlaunder == 0 && sync)
- wait = 2; /* Synchrounous IO */
- else if (launder_loop && maxlaunder-- > 0)
- wait = 1; /* Async IO */
- else
- wait = 0; /* No IO */
-
/* Try to free the page buffers. */
- clearedbuf = try_to_free_buffers(page, wait);
+ clearedbuf = try_to_free_buffers(page, can_get_io_locks);

/*
* Re-take the spinlock. Note that we cannot
@@ -566,7 +538,7 @@
/* The buffers were not freed. */
if (!clearedbuf) {
add_page_to_inactive_dirty_list(page);
- if (wait)
+ if (can_get_io_locks)
flushed_pages++;

/* The page was only in the buffer cache. */
@@ -619,61 +591,8 @@
spin_unlock(&pagemap_lru_lock);

/*
- * If we don't have enough free pages, we loop back once
- * to queue the dirty pages for writeout. When we were called
- * by a user process (that /needs/ a free page) and we didn't
- * free anything yet, we wait synchronously on the writeout of
- * MAX_SYNC_LAUNDER pages.
- *
- * We also wake up bdflush, since bdflush should, under most
- * loads, flush out the dirty pages before we have to wait on
- * IO.
- */
- shortage = free_shortage();
- if (can_get_io_locks && !launder_loop && shortage) {
- launder_loop = 1;
-
- /*
- * User programs can run page_launder() in parallel so
- * we only flush a few pages at a time to avoid big IO
- * storms. Kswapd, OTOH, is expected usually keep up
- * with the paging load in the system and doesn't have
- * the IO storm problem, so it just flushes all pages
- * needed to fix the free shortage.
- */
- maxlaunder = shortage;
- maxlaunder -= flushed_pages;
- maxlaunder -= atomic_read(&nr_async_pages);
-
- if (maxlaunder <= 0)
- goto out;
-
- if (user && maxlaunder > MAX_LAUNDER)
- maxlaunder = MAX_LAUNDER;
-
- /*
- * If we are called by a user program, we need to free
- * some pages. If we couldn't, we'll do the last page IO
- * synchronously to be sure
- */
- if (user && !freed_pages)
- sync = 1;
-
- goto dirty_page_rescan;
- }
-
- /*
- * We have to make sure the data is actually written to
- * the disk now, otherwise we'll never get enough clean
- * pages and the system will keep queueing dirty pages
- * for flushing.
- */
- run_task_queue(&tq_disk);
-
- /*
* Return the amount of pages we freed or made freeable.
*/
-out:
return freed_pages + flushed_pages;
}

@@ -846,7 +765,7 @@
* continue with its real work sooner. It also helps balancing when we
* have multiple processes in try_to_free_pages simultaneously.
*/
-#define DEF_PRIORITY (6)
+#define DEF_PRIORITY (2)
static int refill_inactive(unsigned int gfp_mask, int user)
{
int count, start_count, maxtry;
@@ -981,14 +900,6 @@
/* If needed, try to free some memory. */
if (inactive_shortage() || free_shortage())
do_try_to_free_pages(GFP_KSWAPD, 0);
-
- /*
- * Do some (very minimal) background scanning. This
- * will scan all pages on the active list once
- * every minute. This clears old referenced bits
- * and moves unused pages to the inactive list.
- */
- refill_inactive_scan(DEF_PRIORITY, 0);

/* Once a second, recalculate some VM stats. */
if (time_after(jiffies, recalc + HZ)) {


2001-02-27 19:46:01

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

On Tue, 27 Feb 2001, Mike Galbraith wrote:

> Attempting to avoid doing I/O has been harmful to throughput here
> ever since the queueing/elevator woes were fixed. Ever since then,
> tossing attempts at avoidance has improved throughput markedly.
>
> IMHO, any patch which claims to improve throughput via code deletion
> should be worth a little eyeball time.. and maybe even a test run ;-)
>
> Comments welcome.

Before even thinking about testing this thing, I'd like to
see some (detailed?) explanation from you why exactly you
think the changes in this patch are good and how + why they
work.

IMHO it would be good to not apply ANY code to the stable
kernel tree unless we understand what it does and what the
author meant the code to do...

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-02-27 21:29:08

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

On Tue, 27 Feb 2001, Rik van Riel wrote:

> On Tue, 27 Feb 2001, Mike Galbraith wrote:
>
> > Attempting to avoid doing I/O has been harmful to throughput here
> > ever since the queueing/elevator woes were fixed. Ever since then,
> > tossing attempts at avoidance has improved throughput markedly.
> >
> > IMHO, any patch which claims to improve throughput via code deletion
> > should be worth a little eyeball time.. and maybe even a test run ;-)
> >
> > Comments welcome.
>
> Before even thinking about testing this thing, I'd like to
> see some (detailed?) explanation from you why exactly you
> think the changes in this patch are good and how + why they
> work.

Ok.. quite reasonable ;-)

First and foremost: What does refill_inactive_scan do? It places
work to do on a list.. and nothing more. It frees no memory in and
of itself.. none (but we count it as freed.. that's important). It
is the amount of memory we want desperately to free in the immediate
future. We count on it getting freed. The only way to free I/O bound
memory is to do the I/O.. as fast as the I/O subsystem can sync it.

This is the nut.. scan/deactivate percentages are fairly meaningless
unless we do something about these pages.

What the patch does is simply to push I/O as fast as we can.. we're
by definition I/O bound and _can't_ defer it under any circumstance,
for in this direction lies constipation. The only thing in the world
which will make it better is pushing I/O.

If you test the patch, you'll notice one very important thing. The
system no longer over-reacts.. as badly. That's a diagnostic point.
(On my system under my favorite page turnover rate load, I see my box
drowning in a pool of dirty pages.. which it's not allowed to drain)

What we do right now (as kswapd) is scan a tiny portion of the active
page list, and then push an arbitrary amount of swap because we can't
possibly deactivate enough pages if our shortage is larger than the
search area (nr_active_pages >> 6).. repeat until give-up time. In
practice here (test load, but still..), that leads to pushing soon
to be unneeded [supposition!] pages into swap a full 3/4 of the time.

> IMHO it would be good to not apply ANY code to the stable
> kernel tree unless we understand what it does and what the
> author meant the code to do...

Yes.. I agree 100%. I was not suggesting that this be blindly
integrated. (I know me.. can get all cornfoosed and fsck up;)

-Mike

2001-02-27 23:02:25

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4


On Tue, 27 Feb 2001, Mike Galbraith wrote:

> On Tue, 27 Feb 2001, Rik van Riel wrote:
>
> > On Tue, 27 Feb 2001, Mike Galbraith wrote:
> >
> > > Attempting to avoid doing I/O has been harmful to throughput here
> > > ever since the queueing/elevator woes were fixed. Ever since then,
> > > tossing attempts at avoidance has improved throughput markedly.
> > >
> > > IMHO, any patch which claims to improve throughput via code deletion
> > > should be worth a little eyeball time.. and maybe even a test run ;-)
> > >
> > > Comments welcome.
> >
> > Before even thinking about testing this thing, I'd like to
> > see some (detailed?) explanation from you why exactly you
> > think the changes in this patch are good and how + why they
> > work.
>
> Ok.. quite reasonable ;-)
>
> First and foremost: What does refill_inactive_scan do? It places
> work to do on a list.. and nothing more. It frees no memory in and
> of itself.. none (but we count it as freed.. that's important). It
> is the amount of memory we want desperately to free in the immediate
> future. We count on it getting freed. The only way to free I/O bound
> memory is to do the I/O.. as fast as the I/O subsystem can sync it.
>
> This is the nut.. scan/deactivate percentages are fairly meaningless
> unless we do something about these pages.
>
> What the patch does is simply to push I/O as fast as we can.. we're
> by definition I/O bound and _can't_ defer it under any circumstance,
> for in this direction lies constipation. The only thing in the world
> which will make it better is pushing I/O.

In your I/O bound case, yes. But not in all cases.

> If you test the patch, you'll notice one very important thing. The
> system no longer over-reacts.. as badly. That's a diagnostic point.
> (On my system under my favorite page turnover rate load, I see my box
> drowning in a pool of dirty pages.. which it's not allowed to drain)
>
> What we do right now (as kswapd) is scan a tiny portion of the active
> page list, and then push an arbitrary amount of swap because we can't
> possibly deactivate enough pages if our shortage is larger than the
> search area (nr_active_pages >> 6).. repeat until give-up time. In
> practice here (test load, but still..), that leads to pushing soon
> to be unneeded [supposition!] pages into swap a full 3/4 of the time.

Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
see if the system still swaps out too much?

2001-02-28 05:04:56

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

On Tue, 27 Feb 2001, Marcelo Tosatti wrote:

> On Tue, 27 Feb 2001, Mike Galbraith wrote:
>
> > What the patch does is simply to push I/O as fast as we can.. we're
> > by definition I/O bound and _can't_ defer it under any circumstance,
> > for in this direction lies constipation. The only thing in the world
> > which will make it better is pushing I/O.
>
> In your I/O bound case, yes. But not in all cases.

That's one reason I tossed it out. I don't _think_ it should have any
negative effect on other loads, but a test run might find otherwise.

> > What we do right now (as kswapd) is scan a tiny portion of the active
> > page list, and then push an arbitrary amount of swap because we can't
> > possibly deactivate enough pages if our shortage is larger than the
> > search area (nr_active_pages >> 6).. repeat until give-up time. In
> > practice here (test load, but still..), that leads to pushing soon
> > to be unneeded [supposition!] pages into swap a full 3/4 of the time.

(correction: it's 2/3 of the time not 3/4.. off by one bug in fingers;)

> Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
> see if the system still swaps out too much?

Not yet, but will do.

-Mike

2001-02-28 08:11:37

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

> > Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
> > see if the system still swaps out too much?
>
> Not yet, but will do.

Didn't help. (It actually reduced throughput a little)

-Mike

2001-02-28 08:45:10

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4



On Wed, 28 Feb 2001, Mike Galbraith wrote:

> On Tue, 27 Feb 2001, Marcelo Tosatti wrote:
>
> > On Tue, 27 Feb 2001, Mike Galbraith wrote:
> >
> > > What the patch does is simply to push I/O as fast as we can.. we're
> > > by definition I/O bound and _can't_ defer it under any circumstance,
> > > for in this direction lies constipation. The only thing in the world
> > > which will make it better is pushing I/O.
> >
> > In your I/O bound case, yes. But not in all cases.
>
> That's one reason I tossed it out. I don't _think_ it should have any
> negative effect on other loads, but a test run might find otherwise.

Writes are more expensive than reads. Apart from the aggressive read
caching on the disk, writes have limited caching or no caching at all if
you need security (journalling, for example). (I'm not sure about write
caching details, any harddisk expert?)

On read intensive loads, doing IO to free memory (writing pages out) will
be horribly harmful for these reads (which you can free easily), so its
better to avoid the writes as much as possible.

I remember Matthew Dillon (FreeBSD VM guy) had a read intensive case were
using 20:1 clean/flush ratio to free pages in FreeBSD's launder routine
(at that time, IIRC, their launder routine was looping twice the inactive
dirty list looking for clean pages to throw away, and only on the third
loop it would do IO) was still being a problem for disk performance
because of the writes. Yes, it sounds weird.

I suppose you're running dbench.

2001-02-28 09:11:56

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4



On Wed, 28 Feb 2001, Mike Galbraith wrote:

> > > Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
> > > see if the system still swaps out too much?
> >
> > Not yet, but will do.

But what about swapping behaviour?

It still swaps too much?

2001-02-28 11:48:58

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

On Wed, 28 Feb 2001, Marcelo Tosatti wrote:

> On Wed, 28 Feb 2001, Mike Galbraith wrote:
>
> > > > Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
> > > > see if the system still swaps out too much?
> > >
> > > Not yet, but will do.
>
> But what about swapping behaviour?
>
> It still swaps too much?

Yes.

(returning to study mode)

-Mike

2001-02-28 15:35:55

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

On Wed, 28 Feb 2001, Marcelo Tosatti wrote:
> On Wed, 28 Feb 2001, Mike Galbraith wrote:

> > That's one reason I tossed it out. I don't _think_ it should have any
> > negative effect on other loads, but a test run might find otherwise.
>
> Writes are more expensive than reads. Apart from the aggressive read
> caching on the disk, writes have limited caching or no caching at all if
> you need security (journalling, for example). (I'm not sure about write
> caching details, any harddisk expert?)

I suspect Mike needs to change his benchmark load a little
so that it dirties only 10% of the pages (might be realistic
for web and/or database loads).

At that point, you should be able to see that doing writes
all the time can really mess up read performance due to extra
introduced seeks.

We probably want some in-between solution (like FreeBSD has today).
The first time they see a dirty page, they mark it as seen, the
second time they come across it in the inactive list, they flush it.
This way IO is still delayed a bit and not done if there are enough
clean pages around.

Another solution would be to do some more explicit IO clustering and
only flush _large_ clusters ... no need to invoke extra disk seeks
just to free a single page, unless you only have single pages left.

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-03-01 04:15:16

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

On Wed, 28 Feb 2001, Rik van Riel wrote:

> On Wed, 28 Feb 2001, Marcelo Tosatti wrote:
> > On Wed, 28 Feb 2001, Mike Galbraith wrote:
>
> > > That's one reason I tossed it out. I don't _think_ it should have any
> > > negative effect on other loads, but a test run might find otherwise.
> >
> > Writes are more expensive than reads. Apart from the aggressive read
> > caching on the disk, writes have limited caching or no caching at all if
> > you need security (journalling, for example). (I'm not sure about write
> > caching details, any harddisk expert?)
>
> I suspect Mike needs to change his benchmark load a little
> so that it dirties only 10% of the pages (might be realistic
> for web and/or database loads).

Asking the user to not dirty so many pages is wrong. My benchmark
load is many compute intensive tasks which each dirty a few pages
while doing real work. It would be unrealistic if it just dirtied
pages as fast as possible to intentionally jam up the vm, but it
doesn't do that.

> At that point, you should be able to see that doing writes
> all the time can really mess up read performance due to extra
> introduced seeks.

The fact that writes are painful doesn't change the fact that data
must be written in order to free memory and proceed. Besides, the
elevator is supposed to solve that not the allocator.. or?

> We probably want some in-between solution (like FreeBSD has today).
> The first time they see a dirty page, they mark it as seen, the
> second time they come across it in the inactive list, they flush it.
> This way IO is still delayed a bit and not done if there are enough
> clean pages around.

(delayed write is fine, but I'll be upset if vmlinux doesn't show up
after I buy more ram;)

> Another solution would be to do some more explicit IO clustering and
> only flush _large_ clusters ... no need to invoke extra disk seeks
> just to free a single page, unless you only have single pages left.

This sounds good.. except I keep thinking about the elevator. Clusters
disappear as soon as they hit the queues so clustering at the vm level
doesn't make any sense to me. Where pages actually land is a function
of the fs, and that gets torn down even further by the elevator. If
you submit pages one at a time, the plug will build clusters for you.

I don't think that the vm has the information needed to make decisions
like this nor the responsibility to do so. It's a customer of the I/O
layers beneath it.

-Mike

2001-03-01 15:37:40

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

On Thu, 1 Mar 2001, Mike Galbraith wrote:
> On Wed, 28 Feb 2001, Rik van Riel wrote:
> > On Wed, 28 Feb 2001, Marcelo Tosatti wrote:
> > > On Wed, 28 Feb 2001, Mike Galbraith wrote:
> >
> > > > That's one reason I tossed it out. I don't _think_ it should have any
> > > > negative effect on other loads, but a test run might find otherwise.
> > >
> > > Writes are more expensive than reads. Apart from the aggressive read
> > > caching on the disk, writes have limited caching or no caching at all if
> > > you need security (journalling, for example). (I'm not sure about write
> > > caching details, any harddisk expert?)
> >
> > I suspect Mike needs to change his benchmark load a little
> > so that it dirties only 10% of the pages (might be realistic
> > for web and/or database loads).
>
> Asking the user to not dirty so many pages is wrong. My benchmark
> load is many compute intensive tasks which each dirty a few pages
> while doing real work. It would be unrealistic if it just dirtied
> pages as fast as possible to intentionally jam up the vm, but it
> doesn't do that.

Asking you to test a different kind of workload is wrong ??

The kind of load I described _is_ realistic, think for example
about ftp/www/MySQL servers...

> > At that point, you should be able to see that doing writes
> > all the time can really mess up read performance due to extra
> > introduced seeks.
>
> The fact that writes are painful doesn't change the fact that data
> must be written in order to free memory and proceed. Besides, the
> elevator is supposed to solve that not the allocator.. or?

But if the amount of dirtied pages is _small_, it means that we can
allow the reads to continue uninterrupted for a while before we
flush all dirty pages in one go...

Also, the elevator can only try to optimise whatever you throw at
it. If you throw random requests at the elevator, you cannot expect
it to do ANY GOOD ...

The merging at the elevator level only works if the requests sent to
it are right next to each other on disk. This means that randomly
sending stuff to disk really DOES DESTROY PERFORMANCE and there's
nothing the elevator could ever hope to do about that.

> > We probably want some in-between solution (like FreeBSD has today).
> > The first time they see a dirty page, they mark it as seen, the
> > second time they come across it in the inactive list, they flush it.
> > This way IO is still delayed a bit and not done if there are enough
> > clean pages around.
>
> (delayed write is fine, but I'll be upset if vmlinux doesn't show up
> after I buy more ram;)

Writing out of old data is a task independent of the VM. This is a
job done by kupdate. The only thing the VM does is write pages out
earlier when it's under memory pressure.

> > Another solution would be to do some more explicit IO clustering and
> > only flush _large_ clusters ... no need to invoke extra disk seeks
> > just to free a single page, unless you only have single pages left.
>
> This sounds good.. except I keep thinking about the elevator.
> Clusters disappear as soon as they hit the queues so clustering
> at the vm level doesn't make any sense to me.

You should think about the elevator a bit more. Feel for the poor
thing and try to send it requests it can actually do something
useful with ;)

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-03-01 20:35:58

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

On Thu, 1 Mar 2001, Rik van Riel wrote:

> On Thu, 1 Mar 2001, Mike Galbraith wrote:
> > On Wed, 28 Feb 2001, Rik van Riel wrote:
> > > On Wed, 28 Feb 2001, Marcelo Tosatti wrote:
> > > > On Wed, 28 Feb 2001, Mike Galbraith wrote:
> > >
> > > > > That's one reason I tossed it out. I don't _think_ it should have any
> > > > > negative effect on other loads, but a test run might find otherwise.
> > > >
> > > > Writes are more expensive than reads. Apart from the aggressive read
> > > > caching on the disk, writes have limited caching or no caching at all if
> > > > you need security (journalling, for example). (I'm not sure about write
> > > > caching details, any harddisk expert?)
> > >
> > > I suspect Mike needs to change his benchmark load a little
> > > so that it dirties only 10% of the pages (might be realistic
> > > for web and/or database loads).
> >
> > Asking the user to not dirty so many pages is wrong. My benchmark
> > load is many compute intensive tasks which each dirty a few pages
> > while doing real work. It would be unrealistic if it just dirtied
> > pages as fast as possible to intentionally jam up the vm, but it
> > doesn't do that.
>
> Asking you to test a different kind of workload is wrong ??

No no no and again no (perhaps I misread that bit). But otoh, you
haven't tested the patch I sent in good faith. I sent it because I
have thought about it. I may be wrong in my interpretation of the
results, but those results were thought about.. and they exist.

> The kind of load I described _is_ realistic, think for example
> about ftp/www/MySQL servers...

Yes. My favorite test load is also realistic.

> > > At that point, you should be able to see that doing writes
> > > all the time can really mess up read performance due to extra
> > > introduced seeks.
> >
> > The fact that writes are painful doesn't change the fact that data
> > must be written in order to free memory and proceed. Besides, the
> > elevator is supposed to solve that not the allocator.. or?
>
> But if the amount of dirtied pages is _small_, it means that we can
> allow the reads to continue uninterrupted for a while before we
> flush all dirty pages in one go...

"If wishes were horses, beggers would ride."

There is no mechanysm in place that ensures that dirty pages can't
get out of control, and they do in fact get out of control, and it
is exaserbated (mho) by attempting to define 'too much I/O' without
any information to base this definition upon.

> Also, the elevator can only try to optimise whatever you throw at
> it. If you throw random requests at the elevator, you cannot expect
> it to do ANY GOOD ...

This is a very good point (which I will think upon). I ask you this
in return. Why do you think that the random junk you throw at the
elevator is different than the random junk I throw at it? ;-) I see
no difference at all.. it's the same exact junk. (it's junk because
neither of us knows that it will be optimizable.. it really is a
random bunch of pages because we have ZERO information concerning
the origins, destinations nor informational content of the pages we're
pushing. We have no interest [only because we aren't clever enough
to be interested] in these things.)

> The merging at the elevator level only works if the requests sent to
> it are right next to each other on disk. This means that randomly
> sending stuff to disk really DOES DESTROY PERFORMANCE and there's
> nothing the elevator could ever hope to do about that.

True to some (very real) extent because of the limited buffering of
requests. However, I can not find any useful information that the
vm is using to guarantee the IT does not destroy performance by your
own definition. If it's there and I'm just missing it, I'd thank
you heartily if you'd hit me up side the head with a clue-x-4 ;-)

> > > We probably want some in-between solution (like FreeBSD has today).
> > > The first time they see a dirty page, they mark it as seen, the
> > > second time they come across it in the inactive list, they flush it.
> > > This way IO is still delayed a bit and not done if there are enough
> > > clean pages around.
> >
> > (delayed write is fine, but I'll be upset if vmlinux doesn't show up
> > after I buy more ram;)
>
> Writing out of old data is a task independent of the VM. This is a
> job done by kupdate. The only thing the VM does is write pages out
> earlier when it's under memory pressure.

I was joking.

> > > Another solution would be to do some more explicit IO clustering and
> > > only flush _large_ clusters ... no need to invoke extra disk seeks
> > > just to free a single page, unless you only have single pages left.
> >
> > This sounds good.. except I keep thinking about the elevator.
> > Clusters disappear as soon as they hit the queues so clustering
> > at the vm level doesn't make any sense to me.
>
> You should think about the elevator a bit more. Feel for the poor
> thing and try to send it requests it can actually do something
> useful with ;)

I will, and I hope you can help me out with a little more food for
thought.

-Mike

2001-03-01 21:08:02

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

On Thu, 1 Mar 2001, Mike Galbraith wrote:
> On Thu, 1 Mar 2001, Rik van Riel wrote:
> > On Thu, 1 Mar 2001, Mike Galbraith wrote:

> No no no and again no (perhaps I misread that bit). But otoh,
> you haven't tested the patch I sent in good faith. I sent it
> because I have thought about it. I may be wrong in my
> interpretation of the results, but those results were thought
> about.. and they exist.

I haven't tested it yet for a number of reasons. The most
important one is that the FreeBSD people have been playing
with this thing for a few years now and Matt Dillon has
told me the result of their tests ;)

> > But if the amount of dirtied pages is _small_, it means that we can
> > allow the reads to continue uninterrupted for a while before we
> > flush all dirty pages in one go...
>
> "If wishes were horses, beggers would ride."
>
> There is no mechanysm in place that ensures that dirty pages
> can't get out of control, and they do in fact get out of
> control, and it is exaserbated (mho) by attempting to define
> 'too much I/O' without any information to base this definition
> upon.

True. I think we want something in-between our ideas...

> > Also, the elevator can only try to optimise whatever you throw at
> > it. If you throw random requests at the elevator, you cannot expect
> > it to do ANY GOOD ...
>
> This is a very good point (which I will think upon). I ask you this
> in return. Why do you think that the random junk you throw at the
> elevator is different than the random junk I throw at it? ;-) I see
> no difference at all.. it's the same exact junk.

Except that your code throws the random junk at the elevator all
the time, while my code only bothers the elevator every once in
a while. This should make it possible for the disk reads to
continue with less interruptions.

> > The merging at the elevator level only works if the requests sent to
> > it are right next to each other on disk. This means that randomly
> > sending stuff to disk really DOES DESTROY PERFORMANCE and there's
> > nothing the elevator could ever hope to do about that.
>
> True to some (very real) extent because of the limited buffering
> of requests. However, I can not find any useful information
> that the vm is using to guarantee the IT does not destroy
> performance by your own definition.

Indeed. IMHO we should fix this by putting explicit IO
clustering in the ->writepage() functions.

Doing this, in combination with *WAITING* for dirty pages
to accumulate on the inactive list will give us the
possibility to do more writeout of dirty data with less
disk seeks (and less slowdown of the reads).

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

Rik van Riel wrote:

[ ... ]

> Except that your code throws the random junk at the elevator all
> the time, while my code only bothers the elevator every once in
> a while. This should make it possible for the disk reads to
> continue with less interruptions.
>

Couldn't agree with you more. The elevator does a decent job
these days, but higher level clustering could do more ...

[ ...]

> Indeed. IMHO we should fix this by putting explicit IO
> clustering in the ->writepage() functions.

Enhancing writepage() to perform clustering is the first step.
In addition you want entities (kupdated, kswapd, et. al)
that currently work with only buffers to invoke writepage()
at appropriate points. Just today I sent a patch that does this
and also combines delayed allocation out to Al Viro for comments.
If anyone else is interested I can send it out to the list.

ananth.

--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------

2001-03-01 22:25:31

by Alan

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

> There is no mechanysm in place that ensures that dirty pages can't
> get out of control, and they do in fact get out of control, and it
> is exaserbated (mho) by attempting to define 'too much I/O' without
> any information to base this definition upon.

I think this is a good point. If you do 'too much I/O' then the I/O gets
throttled by submit_bh(). The block I/O layer knows about 'too much I/O'.

2001-03-01 22:33:42

by Alan

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

> Except that your code throws the random junk at the elevator all
> the time, while my code only bothers the elevator every once in
> a while. This should make it possible for the disk reads to
> continue with less interruptions.

Think about it this way, throwing the stuff at the I/O layer is saying
'please make this go away'. Thats the VM decision. Scheduling the I/O is an
I/O and driver layer decision.




2001-03-01 22:35:51

by Chris Evans

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4


On Thu, 1 Mar 2001, Rik van Riel wrote:

> True. I think we want something in-between our ideas...
^^^^^^^
> a while. This should make it possible for the disk reads to
^^^^^^

Oh dear.. not more "vm design by waving hands in the air". Come on people,
improve the vm by careful profiling, tweaking and benching, not by
throwing random patches in that seem cool in theory.

Cheers
Chris

2001-03-01 22:38:21

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4


On Thu, 1 Mar 2001, Chris Evans wrote:

>
> On Thu, 1 Mar 2001, Rik van Riel wrote:
>
> > True. I think we want something in-between our ideas...
> ^^^^^^^
> > a while. This should make it possible for the disk reads to
> ^^^^^^
>
> Oh dear.. not more "vm design by waving hands in the air". Come on people,
> improve the vm by careful profiling, tweaking and benching, not by
> throwing random patches in that seem cool in theory.

OTOH, "careful profiling, tweaking and benching" are always limited to a
number workloads.



2001-03-01 22:39:21

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

On Thu, 1 Mar 2001, Chris Evans wrote:
> On Thu, 1 Mar 2001, Rik van Riel wrote:
>
> > True. I think we want something in-between our ideas...
> ^^^^^^^
> > a while. This should make it possible for the disk reads to
> ^^^^^^
>
> Oh dear.. not more "vm design by waving hands in the air". Come
> on people, improve the vm by careful profiling, tweaking and
> benching, not by throwing random patches in that seem cool in
> theory.

Actually, this was more of "vm design by looking at what
the FreeBSD folks did, why it didn't work and how they
fixed it after 2 years of testing various things".

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-03-02 03:00:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

In article <[email protected]>,
Rik van Riel <[email protected]> wrote:
>
>I haven't tested it yet for a number of reasons. The most
>important one is that the FreeBSD people have been playing
>with this thing for a few years now and Matt Dillon has
>told me the result of their tests ;)

Note that the Linux VM is certainly different enough that I doubt the
comparisons are all that valid. Especially actual virtual memory mapping
is basically from another planet altogether, and heuristics that are
appropriate for *BSD may not really translate all that better.

I'll take numbers over talk any day. At least Mike had numbers, and
possible explanations for them. He also removed more code than he added,
which is always a good sign.

In short, please don't argue against numbers.

Linus

2001-03-02 04:41:17

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

On Thu, 1 Mar 2001, Chris Evans wrote:

> Oh dear.. not more "vm design by waving hands in the air". Come on people,
> improve the vm by careful profiling, tweaking and benching, not by
> throwing random patches in that seem cool in theory.

Excuse me.. we're trying to have a _constructive_ conversation here.

-Mike

2001-03-02 05:19:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

On Thu, 1 Mar 2001, Rik van Riel wrote:

> > > The merging at the elevator level only works if the requests sent to
> > > it are right next to each other on disk. This means that randomly
> > > sending stuff to disk really DOES DESTROY PERFORMANCE and there's
> > > nothing the elevator could ever hope to do about that.
> >
> > True to some (very real) extent because of the limited buffering
> > of requests. However, I can not find any useful information
> > that the vm is using to guarantee the IT does not destroy
> > performance by your own definition.
>
> Indeed. IMHO we should fix this by putting explicit IO
> clustering in the ->writepage() functions.

I notice there's a patch sitting in my mailbox.. think I'll go read
it and think (grunt grunt;) about this issue some more.

Thanks for the input Rik. I appreciate it.

-Mike

2001-03-02 16:12:31

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

On 1 Mar 2001, Linus Torvalds wrote:
> In article <[email protected]>,
> Rik van Riel <[email protected]> wrote:
> >
> >I haven't tested it yet for a number of reasons. The most
> >important one is that the FreeBSD people have been playing
> >with this thing for a few years now and Matt Dillon has
> >told me the result of their tests ;)
>
> Note that the Linux VM is certainly different enough that I
> doubt the comparisons are all that valid. Especially actual
> virtual memory mapping is basically from another planet
> altogether, and heuristics that are appropriate for *BSD may not
> really translate all that better.

The main difference is that under Linux the size of the
inactive list is dynamic, while under FreeBSD the system
always tries to keep a (very) large inactive list around.

I'm not sure if, or how, this would influence the percentage
of dirty pages on the inactive list or how often we'd need to
flush something to disk as opposed to reclaiming clean pages.

> I'll take numbers over talk any day. At least Mike had numbers,

The only number I saw when reading over this thread was that
Mike found that under one workload he tested the Linux kernel
ended up doing IO anyway about 2/3rds of the time.

This would also mean we'd be able to _avoid_ IO 1/3rd of the
time ;)

> In short, please don't argue against numbers.

I'm not arguing against his numbers, all I want to know is
if the patch has the same positive effect on other workloads
as well...

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-03-07 00:08:09

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4



On Fri, 2 Mar 2001, Mike Galbraith wrote:

> On Thu, 1 Mar 2001, Rik van Riel wrote:
>
> > > > The merging at the elevator level only works if the requests sent to
> > > > it are right next to each other on disk. This means that randomly
> > > > sending stuff to disk really DOES DESTROY PERFORMANCE and there's
> > > > nothing the elevator could ever hope to do about that.
> > >
> > > True to some (very real) extent because of the limited buffering
> > > of requests. However, I can not find any useful information
> > > that the vm is using to guarantee the IT does not destroy
> > > performance by your own definition.
> >
> > Indeed. IMHO we should fix this by putting explicit IO
> > clustering in the ->writepage() functions.
>
> I notice there's a patch sitting in my mailbox.. think I'll go read
> it and think (grunt grunt;) about this issue some more.

Mike,

One important information which is not being considered by
page_launder() now the dirty buffers watermark.

In general, it should not try to avoid writing dirty pages if we're above
the dirty buffers watermark.

2001-03-07 07:58:55

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

On Tue, 6 Mar 2001, Marcelo Tosatti wrote:

> On Fri, 2 Mar 2001, Mike Galbraith wrote:
>
> > On Thu, 1 Mar 2001, Rik van Riel wrote:
> >
> > > > > The merging at the elevator level only works if the requests sent to
> > > > > it are right next to each other on disk. This means that randomly
> > > > > sending stuff to disk really DOES DESTROY PERFORMANCE and there's
> > > > > nothing the elevator could ever hope to do about that.
> > > >
> > > > True to some (very real) extent because of the limited buffering
> > > > of requests. However, I can not find any useful information
> > > > that the vm is using to guarantee the IT does not destroy
> > > > performance by your own definition.
> > >
> > > Indeed. IMHO we should fix this by putting explicit IO
> > > clustering in the ->writepage() functions.
> >
> > I notice there's a patch sitting in my mailbox.. think I'll go read
> > it and think (grunt grunt;) about this issue some more.
>
> Mike,
>
> One important information which is not being considered by
> page_launder() now the dirty buffers watermark.
>
> In general, it should not try to avoid writing dirty pages if we're above
> the dirty buffers watermark.

Agreed in theory.. I'll go try to measure.

-Mike

2002-01-16 21:45:20

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4



On Wed, 28 Feb 2001, Mike Galbraith wrote:

> On Wed, 28 Feb 2001, Marcelo Tosatti wrote:
>
> > On Wed, 28 Feb 2001, Mike Galbraith wrote:
> >
> > > > > Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
> > > > > see if the system still swaps out too much?
> > > >
> > > > Not yet, but will do.
> >
> > But what about swapping behaviour?
> >
> > It still swaps too much?
>
> Yes.
>
> (returning to study mode)

Ok, I'm stupid. Changing SWAP_SHIFT will just balance the nr of tasks and
the per-task nr of scanned pte's, but the not (roughly) the total nr of
ptes scanned. I thought it would decrease the number of scanned ptes from
3% (which the current in -ac code does) to 0.3% (which Linus tree does) of
the total ptes.

The problem seems to be multiple users calling swap_out() with 3% of ptes
being scanned. This will unmap ptes way too heavily for common workloads.

Its magic number tuning.. but nothing better can be done for 2.4, I think.



2002-01-16 23:28:28

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4


Duh, ignore that message.

While looking into "postponed messages" folder I found this and
accidentally sent this.

Sorry.

On Wed, 16 Jan 2002, Marcelo Tosatti wrote:

>
>
> On Wed, 28 Feb 2001, Mike Galbraith wrote:
>
> > On Wed, 28 Feb 2001, Marcelo Tosatti wrote:
> >
> > > On Wed, 28 Feb 2001, Mike Galbraith wrote:
> > >
> > > > > > Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
> > > > > > see if the system still swaps out too much?
> > > > >
> > > > > Not yet, but will do.
> > >
> > > But what about swapping behaviour?
> > >
> > > It still swaps too much?
> >
> > Yes.
> >
> > (returning to study mode)
>
> Ok, I'm stupid. Changing SWAP_SHIFT will just balance the nr of tasks and
> the per-task nr of scanned pte's, but the not (roughly) the total nr of
> ptes scanned. I thought it would decrease the number of scanned ptes from
> 3% (which the current in -ac code does) to 0.3% (which Linus tree does) of
> the total ptes.
>
> The problem seems to be multiple users calling swap_out() with 3% of ptes
> being scanned. This will unmap ptes way too heavily for common workloads.
>
> Its magic number tuning.. but nothing better can be done for 2.4, I think.