Andrew could you please consider this for -mm
Small changes to the style after suggestions from Pekka Enberg (thanks), and
changed the default size of prefetch to gently increase with size of ram.
Functionally this is the same code as vm-swap_prefetch-15 and I believe ready
for a wider audience.
Cheers,
Con
---
On 10/10/05, Con Kolivas <[email protected]> wrote:
> Andrew could you please consider this for -mm
>
> Small changes to the style after suggestions from Pekka Enberg (thanks), and
> changed the default size of prefetch to gently increase with size of ram.
> Functionally this is the same code as vm-swap_prefetch-15 and I believe ready
> for a wider audience.
>
+ What this will do on workstations is slowly bring back applications
+ that have swapped out after memory intensive workloads back into
+ physical ram if you have free ram at a later stage and the machine
+ is relatively idle. This means that when you come back to your
+ computer after leaving it idle for a while, applications will come
+ to life faster. Note that your swap usage will appear to increase
+ but these are cached pages, can be dropped freely by the vm, and it
+ should stabilise around 50% swap usage.
+
+ Desktop users will most likely want to say Y.
How about a little note about the impact for server users as well?
You recommend that desktop users enable this, but you don't give any
recommendation for servers.
--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
On Tue, 11 Oct 2005 00:35, Jesper Juhl wrote:
> On 10/10/05, Con Kolivas <[email protected]> wrote:
> > Andrew could you please consider this for -mm
> >
> > Small changes to the style after suggestions from Pekka Enberg (thanks),
> > and changed the default size of prefetch to gently increase with size of
> > ram. Functionally this is the same code as vm-swap_prefetch-15 and I
> > believe ready for a wider audience.
>
> + What this will do on workstations is slowly bring back applications
> + that have swapped out after memory intensive workloads back into
> + physical ram if you have free ram at a later stage and the machine
> + is relatively idle. This means that when you come back to your
> + computer after leaving it idle for a while, applications will come
> + to life faster. Note that your swap usage will appear to increase
> + but these are cached pages, can be dropped freely by the vm, and it
> + should stabilise around 50% swap usage.
> +
> + Desktop users will most likely want to say Y.
>
> How about a little note about the impact for server users as well?
> You recommend that desktop users enable this, but you don't give any
> recommendation for servers.
Your guess is as good as mine. I can easily demonstrate a benefit when using
it with desktop workloads but a server? It's not expensive to run but I don't
really know if it's advantageous either. If I had to take a guess, a server
that had multiple user logins running applications would benefit, but
database, web servers etc I doubt would benefit.
Cheers,
Con
On Tue, 11 Oct 2005 00:23, Con Kolivas wrote:
> Andrew could you please consider this for -mm
>
> Small changes to the style after suggestions from Pekka Enberg (thanks),
> and changed the default size of prefetch to gently increase with size of
> ram. Functionally this is the same code as vm-swap_prefetch-15 and I
> believe ready for a wider audience.
Here's a respin for 2.6.14-rc4
Con
---
Con Kolivas <[email protected]> wrote:
>
> This patch implements swap prefetching when the vm is relatively idle and
> there is free ram available.
> ...
> +/*
> + * Find the zone with the most free pages, recheck the watermarks and
> + * then directly allocate the ram. We don't want prefetch to use
> + * __alloc_pages and go calling on reclaim.
> + */
> +static struct page *prefetch_get_page(void)
> +{
> + struct zone *zone = NULL, *z;
> + struct page *page = NULL;
> + long most_free = 0;
> +
> + for_each_zone(z) {
> + long free;
> +
> + if (z->present_pages == 0)
> + continue;
> +
> + free = z->free_pages;
> +
> + /* We don't prefetch into DMA */
> + if (zone_idx(z) == ZONE_DMA)
> + continue;
> +
> + /* Select the zone with the most free ram */
> + if (free > most_free) {
> + most_free = free;
> + zone = z;
> + }
> + }
> +
> + if (zone == NULL)
> + goto out;
> +
> + page = buffered_rmqueue(zone, 0, GFP_HIGHUSER);
> + if (likely(page)) {
> + struct zonelist *zonelist;
> +
> + zonelist = NODE_DATA(numa_node_id())->node_zonelists +
> + (GFP_HIGHUSER & GFP_ZONEMASK);
> +
> + zone_statistics(zonelist, zone);
> + }
> +out:
> + return page;
> +}
Why use the "zone with most free pages"? Generally it would be better to
use up ZONE_HIGHMEM first: ZONE_NORMAL is valuable.
> +/*
> + * We want to be absolutely certain it's ok to start prefetching.
> + */
> +static int prefetch_suitable(void)
> +{
> + struct page_state ps;
> + unsigned long pending_writes, limit;
> + struct zone *z;
> + int ret = 0;
> +
> + /* Purposefully racy and might return false positive which is ok */
> + if (__test_and_clear_bit(0, &swapped.busy))
> + goto out;
> +
> + temp_free = 0;
> + /*
> + * Have some hysteresis between where page reclaiming and prefetching
> + * will occur to prevent ping-ponging between them.
> + */
> + for_each_zone(z) {
> + unsigned long free;
> +
> + if (z->present_pages == 0)
> + continue;
> + free = z->free_pages;
> + if (z->pages_high * 3 > free)
> + goto out;
> + temp_free += free;
> + }
> +
> + /*
> + * We check to see that pages are not being allocated elsewhere
> + * at any significant rate implying any degree of memory pressure
> + * (eg during file reads)
> + */
> + if (last_free) {
> + if (temp_free + SWAP_CLUSTER_MAX + prefetch_pages() <
> + last_free) {
> + last_free = temp_free;
> + goto out;
> + }
> + } else
> + last_free = temp_free;
> +
> + get_page_state(&ps);
> +
> + /* We shouldn't prefetch when we are doing writeback */
> + if (ps.nr_writeback)
> + goto out;
Yeah, this really needs to become per-disk-queue-aware.
> + /* Delay prefetching if we have significant amounts of dirty data */
> + pending_writes = ps.nr_dirty + ps.nr_unstable;
> + if (pending_writes > SWAP_CLUSTER_MAX)
> + goto out;
Surely this is too aggressive. There are almost always a few tens of dirty
pages floating about, especially when atime updates are enabled. I'd
suggest that you stick a printk in here - I expect you'll find that this
test triggers a lot - too much.
> + /* >2/3 of the ram is mapped, we need some free for pagecache */
> + limit = ps.nr_mapped + ps.nr_slab + pending_writes;
> + if (limit > mapped_limit)
> + goto out;
> +
> + /*
> + * Add swapcache to limit as well, but check this last since it needs
> + * locking
> + */
> + if (unlikely(!read_trylock(&swapper_space.tree_lock)))
> + goto out;
> + limit += total_swapcache_pages;
> + read_unlock(&swapper_space.tree_lock);
I'd just not bother with the locking at all here.
> +static int kprefetchd(void *data)
> +{
> + DEFINE_WAIT(wait);
> +
> + daemonize("kprefetchd");
kthread(), please.
> + set_user_nice(current, 19);
> + /* Set ioprio to lowest if supported by i/o scheduler */
> + sys_ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_CLASS_IDLE);
> +
> + for ( ; ; ) {
> + enum trickle_return prefetched;
> +
> + try_to_freeze();
> + prepare_to_wait(&kprefetchd_wait, &wait, TASK_INTERRUPTIBLE);
> + schedule();
> + finish_wait(&kprefetchd_wait, &wait);
> +
> + /* FAILED implies no entries left - the timer is not reset */
> + prefetched = trickle_swap();
> + switch (prefetched) {
> + case SUCCESS:
> + last_free = temp_free;
> + reset_prefetch_timer();
> + break;
> + case DELAY:
> + last_free = 0;
> + delay_prefetch_timer();
> + break;
> + case FAILED:
> + last_free = 0;
> + break;
> + }
> + }
> + return 0;
> +}
> +
> +/*
> + * Wake up kprefetchd. It will reset the timer itself appropriately so no
> + * need to do it here
> + */
> +static void prefetch_wakeup(unsigned long data)
> +{
> + if (waitqueue_active(&kprefetchd_wait))
> + wake_up_interruptible(&kprefetchd_wait);
> +}
> +
> +static int __init kprefetchd_init(void)
> +{
> + /*
> + * Prepare the prefetch timer. It is inactive until entries are placed
> + * on the swapped_list
> + */
> + init_timer(&prefetch_timer);
> + prefetch_timer.data = 0;
> + prefetch_timer.function = prefetch_wakeup;
> +
> + kernel_thread(kprefetchd, NULL, CLONE_KERNEL);
> +
> + return 0;
> +}
Might be able to use a boring old wake_up_process() here rather than a
waitqueue.
Is the timer actually needed? Could just do schedule_timeout() in
kprefetchd()?
On Wed, 12 Oct 2005 15:34, Andrew Morton wrote:
> Con Kolivas <[email protected]> wrote:
> > + /* Select the zone with the most free ram */
> > + if (free > most_free) {
> > + most_free = free;
> Why use the "zone with most free pages"? Generally it would be better to
> use up ZONE_HIGHMEM first: ZONE_NORMAL is valuable.
Ok. Sounds fair.
> > + /* We shouldn't prefetch when we are doing writeback */
> > + if (ps.nr_writeback)
> > + goto out;
>
> Yeah, this really needs to become per-disk-queue-aware.
I looked but it started looking like I was going to over-engineer.
> > + /* Delay prefetching if we have significant amounts of dirty data */
> > + pending_writes = ps.nr_dirty + ps.nr_unstable;
> > + if (pending_writes > SWAP_CLUSTER_MAX)
> > + goto out;
>
> Surely this is too aggressive. There are almost always a few tens of dirty
> pages floating about, especially when atime updates are enabled. I'd
> suggest that you stick a printk in here - I expect you'll find that this
> test triggers a lot - too much.
Actually I was quite aware of how frequently this hits. What I found in
practice was that the amount of dirty ram was an extraordinarily good marker
of whether the system was globally idle / low stressed or not. It did not
seem to stop prefetching from occurring in the real world on the machines I
tried it on.
> > + if (unlikely(!read_trylock(&swapper_space.tree_lock)))
> > + goto out;
> > + limit += total_swapcache_pages;
> > + read_unlock(&swapper_space.tree_lock);
>
> I'd just not bother with the locking at all here.
Ok.
> > + daemonize("kprefetchd");
>
> kthread(), please.
Check.
> > + init_timer(&prefetch_timer);
> > + prefetch_timer.data = 0;
> > + prefetch_timer.function = prefetch_wakeup;
> > +
> > + kernel_thread(kprefetchd, NULL, CLONE_KERNEL);
> > +
> > + return 0;
> > +}
>
> Might be able to use a boring old wake_up_process() here rather than a
> waitqueue.
>
> Is the timer actually needed? Could just do schedule_timeout() in
> kprefetchd()?
I guess. The timer just made it easy to start and stop it completely before I
turned prefetch into a daemon and it kinda stayed that way. It's not run that
frequently and only does miniscule things in that context; is it of a
significant advantage?
Thanks very much!
Cheers,
Con
[snip snip]
On Wed, 12 Oct 2005 15:34, Andrew Morton wrote:
> Why use the "zone with most free pages"? Generally it would be better to
> use up ZONE_HIGHMEM first: ZONE_NORMAL is valuable.
> I'd just not bother with the locking at all here.
> kthread(), please.
> Might be able to use a boring old wake_up_process() here rather than a
> waitqueue.
> Is the timer actually needed? Could just do schedule_timeout() in
> kprefetchd()?
Ok how's this look? On top of your patches.
Cheers,
Con