2001-10-09 22:01:14

by Xuan Baldauf

[permalink] [raw]
Subject: dynamic swap prioritizing

Hello,

I have a linux box with 3 harddisks of different
characteristics (size, seek time, throughput), each capable
of holding a swap partition. Sometimes, one harddisk is
driven heavily (e.g. database application), sometimes, the
other harddisk is busy.

I imagine following optimization:
- all swap partitions have the same priority from the start
on
- runtime statistics are gathered covering response time
(time from page request to availability)
- the fastest drive is used first (or maybe in striping mode
parallely woth the second-fastest drive)
- because the fastest drive will be more busy, its response
times will rise, reaching equality with other drives
- at that point, other drives are also considered for
swapout
- that system regularily adapts its decisions based on
recent statistics ("recent" is a tuning parameter)

Such an algorithm also would properly prioritize
network-swap and video-memory-swap, reducing time and cost
of a manual priority configuration (and statistics
gathering).

Does the linux kernel already implement such an
optimization? Is it planned?

Xu?n.



2001-10-10 01:43:32

by Rik van Riel

[permalink] [raw]
Subject: Re: dynamic swap prioritizing

On Wed, 10 Oct 2001, Xuan Baldauf wrote:

> Does the linux kernel already implement such an
> optimization? Is it planned?

No and no. But feel free to try to implement it.

I'm not sure if it would be a win to have the
system do this dynamic swap priority readjustment,
but I wouldn't be surprised if it was, either.

regards,

Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ (volunteers needed)

http://www.surriel.com/ http://distro.conectiva.com/

2001-10-10 03:35:55

by Andreas Dilger

[permalink] [raw]
Subject: Re: dynamic swap prioritizing

On Oct 10, 2001 00:01 +0200, Xuan Baldauf wrote:
> I have a linux box with 3 harddisks of different
> characteristics (size, seek time, throughput), each capable
> of holding a swap partition. Sometimes, one harddisk is
> driven heavily (e.g. database application), sometimes, the
> other harddisk is busy.

Daniel Phillips was working on something which may be useful in
this regard. Basically, he was trying to determine how "busy" a
disk was, so that if there are dirty pages to be written on an
"idle" disk they would be written immediately. The theory is
that if you wait longer, the disk may be busy with other I/O and
you have "wasted" the resource of disk bandwidth doing nothing.

Similarly, if you knew how busy each disk with a swap partition
was, you could swap to the most idle disk (assuming equal speed)
or at least take this into account if the speeds are different.

If this is to be generally useful, it would be good to find things
like max sequential read speed, max sequential write speed, and max
seek time (at least). Estimates for max sequential read speed and
seek time could be found at boot time for each disk relatively
easily, but write speed may have to be found only at runtime (or
it could all be fed in to the kernel from user space from benchmarks
run previously).

Once we had data like that, it would be relatively easy to keep
track of the queue depth for each device to determine "business"
and estimated time to an empty queue and make intelligent disk
I/O scheduling decisions (e.g. which MD RAID 1 disk to read from,
which disk to swap to, guaranteed I/O rate for XFS, etc).

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-10-10 08:39:27

by Helge Hafting

[permalink] [raw]
Subject: Re: dynamic swap prioritizing

Xuan Baldauf wrote:
>
> Hello,
>
> I have a linux box with 3 harddisks of different
> characteristics (size, seek time, throughput), each capable
[algorithm for changing swap priorities]
>
> Does the linux kernel already implement such an
> optimization? Is it planned?
>
It doesn't already. I am not sure this is a kernel thing
either. Changing swap priorities could be done by userspace
programs if there were a syscall or /proc interface
for it.

You could then have a script changing priorities when you start your
database for example. Or a daemon making changes based on
statistics.

Helge Hafting

2001-10-10 15:29:28

by Venkatesh Ramamurthy

[permalink] [raw]
Subject: Re: dynamic swap prioritizing

> If this is to be generally useful, it would be good to find things
> like max sequential read speed, max sequential write speed, and max
> seek time (at least). Estimates for max sequential read speed and
> seek time could be found at boot time for each disk relatively
> easily, but write speed may have to be found only at runtime (or
> it could all be fed in to the kernel from user space from benchmarks
> run previously).

Maybe we can find out the statistics for the first time (or when swap is
created) and store this information in the swap partition itself. This would
allow us to compute time consuming statistics only once. Also we need to
create new fields in the swap structure for this purpose.


2001-10-10 15:56:09

by Andreas Dilger

[permalink] [raw]
Subject: Re: dynamic swap prioritizing

On Oct 10, 2001 11:23 -0400, Venkatesh Ramamurthy wrote:
> > If this is to be generally useful, it would be good to find things
> > like max sequential read speed, max sequential write speed, and max
> > seek time (at least). Estimates for max sequential read speed and
> > seek time could be found at boot time for each disk relatively
> > easily, but write speed may have to be found only at runtime (or
> > it could all be fed in to the kernel from user space from benchmarks
> > run previously).
>
> Maybe we can find out the statistics for the first time (or when swap is
> created) and store this information in the swap partition itself. This would
> allow us to compute time consuming statistics only once. Also we need to
> create new fields in the swap structure for this purpose.

I'd rather just have the statistic data in a regular file for ALL disks,
and then send it to the kernel via ioctl or write to a special file that
the kernel will read from. I don't think it is critical to have this
data right at boot time, since it would only be used for optimizing I/O
access and would not be required for a disk to actually work.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-10-10 16:53:44

by Venkatesh Ramamurthy

[permalink] [raw]
Subject: RE: dynamic swap prioritizing

> I'd rather just have the statistic data in a regular file for ALL
disks,
> and then send it to the kernel via ioctl or write to a special
file that
> the kernel will read from. I don't think it is critical to have
this
> data right at boot time, since it would only be used for
optimizing I/O
> access and would not be required for a disk to actually work.

My idea of putting this info on swap was that when the disk is moved
from one system to another system, the statistics stays with the swap
itself. If the swap disk(partition) is put in a different system which has
different configuration(that which would affect the performance info on the
disk), then we can recompute the statistics, otherwise there is no need to
rerun the utility every time the swap disk is moved.
Also the kernel would be smart enough to know about the swap
performance without the need for an utility to invoked, to set the
parameters through IOCTL.


2001-10-10 17:14:18

by Richard B. Johnson

[permalink] [raw]
Subject: Re: dynamic swap prioritizing


I think that when the kernel needs a page of memory, it needs it
NOW! Swapping a dirty page to other in-memory pages wastes the
very page(s) that you need.

The contents of a least-recently used page should be written to
the swap device to free it for immediate use, regardless of the
disk-write speed, and regardless of how close the kernel thinks
the heads are to some track. Other stuff, like "prioritizing"
wastes resources you are trying to obtain. Further, all attempts
so-far, to use "elevator" algorithms to speed disk access fails
to provide any measurable improvements in anything. In fact,
buffering until the data will fit on a nearby track wastes
memory pages and the CPU resources necessary to manage them.

In the days when CPUs were slow, memory was scarce, and I/O
was at a crawl, Digital made a VMS system that worked. Using
the same kind of memory handling should be suburb now-days.

(1) You keep a page of zeroed data. This is used by
all processes for new buffers. A single page handles all.
Reads are always allowed. Writes cause a page-fault. This
is called demand-zero paging.

(2) Pages used for shared file mapping are kept in real
memory as long as possible (run-time libraries).

(3) All other pages are available for swapping. The page-
stealer grabs the least-recently used pages from sleeping
processes first. Tasks that are waiting for I/O are the
next to have their least-recently used pages stolen. Tasks
that are waiting for kernel services are the last to have
their least-recently used pages swiped.

The Linux kernel is not a task, so "waiting for kernel services"
is not valid here. Everything else is.

Not every machine runs with gigahertz processors where CPU
overhead of keeping track of pages is in the noise.

Additionally, prioritizing based upon some "goodness" puts policy in the
kernel.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

I was going to compile a list of innovations that could be
attributed to Microsoft. Once I realized that Ctrl-Alt-Del
was handled in the BIOS, I found that there aren't any.


2001-10-11 11:30:43

by David Nicol

[permalink] [raw]
Subject: OO swap interface


Here's an idea that has been mulling under my mullet for
the last few weeks:

the nbd can be used for a swap device, but since swap has no reason
to inform the drive about what parts of it are free, it is not possible
to have a central nbd server overcommit for multiple client swapping
nodes.

Therefore I wonder how tricky it would be to create a swap interface
that is ignorant of disk geometries. the swap interface language
would accept requests for space, with unique handles, and would
return the swapped-out data on representation of the handle. Like
a virtual memory hat check.






--
David Nicol 816.235.1187
1,3,7-trimethylxanthine

2001-10-12 00:47:03

by Xuan Baldauf

[permalink] [raw]
Subject: Re: dynamic swap prioritizing



"'[email protected]'" wrote:

> On Oct 10, 2001 11:23 -0400, Venkatesh Ramamurthy wrote:
> > > If this is to be generally useful, it would be good to find things
> > > like max sequential read speed, max sequential write speed, and max
> > > seek time (at least). Estimates for max sequential read speed and
> > > seek time could be found at boot time for each disk relatively
> > > easily, but write speed may have to be found only at runtime (or
> > > it could all be fed in to the kernel from user space from benchmarks
> > > run previously).
> >
> > Maybe we can find out the statistics for the first time (or when swap is
> > created) and store this information in the swap partition itself. This would
> > allow us to compute time consuming statistics only once. Also we need to
> > create new fields in the swap structure for this purpose.
>
> I'd rather just have the statistic data in a regular file for ALL disks,
> and then send it to the kernel via ioctl or write to a special file that
> the kernel will read from. I don't think it is critical to have this
> data right at boot time, since it would only be used for optimizing I/O
> access and would not be required for a disk to actually work.
>
> Cheers, Andreas

Hey people,

why do you want to separate statistics data out? The statistics are not about disk
throughput, head seek times, etc. They are just about the time between "needing a
page" and "getting that page", which is very abstract. Let's call it the
swapin-delay. It does not only depend on disk-throughput and head seek times, but
also on "device business".

For every swap device, there is a "swap_business" data structure, which covers a
- average_swapin_delay
- average_swapin_delay_last_write_timestamp /* timestamp where swapin_delay was
last written */

There is a "swap_business_memory_timeout" kernel parameter (accessible via /proc)
which represents the length of a time interval from now into the past. This
interval is to be used as the time interval where gathered disk activity data
should be used for reasoning swap decisions of the future.

For every page fault which requires a page to be swapped in, a timestamp is
written to a datastructure covering the swapin process. When the page is ready
available in memory, a function is called which does following:
- compute the current_swapin_delay for the current swapin
- my_swap_device->average_swapin_delay = (current_swapin_delay * (now -
average_swapin_delay_last_write_timestamp) + my_swap_device->average_swapin_delay
* (average_swapin_delay_last_write_timestamp - (now -
swap_business_memory_timeout))/swap_business_memory_timeout;

There are some special cases like "no disk activity". In this case, swap_business
is not updated for that device. But maybe the reason for no disk activity is that
the disk is a swap disk and the values of "swap_business" where once so bad that
this device will not be considered anymore. That would be a "soft deadlock"...

On swapout, the "average_swapin_delay" fields of every "swap_business" data
structure of every swap device is compared against same field of other available
swap devices. According to these comparision, a decision is made where to do the
next swapout to.

Because that framework only can bring advantages if there are at least two swap
devices, it can be skipped for the one-swap-device-case (most setups do not have
more than one swap device, but maybe just because the 32MB or 64MB graphics card
(with plenty of mostly unused RAM) needs to be manually configured for swap...)

I hope that you get the concept more closer. I cannot see reasons why to create
such statistics in advance and feed them to the kernel somehow. For dynamic
systems, you need dynamic statistics, I think. And "the statistics", in fact, only
consist of two variables per swap device. Not something the kernel should not be
able to manage in reasonable time.

Of course, such a feature should be tested for real advantages

Xu?n.


2001-10-12 03:32:43

by Andreas Dilger

[permalink] [raw]
Subject: Re: dynamic swap prioritizing

On Oct 12, 2001 02:45 +0200, Xuan Baldauf wrote:
> > I'd rather just have the statistic data in a regular file for ALL disks,
> > and then send it to the kernel via ioctl or write to a special file that
> > the kernel will read from. I don't think it is critical to have this
> > data right at boot time, since it would only be used for optimizing I/O
> > access and would not be required for a disk to actually work.
>
> why do you want to separate statistics data out? The statistics are not
> about disk throughput, head seek times, etc. They are just about the time
> between "needing a page" and "getting that page", which is very abstract.
> Let's call it the swapin-delay. It does not only depend on disk-throughput
> and head seek times, but also on "device business".

What I am saying is that such information is useful for ALL devices, and
not just swap devices. There was a long thread from Daniel Phillips
where he was working on (1) a few months ago. Why is this data useful?

1) You have dirty pages in RAM, when should you write them? The current
system is to delay the write as long as possible in case the dirty
pages are discarded (e.g. temp file) before they need to be written.
However, if the disk is idle during this time, then doing the write
immediately will not impose extra overhead, and will mean that the
dirty page could be freed quickly if there was a need for memory.
2) Swap or MD RAID 1 load balancing. Which device should you write to
(swap) or read from (RAID 1)? If you know how fast/busy each device
is, you can make a better decision on this instead of round-robin.
3) Guaranteed rate I/O. For XFS/XLV on SGI IRIX, you can request a
guaranteed I/O rate for a specific time period (e.g. to record video
or capture data from an experiment) and the system will tell you if
it is possible or not. In the IRIX case, they had data on each
drive to tell them what the performance is in advance, while Linux
would need to do a drive-by-drive benchmark instead.

A lot of the data needed for this is already part of "sard", but that
is only reporting the data to user space, while some of the above
decisions need to be done inside the kernel on a continuous basis.

Note that I'm NOT saying that having all of this data will improve
system performance (it may slow it down from overhead), but I was just
advocating a broader view of what could be done (and what has already
been done in related areas).

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-10-12 15:22:33

by Xuan Baldauf

[permalink] [raw]
Subject: Re: dynamic swap prioritizing



"'[email protected]'" wrote:

> [...]
> I was just
> advocating a broader view of what could be done (and what has already
> been done in related areas).

Okay, understood. :-)

>
>
> Cheers, Andreas

Xu?n.