On September 29, 2001 03:20 am, Rik van Riel wrote:
> > Is it normal to have Inact_target 1/4 of main memory (64MB of 256MB RAM)?
> > In previous versions, this value would fluctuate with the load of the
> > system.
> >
> > Is this expected?
>
> Yes, this is a 'compensation' for the fact that page aging changed
> from exponential to linear. The combination of linear page aging
> with a large inactive_target results in a good combination of
> frequency- and recency-based page eviction.
>
> Doing just linear page aging with a small inactive target resulted
> in worse throughput than exponential page aging for some workloads,
> better for other workloads. Linear page aging with a large inactive
> target results in good througput and latency under all workloads I've
> found up to now. As usual, thanks go out to Matt Dillon for finding
> this balancing point.
Nice. With this under control, another feature of his memory manager you
could look at is the variable deactivation threshold, which makes a whole lot
more sense now that the aging is linear. To implement it efficiently
PAGE_AGE_DECL just needs to be a variable, since in effect the deactivation
threshold already is exactly PAGE_AGE_DECL.
How to set this variable is a deep and interesting question. Matt had his
ideas on that as you know, and in fact it's a key feature of the BSD
mm it, but it's far from clear that the BSD arrangement could be used
directly in Linux. There are a number of obvious difficulties: no reverse
map, highmem, more caches to balance, and so on. However, it's intuitively
clear that the mm sweet spot can be made bigger by controlling the DECL
variable, i.e., we can push the thrash point further out for a wider variety
of loads.
Obligatory disclaimer: there is no burning issue here; this is a
*developmental* idea.
--
Daniel
On Mon, 1 Oct 2001, Daniel Phillips wrote:
> Nice. With this under control, another feature of his memory manager
> you could look at is the variable deactivation threshold, which makes
> a whole lot more sense now that the aging is linear.
Actually, when we get to the point where deactivating enough
pages is hard, we know the working set is large and we should
be _more careful_ in chosing what to page out...
When we go one step further, where the working set approaches
the size of physical memory, we should probably start doing
load control FreeBSD-style ... pick a process and deactivate
as many of its pages as possible. By introducing unfairness
like this we'll be sure that only one or two processes will
slow down on the next VM load spike, instead of all processes.
Once we reach permanent heavy overload, we should start doing
process scheduling, restricting the active processes to a
subset of all processes in such a way that the active processes
are able to make progress. After a while, give other processes
their chance to run.
regards,
Rik
--
IA64: a worthy successor to i860.
http://www.surriel.com/ http://distro.conectiva.com/
Send all your spam to [email protected] (spam digging piggy)
Rik van Riel <[email protected]>:
> On Mon, 1 Oct 2001, Daniel Phillips wrote:
>
> > Nice. With this under control, another feature of his memory manager
> > you could look at is the variable deactivation threshold, which makes
> > a whole lot more sense now that the aging is linear.
>
> Actually, when we get to the point where deactivating enough
> pages is hard, we know the working set is large and we should
> be _more careful_ in chosing what to page out...
>
> When we go one step further, where the working set approaches
> the size of physical memory, we should probably start doing
> load control FreeBSD-style ... pick a process and deactivate
> as many of its pages as possible. By introducing unfairness
> like this we'll be sure that only one or two processes will
> slow down on the next VM load spike, instead of all processes.
>
> Once we reach permanent heavy overload, we should start doing
> process scheduling, restricting the active processes to a
> subset of all processes in such a way that the active processes
> are able to make progress. After a while, give other processes
> their chance to run.
Just a comment:
This begins to sound like the old VMS handling:
1. When not loaded down, all processes allocate freely.
2. When getting tight, trim all processes down some amount, until enough is
free (balanced by page fault rate measure - process with the lowest fault
rate gets trimmed first).
3. Continue triming until required space available or all processes are at
their working set minimum.
4. if still tight, swap a process completely (determined by length of time
since last IO wait - larger CPU bound jobs/processes got swaped first),
reclaim memory. Note, at this point OOM may occur.
5. If swap full, do not start new processes (ENOMEM)
6. When a process exits, reclaim memory - if working set minimum available
then swapin a process.
I also vaguely remember something about processes spawning new processes -
if memory wasn't immediately available (working set minimum for the new
process) then the process attempting the spawn is put to sleep (or swapped,
or both - this may have only occured if there was room in swap for the
process, if not - ENOMEM on the fork, in case that causes the parent to
exit and free more memory).
The trimming action did not immediately cause a pageout - all that was
needed was to reduce the working set size. The process that needed memory
would then cause the system to scan memory for pages that could be freed.
The first process examined (may have been the process asking for memory)
would have the excess pages paged out. (I believe they were chosen by a
LRU mechanism)
There was also a scheduling fairness rule about swapped processes geting
a schedule increment of 1, in memory processes got incremented 4, IO wait
processes got +6. When they were selected for run: if previous state was IO,
then decrement by 2, if state run, decrement by 2. If a swapped process
schedule value > in memory process, swap the memory resident process out,
swapin the swaped process. (Oviously this isn't quite right :-)
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]
Any opinions expressed are solely my own.
On October 1, 2001 04:49 pm, Jesse Pollard wrote:
> 5. If swap full, do not start new processes (ENOMEM)
I was going to pounce on this one, but then I read the rest of your post...
> I also vaguely remember something about processes spawning new processes -
> if memory wasn't immediately available (working set minimum for the new
> process) then the process attempting the spawn is put to sleep (or swapped,
> or both - this may have only occured if there was room in swap for the
> process, if not - ENOMEM on the fork, in case that causes the parent to
> exit and free more memory).
Yes, here it should degrade gracefully as well. Child-spawning tasks should
should be made to wait an increasingly long time as pressure increases before
they start seeing a lot of ENOMEM's. Also, such penalties must be carefully
targetted so as not to prevent, for example, a new root login for
administrative purposes. Under tight memory conditions we would want to
target any task that spawns children rapidly, which would constitute a sane
form of fork bomb control: its ok to spawn many tasks rapidly a long as
memory is lightly loaded.
Another weapon we can add to our arsenal is the possibility of suspending
tasks to non-swap storage, which would effectively add a second level of swap
space as large as all the free space on your disk. Equivalently but perhaps
more usefully, we could allow swap files to grow dynamically.
Implementing such complex policy seems a distant goal considering that we are
still far from even being able to make an accurate OOM determination.
However, I have a suggestion. Such policy is exactly that, policy, and as
such should be implemented outside the kernel. We just need to expose the
relevant statistics and vm/scheduler control hooks, taking care that the task
responsible for scheduling policy never becomes its own victim. This is a
much smaller and more clearly defined task than actually implementing the
task control policy.
> The trimming action did not immediately cause a pageout - all that was
> needed was to reduce the working set size. The process that needed memory
> would then cause the system to scan memory for pages that could be freed.
> The first process examined (may have been the process asking for memory)
> would have the excess pages paged out. (I believe they were chosen by a
> LRU mechanism)
>
> There was also a scheduling fairness rule about swapped processes geting
> a schedule increment of 1, in memory processes got incremented 4, IO wait
> processes got +6. When they were selected for run: if previous state was IO,
> then decrement by 2, if state run, decrement by 2. If a swapped process
> schedule value > in memory process, swap the memory resident process out,
> swapin the swaped process. (Oviously this isn't quite right :-)
Wouldn't you love to be able to tweak this policy from user space, in a
language of your choice, on a running system? ;-)
--
Daniel
On October 1, 2001 03:57 pm, Rik van Riel wrote:
> On Mon, 1 Oct 2001, Daniel Phillips wrote:
>
> > Nice. With this under control, another feature of his memory manager
> > you could look at is the variable deactivation threshold, which makes
> > a whole lot more sense now that the aging is linear.
>
> Actually, when we get to the point where deactivating enough
> pages is hard, we know the working set is large and we should
> be _more careful_ in chosing what to page out...
Naturally. However, this is orthogonal. Consider the case where you've hit
the wall and the inactive list has suffered sudden depletion. At this point
you have to deactivate a large number of pages and you will have few or no
intervening age-up events (because you hit the wall and nobody's moving).
It's a useless waste of CPU and real time to cycle through the active list 5
times to deactivate enough pages. You should cycle through at most twice,
once to age up any pages with Ref set and the second time to deactivate the
required number of pages according to a threshold you estimated on the first
pass.
This is just the first common example that came to mind where a variable
deactivation threshold is obviously desirable, I'm sure there are others.
> When we go one step further, where the working set approaches
> the size of physical memory, we should probably start doing
> load control FreeBSD-style ... pick a process and deactivate
> as many of its pages as possible. By introducing unfairness
> like this we'll be sure that only one or two processes will
> slow down on the next VM load spike, instead of all processes.
>
> Once we reach permanent heavy overload, we should start doing
> process scheduling, restricting the active processes to a
> subset of all processes in such a way that the active processes
> are able to make progress. After a while, give other processes
> their chance to run.
No question about the need for higher level process control, but the low
level machinery could still be improved, don't you think?
--
Daniel