Today's patch tackles the use-once problem, that is, the problem of
how to identify and discard pages containing data likely to be used
only once in a long time, while retaining pages that are used more
often.
I'll try to explain not only what I did, but the process I went
through to arrive at this particular approach. This requires some
background.
Recent History of Linux Memory Management
-----------------------------------------
Rik van Riel developed our current memory management regime early last
fall, more or less on a dare by Linus.[1] At that time the big
question was, could the poorly-performing 2.3 memory manager be fixed
up in some way or should it be rewritten from scratch? Rik favored a
complete rewrite in the 2.5 timeframe whereas Linus wanted to see
incremental improvements to the existing code base. After a little
flam^H^H^H^H encouragement from Linus, Rik did manage to produce a
patch that improved the performance of 2.3 considerably.
Rik's patch was loosely inspired by Matt Dillon's FreeBSD work.[2] At
that time, FreeBSD enjoyed a clear lead in memory management
performance over Linux, so it made sense to use his work as a model.
What Rik came up with though is really his own creation[3] and differs
in a number of fundamental respects from Matt's work. After a few
months of working out minor problems this is essentially what we have
now. It works well under typical loads but has tended to suffer from
variable or even horrible performance under certain rarer loads and
machine configurations. Over time some of the problems have been
tracked down and fixed, the most recent being Marcelo's identification
and eradication of the zone-imbalance performance bug.[4]
During this whole period, I've been studying Rik's memory manager with
a view to understanding how it does what it does, and thus, how to go
about improving it. I'll to describe the essential details now, as I
see them.
What Rik's Memory Manager Does
------------------------------
Every physical page of memory in the system can be on at most one of
three "page_lru" lists. They are:
1) active
2) inactive_dirty (misnomer - has both clean and dirty pages)
3) inactive_clean (misnomer - really is 'inactive_bufferless')
List (1) is an unordered ring while lists (3) is a fifo queue. List
(2) is somewhere in between, a fifo for clean pages and a ring for
dirty pages. The transitions are:
active -> inactive_dirty, if a page is an eviction candidate
inactive_dirty -> inactive_clean, if a page is not dirty
inactive_clean -> active, if a page is referenced
inactive_dirty -> active, if a page is referenced
The memory manager's job is to scan each of these lists from time to
time and effect the appropriate transitions. To decide which pages
should be considered for eviction, the memory manager walks around the
active ring "aging" pages, "up" if they have been referenced since the
last scan and "down" otherwise. A page whose age has reached zero is
considered to be a good candidate for eviction (a "victim") and is
moved to the inactive_dirty queue (note: regardless of whether it is
actually dirty). The memory manager also scans the "old" end of the
inactive_dirty queue. Any page that has been used since being put on
the queue is moved to the active ring. Unreferenced clean pages that
have their buffers (if any) removed and are moved to the inactive_clean
queue. Unreferenced dirty pages are scheduled for writeout and moved
back to the "young" end of the queue, to be rescanned later.
All of this scanning is driven by demand for pages. To anticpate
future demand, the memory manager tries to keep the inactive_clean list
longer than a certain minimum, and this in turn determines the rate at
which the other two lists are scanned.
In this short tour of Linux's memory manager I've glossed over many
subtle details and complexities. Sorry. This is just to provide some
context for discussing the problem at hand, and my approach to solving
it. Now on to the theoretical part.
Why it works
------------
The two fifo queues can be thought of as a single, two-stage queue. So
what we have is an unordered ring and a queue. As described above, the
ring is used to find eviction candidates via the aging process, which
gives us information about both how frequently and how recently a given
page was used, mixed together inseparably. Ultimately, being recently
used is more important than being frequently used, so to find out
whether a page really isn't being used we put it on the inactive queue.
If it goes all the way from one end of the inactive queue to the other
without being referenced we then know quite a lot more about it, and
can evict it with a reasonable expectation that it will not need to be
re-read immediately. (Notice how this is all kind of approximate -
memory management is like that.)
Why it doesn't always work
--------------------------
This is the tough question. If we knew the whole answer, it would
already be working perfectly under all conditions, or we would have
moved on to a different design. Recently though, some of the more
obscure problems have begun to yield to analysis. In particular,
Marcelo demonstrated how important it is to know what the memory manager
is actually doing, and developed a statistical reporting interface
(soon to be included in the standard kernel) which allowed him to
isolate and fix the obscure zone-balancing problem. There are also
more of us coming up to speed on understanding the memory manager, both
the part inherited from 2.2 and the new part designed by Rik.
Use-once Pages
--------------
One of the things we want the memory manager to do is to assign very
low priority to data that tends to be used once and then not again for
a long time. Pages containing such data should be identified and
evicted as soon as possible. Typically, these "use-once" pages are
generated by file IO, either reads or writes, and reside in the page
cache. The question is, how do we distinguish between "use-once" file
data and file data that we should keep in memory, expecting it to be
used more than once in a short period? Up till now we've used a simple
strategy: always expect that file data is "use-once" and queue it for
eviction immediately after being used. If it really is going to be
used more than once, then hopefully it will be used again before it
reaches the other end of the inactive queue and thus be "rescued" from
eviction.
Rik added code to the memory manager to implement this idea, which he
called "drop-behind", and system performance did improve. (I verified
this by disabling the code, and sure enough, it got worse again.) But
there are several problems with the drop-behind strategy:
- Deactivates even popular file pages - every time a popular
file page is touched by generic_file_read or generic_file_write
it gets deactivated, losing historical information and putting
the page at risk of being evicted, should its next access come
a little too late.
- Busy work - readahead pages circulate on the active list before
being used and deactivated, which does nothing other than
consume CPU cycles.
- Not general - drop-behind specific heuristic for file I/O and
shows a bias towards one particular access pattern, perhaps at
the expense of, say, a database-like access pattern. If for
some reason, less than an entire file is read, the readahead
pages will not be deactivated.
- Doesn't do anything for memory-mapped files.
LRU/2 and 2Q
------------
Enter the LRU/2 Algorithm[5] which provides a specific prescription
aimed at optimizing for database loads with a mix of use-once pages and
use-often pages. The idea is to choose for eviction the page whose
second-most-recent access was furthest in the past, and not pay any
attention at all to the time of the most recent access. This algorithm
has been shown to be more efficient than classic LRU for real-world
loads but is difficult to implement efficiently. The 2Q algorithm,
discussed on lkml recently, approximates the behavior of LRU/2,
essentially by ignoring the first use of a page then managing all pages
referenced at least twice using a normal LRU. Both LRU/2 and 2Q show
significant improvements over LRU, on the order of 5-15%, in testing
against captured real-world database access patterns.
My Approach
-----------
The 2Q algorithm has a major disadvantage from my point of view: it
uses a LRU list.[6] We don't, we use the aging strategy described
above. There are several reasons to keep doing it this way:
- It's more efficient to update an age than to relink a LRU item
- Aging captures both frequency and temporal information
- Aging offers more scope for fine tuning
- It's what we're doing now - changing would be a lot of work
So I decided to look for a new way of approaching the use-once problem
that would easily integrated with our current approach. What I came
up with is pretty simple: instead of starting a newly allocated page on
the active ring, I start it on the inactive queue with an age of zero.
Then, any time generic_file_read or write references a page, if its
age is zero, set its age to START_AGE and mark it as unreferenced.
This has the effect of scheduling all file pages read or written the
first time for early eviction, but introduces a delay before the
eviction actually occurs. If the file page is referenced relatively
soon thereafter it will be "rescued" from eviction as described above,
and thereafter aged normally on the active ring. Subsequent reads and
writes of the same cached page age it normally and do not cause it to
be "unreferenced" again, because its age is not zero. (Just noticed
while composing this: the attached patch does not provide properly for
the case where a page deactivated by aging is subsequently
read/written, and it should.)
Implementing this idea was not as easy as it sounds. At first, I hoped
for a one-line implementation:
- add_page_to_active_list(page);
+ add_page_to_inactive_dirty_list(page);
but alas this was not to be. The actual implementation required three
days of chasing after odd effects caused by redundant tests in the page
scanning code, and learning a lot more about the life cycles of page
cache pages. During this process I managed to disable aging of
read/write pages completely, resulting in a "fifo" eviction policy.
This degraded performance by about 60%. (Truly remarkable not only
because page io is far from my test case's only activity, but because
it shows just how well we are already doing versus a truly naive
eviction policy.) This was especially discouraging because I had
imagined I was quite close to actually achieving some optimization.
However, changing just one line took me from that disaster to a version
that magically turned in a performance 12% better than I'd seen with
the existing code.
In retrospect, if I'd been smarter about this I would have applied
Marcelo's statistical patch at the beginning of the process. That
would have saved me a lot of time spent hunting around in the dark,
using only the test case running time for feedback.[7]
Test Load
---------
For a test load, I used the following:
time -o /tmp/time1 make clean bzImage &
time -o /tmp/time2 grep -r foo /usr/src >/dev/null 2>/dev/null
In other words, a clean make of the kernel concurrent with a grep of my
(rather large) source directory. By themselves, the make requires a
little more than 12 minutes and the grep about 15 minutes. The stock
2.4.5 kernel turned in a performance of slightly better than 20 minutes
for the two simultaneously: very respectable. All of my "improved"
kernels did worse than this, up until the last one, which did
considerably better. (I think the following principle applies: you
always find what you are looking for in the last place you look.)
I chose to combine the make and grep in order to create an I/O-bound
situation, but not without some complexity and with real work being
done. With optimal performance, I imagined the disk light would be on
all, or nearly all the time. This turned out to be the case.
Unfortunately, it also turned out to be the case with my disastrous
"fifo" kernel. In the latter case, the light was always on because
pages frequently needed to be read back in immediately after being
evicted.
This is probably true though: any time the disk light goes out during
this test it's trying to tell us we're not finished optimizing yet.
Timings
-------
The grep always takes longer than the make, so the numbers for the grep
are the total elapsed time for both processes to complete and are thus
more important. It's nice to see that the make always runs faster with
the patch though - this shows that the system is treating the two
processes fairly and the grep is not starving the make too badly for
disk bandwidth. (However, I did note with interest that the
disk-intensive make step involving 'tmppiggy' takes 8 times longer when
running in parallel with the grep than by itself - there is some poor
IO scheduling behavior to investigate there.)
Here is the executive summary:
2.4.5, vanilla 2.4.5+use.once.patch
make 16:29.87 15:46.67
grep 20:09.15 17:37.33
So, with the patch, the two processes run to completion 2 1/2 minutes
faster, a savings of about 13%. These results are quite repeatable and
considerably better than what I had hoped for when I began this work.
Detailed timings are given below. Interestingly, the timings for the
two processes running separately do not vary nearly as much: the make
runs at virtually the same speed with or without the patch, and the
grep runs 54 seconds faster, much less than the 2 1/2 minute difference
seen when the processes run in parallel.
vanilla 2.2.14
make
475.38user 59.71system 16:27.28elapsed 54%CPU
0inputs+0outputs (461932major+513611minor)pagefaults 268swaps
grep
22.14user 71.81system 20:09.15elapsed 7%CPU
0inputs+0outputs (366064major+7236minor)pagefaults 110swaps
vanilla 2.4.5
make
305.70user 229.08system 16:29.87elapsed 54%CPU
0inputs+0outputs (460034major+533055minor)pagefaults 0swaps
grep
13.48user 85.18system 19:49.90elapsed 8%CPU
0inputs+0outputs (1839major+20261minor)pagefaults 0swaps
vanilla 2.4.5 (again)
make
306.05user 229.74system 16:34.83elapsed 53%CPU
0inputs+0outputs (457841major+531251minor)pagefaults 0swaps
grep
13.64user 84.12system 19:52.66elapsed 8%CPU
0inputs+0outputs (2104major+16537minor)pagefaults 0swaps
use.once.patch
make
306.34user 253.36system 15:46.67elapsed 59%CPU
0inputs+0outputs (451569major+522397minor)pagefaults 0swaps
grep
13.20user 81.60system 17:37.33elapsed 8%CPU
0inputs+0outputs (369major+11632minor)pagefaults 0swaps
use.once.patch (again)
make
306.30user 244.88system 15:31.69elapsed 59%CPU
0inputs+0outputs (451352major+522484minor)pagefaults 0swaps
grep
13.50user 82.88system 17:36.84elapsed 9%CPU
0inputs+0outputs (379major+14643minor)pagefaults 0swaps
vanilla 2.4.5, just the make
638.42user 27.96system 12:03.33elapsed 92%CPU
0inputs+0outputs (450134major+520691minor)pagefaults 0swaps
use.once.patch, just the make
632.39user 28.37system 11:58.59elapsed 91%CPU
0inputs+0outputs (450135major+520725minor)pagefaults 0swaps
vanilla 2.4.5, just the grep
13.08user 62.54system 13:43.16elapsed 9%CPU
0inputs+0outputs (1316major+21906minor)pagefaults 0swaps
use.once.patch, just the grep
13.00user 61.05system 12:49.37elapsed 9%CPU
0inputs+0outputs (259major+7586minor)pagefaults 0swaps
[1] http://mail.nl.linux.org/linux-mm/2000-08/msg00023.html
(Linus goads Rik on to great things)
[2] http://mail.nl.linux.org/linux-mm/2000-05/msg00419.html
(Rik's discussion with Matt Dillon)
[3] http://mail.nl.linux.org/linux-mm/2000-05/msg00418.html
(Rik's RFC for his 2.3/4 memory manager design)
[4] http://marc.theaimsgroup.com/?l=linux-kernel&m=99542192302579&w=2
(Marcelo's fix to the zone balancing problem)
[5] http://www.informatik.uni-trier.de/~ley/db/conf/vldb/vldb94-439.html
(Original paper on the 2Q algorithm)
[6] A further disadvantage of the 2Q Algorithm is that it is oriented
towards database files with a relatively limited number of blocks in
them, as opposed to files of effectively unlimited size as are common
in a general-purpose operating system.
[7] While doing this work I like to imagine how it must feel to work
on a tokomak prototype fusion reactor or z-pinch machine. In these
projects, scientists and engineers try to put theory into practice by
squeezing ever more power out of the same, weird and wonderful piece of
hardware. Sometimes new ideas work as expected, and sometimes the
plasma just goes unstable.
The Patch
---------
This is an experimental patch, for comment. There is a small amount
of raciness in the check_used_once function, on some architectures,
but this would at worst lead to suboptimal performance. The patch has
been completely stable for me, and it may just be my imagination, but
interactive performance seems to be somewhat improved.
To apply:
cd /usr/src/your-2.4.5-tree
patch <this.patch -p0
--- ../2.4.5.clean/mm/filemap.c Thu May 31 17:30:45 2001
+++ ./mm/filemap.c Mon Jul 23 17:42:40 2001
@@ -770,51 +770,6 @@
#endif
/*
- * We combine this with read-ahead to deactivate pages when we
- * think there's sequential IO going on. Note that this is
- * harmless since we don't actually evict the pages from memory
- * but just move them to the inactive list.
- *
- * TODO:
- * - make the readahead code smarter
- * - move readahead to the VMA level so we can do the same
- * trick with mmap()
- *
- * Rik van Riel, 2000
- */
-static void drop_behind(struct file * file, unsigned long index)
-{
- struct inode *inode = file->f_dentry->d_inode;
- struct address_space *mapping = inode->i_mapping;
- struct page *page;
- unsigned long start;
-
- /* Nothing to drop-behind if we're on the first page. */
- if (!index)
- return;
-
- if (index > file->f_rawin)
- start = index - file->f_rawin;
- else
- start = 0;
-
- /*
- * Go backwards from index-1 and drop all pages in the
- * readahead window. Since the readahead window may have
- * been increased since the last time we were called, we
- * stop when the page isn't there.
- */
- spin_lock(&pagecache_lock);
- while (--index >= start) {
- page = __find_page_simple(mapping, index);
- if (!page)
- break;
- deactivate_page(page);
- }
- spin_unlock(&pagecache_lock);
-}
-
-/*
* Read-ahead profiling information
* --------------------------------
* Every PROFILE_MAXREADCOUNT, the following information is written
@@ -1033,12 +988,6 @@
if (filp->f_ramax > max_readahead)
filp->f_ramax = max_readahead;
- /*
- * Move the pages that have already been passed
- * to the inactive list.
- */
- drop_behind(filp, index);
-
#ifdef PROFILE_READAHEAD
profile_readahead((reada_ok == 2), filp);
#endif
@@ -1048,6 +997,15 @@
}
+inline void check_used_once (struct page *page)
+{
+ if (!page->age)
+ {
+ page->age = PAGE_AGE_START;
+ ClearPageReferenced(page);
+ }
+}
+
/*
* This is a generic file read routine, and uses the
* inode->i_op->readpage() function for the actual low-level
@@ -1163,7 +1121,8 @@
offset += ret;
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
-
+
+ check_used_once (page);
page_cache_release(page);
if (ret == nr && desc->count)
continue;
@@ -2597,7 +2556,6 @@
while (count) {
unsigned long index, offset;
char *kaddr;
- int deactivate = 1;
/*
* Try to find the page in the cache. If it isn't there,
@@ -2606,10 +2564,8 @@
offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
index = pos >> PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
- if (bytes > count) {
+ if (bytes > count)
bytes = count;
- deactivate = 0;
- }
/*
* Bring in the user page that we will copy from _first_.
@@ -2653,8 +2609,7 @@
unlock:
/* Mark it unlocked again and drop the page.. */
UnlockPage(page);
- if (deactivate)
- deactivate_page(page);
+ check_used_once(page);
page_cache_release(page);
if (status < 0)
--- ../2.4.5.clean/mm/swap.c Mon Jan 22 22:30:21 2001
+++ ./mm/swap.c Mon Jul 23 17:38:25 2001
@@ -231,11 +231,8 @@
spin_lock(&pagemap_lru_lock);
if (!PageLocked(page))
BUG();
- DEBUG_ADD_PAGE
- add_page_to_active_list(page);
- /* This should be relatively rare */
- if (!page->age)
- deactivate_page_nolock(page);
+ add_page_to_inactive_dirty_list(page);
+ page->age = 0;
spin_unlock(&pagemap_lru_lock);
}
--- ../2.4.5.clean/mm/vmscan.c Sat May 26 02:00:18 2001
+++ ./mm/vmscan.c Mon Jul 23 17:25:27 2001
@@ -359,10 +359,10 @@
}
/* Page is or was in use? Move it to the active list. */
- if (PageReferenced(page) || page->age > 0 ||
- (!page->buffers && page_count(page) > 1)) {
+ if (PageReferenced(page) || (!page->buffers && page_count(page) > 1)) {
del_page_from_inactive_clean_list(page);
add_page_to_active_list(page);
+ page->age = PAGE_AGE_START;
continue;
}
@@ -462,11 +462,11 @@
}
/* Page is or was in use? Move it to the active list. */
- if (PageReferenced(page) || page->age > 0 ||
- (!page->buffers && page_count(page) > 1) ||
+ if (PageReferenced(page) || (!page->buffers && page_count(page) > 1) ||
page_ramdisk(page)) {
del_page_from_inactive_dirty_list(page);
add_page_to_active_list(page);
+ page->age = PAGE_AGE_START;
continue;
}
On Tue, Jul 24, 2001 at 05:47:30AM +0200, Daniel Phillips wrote:
> So I decided to look for a new way of approaching the use-once problem
> that would easily integrated with our current approach. What I came
> up with is pretty simple: instead of starting a newly allocated page on
> the active ring, I start it on the inactive queue with an age of zero.
> Then, any time generic_file_read or write references a page, if its
> age is zero, set its age to START_AGE and mark it as unreferenced.
This is wonderfully simple and ellegant.
Jim
Hey, this looks _really_ nice. I never liked the special-cases that you
removed (drop_behind in particular), and I have to say that the new code
looks a lot saner, even without your extensive description and timing
analysis.
Please people, test this out extensively - I'd love to integrate it, but
while it looks very sane I'd really like to hear of different peoples
reactions to it under different loads.
Linus
On Tue, 24 Jul 2001 [email protected] wrote:
> On Tue, Jul 24, 2001 at 05:47:30AM +0200, Daniel Phillips wrote:
>
> > So I decided to look for a new way of approaching the use-once problem
> > that would easily integrated with our current approach. What I came
> > up with is pretty simple: instead of starting a newly allocated page on
> > the active ring, I start it on the inactive queue with an age of zero.
> > Then, any time generic_file_read or write references a page, if its
> > age is zero, set its age to START_AGE and mark it as unreferenced.
>
> This is wonderfully simple and ellegant.
It's a tad incorrect, though.
If a page gets 2 1-byte reads in a microsecond, with this
patch it would get promoted to the active list, even though
it's really only used once.
Preferably we'd want to have a "new" list of, say, 5% maximum
of RAM size, where all references to a page are ignored. Any
references made after the page falls off that list would be
counted for page replacement purposes.
regards,
Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
[ CC: to linux-kernel added back since the topic isn't flammable ;) ]
On Tue, 24 Jul 2001, Stuart MacDonald wrote:
> From: "Rik van Riel" <[email protected]>
> > If a page gets 2 1-byte reads in a microsecond, with this
> > patch it would get promoted to the active list, even though
> > it's really only used once.
>
> I hate to bother you directly, but I don't wish to start a flame
> war on lkml. How exactly would you explain two accesses as
> being "used once"?
Because they occur in a very short interval, an interval MUCH
shorter than the time scale in which the VM subsystem looks at
referenced bits, etc...
Generally a CPU doesn't read more than one cache line at a time,
so I guess all "single references" are in reality multiple
accesses very shortly following each other.
This seems to be generally accepted theory and practice in the
field of page replacement.
regards,
Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
From: "Rik van Riel" <[email protected]>
> Because they occur in a very short interval, an interval MUCH
> shorter than the time scale in which the VM subsystem looks at
> referenced bits, etc...
So what you're sayng is that any number of accesses, as long
as they all occur within the interval < VM subsystem scanning
interval, are all counted as one?
> This seems to be generally accepted theory and practice in the
> field of page replacement.
Okay, just seems odd to me, but IANAVMGuru.
..Stu
On Tue, 24 Jul 2001, Linus Torvalds wrote:
> Hey, this looks _really_ nice. I never liked the special-cases
> that you removed (drop_behind in particular), and I have to say
> that the new code looks a lot saner, even without your extensive
> description and timing analysis.
Fully agreed, drop_behind is an ugly hack. The sooner
it dies the happier I am ;)
> Please people, test this out extensively - I'd love to integrate
> it, but while it looks very sane I'd really like to hear of
> different peoples reactions to it under different loads.
The one thing which has always come up in LRU/k and 2Q
papers is that the "first reference" can really be a
series of references in a very short time.
Counting only the very first reference will fail if we
do eg. sequential IO with non-page aligned read() calls,
which doesn't look like it's too uncommon.
In order to prevent this from happening, either the system
counts all first references in a short timespan (complex to
implement) or it has the new pages on a special - small fixed
size - page list and all references to the page while on that
list are ignored.
regards,
Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
On Tue, Jul 24, 2001 at 02:07:32PM -0300, Rik van Riel wrote:
> > I hate to bother you directly, but I don't wish to start a flame
> > war on lkml. How exactly would you explain two accesses as
> > being "used once"?
>
> Because they occur in a very short interval, an interval MUCH
> shorter than the time scale in which the VM subsystem looks at
> referenced bits, etc...
Would mmap() then a for(;;) loop over the range be an example of such a
use?
mrc
--
Mike Castle [email protected] http://www.netcom.com/~dalgoda/
We are all of us living in the shadow of Manhattan. -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc
On Tue, 24 Jul 2001, Stuart MacDonald wrote:
> From: "Rik van Riel" <[email protected]>
> > Because they occur in a very short interval, an interval MUCH
> > shorter than the time scale in which the VM subsystem looks at
> > referenced bits, etc...
>
> So what you're sayng is that any number of accesses, as long
> as they all occur within the interval < VM subsystem scanning
> interval, are all counted as one?
Actually, the length of this interval could be even smaller
and is often a point of furious debating.
The 2Q algorithm seems to have solved this problem by not
using an interval, but a FIFO queue of small, fixed length.
> > This seems to be generally accepted theory and practice in the
> > field of page replacement.
>
> Okay, just seems odd to me, but IANAVMGuru.
Seems odd at first glance, true.
Let me give you an example:
- sequential access of a file
- script reads the file in 80-byte segments
(parsing some arcane data structure)
- these segments are accessed in rapid succession
- each 80-byte segment is accessed ONCE
In this case, even though the data is accessed only
once, each page is touched PAGE_SIZE/80 times, with
one 80-byte read() each time.
I hope this example gives you some holding point to
make it easier and grasp this - somewhat counter
intuitive - concept.
regards,
Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
On Tue, 24 Jul 2001, Mike Castle wrote:
> On Tue, Jul 24, 2001 at 02:07:32PM -0300, Rik van Riel wrote:
> > > I hate to bother you directly, but I don't wish to start a flame
> > > war on lkml. How exactly would you explain two accesses as
> > > being "used once"?
> >
> > Because they occur in a very short interval, an interval MUCH
> > shorter than the time scale in which the VM subsystem looks at
> > referenced bits, etc...
>
> Would mmap() then a for(;;) loop over the range be an example of
> such a use?
Yes. The problem here, however, would be that we'll only find
the referenced bit in the page table some time later.
This is another reason for having a "new" queue like 2Q's A1in
queue. It means we can both count all accesses in the first short
period as one access AND we can avoid actually doing anything
special as these accesses happen....
regards,
Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
From: "Rik van Riel" <[email protected]>
> Actually, the length of this interval could be even smaller
> and is often a point of furious debating.
Which is why I was avoiding flames earlier; I sorta
figured this might be a hot issue.
> Let me give you an example:
>
> - sequential access of a file
> - script reads the file in 80-byte segments
> (parsing some arcane data structure)
> - these segments are accessed in rapid succession
> - each 80-byte segment is accessed ONCE
>
> In this case, even though the data is accessed only
> once, each page is touched PAGE_SIZE/80 times, with
> one 80-byte read() each time.
I'd figured that sort of scenario was the basis for counting
them all as one. What about
- random access of same file
- script reads one arcane 80 byte struct
- script updates that struct say PAGE_SIZE/80
times, with one 80-byte write() each time
If they're all counted as one, would the page age
correctly, if the script happens to take a few seconds
break between another flurry of all-as-one updating?
..Stu
On Tuesday 24 July 2001 19:04, Rik van Riel wrote:
> On Tue, 24 Jul 2001, Linus Torvalds wrote:
> > Hey, this looks _really_ nice. I never liked the special-cases
> > that you removed (drop_behind in particular), and I have to say
> > that the new code looks a lot saner, even without your extensive
> > description and timing analysis.
>
> Fully agreed, drop_behind is an ugly hack. The sooner
> it dies the happier I am ;)
>
> > Please people, test this out extensively - I'd love to integrate
> > it, but while it looks very sane I'd really like to hear of
> > different peoples reactions to it under different loads.
>
> The one thing which has always come up in LRU/k and 2Q
> papers is that the "first reference" can really be a
> series of references in a very short time.
Yes, I thought about that but decided to try to demonstrate the
concept in its simplest form, and if things worked out, go ahead
and try to refine it.
Memory-mapped files have to be handled too. One possible way to
go at it is to do the test not against the current page being
handled by generic_* but against the page already on the head of
the inactive_dirty list, at the time the *next* page is queued.
This introduces a slight delay, time enough for several programmed
IO operations to complete. It will also work for mmap. As a
bonus, the code might even get cleaner because all the use_once
tests are gathered together into a single place.
> Counting only the very first reference will fail if we
> do eg. sequential IO with non-page aligned read() calls,
> which doesn't look like it's too uncommon.
Yep. We should also look at some statistics. So far I've just
used a single number: execution time.
> In order to prevent this from happening, either the system
> counts all first references in a short timespan (complex to
> implement) or it has the new pages on a special - small fixed
> size - page list and all references to the page while on that
> list are ignored.
Yes, those are both possibilities.
--
Daniel
On Tue, 24 Jul 2001, Daniel Phillips wrote:
> On Tuesday 24 July 2001 19:04, Rik van Riel wrote:
> Memory-mapped files have to be handled too.
> > In order to prevent this from happening, either the system
> > counts all first references in a short timespan (complex to
> > implement) or it has the new pages on a special - small fixed
> > size - page list and all references to the page while on that
> > list are ignored.
>
> Yes, those are both possibilities.
The special "new pages" list handles the mmap() situation
automatically.
regards,
Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
On Tue, Jul 24, 2001 at 02:51:03PM -0300, Rik van Riel wrote:
> Let me give you an example:
>
> - sequential access of a file
> - script reads the file in 80-byte segments
> (parsing some arcane data structure)
> - these segments are accessed in rapid succession
> - each 80-byte segment is accessed ONCE
>
> In this case, even though the data is accessed only
> once, each page is touched PAGE_SIZE/80 times, with
> one 80-byte read() each time.
Of course, such a program is poorly written anyway. That many individual
system calls! At least rewrite to use fread(). ;->
Change the example to using mmap(), and one feels less likely throttle the
programmer.
mrc
--
Mike Castle [email protected] http://www.netcom.com/~dalgoda/
We are all of us living in the shadow of Manhattan. -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc
On Tue, 24 Jul 2001, Mike Castle wrote:
> On Tue, Jul 24, 2001 at 02:51:03PM -0300, Rik van Riel wrote:
> > Let me give you an example:
>
> Of course, such a program is poorly written anyway. That many
> individual system calls!
That's not the point I was trying to make ;))
Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
I compiled aha152x as a module and usually have it load at boot. However,
with kernel 2.4.7, the machine hangs in some in-between state when the
module is loaded. No oops or panic is printed -- the machine simply hangs
and does nothing. The keyboard is responsive however but it appears as
though nothing gets through to the kernel.
Furthermore, nothing is printed in the logs about any such error. I'm
wondering if this is somehow related to the recent SCSI layer cleanups?
The machine is an SMP Pentium III with 640 megabytes of RAM. If you need
any more details, e.g. .config, lspci, etc. let me know and I'll provide
them. For the moment, I have removed the controller card from the machine
simply because it was pissing me off to no extent and I didn't want to
deal with it any longer. Fortunately, the only device connected to it is
an optical disk drive that gets very infrequent use, so I can live without
it, for a while.
LMK what you come up with. TIA,
Kelsey Hudson [email protected]
Software Engineer
Compendium Technologies, Inc (619) 725-0771
---------------------------------------------------------------------------
I'm sorry that it is for old 2.2.x but I have 2.2.19 router
as MASQ box for our microwave line. I setup automatic backup
via ISDN but I find that masq connections stops working when
ISDN gets dialed.
It is because routing is changed and maddr field is bad in
ip_masq struct.
There is existing solution: sysctl_ip_dynaddr. But it replaces
maddr only on connections where no reply was recieved from outside.
It is true for machine connected via diald for example but when
changing LIVE route it simply doesn't work.
I patched 3 files with very short patch. It allows you to add
4 to sysctl_ip_dynaddr flag and it skips no-reply test.
It works well now.
Patched version is bacward compatible and there is no speed
penalty (well almost).
What about 2.2.20 inclusion ? It this correct place to send patch to ?
devik
--------
diff -ru linux/net/ipv4/ip_masq.c gate2/net/ipv4/ip_masq.c
--- linux/net/ipv4/ip_masq.c Sun Mar 25 18:37:41 2001
+++ gate2/net/ipv4/ip_masq.c Tue Jul 24 20:05:21 2001
@@ -50,6 +50,7 @@
* Kai Bankett : do not toss other IP protos in proto_doff()
* Dan Kegel : pointed correct NAT behavior for UDP streams
* Julian Anastasov : use daddr and dport as hash keys
+ * Martin Devera : extended sysctl_ip_dynaddr functionality
*
*/
@@ -1236,9 +1237,9 @@
* in this tunnel and routing iface address has changed...
* "You are welcome, diald".
*/
- if ( sysctl_ip_dynaddr && ms->flags & IP_MASQ_F_NO_REPLY && maddr != ms->maddr) {
-
- if (sysctl_ip_dynaddr > 1) {
+ if ( sysctl_ip_dynaddr && (sysctl_ip_dynaddr & 4 ||
+ ms->flags & IP_MASQ_F_NO_REPLY) && maddr != ms->maddr) {
+ if ((sysctl_ip_dynaddr & 3) > 1) {
IP_MASQ_INFO( "ip_fw_masquerade(): change masq.addr from %d.%d.%d.%d to %d.%d.%d.%d\n",
NIPQUAD(ms->maddr),NIPQUAD(maddr));
}
@@ -1527,9 +1528,9 @@
* in this tunnel and routing iface address has changed...
* "You are welcome, diald".
*/
- if ( sysctl_ip_dynaddr && ms->flags & IP_MASQ_F_NO_REPLY && maddr != ms->maddr) {
-
- if (sysctl_ip_dynaddr > 1) {
+ if ( sysctl_ip_dynaddr && (sysctl_ip_dynaddr & 4 ||
+ ms->flags & IP_MASQ_F_NO_REPLY) && maddr != ms->maddr) {
+ if ((sysctl_ip_dynaddr & 3) > 1) {
IP_MASQ_INFO( "ip_fw_masq_icmp(): change masq.addr %d.%d.%d.%d to %d.%d.%d.%d",
NIPQUAD(ms->maddr), NIPQUAD(maddr));
}
diff -ru linux/net/ipv4/tcp_ipv4.c gate2/net/ipv4/tcp_ipv4.c
--- linux/net/ipv4/tcp_ipv4.c Sun Mar 25 18:37:41 2001
+++ gate2/net/ipv4/tcp_ipv4.c Tue Jul 24 20:12:06 2001
@@ -1889,7 +1889,7 @@
}
if (new_saddr != sk->saddr) {
- if (sysctl_ip_dynaddr > 1) {
+ if ((sysctl_ip_dynaddr & 3) > 1) {
printk(KERN_INFO "tcp_v4_rebuild_header(): shifting sk->saddr "
"from %d.%d.%d.%d to %d.%d.%d.%d\n",
NIPQUAD(sk->saddr),
--- linux/Documentation/networking/ip_dynaddr.txt Sun Mar 25 18:31:57 2001
+++ gate2/Documentation/networking/ip_dynaddr.txt Tue Jul 24 20:44:42 2001
@@ -23,6 +23,12 @@
# echo 2 > /proc/sys/net/ipv4/ip_dynaddr
To disable (default)
# echo 0 > /proc/sys/net/ipv4/ip_dynaddr
+ To always rewrite MASQ connections instead only when no reply
+ packet is recieved (needed to be able to change routing, for
+ example when doing backup-line) add 4 to flag, like:
+ # echo 5 > /proc/sys/net/ipv4/ip_dynaddr
+ or
+ # echo 6 > /proc/sys/net/ipv4/ip_dynaddr
Enjoy!
---------------
This may not be the best place to ask this, but I've done the research,
can't come up with an answer, and don't know a better group of people to
ask.
I have an embedded Linux device (2.2.12 kernel, BlueCat distro) that uses a
serial port for the console. When the box comes up, rc.sysinit starts the
application as a detached process (my_program&). When the program spits out
periodic status reports, the \n is not being mapped to CR-LF (i.e., I'm
getting only linefeeds).
Once you log on to the box (via login, into bash), the output becomes
correctly cooked.
I've tried twiddling termios parameters for OPOST and ONLCR, but it has no
effect. Trying to have the application run "stty -a" via a system() call
reports an error regarding it can't get the parameters for stdin.
What parameters are required to be set for a detached process started via
init to correctly have it's output mapped from \n to CR-NL?
--John
Hello all.
I've been playing around with Davide Libenzi's "/dev/epoll" patch and have
been very impressed by the performance figures (see his paper at
http://www.xmailserver.org/linux-patches/nio-improve.html for some
numbers) - it seems to exhibit really excellent scaling.
Given the sheer utility of using /dev/epoll in a largescale server, are
there any plans to roll it into the mainline kernel at any point? If not,
why / is there an equivalent scheduled for inclusion?
Clearly, IMHO, Linux needs something that can scale as well as BSD's kqueue
and Solaris's /dev/poll. /dev/epoll seems to be an excellent answer.
Sincerely,
David E. Weekly
PS: Cheers to Anton Altaparmikov's work on NTFS: In 2.4.7, we now have
rather robust read/write support of NT filesystems!
On Tue, 24 Jul 2001, David E. Weekly wrote:
> I've been playing around with Davide Libenzi's "/dev/epoll"
> Given the sheer utility of using /dev/epoll in a largescale
> server, are there any plans to roll it into the mainline kernel
> at any point?
So why don't you ask Davide if he has any plans to submit
it for inclusion into the kernel? ;)
Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
Hei,
I am trying to install seawolf(linux IA64) on an IA 64 machine. For
partitioning I went through several trails without success. (total # disk space
is 10G). After all the initial settings, I restarted the machine to EFI (boot
manager). Can't boot to linux and the EFI can't recognize files the
installation CD initially saved on the hard disk. Are there special ways or
tricks to partition the disk and to make it work?
thanks in advance
hiu fung
Hello...
Am Dienstag, 24. Juli 2001 18:48 schrieb Linus Torvalds:
> Please people, test this out extensively - I'd love to integrate it, but
> while it looks very sane I'd really like to hear of different peoples
> reactions to it under different loads.
I just decided to give this patch a try, as I have written a little
application which does some statistics over traces dumped by another program
by mmap()ing a large file and reading it sequentially. The trace file to be
analyzed is about 240 megs in size and consists of records each 249 bytes
long. The analysis program opens and the mmap()s the trace file doing some
small calculations on the data (basically it adds up fields from the records
to get overall values).
I have tested this on my Athlon 600 with 128 Megs of RAM, and it does not
make any difference whether I use plain 2.4.7 or 2.4.5-use-once. The program
always takes about 1 minute 6 seconds (+- 2 seconds) to complete, and the
machine starts swapping out stuff (thus I have omitted further stats like
vmstat output). I have just taken another look into my program to verify it
does not do something silly, like keeping old data around, but the program
cycle is always the same: copy a record from the mmap into a struct, perform
analysis, and copy next record. The struct is always reused for the next
struct (so there is only one struct at any time).
I can do further tests, if someone asks me to. I could even modify the
analysis program to check changes in behaviour...
> Linus
Patrick
--
On Tue, 24 Jul 2001, Patrick Dreker wrote:
[snip program with mmap()]
> I have tested this on my Athlon 600 with 128 Megs of RAM, and it
> does not make any difference whether I use plain 2.4.7 or
> 2.4.5-use-once.
As expected. Only programs using generic_file_{read,write}()
will be impacted at the moment.
regards,
Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
On Tue, 24 Jul 2001, Patrick Dreker wrote:
>
> I just decided to give this patch a try, as I have written a little
> application which does some statistics over traces dumped by another program
> by mmap()ing a large file and reading it sequentially. The trace file to be
> analyzed is about 240 megs in size and consists of records each 249 bytes
> long. The analysis program opens and the mmap()s the trace file doing some
> small calculations on the data (basically it adds up fields from the records
> to get overall values).
Note that the patch won't do much anything for the mmap() case - the VM
doesn't consider that "use once" anyway, and trying to make it do so would
be hard, I suspect. It's just very nasty to try to figure out whether the
user has touched the page a single time or millions of times...
We do have the "accessed" bit, but in order to get any access patterns
we'd have to scan the page tables a _lot_ more than we do now. Right now
we do it only under memory pressure, and only about once per memory cycle,
which is not really even remotely enough to get anything more than "yes,
the user did touch the page"..
In order for mmap() to make a difference with the new code, we could do
something like look at the adjacent VM pages on page-in. That, together
with VM hints, might well be able to do similar things for mmap. But at a
noticeably higher cost than what the current code has.
Linus
On Tue, 24 Jul 2001, Daniel Phillips wrote:
> Today's patch tackles the use-once problem, that is, the problem of
> how to identify and discard pages containing data likely to be used
> only once in a long time, while retaining pages that are used more
> often.
>
> I'll try to explain not only what I did, but the process I went
> through to arrive at this particular approach. This requires some
> background.
Well, as I see the patch should remove the problem where drop_behind()
deactivates pages of a readahead window even if some of those pages are
not "used-once" pages, right ?
I just want to make sure the performance improvements you're seeing caused
by the fix of this _particular_ problem.
If we knew the amount of non-used-once pages which drop_behind() is
deactivating under _your_ tests, we could make absolute sure about the
problem.
On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> On Tue, 24 Jul 2001, Daniel Phillips wrote:
>
> > Today's patch tackles the use-once problem, that is, the problem of
>
> Well, as I see the patch should remove the problem where
> drop_behind() deactivates pages of a readahead window even if
> some of those pages are not "used-once" pages, right ?
>
> I just want to make sure the performance improvements you're
> seeing caused by the fix of this _particular_ problem.
Fully agreed.
Especially since it was a one-liner change from worse
performance to better performance (IIRC) it would be
nice to see exactly WHY the system behaves the way it
does. ;)
Reading a bunch of 2Q, LRU/k, ... papers and thinking
about the problem very carefully should help us a bit
in this. Lots of researches have already looked into
this particular problem in quite a lot of detail.
regards,
Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
Am Dienstag, 24. Juli 2001 22:32 schrieb Rik van Riel:
> On Tue, 24 Jul 2001, Patrick Dreker wrote:
>
> [snip program with mmap()]
>
> > I have tested this on my Athlon 600 with 128 Megs of RAM, and it
> > does not make any difference whether I use plain 2.4.7 or
> > 2.4.5-use-once.
>
> As expected. Only programs using generic_file_{read,write}()
> will be impacted at the moment.
D'oh... got some thing wrong there it seems :-(
Anyways, I still have some old routines in the program doing the same via
read(). I have tried that (although the program is pretty silly, as it reads
the 240megs in 4 byte chunks.... the call overhead probably is the bigger
problem there...) and it improved the runtime by aproximately 20% using the
use_once patch.
linux-2.4.7:
22.300u 135.310s 2:41.38 97.6% 0+0k 0+0io 110pf+0w
linux-2.4.5-use_once:
14.980u 108.870s 2:09.79 95.4% 0+0k 0+0io 200pf+0w
Both measurements taken after reboot and while running KDE2.2
As stated: I am still willing to do further experiments... (read()ing larger
chunks at once?)
--
Patrick Dreker
On Wed, 25 Jul 2001, Patrick Dreker wrote:
> Anyways, I still have some old routines in the program doing the
> same via read(). I have tried that (although the program is
> pretty silly, as it reads the 240megs in 4 byte chunks.... the
> call overhead probably is the bigger problem there...) and it
> improved the runtime by aproximately 20% using the use_once
> patch.
>
> linux-2.4.7:
> 22.300u 135.310s 2:41.38 97.6% 0+0k 0+0io 110pf+0w
>
> linux-2.4.5-use_once:
> 14.980u 108.870s 2:09.79 95.4% 0+0k 0+0io 200pf+0w
COOL ....
> Both measurements taken after reboot and while running KDE2.2 As
> stated: I am still willing to do further experiments...
> (read()ing larger chunks at once?)
Could you try reading 4kB chunks at once ?
That way you should truly only touch each page once.
(using smaller chunks, or chunks which aren't a
multiple of 4kB should break the current code)
regards,
Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
On Tue, 24 Jul 2001, Rik van Riel wrote:
> On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> > On Tue, 24 Jul 2001, Daniel Phillips wrote:
> >
> > > Today's patch tackles the use-once problem, that is, the problem of
> >
> > Well, as I see the patch should remove the problem where
> > drop_behind() deactivates pages of a readahead window even if
> > some of those pages are not "used-once" pages, right ?
> >
> > I just want to make sure the performance improvements you're
> > seeing caused by the fix of this _particular_ problem.
>
> Fully agreed.
>
> Especially since it was a one-liner change from worse
> performance to better performance (IIRC) it would be
> nice to see exactly WHY the system behaves the way it
> does. ;)
Yes.
Daniel's patch adds "drop behind" (that is, adding swapcache
pages to the inactive dirty) behaviour to swapcache pages.
This is a _new_ thing, and I would like to know how that is changing the
whole VM behaviour..
On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> Daniel's patch adds "drop behind" (that is, adding swapcache
> pages to the inactive dirty) behaviour to swapcache pages.
>
> This is a _new_ thing, and I would like to know how that is
> changing the whole VM behaviour..
It means that swap cache pages get about 1 second
to be used (in theory, under load) and after that
they can get evicted.
Should show up nicely with your VM stats patch...
Lets try to get the VM statistics patch merged,
after that we can talk about this stuff without
having to use our phantasy as much as we use it
now ;))
Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
On Tuesday 24 July 2001 22:27, Marcelo Tosatti wrote:
> On Tue, 24 Jul 2001, Daniel Phillips wrote:
> > Today's patch tackles the use-once problem, that is, the problem of
> > how to identify and discard pages containing data likely to be used
> > only once in a long time, while retaining pages that are used more
> > often.
> >
> > I'll try to explain not only what I did, but the process I went
> > through to arrive at this particular approach. This requires some
> > background.
>
> Well, as I see the patch should remove the problem where
> drop_behind() deactivates pages of a readahead window even if some of
> those pages are not "used-once" pages, right ?
Yes, that was a specific goal.
> I just want to make sure the performance improvements you're seeing
> caused by the fix of this _particular_ problem.
Of course I chose the test load to show the benefit clearly, but I
also tried to avoid disturbing any of the existing cases that are
working well. I did the verification and audit in a hurry, so I
wouldn't be surprised at all if I missed something, but I feel
equally that picking these up is more a matter of adjustment than
change in viewpoint.
> If we knew the amount of non-used-once pages which drop_behind() is
> deactivating under _your_ tests, we could make absolute sure about
> the problem.
Yes, for sure, it's time to use the statistics weapon. I guess we
need to add in a couple more sensor points. An obvious place is in
check_used_once.
--
Daniel
On Tuesday 24 July 2001 22:53, Marcelo Tosatti wrote:
> On Tue, 24 Jul 2001, Rik van Riel wrote:
> > On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> > > On Tue, 24 Jul 2001, Daniel Phillips wrote:
> > > > Today's patch tackles the use-once problem, that is, the
> > > > problem of
> > >
> > > Well, as I see the patch should remove the problem where
> > > drop_behind() deactivates pages of a readahead window even if
> > > some of those pages are not "used-once" pages, right ?
> > >
> > > I just want to make sure the performance improvements you're
> > > seeing caused by the fix of this _particular_ problem.
> >
> > Fully agreed.
> >
> > Especially since it was a one-liner change from worse
> > performance to better performance (IIRC) it would be
> > nice to see exactly WHY the system behaves the way it
> > does. ;)
>
> Yes.
>
> Daniel's patch adds "drop behind" (that is, adding swapcache
> pages to the inactive dirty) behaviour to swapcache pages.
>
> This is a _new_ thing, and I would like to know how that is changing
> the whole VM behaviour..
Yes, absolutely. I knew I was doing that but I also thought it wouldn't
hurt. Rather it's part of a transition towards a full unification of
the file and swap paths.
Basically, I just left that part of it hanging. If you check my
detailed timings you'll see all my test runs have swaps=0, basically
because I didn't really want to hear about it just then ;-)
I was pretty sure it could be fixed if it broke.
--
Daniel
On Wednesday 25 July 2001 00:05, Rik van Riel wrote:
> On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> > On Tue, 24 Jul 2001, Daniel Phillips wrote:
> > > Today's patch tackles the use-once problem, that is, the problem
> > > of
> >
> > Well, as I see the patch should remove the problem where
> > drop_behind() deactivates pages of a readahead window even if
> > some of those pages are not "used-once" pages, right ?
> >
> > I just want to make sure the performance improvements you're
> > seeing caused by the fix of this _particular_ problem.
>
> Fully agreed.
>
> Especially since it was a one-liner change from worse
> performance to better performance (IIRC) it would be
> nice to see exactly WHY the system behaves the way it
> does. ;)
Oh, it wasn't an accident, I knew what I was trying to achieve.
It's just that I didn't immediately understand all the curlicues in
the page life cycles just from staring at the code. I had to see
the machine moving first. It was very instructive to see what
happened on reverting to a fifo strategy. It might even be a good
idea to put a proc hook in there to allow on-the-fly dumbing down
of the lru machinery. That way we can measure system behaviour
against a baseline without having to go through the
compile-reboot-bench cycle every time, which eats a major amount
of time.
> Reading a bunch of 2Q, LRU/k, ... papers and thinking
> about the problem very carefully should help us a bit
> in this. Lots of researches have already looked into
> this particular problem in quite a lot of detail.
Yep, unfortunately the subject of memory management in operating
systems seems to have received a lot less attention than memory
management in databases.
--
Daniel
On Tuesday 24 July 2001 18:56, Rik van Riel wrote:
> On Tue, 24 Jul 2001 [email protected] wrote:
> > On Tue, Jul 24, 2001 at 05:47:30AM +0200, Daniel Phillips wrote:
> > > So I decided to look for a new way of approaching the use-once
> > > problem that would easily integrated with our current approach.
> > > What I came up with is pretty simple: instead of starting a newly
> > > allocated page on the active ring, I start it on the inactive
> > > queue with an age of zero. Then, any time generic_file_read or
> > > write references a page, if its age is zero, set its age to
> > > START_AGE and mark it as unreferenced.
> >
> > This is wonderfully simple and ellegant.
>
> It's a tad incorrect, though.
>
> If a page gets 2 1-byte reads in a microsecond, with this
> patch it would get promoted to the active list, even though
> it's really only used once.
Yes, I expected that to be a relatively rare case however - block
aligned transfers are much more common, and when we have multiple
blocks per page we may well want the page to go onto the active list
because there may be quite a delay between successive block accesses.
>From there it will be aged normally, not such a horrible thing.
To fix this properly I suspect that just a few microseconds is enough
delay to pick up the temporal groupings you're talking about. That's
not hard to achieve.
> Preferably we'd want to have a "new" list of, say, 5% maximum
> of RAM size, where all references to a page are ignored. Any
> references made after the page falls off that list would be
> counted for page replacement purposes.
At that size you'd run a real risk of missing the tell-tale multiple
references that mark a page as frequently used. Think about metadata
here (right now, that usually just means directory pages, but later...
who knows). Once somebody has looked at two files in a directory -
while the page sits on the "ignore" list - chances are very good
they'll look at a third, but perhaps not before it drops off the south
end of the inactive queue.
Well, theorizing can only get us so far before we need to take actual
measurements.
--
Daniel
At 01:04 25/07/2001, Daniel Phillips wrote:
>At that size you'd run a real risk of missing the tell-tale multiple
>references that mark a page as frequently used. Think about metadata
>here (right now, that usually just means directory pages, but later...
>who knows).
This is not actually implemented yet, but NTFS TNG will use the page cache
to hold both the LCN (physical clusters) and MFT (on disk inodes)
allocation bitmaps in addition to file and directory pages. (Admittedly the
LCN case folds into the file pages in page cache one as the LCN bitmap is
just stored inside the usual data of a file called $Bitmap, but the MFT
case is more complex as it is in an additional attribute inside the file
$MFT so the normal file read/write functions definitely cannot be used. The
usual data here is the actual on disk inodes...)
Just FYI.
Anton
--
"Nothing succeeds like success." - Alexandre Dumas
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/
On Tue, 24 Jul 2001, Rik van Riel wrote:
>
> (using smaller chunks, or chunks which aren't a
> multiple of 4kB should break the current code)
Maybe Patrick is using stdio? In that case, the small chunks will be
coalesced in the library layer anyway, which might explain the lack of
breakage.
Of course, if it improved performance even when "broken", that would be
even better. I like those kind sof algorithms.
Linus
On Tuesday 24 July 2001 22:33, Linus Torvalds wrote:
> On Tue, 24 Jul 2001, Patrick Dreker wrote:
> > I just decided to give this patch a try, as I have written a little
> > application which does some statistics over traces dumped by
> > another program by mmap()ing a large file and reading it
> > sequentially. The trace file to be analyzed is about 240 megs in
> > size and consists of records each 249 bytes long. The analysis
> > program opens and the mmap()s the trace file doing some small
> > calculations on the data (basically it adds up fields from the
> > records to get overall values).
>
> Note that the patch won't do much anything for the mmap() case - the
> VM doesn't consider that "use once" anyway, and trying to make it do
> so would be hard, I suspect. It's just very nasty to try to figure
> out whether the user has touched the page a single time or millions
> of times...
>
> We do have the "accessed" bit, but in order to get any access
> patterns we'd have to scan the page tables a _lot_ more than we do
> now. Right now we do it only under memory pressure, and only about
> once per memory cycle, which is not really even remotely enough to
> get anything more than "yes, the user did touch the page"..
>
> In order for mmap() to make a difference with the new code, we could
> do something like look at the adjacent VM pages on page-in. That,
> together with VM hints, might well be able to do similar things for
> mmap. But at a noticeably higher cost than what the current code has.
OK, the truth is, I decided to punt that problem on the assumption I'd
be able to come up with something workable when the time came. Now
having thought about it a little longer I see it's not so easy. It's
time for some free-form thinking.
Maybe there's a relatively simple way. We'll see for either memory
read or write we'll see the fault. At that point the page starts its
death march down the inactive queue and shortly after that the actual
memory operation takes place. Another short time later we clear the
hardware referenced bit by some as-yet-undetermined means. At the
south end of the queue we check the hardware referenced bit and
reactivate or continue with the eviction process as with the current
patch.
Now I sense there are gotchas here that I'm not entirely on top of -
the swap side of the memory manager being a little less than crystal
clear to me at this point. Let me start by guessing that it's not so
easy to know which pte to look in for the referenced bit. However, we
could cheat and store the pte address in the struct page, or perhaps in
the buffers, which happen to be still attached at the time.
Hmm, just a first idea, and admittedly ugly. I think what I really
have to do is some more code study.
--
Daniel
On Wednesday 25 July 2001 02:43, Anton Altaparmakov wrote:
> At 01:04 25/07/2001, Daniel Phillips wrote:
> >At that size you'd run a real risk of missing the tell-tale multiple
> >references that mark a page as frequently used. Think about
> > metadata here (right now, that usually just means directory pages,
> > but later... who knows).
>
> This is not actually implemented yet, but NTFS TNG will use the page
> cache to hold both the LCN (physical clusters) and MFT (on disk
> inodes) allocation bitmaps in addition to file and directory pages.
> (Admittedly the LCN case folds into the file pages in page cache one
> as the LCN bitmap is just stored inside the usual data of a file
> called $Bitmap, but the MFT case is more complex as it is in an
> additional attribute inside the file $MFT so the normal file
> read/write functions definitely cannot be used. The usual data here
> is the actual on disk inodes...)
>
> Just FYI.
Yes, not a surprise. Plus, I was working on an experimental patch to
put Ext2 index blocks into the page cache just before I got off on this
use-once tangent. Time to go back to that...
--
Daniel
On Tuesday 24 July 2001 22:24, Patrick Dreker wrote:
> I just decided to give this patch a try, as I have written a little
> application which does some statistics over traces dumped by another
> program by mmap()ing a large file and reading it sequentially. The
> trace file to be analyzed is about 240 megs in size and consists of
> records each 249 bytes long. The analysis program opens and the
> mmap()s the trace file doing some small calculations on the data
> (basically it adds up fields from the records to get overall values).
>
> I have tested this on my Athlon 600 with 128 Megs of RAM, and it does
> not make any difference whether I use plain 2.4.7 or 2.4.5-use-once.
> The program always takes about 1 minute 6 seconds (+- 2 seconds) to
> complete, and the machine starts swapping out stuff
In this case that's an excellent result:
- The optimization doesn't include mmap's (yet)
- It doesn't break swap. (Good, I didn't check that myself)
This is a case of "no news is good news".
> (thus I have
> omitted further stats like vmstat output). I have just taken another
> look into my program to verify it does not do something silly, like
> keeping old data around, but the program cycle is always the same:
> copy a record from the mmap into a struct, perform analysis, and copy
> next record. The struct is always reused for the next struct (so
> there is only one struct at any time).
>
> I can do further tests, if someone asks me to. I could even modify
> the analysis program to check changes in behaviour...
(Already read your mail where you picked up the 20% improvement,
thanks, it warms my heart:-)
--
Daniel
On Tue, 24 Jul 2001, Daniel Phillips wrote:
> --- ../2.4.5.clean/mm/vmscan.c Sat May 26 02:00:18 2001
> +++ ./mm/vmscan.c Mon Jul 23 17:25:27 2001
> @@ -359,10 +359,10 @@
> }
>
> /* Page is or was in use? Move it to the active list. */
> - if (PageReferenced(page) || page->age > 0 ||
> - (!page->buffers && page_count(page) > 1)) {
> + if (PageReferenced(page) || (!page->buffers && page_count(page) > 1)) {
> del_page_from_inactive_clean_list(page);
> add_page_to_active_list(page);
> + page->age = PAGE_AGE_START;
> continue;
> }
>
> @@ -462,11 +462,11 @@
> }
>
> /* Page is or was in use? Move it to the active list. */
> - if (PageReferenced(page) || page->age > 0 ||
> - (!page->buffers && page_count(page) > 1) ||
> + if (PageReferenced(page) || (!page->buffers && page_count(page) > 1) ||
> page_ramdisk(page)) {
> del_page_from_inactive_dirty_list(page);
> add_page_to_active_list(page);
> + page->age = PAGE_AGE_START;
> continue;
> }
That change will cause problems: we indicate that a page is young due to a
young mapped pte by increasing its age (at swap_out()).
With the current way of doing things, if you don't move higher aged pages
from the inactive lists to the active list you'll have severe problems
because of that (inactive list full of unfreeable pages due to mapped
pte's).
I suggest you to change
if (!page->buffers && page_count(page) > 1)
to
if ((page_count(page) - !!page->buffers) > 1)
at page_launder/reclaim_page to fix that problem.
This way we end up moving all pages with additional references ignoring
the buffers to the active list.
We have to unmap all references to a page before being able to make it
reclaimable clean cache anyway, so there is no real point keeping those
pages at the inactive list.
On Tuesday 24 July 2001 19:09, Daniel Phillips wrote:
> On Tuesday 24 July 2001 22:53, Marcelo Tosatti wrote:
> > On Tue, 24 Jul 2001, Rik van Riel wrote:
> > > On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> > > > On Tue, 24 Jul 2001, Daniel Phillips wrote:
> > > > > Today's patch tackles the use-once problem, that is, the
> > > > > problem of
> > > >
> > > > Well, as I see the patch should remove the problem where
> > > > drop_behind() deactivates pages of a readahead window even if
> > > > some of those pages are not "used-once" pages, right ?
> > > >
> > > > I just want to make sure the performance improvements you're
> > > > seeing caused by the fix of this _particular_ problem.
> > >
> > > Fully agreed.
> > >
> > > Especially since it was a one-liner change from worse
> > > performance to better performance (IIRC) it would be
> > > nice to see exactly WHY the system behaves the way it
> > > does. ;)
> >
> > Yes.
> >
> > Daniel's patch adds "drop behind" (that is, adding swapcache
> > pages to the inactive dirty) behaviour to swapcache pages.
> >
> > This is a _new_ thing, and I would like to know how that is changing
> > the whole VM behaviour..
>
> Yes, absolutely. I knew I was doing that but I also thought it wouldn't
> hurt. Rather it's part of a transition towards a full unification of
> the file and swap paths.
Stupid question time:
So basically (lemme get this straight):
All VM allocation in 2.4 has become some variant of mmap. Either we're
mmaping in the executable and libraries when we exec, we're mmaping in a
file, or we're mmaping in the swap file/block device which is how we do
anonymous page allocations.
And this is why 2.4 keeps wanting to allocate swap space for pages that are
still in core. And why the 2.4 VM keeps going ballistic on boxes that have
more physical DRAM than they have swap space. (I.E. the 2X swap actually
being NECESSARY now, and 256 megs of "overflow" swap files for a 2 gig ram
box actually hurting matters by confusing the VM if swap is enabled AT ALL,
since it clashes conceptually.)
So getting swap to exclude in-core pages (so they don't take up space in the
swap file) would therefore be conceptually similar to implementing "files
with holes in them" support like EXT2 has for these swap file mmaps.
And the argument that doing so might "fragment the swap file", while true, is
actually a bit of a red herriing because the real objection is that it
unnecessarily complicates an otherwise clean and straightforward concept.
Am I even close here? (That was the stupid question I mentioned earlier,
knew I'd get around to it...) Reading the code's a bit harder when you don't
have any sort of conceptual understanding of what the heck it's actually
trying to do.
> Basically, I just left that part of it hanging. If you check my
> detailed timings you'll see all my test runs have swaps=0, basically
> because I didn't really want to hear about it just then ;-)
>
> I was pretty sure it could be fixed if it broke.
Just thought I'd say that personally, I think your greatest contribution in
this whole thread is that somebody finally clearly explained how the new VM
works, using small words. The actual improvements are just gravy. Yes I'm
biased. :)
I don't suppose we could get some variant of your initial post into
/Documentation/vm/HowItActuallyWorks.txt? (I take it the biggest "detail"
you glossed over was the seperation of memory into zones?)
> --
> Daniel
Rob
Marcelo Tosatti wrote:
>
> Daniel's patch adds "drop behind" (that is, adding swapcache
> pages to the inactive dirty) behaviour to swapcache pages.
In some *brief* testing here, it appears that the use-once changes
make an improvement for light-medium workloads. With swap-intensive
workloads, the (possibly accidental) changes to swapcache aging
in fact improve things a lot, and use-once makes things a little worse.
This is a modified Galbraith test: 64 megs of RAM, `make -j12
bzImage', dual CPU:
2.4.7: 6:54
2.4.7+Daniel's patch 6:06
2.4.7+the below patch 5:56
--- mm/swap.c 2001/01/23 08:37:48 1.3
+++ mm/swap.c 2001/07/25 04:08:59
@@ -234,8 +234,8 @@
DEBUG_ADD_PAGE
add_page_to_active_list(page);
/* This should be relatively rare */
- if (!page->age)
- deactivate_page_nolock(page);
+ deactivate_page_nolock(page);
+ page->age = 0;
spin_unlock(&pagemap_lru_lock);
}
This change to lru_cache_add() is the only change made to 2.4.7,
and it provides the 17% speedup for this swap-intensive load.
With the same setup, running a `grep -r /usr/src' in parallel
with a `make -j3 bzImage', the `make -j3' takes:
2.4.7: 5:13
2.4.7+Daniel: 5:03
2.4.7+the above patch: 5:16
With the same setup, running a `grep -r /usr/src' in parallel
with a `make -j1 bzImage', the `make -j1' takes:
2.4.7: 9:25
2.4.7+Daniel: 8:55
2.4.7+the above patch: 9:35
So with lighter loads, use-once is starting to provide benefit, and the
deactivation is too aggressive.
> This is a _new_ thing, and I would like to know how that is changing the
> whole VM behaviour..
Sure. Daniel's patch radically changes the aging of swapcache
and other pages, and with some workloads it appears that it is
this change which brings about the performance increase, rather
than the intended use-once stuff.
I suspect the right balance here is to take use-once, but *not*
take its changes to lru_cache_add(). That's a separate thing.
Seems that lru_cache_add() is making decisions at a too-low level, and
they are sometimes wrong. The decision as to what age to give the page
and whether it should be activated needs to be made at a higher level.
-
On Tue, 24 Jul 2001, Rob Landley wrote:
> On Tuesday 24 July 2001 19:09, Daniel Phillips wrote:
> > On Tuesday 24 July 2001 22:53, Marcelo Tosatti wrote:
> > > On Tue, 24 Jul 2001, Rik van Riel wrote:
> > > > On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> > > > > On Tue, 24 Jul 2001, Daniel Phillips wrote:
> > > > > > Today's patch tackles the use-once problem, that is, the
> > > > > > problem of
> > > > >
> > > > > Well, as I see the patch should remove the problem where
> > > > > drop_behind() deactivates pages of a readahead window even if
> > > > > some of those pages are not "used-once" pages, right ?
> > > > >
> > > > > I just want to make sure the performance improvements you're
> > > > > seeing caused by the fix of this _particular_ problem.
> > > >
> > > > Fully agreed.
> > > >
> > > > Especially since it was a one-liner change from worse
> > > > performance to better performance (IIRC) it would be
> > > > nice to see exactly WHY the system behaves the way it
> > > > does. ;)
> > >
> > > Yes.
> > >
> > > Daniel's patch adds "drop behind" (that is, adding swapcache
> > > pages to the inactive dirty) behaviour to swapcache pages.
> > >
> > > This is a _new_ thing, and I would like to know how that is changing
> > > the whole VM behaviour..
> >
> > Yes, absolutely. I knew I was doing that but I also thought it wouldn't
> > hurt. Rather it's part of a transition towards a full unification of
> > the file and swap paths.
>
> Stupid question time:
>
> So basically (lemme get this straight):
>
> All VM allocation in 2.4 has become some variant of mmap.
>
> Either we're mmaping in the executable and libraries when we exec,
> we're mmaping in a file, or we're mmaping in the swap file/block
> device which is how we do anonymous page allocations.
>
>
> And this is why 2.4 keeps wanting to allocate swap space for pages that are
> still in core. And why the 2.4 VM keeps going ballistic on boxes that have
> more physical DRAM than they have swap space. (I.E. the 2X swap actually
> being NECESSARY now, and 256 megs of "overflow" swap files for a 2 gig ram
> box actually hurting matters by confusing the VM if swap is enabled AT ALL,
> since it clashes conceptually.)
2.4 keeps wanting to allocate swap space for pages that are in core to be
able to add them to the LRU lists. That way we can get more detailed
information about the VM usage on the system.
When we run out of memory, we start looking linearly in all processes ptes
to find ptes which have not been used recently. When we find one pte
which has not been used recently, we unmap it from the physical page.
For anonymous data, we have to unmap the pte and point it to some
"identifier" which can be used later (at page fault time) to find the
data. (wheter its still on memory or already swapped out to disk)
Currently this "identifier" is a value in the swap map. The swap map is a
_direct_ mapping to backing storage (swap space).
That means we cannot unmap a pte which points to anon data without
allocating swap space for the page first.
We have added an aging scheme which gets detailed information about the
page usage, great! The problem is that we cannot get detailed anon page
usage info without having to allocate swap space. Result: 2xRAM swap rule
created.
Have you got the idea here ?
That is not going to change on 2.4, I think. Sorry.
> So getting swap to exclude in-core pages (so they don't take up space in the
> swap file) would therefore be conceptually similar to implementing "files
> with holes in them" support like EXT2 has for these swap file mmaps.
Not really. To exclude in-core pages (in-core meaning all pte's which
point to the in-mem page are mapped and valid) from swap we just need to
remove the page from swapcache. Its as simple as that, really.
However that will cause swap fragmentation. And lots of it.
> And the argument that doing so might "fragment the swap file", while true, is
> actually a bit of a red herriing because the real objection is that it
> unnecessarily complicates an otherwise clean and straightforward concept.
Nope. The "objections" for not deallocating in-core pages from swap (on
2.4) are:
- The code would need to avoid fragmentation. It would have to "unmap"
contiguous chunks of swap mappings as soon as swap started to get full.
Take into account that the swap clustering code we are using now was wrote
at the time we used to allocate swap space only when actually writing out
swap data: there was no need for the code to be fast.
- 2.4 is out. Fixing the problem in a performance-decent way would have to
change quite some code.
> Am I even close here? (That was the stupid question I mentioned earlier,
> knew I'd get around to it...) Reading the code's a bit harder when you don't
> have any sort of conceptual understanding of what the heck it's actually
> trying to do.
Well, I think so.
>
> > Basically, I just left that part of it hanging. If you check my
> > detailed timings you'll see all my test runs have swaps=0, basically
> > because I didn't really want to hear about it just then ;-)
> >
> > I was pretty sure it could be fixed if it broke.
>
> Just thought I'd say that personally, I think your greatest contribution in
> this whole thread is that somebody finally clearly explained how the new VM
> works, using small words. The actual improvements are just gravy. Yes I'm
> biased. :)
>
> I don't suppose we could get some variant of your initial post into
> /Documentation/vm/HowItActuallyWorks.txt? (I take it the biggest "detail"
> you glossed over was the seperation of memory into zones?)
Last comment:
Who the heck cares about good swapping performance when tasks are
getting killed because there is no more swap space available due to this
swap preallocation ? No one.
It makes sense to fuck up swap clustering to avoid the OOM killer.
In case someone is having this problem with the OOM killer in practice
(real workloads), please tell us.
On Wed, 25 Jul 2001, Andrew Morton wrote:
> Marcelo Tosatti wrote:
> >
> > Daniel's patch adds "drop behind" (that is, adding swapcache
> > pages to the inactive dirty) behaviour to swapcache pages.
>
> In some *brief* testing here, it appears that the use-once changes
> make an improvement for light-medium workloads. With swap-intensive
> workloads, the (possibly accidental) changes to swapcache aging
> in fact improve things a lot, and use-once makes things a little worse.
>
> This is a modified Galbraith test: 64 megs of RAM, `make -j12
> bzImage', dual CPU:
>
> 2.4.7: 6:54
> 2.4.7+Daniel's patch 6:06
> 2.4.7+the below patch 5:56
>
> --- mm/swap.c 2001/01/23 08:37:48 1.3
> +++ mm/swap.c 2001/07/25 04:08:59
> @@ -234,8 +234,8 @@
> DEBUG_ADD_PAGE
> add_page_to_active_list(page);
> /* This should be relatively rare */
> - if (!page->age)
> - deactivate_page_nolock(page);
> + deactivate_page_nolock(page);
> + page->age = 0;
> spin_unlock(&pagemap_lru_lock);
> }
>
> This change to lru_cache_add() is the only change made to 2.4.7,
> and it provides the 17% speedup for this swap-intensive load.
>
> With the same setup, running a `grep -r /usr/src' in parallel
> with a `make -j3 bzImage', the `make -j3' takes:
>
> 2.4.7: 5:13
> 2.4.7+Daniel: 5:03
> 2.4.7+the above patch: 5:16
>
> With the same setup, running a `grep -r /usr/src' in parallel
> with a `make -j1 bzImage', the `make -j1' takes:
>
> 2.4.7: 9:25
> 2.4.7+Daniel: 8:55
> 2.4.7+the above patch: 9:35
Ok, now we know what improves the swapping performance.
I wonder _why_ it improves swapping performance.
> So with lighter loads, use-once is starting to provide benefit, and the
> deactivation is too aggressive.
>
> > This is a _new_ thing, and I would like to know how that is changing the
> > whole VM behaviour..
>
> Sure. Daniel's patch radically changes the aging of swapcache
> and other pages, and with some workloads it appears that it is
> this change which brings about the performance increase, rather
> than the intended use-once stuff.
>
> I suspect the right balance here is to take use-once, but *not*
> take its changes to lru_cache_add(). That's a separate thing.
Yep.
I perfectly agree with Daniel's use-once idea for _file cache_ pages. Its
really nice.
Now I'm not sure why directly adding swapcache pages to the inactive dirty
lits with 0 zero age improves things.
It _sounds_ like we avoid moving those swapcache pages to the active list,
then later move them to the inactive list and then writing them out.
That can result in less effective writeout clustering (its just a
guess) and also CPU wastage, as Daniel mentioned previously.
I'll hopefully going to profile his code tomorrow with the VM stats stuff,
if I have time.
> Seems that lru_cache_add() is making decisions at a too-low level, and
> they are sometimes wrong. The decision as to what age to give the page
> and whether it should be activated needs to be made at a higher level.
Agreed.
Am Mittwoch, 25. Juli 2001 02:27 schrieb Linus Torvalds:
> On Tue, 24 Jul 2001, Rik van Riel wrote:
> > (using smaller chunks, or chunks which aren't a
> > multiple of 4kB should break the current code)
>
> Maybe Patrick is using stdio? In that case, the small chunks will be
> coalesced in the library layer anyway, which might explain the lack of
> breakage.
I am using normal read() calls to read the file after open()ing it on program
start and each call reads only 4 bytes. So if I am reading this right I am
seeing an improvement where I should not see one, or at least not such a big
one, right?
I did a few more test runs on each of the kernels to check if the results are
reproducible:
2.4.7-plain:
17.320u 115.100s 2:17.36 96.4% 0+0k 0+0io 110pf+0w
17.200u 94.170s 1:53.14 98.4% 0+0k 0+0io 110pf+0w
17.490u 113.730s 2:13.48 98.3% 0+0k 0+0io 110pf+0w
2.4.5-use_once:
14.730u 108.310s 2:09.57 94.9% 0+0k 0+0io 203pf+0w
13.880u 79.410s 1:38.64 94.5% 0+0k 0+0io 226pf+0w
14.840u 78.680s 1:37.86 95.5% 0+0k 0+0io 238pf+0w
The time under 2.4.5-use_once stays constant from the second run on (I tried
3 more times), while 2.4.7 shows pretty varying performance but I have never
seen it getting better than the 1:53.14 from the second run above. I had
stopped all services which I knew to cause periodic activity (exim,
cron/anacron) which could disturb the tests.
I also noticed, that under 2.4.5 after the 3 test runs the KDE Taskbasr got
swapped out, while under 2.4.7 this was not the case.
> Of course, if it improved performance even when "broken", that would be
> even better. I like those kind sof algorithms.
Who doesn't? :)
>
> Linus
--
Patrick Dreker
---------------------------------------------------------------------
> Is there anything else I can contribute?
The latitude and longtitude of the bios writers current position, and
a ballistic missile.
Alan Cox on [email protected]
Rob Landley <[email protected]> writes:
> Stupid question time:
>
> So basically (lemme get this straight):
>
> All VM allocation in 2.4 has become some variant of mmap. Either we're
> mmaping in the executable and libraries when we exec, we're mmaping in a
> file, or we're mmaping in the swap file/block device which is how we do
> anonymous page allocations.
You are getting warm. In 2.4 most caching is moving exclusively to
the page cache. Page cache write support was added between 2.2 and
2.4 allowing things to stop using the block cache. However everything
you are mentioning used the page cache in 2.2. The page cache is
superior to the block cache as it caches the data we want to use not
some intermediate form of it.
Having one general purpose cache layer, increases the reliability
of the system, as we only have to have that layer debugged to know
everything is working. It is by know means the entire VM but it is
the majority and it is where we most pages that can be freed on demand
so the page cache receives most of the attention.
> And this is why 2.4 keeps wanting to allocate swap space for pages that are
> still in core. And why the 2.4 VM keeps going ballistic on boxes that have
> more physical DRAM than they have swap space. (I.E. the 2X swap actually
> being NECESSARY now, and 256 megs of "overflow" swap files for a 2 gig ram
> box actually hurting matters by confusing the VM if swap is enabled AT ALL,
> since it clashes conceptually.)
Nope. There are several canidates for this but none of the is simply
using the page cache.
The description, and characterization of the 2.4 VM with respect to
swapping is wrong. I have seen to much overgeneralization on this
topic already, and refuse to treat it in the abstract. If you have a
reproducible problem case report it specifically.
The biggest 2.4 swapping bug was failure free pages in the swap cache
that were not connected to any process. This should now be fixed.
The biggest issue is that the 2.4 VM has been heavily modified since
2.2 and the bugs are still being worked out. If you have a an
application that needs noticeably more swap than 2.2 did report it as
a bug. Any other treatment at this point would be over confidence in
the correctness of the VM in corner cases.
> So getting swap to exclude in-core pages (so they don't take up space in the
> swap file) would therefore be conceptually similar to implementing "files
> with holes in them" support like EXT2 has for these swap file mmaps.
Not really. There are some similiarities.
> And the argument that doing so might "fragment the swap file", while true, is
> actually a bit of a red herriing because the real objection is that it
> unnecessarily complicates an otherwise clean and straightforward concept.
Fragmenting the swap file is a read herring.
I have seen no real objections, just no real code. The thing to watch
out for is that you need to make certain that you actually have the
resources to swap a file out to. Consider having a swapfile on tmpfs.
Having a deadlock in the swapping code would be bad.
> Am I even close here? (That was the stupid question I mentioned earlier,
> knew I'd get around to it...) Reading the code's a bit harder when you don't
> have any sort of conceptual understanding of what the heck it's actually
> trying to do.
See my earlier answers for closenes.
> I don't suppose we could get some variant of your initial post into
> /Documentation/vm/HowItActuallyWorks.txt? (I take it the biggest "detail"
> you glossed over was the seperation of memory into zones?)
Becareful here, what is maintainable is a concepts behind how it
works. The page cache, multpile memory zones, etc.
You have the code for a detailed how it works document, and it is
continually changing. The driving forces, tools, and ideas behind
what is happening should be static enough to document however.
Eric
On Tue, 24 Jul 2001 [email protected] wrote:
>Date: Tue, 24 Jul 2001 16:26:12 -0400 (EDT)
>From: [email protected]
>To: [email protected]
>Content-Type: text/plain; charset=US-ASCII
>Subject: linux partitioning in IA64
>
>Hei,
>
>I am trying to install seawolf(linux IA64) on an IA 64 machine.
>For partitioning I went through several trails without success.
>(total # disk space is 10G). After all the initial settings, I
>restarted the machine to EFI (boot manager). Can't boot to
>linux and the EFI can't recognize files the installation CD
>initially saved on the hard disk. Are there special ways or
>tricks to partition the disk and to make it work?
Much better to send this type of question to [email protected],
as it is mostly OT here...
Hope this helps.
----------------------------------------------------------------------
Mike A. Harris - Linux advocate - Open Source advocate
Opinions and viewpoints expressed are solely my own.
----------------------------------------------------------------------
Definition: MCSE - Must Consult Someone Experienced
On Tuesday 24 July 2001 21:35, Rob Landley wrote:
> I don't suppose we could get some variant of your initial post into
> /Documentation/vm/HowItActuallyWorks.txt? (I take it the biggest
> "detail" you glossed over was the seperation of memory into zones?)
I glossed over a lot of big details:
- zones
- type of pages: anonymous, swap cache, file, high, buffer, ramdisk
- interaction with page cache
- various flavors of swap-in and swap-out paths
- shared memory and the swap cache
- locking strategy
- aging strategy
- scanning policy
- deadlock and livelock avoidance measures
- unloaded vs loaded behaviour
- effect of load changes
- out of memory handling
- clustering (or lack of it)
- IO throttling
Each of these is a topic all by itself. You'll find all of them
discussed extensively here on lkml.
--
Daniel
On Wednesday 25 July 2001 10:32, Eric W. Biederman wrote:
> Consider having a swapfile on tmpfs.
Ooh, a truly twisted thought.
We'd know we're making progress when we can prove it doesn't deadlock.
--
Daniel
Hello,
I read whole this thread and it started to interest me.
If I understood the code correctly, the used bit on ptes
is tested once per 64 second (for particular page) from
kswapd or more often under mem pressure and only for
pages on active list. Correct ?
Would it be possible to do cost-based eviction of pages ?
Like to compute cost for each page periodically (with variable
period - resort page to finner-period list when u bit was set for
too long) and resort page into hash bucket with key equal to
the cost (cost would be in moderate integer range).
Then you can simply evict pages from lowest hash bucket list
and move them to free list (after cleaning them).
It would allow developers to modify and test various cost
schemas including LRU,LRU/2,aging ... with simple change
of cost function.
I tried to create statistical model of page eviction and after
some simplify it seems that cost c = (Tw-t)^2+(Tr-t)^2
where Tr/w is expected time of next page reference/dirtying
based on EWMA averaging could minimize disc activity during
page eviction.
I wanted to try it but I'd like to know opinions of others.
Probably someone have got similar idea ..
Maybe it would be so expensive ..
fill other "maybe" here .. :-)
devik
On Wednesday 25 July 2001 10:20, Patrick Dreker wrote:
> I did a few more test runs on each of the kernels to check if the
> results are reproducible:
> 2.4.7-plain:
> 17.320u 115.100s 2:17.36 96.4% 0+0k 0+0io 110pf+0w
> 17.200u 94.170s 1:53.14 98.4% 0+0k 0+0io 110pf+0w
> 17.490u 113.730s 2:13.48 98.3% 0+0k 0+0io 110pf+0w
>
> 2.4.5-use_once:
> 14.730u 108.310s 2:09.57 94.9% 0+0k 0+0io 203pf+0w
> 13.880u 79.410s 1:38.64 94.5% 0+0k 0+0io 226pf+0w
> 14.840u 78.680s 1:37.86 95.5% 0+0k 0+0io 238pf+0w
Look at the CPU dropping along with the times. Usually it goes the
other way.
> The time under 2.4.5-use_once stays constant from the second run on
> (I tried 3 more times), while 2.4.7 shows pretty varying performance
> but I have never seen it getting better than the 1:53.14 from the
> second run above. I had stopped all services which I knew to cause
> periodic activity (exim, cron/anacron) which could disturb the tests.
>
> I also noticed, that under 2.4.5 after the 3 test runs the KDE
> Taskbasr got swapped out, while under 2.4.7 this was not the case.
Not swapping out the task bar is a different problem, only loosely
related. The use-once thing is a step in the right direction because
it makes relatively more file IO pages available for deactivation
versus swap pages, and the task bar has a better chance of surviving.
However, it's not a really firm connection to the problem.
An additional line of attack is to look at the aging policy. I have a
strong sense we can do it better. Right now we're aging everything
down to a uniform zero and that really obviously throws away a lot of
information.
In the 2.5 timeframe, better unification of the page cache and swap
paths will make it much easier to focus on the taskbar problem.
--
Daniel
Daniel Phillips <[email protected]> writes:
> On Wednesday 25 July 2001 10:32, Eric W. Biederman wrote:
> > Consider having a swapfile on tmpfs.
>
> Ooh, a truly twisted thought.
>
> We'd know we're making progress when we can prove it doesn't deadlock.
Given all of that I think I going to go try to hack up a version that
works.
The ensuring we don't deadlock sounds like it will require some shared
mechanisms with journaling filesystems. The whole pinned ram
problem. So to some extent it looks like a 2.4 problem. Though I'd
rather solve it in 2.5.x first.
Eric
On Wed, 25 Jul 2001, Daniel Phillips wrote:
> On Wednesday 25 July 2001 08:33, Marcelo Tosatti wrote:
> > Now I'm not sure why directly adding swapcache pages to the inactive
> > dirty lits with 0 zero age improves things.
>
> Because it moves the page rapidly down the inactive queue towards the
> ->writepage instead of leaving it floating around on the active ring
> waiting to be noticed. We already know we want to evict that page,
We don't.
The page gets unmapped and added to the swap cache the first
time it wasn't referenced by the process.
This is before any page aging is done.
Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
On Wednesday 25 July 2001 18:41, Rik van Riel wrote:
> On Wed, 25 Jul 2001, Daniel Phillips wrote:
> > On Wednesday 25 July 2001 08:33, Marcelo Tosatti wrote:
> > > Now I'm not sure why directly adding swapcache pages to the
> > > inactive dirty lits with 0 zero age improves things.
> >
> > Because it moves the page rapidly down the inactive queue towards
> > the ->writepage instead of leaving it floating around on the active
> > ring waiting to be noticed. We already know we want to evict that
> > page,
>
> We don't.
>
> The page gets unmapped and added to the swap cache the first
> time it wasn't referenced by the process.
>
> This is before any page aging is done.
True, it's more accurate to say that we already know we want to *try*
evicting that page. A wrong guess should not make it all the way down
the inactive queue.
--
Daniel
> On Wednesday 25 July 2001 02:43, Anton Altaparmakov wrote:
> > At 01:04 25/07/2001, Daniel Phillips wrote:
> > >At that size you'd run a real risk of missing the tell-tale multiple
> > >references that mark a page as frequently used. Think about
> > > metadata here (right now, that usually just means directory pages,
> > > but later... who knows).
> >
> > This is not actually implemented yet, but NTFS TNG will use the page
> > cache to hold both the LCN (physical clusters) and MFT (on disk
> > inodes) allocation bitmaps in addition to file and directory pages.
> > (Admittedly the LCN case folds into the file pages in page cache one
> > as the LCN bitmap is just stored inside the usual data of a file
> > called $Bitmap, but the MFT case is more complex as it is in an
> > additional attribute inside the file $MFT so the normal file
> > read/write functions definitely cannot be used. The usual data here
> > is the actual on disk inodes...)
> >
> > Just FYI.
>
> Yes, not a surprise. Plus, I was working on an experimental patch to
> put Ext2 index blocks into the page cache just before I got off on this
> use-once tangent. Time to go back to that...
>
> --
> Daniel
> -
So here is a datapoint from a filesystem which uses pages for all of its
data and metadata, XFS. These are for 2.4.7 with and without the patch.
XFS metadata is held in a special address space, when not in use by XFS
the vm is free to reclaim it via the usual page reclaim methods, so XFS
is probably fairly heavily influenced by a change like this.
All runs on a 2 cpu 450MHz PIII with 128M of memory, single scsi partition.
1. dbench runs at various loads back to back, starting with
the highest load, 3 runs of each load):
Vanilla 2.4.7 xfs
64
=====
Throughput 6.77363 MB/sec (NB=8.46704 MB/sec 67.7363 MBit/sec)
Throughput 7.03715 MB/sec (NB=8.79644 MB/sec 70.3715 MBit/sec)
Throughput 7.11706 MB/sec (NB=8.89632 MB/sec 71.1706 MBit/sec)
48
=====
Throughput 8.50251 MB/sec (NB=10.6281 MB/sec 85.0251 MBit/sec)
Throughput 8.38426 MB/sec (NB=10.4803 MB/sec 83.8426 MBit/sec)
Throughput 8.58152 MB/sec (NB=10.7269 MB/sec 85.8152 MBit/sec)
32
=====
Throughput 11.6378 MB/sec (NB=14.5473 MB/sec 116.378 MBit/sec)
Throughput 11.7586 MB/sec (NB=14.6983 MB/sec 117.586 MBit/sec)
Throughput 12.0686 MB/sec (NB=15.0858 MB/sec 120.686 MBit/sec)
16
=====
Throughput 16.9877 MB/sec (NB=21.2346 MB/sec 169.877 MBit/sec)
Throughput 16.9502 MB/sec (NB=21.1877 MB/sec 169.502 MBit/sec)
Throughput 16.8703 MB/sec (NB=21.0878 MB/sec 168.703 MBit/sec)
8
=====
Throughput 36.085 MB/sec (NB=45.1062 MB/sec 360.85 MBit/sec)
Throughput 31.8114 MB/sec (NB=39.7642 MB/sec 318.114 MBit/sec)
Throughput 36.0801 MB/sec (NB=45.1001 MB/sec 360.801 MBit/sec)
4
=====
Throughput 42.7741 MB/sec (NB=53.4677 MB/sec 427.741 MBit/sec)
Throughput 46.2591 MB/sec (NB=57.8239 MB/sec 462.591 MBit/sec)
Throughput 42.8487 MB/sec (NB=53.5608 MB/sec 428.487 MBit/sec)
2
=====
Throughput 48.776 MB/sec (NB=60.97 MB/sec 487.76 MBit/sec)
Throughput 48.4417 MB/sec (NB=60.5522 MB/sec 484.417 MBit/sec)
Throughput 47.0938 MB/sec (NB=58.8673 MB/sec 470.938 MBit/sec)
1
=====
Throughput 32.0957 MB/sec (NB=40.1196 MB/sec 320.957 MBit/sec)
Throughput 32.1901 MB/sec (NB=40.2376 MB/sec 321.901 MBit/sec)
Throughput 31.3972 MB/sec (NB=39.2465 MB/sec 313.972 MBit/sec)
2.4.7 xfs + patch
64
=====
Throughput 8.99448 MB/sec (NB=11.2431 MB/sec 89.9448 MBit/sec)
Throughput 9.07137 MB/sec (NB=11.3392 MB/sec 90.7137 MBit/sec)
Throughput 9.09924 MB/sec (NB=11.3741 MB/sec 90.9924 MBit/sec)
48
=====
Throughput 10.7328 MB/sec (NB=13.416 MB/sec 107.328 MBit/sec)
Throughput 10.791 MB/sec (NB=13.4887 MB/sec 107.91 MBit/sec)
Throughput 10.9185 MB/sec (NB=13.6482 MB/sec 109.185 MBit/sec)
32
=====
Throughput 12.6625 MB/sec (NB=15.8282 MB/sec 126.625 MBit/sec)
Throughput 12.7601 MB/sec (NB=15.9501 MB/sec 127.601 MBit/sec)
Throughput 12.6732 MB/sec (NB=15.8415 MB/sec 126.732 MBit/sec)
16
=====
Throughput 19.5692 MB/sec (NB=24.4614 MB/sec 195.692 MBit/sec)
Throughput 19.6163 MB/sec (NB=24.5204 MB/sec 196.163 MBit/sec)
Throughput 19.6773 MB/sec (NB=24.5966 MB/sec 196.773 MBit/sec)
8
=====
Throughput 36.8323 MB/sec (NB=46.0404 MB/sec 368.323 MBit/sec)
Throughput 36.9346 MB/sec (NB=46.1683 MB/sec 369.346 MBit/sec)
Throughput 36.4622 MB/sec (NB=45.5778 MB/sec 364.622 MBit/sec)
4
=====
Throughput 42.883 MB/sec (NB=53.6037 MB/sec 428.83 MBit/sec)
Throughput 42.1596 MB/sec (NB=52.6995 MB/sec 421.596 MBit/sec)
Throughput 42.9602 MB/sec (NB=53.7002 MB/sec 429.602 MBit/sec)
2
=====
Throughput 44.2798 MB/sec (NB=55.3498 MB/sec 442.798 MBit/sec)
Throughput 39.5441 MB/sec (NB=49.4301 MB/sec 395.441 MBit/sec)
Throughput 46.8569 MB/sec (NB=58.5711 MB/sec 468.569 MBit/sec)
1
=====
Throughput 32.0682 MB/sec (NB=40.0852 MB/sec 320.682 MBit/sec)
Throughput 32.1923 MB/sec (NB=40.2403 MB/sec 321.923 MBit/sec)
Throughput 32.1677 MB/sec (NB=40.2097 MB/sec 321.677 MBit/sec)
A definite improvement under high loads, and the lower loads appear
about the same. I reran the dbench 2 numbers and got results in line
with the non patched kernel.
2. Kernel builds
Unmodified kernel:
time make -j3 bzImage
948.880u 70.660s 8:45.88 193.8% 0+0k 0+0io 721072pf+0w
948.360u 69.950s 8:43.68 194.4% 0+0k 0+0io 721072pf+0w
Patched kernel:
942.660u 54.320s 8:29.49 195.6% 0+0k 0+0io 721044pf+0w
942.150u 54.900s 8:29.35 195.7% 0+0k 0+0io 721041pf+0w
Looks good to me, 22% less system time for the kernel build
is pretty good, also looks like my cpus are slower than most
people's nowadays :-(
Steve
Hi,
Here are some more figures.
2.4.7 + ide patch + ide dma timeout fix + lvm beta8 + reiserfs fs
Throughput 19.3959 MB/sec (NB=24.2449 MB/sec 193.959 MBit/sec)
dbench 20 24.88s user 70.89s system 69% cpu 2:17.23 total
Throughput 19.0447 MB/sec (NB=23.8059 MB/sec 190.447 MBit/sec)
dbench 20 24.37s user 69.24s system 67% cpu 2:19.63 total
tob -f - -full misc > misc-0725.tob 235.80s user 321.73s system 42% cpu 21:52.49 total
2.4.7 + ide patch + ide dma timeout fix + lvm beta8 + reiserfs fs
Throughput 19.3395 MB/sec (NB=24.1744 MB/sec 193.395 MBit/sec)
dbench 20 23.70s user 70.68s system 68% cpu 2:17.52 total
Throughput 19.1872 MB/sec (NB=23.9841 MB/sec 191.872 MBit/sec)
dbench 20 24.30s user 69.42s system 67% cpu 2:18.60 total
tob -f - -full misc > /back/misc-0725.tob 229.64s user 326.26s system 43% cpu 21:11.77 total
Basicily no big changes here - which is good news.
Ed Tomlinson
Rik van Riel <[email protected]> writes:
> On Wed, 25 Jul 2001, Daniel Phillips wrote:
> > On Wednesday 25 July 2001 08:33, Marcelo Tosatti wrote:
> > > Now I'm not sure why directly adding swapcache pages to the inactive
> > > dirty lits with 0 zero age improves things.
> >
> > Because it moves the page rapidly down the inactive queue towards the
> > ->writepage instead of leaving it floating around on the active ring
> > waiting to be noticed. We already know we want to evict that page,
>
> We don't.
Agreed. The kinds of ``aging'' don't match up so we can't tell if
it meets our usual criteria for aging.
> The page gets unmapped and added to the swap cache the first
> time it wasn't referenced by the process.
>
> This is before any page aging is done.
Actually there has been aging done. Unless you completely disable
testing for pte_young. It is a different kind of aging but it is
aging.
Eric
On Thursday 26 July 2001 10:36, Eric W. Biederman wrote:
> Rik van Riel <[email protected]> writes:
> > On Wed, 25 Jul 2001, Daniel Phillips wrote:
> > > On Wednesday 25 July 2001 08:33, Marcelo Tosatti wrote:
> > > > Now I'm not sure why directly adding swapcache pages to the
> > > > inactive dirty lits with 0 zero age improves things.
> > >
> > > Because it moves the page rapidly down the inactive queue towards
> > > the ->writepage instead of leaving it floating around on the
> > > active ring waiting to be noticed. We already know we want to
> > > evict that page,
> >
> > We don't.
>
> Agreed. The kinds of ``aging'' don't match up so we can't tell if
> it meets our usual criteria for aging.
Well, in the absence of of benchmark evidence its hard to tell how
valuable the usual criteria really are. That's another topic though,
because the question here is not how the aging matches up, but how the
inactive queue handling matches up, see below.
> > The page gets unmapped and added to the swap cache the first
> > time it wasn't referenced by the process.
> >
> > This is before any page aging is done.
>
> Actually there has been aging done. Unless you completely disable
> testing for pte_young. It is a different kind of aging but it is
> aging.
In the case of a process page and in the case of a file IO page the
situation is the same: we have decided to put the page on trial. Once
we have arrived at that decision its previous history doesn't matter,
so it makes sense to set its age to a known state. In this case it's
0, meaning "on trial".
Consider that the process page will only ever be down-aged after being
unmapped. So if we put it on the active queue, it just acts like a
big, slow, inefficient timer. Why not put it on a fifo queue instead?
It's the same thing, just more efficient. But we already have a fifo
queue, the inactive_dirty_queue, which we are using for everything
else, so why not use it for this too, then at least we know its fair.
In other words, there is no obvious need to treat a process page
differently from a file IO page once decide to put it on trial.
There does seem to be a dangling thread here though - when a process
page is unmapped and added to swap cache in try_to_swap_out then later
faulted back in, I don't see where we "rescue" the page from the
inactive queue. Maybe I'm just not looking hard enough.
--
Daniel
On Thu, 26 Jul 2001, Daniel Phillips wrote:
> There does seem to be a dangling thread here though - when a process
> page is unmapped and added to swap cache in try_to_swap_out then later
> faulted back in, I don't see where we "rescue" the page from the
> inactive queue. Maybe I'm just not looking hard enough.
do_swap_page()->lookup_swap_cache()->__find_page_nolock()->SetPageReferenced().
The referenced bit will make page_launder/reclaim_page move the page to
the active list.
On Thursday 26 July 2001 12:38, Marcelo Tosatti wrote:
> On Thu, 26 Jul 2001, Daniel Phillips wrote:
> > There does seem to be a dangling thread here though - when a
> > process page is unmapped and added to swap cache in try_to_swap_out
> > then later faulted back in, I don't see where we "rescue" the page
> > from the inactive queue. Maybe I'm just not looking hard enough.
>
> do_swap_page()->lookup_swap_cache()->__find_page_nolock()->SetPageRef
>erenced().
>
> The referenced bit will make page_launder/reclaim_page move the page
> to the active list.
Yes. And there I set ->age to a known state (START_AGE), so everything
seems to be in order.
--
Daniel
On Wed, 25 Jul 2001, Andrew Morton wrote:
> Marcelo Tosatti wrote:
> >
> > Daniel's patch adds "drop behind" (that is, adding swapcache
> > pages to the inactive dirty) behaviour to swapcache pages.
>
> In some *brief* testing here, it appears that the use-once changes
> make an improvement for light-medium workloads. With swap-intensive
> workloads, the (possibly accidental) changes to swapcache aging
> in fact improve things a lot, and use-once makes things a little worse.
>
> This is a modified Galbraith test: 64 megs of RAM, `make -j12
> bzImage', dual CPU:
>
> 2.4.7: 6:54
> 2.4.7+Daniel's patch 6:06
> 2.4.7+the below patch 5:56
>
> --- mm/swap.c 2001/01/23 08:37:48 1.3
> +++ mm/swap.c 2001/07/25 04:08:59
> @@ -234,8 +234,8 @@
> DEBUG_ADD_PAGE
> add_page_to_active_list(page);
> /* This should be relatively rare */
> - if (!page->age)
> - deactivate_page_nolock(page);
> + deactivate_page_nolock(page);
> + page->age = 0;
> spin_unlock(&pagemap_lru_lock);
> }
>
> This change to lru_cache_add() is the only change made to 2.4.7,
> and it provides the 17% speedup for this swap-intensive load.
After some thoughs I think this is due to swapin readahead.
We add "drop behind" behaviour to swap readahead, so we have less impact
on the system due to swap readahead "misses".
That is nice but dangerous under swap IO intensive loads:
We start swapin readahead, bring the first page of the "cluster" in
memory, block on the next swap page's IO (the one's we are doing
readahead) and in the meantime the first page we read is reclaimed by
someone else.
Then we'll have to reread the first page at at the direct
read_swap_cache_async() called by do_swap_page().
I really want to confirm if this is happening right now. Hope to do it
soon.