This implements anon_vma for the anonymous memory unmapping and objrmap
for the file mappings, effectively removing rmap completely and
replacing it with more efficient algorithms (in terms of memory
utilization and cpu cost for the fast paths).
I attached the results for a basic benchmark comparing 2.4, 2.6 and
2.6-aa.
Next thing to fix are the nonlinear mappings (probably I will use the
sysctl for the short term, sysctl may be needed anyways for allowing
mlock to all users), and then the rbtree for the i_mmap{shared} (the
prio_tree isn't stable yet, over time we can replace it with the
prio_tree of course).
I'm running this kernel while writing this and it's under 500M swap load
without problems. Ingo complains some workload with zillon of vmas
in the same file will not work well, but 1) those workloads are supposed
to use remap_file_pages in 32bit archs, and 2) they wants mlock anyways,
and this vm design is optimal on the 64bit without requiring nor
remap_file_pages nor mlock there.
This is better than anonmm because it's finegriend, so applications
using local "malloc" in a process-group will scale better during heavy
swapping (in presence of tons of processes in the same group), and
secondly it gets mremap right (tons of programs uses cows heavily, the
most obvious example is kde, I don't know if they ever run mremap but
this sounds safer overall). It costs a bit more of memory but I don't
think it matters, an anon_vma is only 12 bytes and it can cover
gigabytes of address space.
I did some robusteness improvement as well in the ->nopage handlings,
and a fix in bttv that didn't set VM_RESERVE, plus some cleanup in the
vma merging.
vma merging for mprotect and mremap is disabled for now, but it can be
enabled again, and after we re-enable it file backed vmas should be
supported too in the merging (so far they aren't).
In terms of performance I hope with this effort we'll get back the two
digit percent loss compared to 2.4 based kernels, if this doesn't bring
all performance back the next suspect is the scheduler lacking HT
awareness or stuff like that (but I doubt, see the bench page and
imagine a much bigger box with quite some more ram than 1G).
With regard to the x86 32bit arch this releases 4byte of page_t, plus it
saves tons of zone-normal to make it possible to use >=4G boxes with the
optimal 3:1 model like we do in 2.4 today (something not doable in 2.6
mainline due the huge rmap overhead, and that in practice limits the 2.6
kernel to non PAE boxes [depending on the workload]).
Credit for the original objrmap idea goes to David Miller, the first
good implementation for 2.6 is from Dave McCracken but only for the file
mappings, and the first effort to cover the anonymous memory too is from
Hugh Dickins and Wli also maintained the anonmm code later. I coalesced
the good stuff together plus I designed and implemented the the anon_vma
solution for anonymous memory.
Alternate solutions to anon_vma have been proposed and they may be
considered in alternative to this. Ideally we should split the anon_vma
patch in two parts, one that could be re-used by the anonmm design,
though I was no time to split it so far. I'm not claiming anon_vma is
definitely superior to anonmm but it's the solution I prefer. It is clearly
more efficient in some very high end workload I've in mind, but in the
small boxes it takes a bit more of memory so for the simpler workloads
anonmm is prefereable, plus anonmm allows full vma merging (though it
requires cow during mremap).
URL:
http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.6/2.6.5-rc1-aa1.gz
http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.6/2.6.5-rc1-aa1/
Only in 2.6.5-rc1-aa1: 00000_extraversion-1
Add -aa1.
Only in 2.6.5-rc1-aa1: 00000_twofish-2.6.gz
Compatibility cryptoloop.
Only in 2.6.5-rc1-aa1: 00100_objrmap-core-1.gz
Objrmap core from Dave McCracken implementing objrmap
methods for file mappings reusing the truncate
infrastructure (as in DaveM's effort). This doesn't
obsolete rmap yet, anonymous memory is still tracked
via rmap.
Only in 2.6.5-rc1-aa1: 00101_anon_vma-1.gz
Implements innovative objrmap methods for anonymous
memory using the finegriend anon_vma design that handles
mremap gracefully and avoids to scan all mm in a child-group
to find the right vmas as it happens with anonmm. This effectively
obsoletes rmap and infact it releases 4bytes per page_t by dropping the
pte_chains.
The combination of anon_vma and objrmap-core avoids huge waste
of memory and cpu resources and it boosts smp scalability too.
This makes it possible to use >=4G boxes without the 4:4 slowdown.
The nonlinear vmas aren't covered yet, so this is an insecure
kernel at this time (only in terms of local security
DoS, no root compromise can happen or whatever like that, it's
like allowing a bad user to use mlock, so he can turn
down the machine locally). The nonlinear vmas will be covered too with
further patches, this is already more than good enough for starting
beating on it with real loads like single user desktop usage.
The file based methods will have to be optimized further too with
a rbtree.
Both the nonlinear vmas and the rbtree for the file methods
i_mmap{shared} will be covered in the next -aa.
If you find compile breakages s/->mapping/->as.mapping/ will fix it.
Only in 2.6.5-rc1-aa1: 00200_kgdb-ga-1.gz
Only in 2.6.5-rc1-aa1: 00201_kgdb-ga-recent-gcc-fix-1.gz
Only in 2.6.5-rc1-aa1: 00201_kgdb-THREAD_SIZE-fixes-1.gz
Only in 2.6.5-rc1-aa1: 00201_kgdb-x86_64-support-1.gz
kgdb from Andrew's -mm tree.
On Thu, 18 Mar 2004, Andrea Arcangeli wrote:
> This implements anon_vma for the anonymous memory unmapping and objrmap
> for the file mappings, effectively removing rmap completely and
> replacing it with more efficient algorithms (in terms of memory
> utilization and cpu cost for the fast paths).
Cool. I'm glad we've figured out how to fix all the problems
with object based rmap. I'd be happy to get rid of the pte
based reverse mapping stuff...
> Next thing to fix are the nonlinear mappings (probably I will use the
> sysctl for the short term, sysctl may be needed anyways for allowing
> mlock to all users), and then the rbtree for the i_mmap{shared} (the
> prio_tree isn't stable yet, over time we can replace it with the
> prio_tree of course).
Yeah, we'll definately need the prio_tree stuff before the
object based rmap can go into the mainline kernel...
As for nonlinear mappings, if the VMA is locked, no need to
check anything ... the swappable VMAs could be a bit of a
problem though, though I guess we could just unmap a large
number of ptes ;)
> I'm running this kernel while writing this and it's under 500M swap load
> without problems. Ingo complains some workload with zillon of vmas
> in the same file will not work well, but 1) those workloads are supposed
> to use remap_file_pages in 32bit archs, and 2) they wants mlock anyways,
> and this vm design is optimal on the 64bit without requiring nor
> remap_file_pages nor mlock there.
Take a look at User Mode Linux ...
I don't think wants to use mlock, and I suspect it doesn't
use remap_file_pages (yet?). Once it does use remap_file_pages,
we'll still need to find a more or less efficient way to swap
out those pages ...
> Alternate solutions to anon_vma have been proposed and they may be
> considered in alternative to this. Ideally we should split the anon_vma
> patch in two parts, one that could be re-used by the anonmm design,
> though I was no time to split it so far. I'm not claiming anon_vma is
> definitely superior to anonmm but it's the solution I prefer. It is clearly
> more efficient in some very high end workload I've in mind, but in the
> small boxes it takes a bit more of memory so for the simpler workloads
> anonmm is prefereable, plus anonmm allows full vma merging (though it
> requires cow during mremap).
They both have their advantages and disadvantages. We'll
have to find out in practice which one works better...
I hope to get some time soon to implement the mm-based
reverse mapping, so we can test both alternatives.
At that point we'll want to split the file-backed stuff off
into a separate patch, on which we could layer either your
vma-based or Linus's mm-based reverse mapping scheme.
I'm kind of curious which one will end up better under
which workloads ;)
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Thu, Mar 18, 2004 at 10:32:58AM -0500, Rik van Riel wrote:
> At that point we'll want to split the file-backed stuff off
the filebacked stuff is already separated, what can be separated further
is the preparation for the page->as.mapping support.
> I'm kind of curious which one will end up better under
> which workloads ;)
I know of big iron critical workloads where mine will work better,
though I agree for a desktop not runing kde the anonm is cheaper in
terms of memory utilization (saves .
andrea@dualathlon:~> egrep 'vm_area|anon_vma' /proc/slabinfo
vm_area_struct 6613 8500 76 50 1 : tunables 120 60 8 : slabdata 170 170 0
anon_vma 2085 2250 12 250 1 : tunables 120 60 8 : slabdata 9 9 0
andrea@dualathlon:~> free
total used free shared buffers cached
Mem: 1031348 1014156 17192 0 18144 665052
-/+ buffers/cache: 330960 700388
Swap: 1028152 0 1028152
andrea@dualathlon:~>
the anonmm would take 12*2085+6613*12 = 104k less in my 1G desktop loaded with
my usual stuff (not really, the difference is less than 100k since anonmm takes
quite some bytes, which is significant too if we count the kbytes like I'm
doing), I believe those 100k may be worth it for the super high end workload
swapping 8G on a 16G box with hundred of tasks each task with its own anonymous
direct memory big chunk of memory, anon_vma will avoid checking all hundred MM
for each anon page we swap, plus it gets mremap efficiently which sounds safer
for the short term.
On Thu, Mar 18, 2004 at 10:32:58AM -0500, Rik van Riel wrote:
> I don't think wants to use mlock, and I suspect it doesn't
for the short term it's ok, I agree longer term we may want to make them
swappable, but we don't necessairly need to support efficient swapping
for huge vmas (i.e. 10G, if one wants efficient swapping
remap_file_pages shouldn't be used).
Could you tell me how you called the mlock sysctl that Wli talked about?
(Wli mentioned the mlock sysctl in a patch from you) Thanks.
On Thu, 18 Mar 2004, Andrea Arcangeli wrote:
> On Thu, Mar 18, 2004 at 10:32:58AM -0500, Rik van Riel wrote:
> > I don't think wants to use mlock, and I suspect it doesn't
>
> for the short term it's ok, I agree longer term we may want to make them
> swappable, but we don't necessairly need to support efficient swapping
> for huge vmas (i.e. 10G, if one wants efficient swapping
> remap_file_pages shouldn't be used).
Agreed, it's definately a medium term thing to implement.
To be done before merging mainstream, but after stabilising
the current code...
> Could you tell me how you called the mlock sysctl that Wli talked about?
> (Wli mentioned the mlock sysctl in a patch from you) Thanks.
It's in the RHEL3 kernel, a patch named linux-2.4.21-mlock.patch.
Basically it allows any normal process to have up to its
current->rlim[RLIMIT_MEMLOCK].rlim_cur memory locked.
Root can configure, in /etc/security/limits.conf, how much
memory the processes of each user are allowed to mlock.
Processes with CAP_IPC_LOCK set can mlock as much as they
want, unrestricted by the limit.
Last year at the kernel summit, Linus seemed pretty happy
with this approach. I think Wim Coekaerts at Oracle has
ported the patch to 2.6 already...
I suspect the security paranoid will like this patch too,
because it allows gnupg to mlock the memory it wants to
have locked.
Now it just needs to be submitted to Andrew and Linus ;)
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Thu, 18 Mar 2004, Rajesh Venkatasubramanian wrote:
> > Yeah, we'll definately need the prio_tree stuff before the
> > object based rmap can go into the mainline kernel...
>
> The prio_tree stuff is progressing well. I managed to boot and do some
> preliminary testing both on UP and SMP. I want to test it bit more.
> Moreover, I am yet to merge it with objrmap. I want to finish that
> and run some tests - rmap-test.c, test-mmap3.c, etc.
>
> If you are interested in seeing how it looks now:
>
> http://www-personal.engin.umich.edu/~vrajesh/linux/prio_tree/
Cool. I'll definately take a look ...
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
El Thu, 18 Mar 2004 11:49:52 -0500 (EST) Rik van Riel <[email protected]> escribi?:
> I suspect the security paranoid will like this patch too,
> because it allows gnupg to mlock the memory it wants to
> have locked.
I think it's good for cd-burning too. Currently most of the distros set
the suid bit for cdrecord (wich implies some security bugs). You can
workaround that by changing the devide node's permissions and kill the suid bit:
brw-rw---- 1 root burning 22, 0 2003-05-23 16:41 /dev/cd-rw
but still cdrecord will cry:
cdrecord: Operation not permitted. WARNING: Cannot do mlockall(2).
cdrecord: WARNING: This causes a high risk for buffer underruns.
With that patch desktop users will be able to burn cds without falling into
buffer underruns and without using the suid hack, I guess? Nice work :)
Diego Calleja
On Thu, 18 Mar 2004, Andrea Arcangeli wrote:
> This implements anon_vma for the anonymous memory unmapping and objrmap
> for the file mappings, effectively removing rmap completely and
> replacing it with more efficient algorithms (in terms of memory
> utilization and cpu cost for the fast paths).
There's a lot about this that I like.
I think your use of anonymous vm_pgoff is brilliant. I spent the
afternoon vacillating between believing it does the (mremap move)
job so smoothly we don't notice, and believing it's all a con trick
and does nothing at all. Currently I believe it does the job!
And very satisfying how anon and obj converge in page_referenced
and try_to_unmap: although they converge in my version too, I still
have to use find_vma on anons at the inner level, yours much nicer.
I think the winning answer will actually lie between yours and mine:
the anon vmas linked into one tree belonging to the whole fork group.
Some of the messier aspects of your design come from dealing with the
vmas separately, with merging issues currently unresolved. A lot of
that goes away if you lump all the anon vmas of the same fork group
into the same list. Yes, of course the _list_ would be slower to
search; but once we have the right tree structure for the obj vmas,
I believe it should apply just as effectively to the anon vmas.
I am disappointed that you're persisting with that "as" union: many
source files remain to be converted to page->as.mapping, and I don't
think it's worth it at all. mm/objrmap.c is the _only_ file needing
a cast there for the anon case, please don't demand edits of so many
filesystems and others, not at this stage anyway. You've been very
diligent about inserting BUG_ONs, but filesystems deal with their own
pages, I don't think it's necessary to impose such changes all over.
And I still think you're making the wrong choice on page_mapping,
to supply &swapper_space if PageSwapCache. I know you disapprove
of the "if (PageSwapCache(page)) blk_run_queues();" that ends up
in my sync_page(), but mostly you're imposing an irrelevant test
in a common inline function: you'll find _very_ few places really
find the swapper_space mapping useful, I'd rather they be explicit.
I've not reviewed thoroughly at all, was concentrating on bringing
my anonmm version up to date. Still got to add my comment headers,
will post later without mremap-move and non-linear solutions.
A few minor patches to yours below.
A couple of filesystems history has left in my standard .config but
apparently not in yours. Some over-enthusiastic BUG_ONs which will
trigger when shmem_writepage swizzles a tmpfs page to swap at the
wrong moment (I saw the filemap.c in practice; I think truncate.c
ones could hit too; didn't search very hard for others).
And impressive though your saving on page tables was ;)
PageTables: 0 kB
a patch to correct that. Yes, I've left inc out of pte_offset_kernel:
should it have been there anyway? and it's not clear to me whether ppc
and ppc64 can manage an early per-cpu increment. You'll find you've
broken ppc and ppc64, which have grown to use the pgtable rmap stuff
for themselves, you'll probably want to grab some of my arch patches.
Hugh
--- 26051a1/fs/cramfs/inode.c 2004-03-17 22:02:53.000000000 +0000
+++ 26051A1/fs/cramfs/inode.c 2004-03-18 18:57:15.784636856 +0000
@@ -422,7 +422,7 @@ static struct dentry * cramfs_lookup(str
static int cramfs_readpage(struct file *file, struct page * page)
{
- struct inode *inode = page->mapping->host;
+ struct inode *inode = page->as.mapping->host;
u32 maxblock, bytes_filled;
void *pgdata;
--- 26051a1/fs/sysv/dir.c 2004-03-17 22:02:53.000000000 +0000
+++ 26051A1/fs/sysv/dir.c 2004-03-18 18:57:15.793635488 +0000
@@ -39,10 +39,10 @@ static inline unsigned long dir_pages(st
static int dir_commit_chunk(struct page *page, unsigned from, unsigned to)
{
- struct inode *dir = (struct inode *)page->mapping->host;
+ struct inode *dir = (struct inode *)page->as.mapping->host;
int err = 0;
- page->mapping->a_ops->commit_write(NULL, page, from, to);
+ page->as.mapping->a_ops->commit_write(NULL, page, from, to);
if (IS_DIRSYNC(dir))
err = write_one_page(page, 1);
else
@@ -224,7 +224,7 @@ got_it:
from = (char*)de - (char*)page_address(page);
to = from + SYSV_DIRSIZE;
lock_page(page);
- err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
+ err = page->as.mapping->a_ops->prepare_write(NULL, page, from, to);
if (err)
goto out_unlock;
memcpy (de->name, name, namelen);
@@ -244,7 +244,7 @@ out_unlock:
int sysv_delete_entry(struct sysv_dir_entry *de, struct page *page)
{
- struct address_space *mapping = page->mapping;
+ struct address_space *mapping = page->as.mapping;
struct inode *inode = (struct inode*)mapping->host;
char *kaddr = (char*)page_address(page);
unsigned from = (char*)de - kaddr;
@@ -346,13 +346,13 @@ not_empty:
void sysv_set_link(struct sysv_dir_entry *de, struct page *page,
struct inode *inode)
{
- struct inode *dir = (struct inode*)page->mapping->host;
+ struct inode *dir = (struct inode*)page->as.mapping->host;
unsigned from = (char *)de-(char*)page_address(page);
unsigned to = from + SYSV_DIRSIZE;
int err;
lock_page(page);
- err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
+ err = page->as.mapping->a_ops->prepare_write(NULL, page, from, to);
if (err)
BUG();
de->inode = cpu_to_fs16(SYSV_SB(inode->i_sb), inode->i_ino);
--- 26051a1/mm/filemap.c 2004-03-18 11:08:59.000000000 +0000
+++ 26051A1/mm/filemap.c 2004-03-18 18:57:15.793635488 +0000
@@ -460,7 +460,6 @@ repeat:
/* Has the page been truncated while we slept? */
BUG_ON(PageAnon(page));
- BUG_ON(PageSwapCache(page));
if (page->as.mapping != mapping || page->index != offset) {
unlock_page(page);
page_cache_release(page);
--- 26051a1/mm/memory.c 2004-03-18 11:08:59.000000000 +0000
+++ 26051A1/mm/memory.c 2004-03-18 18:57:15.793635488 +0000
@@ -104,6 +104,7 @@ static inline void free_one_pmd(struct m
}
page = pmd_page(*dir);
pmd_clear(dir);
+ dec_page_state(nr_page_table_pages);
pte_free_tlb(tlb, page);
}
@@ -162,6 +163,7 @@ pte_t fastcall * pte_alloc_map(struct mm
pte_free(new);
goto out;
}
+ inc_page_state(nr_page_table_pages);
pmd_populate(mm, pmd, new);
}
out:
--- 26051a1/mm/truncate.c 2004-03-18 11:08:59.000000000 +0000
+++ 26051A1/mm/truncate.c 2004-03-18 18:57:15.793635488 +0000
@@ -48,7 +48,6 @@ static void
truncate_complete_page(struct address_space *mapping, struct page *page)
{
BUG_ON(PageAnon(page));
- BUG_ON(PageSwapCache(page));
if (page->as.mapping != mapping)
return;
@@ -73,7 +72,6 @@ static int
invalidate_complete_page(struct address_space *mapping, struct page *page)
{
BUG_ON(PageAnon(page));
- BUG_ON(PageSwapCache(page));
if (page->as.mapping != mapping)
return 0;
@@ -264,7 +262,6 @@ void invalidate_inode_pages2(struct addr
lock_page(page);
BUG_ON(PageAnon(page));
- BUG_ON(PageSwapCache(page));
if (page->as.mapping == mapping) { /* truncate race? */
wait_on_page_writeback(page);
next = page->index + 1;
After one day of stress testing I could reproduce this:
M------------[ cut here ]------------
^Mkernel BUG at mm/objrmap.c:271!
^Minvalid operand: 0000 [#1]
^MSMP
^MCPU: 0
^MEIP: 0060:[<c014df45>] Not tainted
^MEFLAGS: 00210246 (2.6.5-rc1-aa1)
^MEIP is at page_add_rmap+0x145/0x180
^Meax: 00000000 ebx: c18d8740 ecx: 00200246 edx: c1a0e6a0
^Mesi: e2c26ea8 edi: 4040a8cc ebp: 00ab3d00 esp: c4323ec8
^Mds: 007b es: 007b ss: 0068
^MProcess python (pid: 16772, threadinfo=c4322000 task=e63285f0)
^MStack: 389c8025 00000000 f43c9028 c0149803 00000000 f43ca338 f3e50400 c18fb628
^M c197d970 00000001 c18d8740 f3e50404 4040a8cc e2c26ea8 e27d3040 e27d3040
^M e27d3060 e2c26ea8 e63285f0 c0117b24 00000000 c1a14040 00000000 4040a8cc
^MCall Trace:
^M [<c0149803>] handle_mm_fault+0x4c3/0x900
^M [<c0117b24>] do_page_fault+0x164/0x534
^M [<c011945a>] recalc_task_prio+0x8a/0x1c0
^M [<c011b413>] schedule+0x1e3/0x680
^M [<c014c04a>] do_munmap+0x2da/0x430
^M [<c01179c0>] do_page_fault+0x0/0x534
^M [<c0106d01>] error_code+0x2d/0x38
^MCode: 0f 0b 0f 01 13 00 39 c0 eb 90 0f 0b eb 00 13 00 39 c0 e9 2f
After some more debugging I realized what happened. It's a race condition very
hard to trigger.
There's one task with some anonymous memory swapped out, the pte points
the swp_entry. This task forks() and the swp_entry is duplicated.
One of the two childs generates a swapin with a _read_ (so it remains a
swapcache cow), so one of the two ptes is replaced with a pointer to a
swapcache instead of the swp_entry, the page has mapcount == 1 and count
==2.
Then the memory pressure cause try_to_unmap_one to unmap the swapcache
setting the pte back to the swp_entry, and since there was only 1
mapping, I also clear the PG_anon bitflag, but right before
try_to_unmap_one clears the PG_anon, the other process does a minor fault
like this:
process 1 process 2
---------- ------------
swapout
do_swap_page
lookup_swap_cache
SetPageAnon
ClearPageAnon
page_add_rmap
BUG_ON(!page->as.mapping) <- crash
the window for the race is incredibly small, it takes hours of heavy
swap on a real life system doing fork sleeping and touching ram readonly
to trigger it (my testbox never triggered it despite the load yet).
The fix is simple: always set and clear PG_anon under the page_map_lock,
this will avoid the race since all ClearPageAnon already runs under the
page_map_lock. I will implement and test in a few hours.
the other way to fix it is to return doing like Dave, that is to clear
PageAnon implicitly in __free_pages_ok but I don't like that, since it's
not robust, if we lose a bitflag with my code the kernel will oops
immediatly, so it's much easier to find the path that lost the bitflag.
I prefer the objrmap code to manage the PG_anon all explicitly
(atomically during the !mapcount++ and !--mapcount transitions) for both
the setting and the clearing of the bitflag, and if we lose it we crash
immediatly in __free_pages_ok (instead of silenty clearing the bitflag
like it would happen in the objrmap patch). I find this more robust.
On Thu, 18 Mar 2004, Andrea Arcangeli wrote:
>
> The fix is simple: always set and clear PG_anon under the page_map_lock,
> this will avoid the race since all ClearPageAnon already runs under the
> page_map_lock. I will implement and test in a few hours.
>
> ... I find this more robust.
Absolutely, that's what I did too. My old page_add_rmap had anon flag,
but the new patches have page_add_anon_rmap and page_add_obj_rmap.
Hugh
On Thu, Mar 18, 2004 at 08:41:45PM +0000, Hugh Dickins wrote:
> On Thu, 18 Mar 2004, Andrea Arcangeli wrote:
>
> > This implements anon_vma for the anonymous memory unmapping and objrmap
> > for the file mappings, effectively removing rmap completely and
> > replacing it with more efficient algorithms (in terms of memory
> > utilization and cpu cost for the fast paths).
>
> There's a lot about this that I like.
>
> I think your use of anonymous vm_pgoff is brilliant. I spent the
> afternoon vacillating between believing it does the (mremap move)
> job so smoothly we don't notice, and believing it's all a con trick
> and does nothing at all. Currently I believe it does the job!
>
> And very satisfying how anon and obj converge in page_referenced
> and try_to_unmap: although they converge in my version too, I still
> have to use find_vma on anons at the inner level, yours much nicer.
>
> I think the winning answer will actually lie between yours and mine:
> the anon vmas linked into one tree belonging to the whole fork group.
> Some of the messier aspects of your design come from dealing with the
> vmas separately, with merging issues currently unresolved. A lot of
> that goes away if you lump all the anon vmas of the same fork group
> into the same list. Yes, of course the _list_ would be slower to
> search; but once we have the right tree structure for the obj vmas,
> I believe it should apply just as effectively to the anon vmas.
the problem is that they will still not be mergeable if we obey to the
vm_pgoff. We could merge some more though. The other main issue is the
search in this single object for all mm, that has again the downside of
searching all mm. I keep track of exactly which mm to track instead,
this is more finegrined (after all I was competing with the rmap
pte_chains that are overkill finegrined).
But I certainly agree we could mix the two things and have 1 anon_vma
object per-mm instad of per-vma by losing some granularity in the
tracking and making the search more expensive, but then we'd need a
prio_tree there too and that doesn't come for free either, so I'd rather
spend the 12 bytes for the finegrined tracking, the prio_tree can't get
right which mm to scan and which not for every page, the current
anon_vma can.
> I am disappointed that you're persisting with that "as" union: many
> source files remain to be converted to page->as.mapping, and I don't
> think it's worth it at all. mm/objrmap.c is the _only_ file needing
> a cast there for the anon case, please don't demand edits of so many
> filesystems and others, not at this stage anyway. You've been very
Ok, I don't think it's the right way to go but you quite convinced me on
this for the short term (now that it swaps heavily just fine and I'm not
scared anymore), the patch will become quite a lot smaller.
> A couple of filesystems history has left in my standard .config but
> apparently not in yours. Some over-enthusiastic BUG_ONs which will
yep, I didn't compile those fs.
> trigger when shmem_writepage swizzles a tmpfs page to swap at the
> wrong moment (I saw the filemap.c in practice; I think truncate.c
> ones could hit too; didn't search very hard for others).
>
> And impressive though your saving on page tables was ;)
> PageTables: 0 kB
> a patch to correct that. Yes, I've left inc out of pte_offset_kernel:
Great spotting, obviously I don't truncate on shmfs often ;), nor I
check the pagetable size often either. This is a greatly appreciated set
of fixes.
> should it have been there anyway? and it's not clear to me whether ppc
Dunno.
> and ppc64 can manage an early per-cpu increment. You'll find you've
> broken ppc and ppc64, which have grown to use the pgtable rmap stuff
> for themselves, you'll probably want to grab some of my arch patches.
Oh well, this is the next blocker now... Where can I find your arch
patches? PPC folks comments?
BTW, the only scary thing that happened so far was the crash during
heavy swapping, I'm now running with this fix applied and I couldn't
reproduce anymore (usually it always happen in 6 hours or less, so it's
not 100% certain it's bulletproof yet). But I'm quite optimistic I got
rid of one bad race condition now, since everything made sense, both
the oops and the bug. You can see my last email to l-k for details.
So now I will merge your fixes and I will remove the union (so the
breakages gets dominated).
Thank you very much for all your help and great feedback!
--- ./fs/exec.c.~1~ 2004-03-17 22:37:53.000000000 +0100
+++ ./fs/exec.c 2004-03-18 23:21:46.572083608 +0100
@@ -324,9 +324,8 @@ void put_dirty_page(struct task_struct *
}
lru_cache_add_active(page);
flush_dcache_page(page);
- SetPageAnon(page);
set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(page, prot))));
- page_add_rmap(page, vma, address);
+ page_add_rmap(page, vma, address, 1);
pte_unmap(pte);
tsk->mm->rss++;
spin_unlock(&tsk->mm->page_table_lock);
--- ./include/linux/objrmap.h.~1~ 2004-03-17 22:37:53.000000000 +0100
+++ ./include/linux/objrmap.h 2004-03-18 23:20:44.169570232 +0100
@@ -53,7 +53,7 @@ extern void FASTCALL(anon_vma_link(struc
extern void FASTCALL(__anon_vma_link(struct vm_area_struct * vma));
/* objrmap tracking functions */
-void FASTCALL(page_add_rmap(struct page *, struct vm_area_struct *, unsigned long));
+void FASTCALL(page_add_rmap(struct page *, struct vm_area_struct *, unsigned long, int));
void FASTCALL(page_remove_rmap(struct page *));
/*
--- ./mm/objrmap.c.~1~ 2004-03-17 22:37:53.000000000 +0100
+++ ./mm/objrmap.c 2004-03-18 23:34:21.986243200 +0100
@@ -23,6 +23,8 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/objrmap.h>
+#include <linux/init.h>
+#include <asm/tlbflush.h>
kmem_cache_t * anon_vma_cachep;
@@ -248,13 +250,32 @@ static inline void anon_vma_page_link(st
* Add a new pte reverse mapping to a page.
*/
void fastcall page_add_rmap(struct page *page, struct vm_area_struct * vma,
- unsigned long address)
+ unsigned long address, int anon)
{
+ int last_anon;
+
if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
return;
+ BUG_ON(vma->vm_flags & VM_RESERVED);
page_map_lock(page);
+ /*
+ * Setting and clearing PG_anon must always happen inside
+ * page_map_lock to avoid races between mapping and
+ * unmapping on different processes of the same
+ * shared cow swapcache page. And while we take the
+ * page_map_lock PG_anon cannot change from under us.
+ * Actually PG_anon cannot change under fork either
+ * since fork holds a reference on the page so it cannot
+ * be unmapped under fork and in turn copy_page_range is
+ * allowed to read PG_anon outside the page_map_lock.
+ */
+ last_anon = PageAnon(page);
+ if (anon && !last_anon)
+ SetPageAnon(page);
+ BUG_ON(!anon && last_anon);
+
if (!page->mapcount++)
inc_page_state(nr_mapped);
@@ -266,16 +287,17 @@ void fastcall page_add_rmap(struct page
* We can find the mappings by walking the object
* vma chain for that object.
*/
- BUG_ON(!page->as.mapping);
BUG_ON(PageSwapCache(page));
+ BUG_ON(!page->as.mapping);
}
page_map_unlock(page);
}
/* this needs the page->flags PG_map_lock held */
-static void inline anon_vma_page_unlink(struct page * page)
+static inline void anon_vma_page_unlink(struct page * page)
{
+ BUG_ON(!page->as.mapping);
/*
* Cleanup if this anon page is gone
* as far as the vm is concerned.
@@ -304,6 +326,8 @@ void fastcall page_remove_rmap(struct pa
if (!page_mapped(page))
goto out_unlock;
+ BUG_ON(vma->vm_flags & VM_RESERVED);
+
if (!--page->mapcount)
dec_page_state(nr_mapped);
@@ -387,8 +411,7 @@ try_to_unmap_one(struct vm_area_struct *
BUG_ON(!page->mapcount);
mm->rss--;
- page->mapcount--;
- if (PageAnon(page))
+ if (!--page->mapcount && PageAnon(page))
anon_vma_page_unlink(page);
page_cache_release(page);
--- ./mm/fremap.c.~1~ 2004-03-17 22:37:53.000000000 +0100
+++ ./mm/fremap.c 2004-03-18 23:22:06.747016552 +0100
@@ -77,7 +77,7 @@ int install_page(struct mm_struct *mm, s
mm->rss++;
flush_icache_page(vma, page);
set_pte(pte, mk_pte(page, prot));
- page_add_rmap(page, vma, addr);
+ page_add_rmap(page, vma, addr, 0);
pte_val = *pte;
pte_unmap(pte);
update_mmu_cache(vma, addr, pte_val);
--- ./mm/memory.c.~1~ 2004-03-18 00:55:19.000000000 +0100
+++ ./mm/memory.c 2004-03-18 23:25:29.042263000 +0100
@@ -324,7 +324,7 @@ skip_copy_pte_range:
*/
BUG_ON(!page_mapped(page));
BUG_ON(!page->as.mapping);
- page_add_rmap(page, vma, address);
+ page_add_rmap(page, vma, address, PageAnon(page));
} else {
BUG_ON(page_mapped(page));
BUG_ON(page->as.mapping);
@@ -1056,8 +1056,7 @@ static int do_wp_page(struct mm_struct *
++mm->rss;
page_remove_rmap(old_page);
break_cow(vma, new_page, address, page_table);
- SetPageAnon(new_page);
- page_add_rmap(new_page, vma, address);
+ page_add_rmap(new_page, vma, address, 1);
lru_cache_add_active(new_page);
/* Free the old page.. */
@@ -1291,8 +1290,7 @@ static int do_swap_page(struct mm_struct
flush_icache_page(vma, page);
set_pte(page_table, pte);
- SetPageAnon(page);
- page_add_rmap(page, vma, address);
+ page_add_rmap(page, vma, address, 1);
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, address, pte);
@@ -1314,7 +1312,7 @@ do_anonymous_page(struct mm_struct *mm,
{
pte_t entry;
struct page * page = ZERO_PAGE(addr);
- int ret;
+ int ret, anon = 0;
/* Read-only mapping of ZERO_PAGE. */
entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
@@ -1348,7 +1346,7 @@ do_anonymous_page(struct mm_struct *mm,
vma);
lru_cache_add_active(page);
mark_page_accessed(page);
- SetPageAnon(page);
+ anon = 1;
}
set_pte(page_table, entry);
@@ -1360,7 +1358,7 @@ do_anonymous_page(struct mm_struct *mm,
ret = VM_FAULT_MINOR;
/* ignores ZERO_PAGE */
- page_add_rmap(page, vma, addr);
+ page_add_rmap(page, vma, addr, anon);
return ret;
}
@@ -1384,7 +1382,7 @@ do_no_page(struct mm_struct *mm, struct
struct page * new_page;
struct address_space *mapping = NULL;
pte_t entry;
- int sequence = 0, reserved, clear_anon_page;
+ int sequence = 0, reserved, anon;
int ret = VM_FAULT_MINOR;
if (!vma->vm_ops || !vma->vm_ops->nopage)
@@ -1435,7 +1433,7 @@ retry:
/*
* Should we do an early C-O-W break?
*/
- clear_anon_page = 0;
+ anon = 0;
if (write_access && !(vma->vm_flags & VM_SHARED)) {
struct page * page;
if (unlikely(anon_vma_prepare(vma)))
@@ -1446,9 +1444,8 @@ retry:
copy_user_highpage(page, new_page, address);
page_cache_release(new_page);
lru_cache_add_active(page);
- SetPageAnon(page);
- clear_anon_page = 1;
new_page = page;
+ anon = 1;
}
spin_lock(&mm->page_table_lock);
@@ -1461,8 +1458,6 @@ retry:
(unlikely(sequence != atomic_read(&mapping->truncate_count)))) {
sequence = atomic_read(&mapping->truncate_count);
spin_unlock(&mm->page_table_lock);
- if (unlikely(clear_anon_page))
- ClearPageAnon(new_page);
page_cache_release(new_page);
goto retry;
}
@@ -1488,7 +1483,7 @@ retry:
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
set_pte(page_table, entry);
if (likely(!reserved))
- page_add_rmap(new_page, vma, address);
+ page_add_rmap(new_page, vma, address, anon);
pte_unmap(page_table);
} else {
/* One of our sibling threads was faster, back out. */
--- ./mm/swapfile.c.~1~ 2004-03-17 22:37:53.000000000 +0100
+++ ./mm/swapfile.c 2004-03-18 23:25:42.224259032 +0100
@@ -391,8 +391,7 @@ unuse_pte(struct vm_area_struct *vma, un
get_page(page);
set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
BUG_ON(!vma->anon_vma);
- SetPageAnon(page);
- page_add_rmap(page, vma, address);
+ page_add_rmap(page, vma, address, 1);
swap_free(entry);
}
On Thu, Mar 18, 2004 at 10:37:55PM +0000, Hugh Dickins wrote:
> On Thu, 18 Mar 2004, Andrea Arcangeli wrote:
> >
> > The fix is simple: always set and clear PG_anon under the page_map_lock,
> > this will avoid the race since all ClearPageAnon already runs under the
> > page_map_lock. I will implement and test in a few hours.
> >
> > ... I find this more robust.
>
> Absolutely, that's what I did too. My old page_add_rmap had anon flag,
> but the new patches have page_add_anon_rmap and page_add_obj_rmap.
I remebered you had the anon param too, though I didn't realize it was
for this reason :/
the reason why I go with the parameter is that sometime I've a single
call at the end of the function with the same new_page, but sometime
it has to be considered anonymous sometime not. So I couldn't split out
the interface with _anon/_file suffixes with some #define and remove the
0/1 numbering ugliness.
On Fri, Mar 19, 2004 at 12:06:28AM +0100, Andrea Arcangeli wrote:
> Ok, I don't think it's the right way to go but you quite convinced me on
> this for the short term (now that it swaps heavily just fine and I'm not
sorry, I changed my mind again, using the cast is too ugly, I don't like
it, so I'm going to apply your fs fixes instead ;)
Diego Calleja Garc?a wrote:
> El Thu, 18 Mar 2004 11:49:52 -0500 (EST) Rik van Riel <[email protected]> escribi?:
>
>
>>I suspect the security paranoid will like this patch too,
>>because it allows gnupg to mlock the memory it wants to
>>have locked.
>
>
> I think it's good for cd-burning too. Currently most of the distros set
> the suid bit for cdrecord (wich implies some security bugs). You can
> workaround that by changing the devide node's permissions and kill the suid bit:
> brw-rw---- 1 root burning 22, 0 2003-05-23 16:41 /dev/cd-rw
>
> but still cdrecord will cry:
> cdrecord: Operation not permitted. WARNING: Cannot do mlockall(2).
> cdrecord: WARNING: This causes a high risk for buffer underruns.
>
> With that patch desktop users will be able to burn cds without falling into
> buffer underruns and without using the suid hack, I guess? Nice work :)
Have a bit of caution there, cdrecord sets itself realtime priority,
locks pages in memory, and ensures that the process is likely to work
even under load. I don't think addressing just a part of the problem
will result in reliability under load. You would have to look at
capabilities to allow these things to be done, under load they may not
keep up depending on what's going on. Good to get a start, don't assume
all the issues are addressed.
--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
Andrea Arcangeli writes:
> On Thu, Mar 18, 2004 at 08:41:45PM +0000, Hugh Dickins wrote:
> > and ppc64 can manage an early per-cpu increment. You'll find you've
> > broken ppc and ppc64, which have grown to use the pgtable rmap stuff
> > for themselves, you'll probably want to grab some of my arch patches.
>
> Oh well, this is the next blocker now... Where can I find your arch
> patches? PPC folks comments?
We need page->mapping and page->index set for pages used for pte and
pmd pages, so that we can get from a pte_t * back to the mm_struct *
and the address within that mm. It's just a matter of setting
page->mapping and page->index in the pte_alloc and pmd_alloc
functions.
I just looked at Hugh's anobjrmap 6/6 patch and it looks to me that it
does that correctly.
Regards,
Paul.
El Thu, 18 Mar 2004 19:34:29 -0500 Bill Davidsen <[email protected]> escribi?:
> Have a bit of caution there, cdrecord sets itself realtime priority,
> locks pages in memory, and ensures that the process is likely to work
> even under load. I don't think addressing just a part of the problem
> will result in reliability under load. You would have to look at
> capabilities to allow these things to be done, under load they may not
> keep up depending on what's going on. Good to get a start, don't assume
> all the issues are addressed.
Yes, the following message is:
cdrecord: Operation not permitted. WARNING: Cannot set RR-scheduler
cdrecord: Permission denied. WARNING: Cannot set priority using setpriority().
cdrecord: WARNING: This causes a high risk for buffer underruns.
But since 2.6 uses DMA for recording the CPU usage is really low...I guess
people will still use suid for that :)
On Fri, 19 Mar 2004, Andrea Arcangeli wrote:
> the problem is that they will still not be mergeable if we obey to the
> vm_pgoff. We could merge some more though. The other main issue is the
> search in this single object for all mm, that has again the downside of
> searching all mm. I keep track of exactly which mm to track instead,
If you read Hugh's latest code, you'll find that for all the
non-shared pages, his code only looks at 1 mm too ...
> But I certainly agree we could mix the two things and have 1 anon_vma
> object per-mm instad of per-vma by losing some granularity in the
> tracking and making the search more expensive, but then we'd need a
> prio_tree there too and that doesn't come for free either, so I'd rather
> spend the 12 bytes for the finegrined tracking, the prio_tree can't get
> right which mm to scan and which not for every page, the current
> anon_vma can.
Note that this disadvantage only holds true for pages that
are shared between multiple processes, but not all of the
processes in a fork group. The non-shared pages are found
immediately with Hugh's code, so I suspect this shouldn't
be a big disadvantage any more.
Also, we'll need the prio_tree anyway for file backed stuff.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Sat, Mar 20, 2004 at 08:35:52AM -0500, Rik van Riel wrote:
> On Fri, 19 Mar 2004, Andrea Arcangeli wrote:
>
> > the problem is that they will still not be mergeable if we obey to the
> > vm_pgoff. We could merge some more though. The other main issue is the
> > search in this single object for all mm, that has again the downside of
> > searching all mm. I keep track of exactly which mm to track instead,
>
> If you read Hugh's latest code, you'll find that for all the
> non-shared pages, his code only looks at 1 mm too ...
I agree this optimization will cover the common case, that's a nice
improvement compared to the old patches.
still this doesn't solve the mremap of a shared cow region (not a direct
one), how is that solved in Hugh's current patch? Is he implementing
Linus's suggested unsharing? I don't see it implemented so I wonder how
can he swap those regions. Is that like an mlock right now?
> > But I certainly agree we could mix the two things and have 1 anon_vma
> > object per-mm instad of per-vma by losing some granularity in the
> > tracking and making the search more expensive, but then we'd need a
> > prio_tree there too and that doesn't come for free either, so I'd rather
> > spend the 12 bytes for the finegrined tracking, the prio_tree can't get
> > right which mm to scan and which not for every page, the current
> > anon_vma can.
>
> Note that this disadvantage only holds true for pages that
> are shared between multiple processes, but not all of the
> processes in a fork group. The non-shared pages are found
> immediately with Hugh's code, so I suspect this shouldn't
> be a big disadvantage any more.
>
> Also, we'll need the prio_tree anyway for file backed stuff.
I don't need a tree for the swapping efficiently with anon_vma (not even
the rbtree lookup with find_vma). I believe in practice anon_vma is the
most efficient design for swapping anonymous memory. If Martin can show
anon_vma slower than anonmm (in the non-swap case of SDAT bench) then I
can change my mind about it, at the moment I believe it'll not be
slower.
Also you're not going to share the same prio_tree code for the anonmm
and the inode, there's no meaningful vm_pgoff in the anonmm design.
As for the parts Hugh's claims sharable, I much prefer my
implementation, he also still leaves some PG_anon outside the
PG_map_lock, I'm quite confortable with my implementation being rock
solid, I find it simpler too (i.e. absolutely nothing to do in mremap).
I also run lots of regression tests already, the only pending bug is
Jens's device driver bug in ->nopage, that I'm asking about on l-k for
confirmation.
On Thu, Mar 18, 2004 at 11:49:52AM -0500, Rik van Riel wrote:
> It's in the RHEL3 kernel, a patch named linux-2.4.21-mlock.patch.
>
> Basically it allows any normal process to have up to its
> current->rlim[RLIMIT_MEMLOCK].rlim_cur memory locked.
> Root can configure, in /etc/security/limits.conf, how much
> memory the processes of each user are allowed to mlock.
>
> Processes with CAP_IPC_LOCK set can mlock as much as they
> want, unrestricted by the limit.
>
> Last year at the kernel summit, Linus seemed pretty happy
> with this approach. I think Wim Coekaerts at Oracle has
> ported the patch to 2.6 already...
>
> I suspect the security paranoid will like this patch too,
> because it allows gnupg to mlock the memory it wants to
> have locked.
I agree.
> Now it just needs to be submitted to Andrew and Linus ;)
could somebody submit it to me too? ;) otherwise I'll have to search for
some big rpm.
what I was thinking about was more basic without checking the rlimits,
certainly I agree using the rlimits is a very nice bonus (though the
code is a bit more complicated, and it won't be a one liner in
remap_file_pages as I was hoping for ;).
On Saturday 20 March 2004 17:31, Andrea Arcangeli wrote:
Hi Andrea,
> > It's in the RHEL3 kernel, a patch named linux-2.4.21-mlock.patch.
> could somebody submit it to me too? ;) otherwise I'll have to search for
> some big rpm.
there you go.
ciao, Marc