2003-07-01 01:25:40

by Mel Gorman

[permalink] [raw]
Subject: What to expect with the 2.6 VM

Hi,

I'm writing a small paper on the 2.6 VM for a conference. It is based on
2.5.73 and is basically an introduction to some of the new features that are
up and coming. As it's only an introduction, it's a little light on heavy
detail. Below is the parts of the paper which I thought would be of general
interest (I left out stuff like the abstract, introduction, conclusion and
bibliography). Any comments, feedback or pointings out of big features I've
missed are appreciated.




Linux Virtual Memory Management
Some Things to Expect In 2.6

Describing Memory
=================

The 2.6 VM has put a lot of effort into being scalable to enterprise level
architectures while trying not to compromise on performance on low to mid
level computers. Consequently, a number of the changes are aimed at
reducing spin lock contention and keeping operations local to the CPU
where possible.

Physical nodes for NUMA architectures are described in a very similar
manner to 2.4 which just two minor differences. The first difference is
that the starting location of the node is now identified by a Page Frame
Number (PFN) instead of as a physical address which is an offset of pages
within the virtual mem_map array. This solved the problem of where
architecture used addressing extensions such as Intel's Physical Address
Extensions (PAE) which cannot be addressed by a 32 number. The second is
that each node now has its own wait queue for the page replacement daemon
kswapd which is discussed further in a later section.

The principal changes made to how zones are represented is related to the
page replacement policy. The LRU lists for pages and the associated locks
are now per-zone where they were global with 2.4. As LRU lists could grow
quite large, especially if IO intensive applications were running, the old
LRU list lock could heavily contended. This resulted in the LRU list being
locked while kswapd performed a linear search for pages in the zone of
interest. As the lists are now per-zone, and an instance of kswapd runs
for each node, the contention is greatly reduced.

The second noticeable change is aimed at improving CPU cache usage for
which Linux now uses a few simple tricks. The first utilises a padding in
the structures to ensure that zone>lock and
zone>lru_lock use separate lines [Sea] in the CPU cache. To
further alleviate lock contention, a list of pages, called a pageset, is
maintained for each CPU in the system and for each zone with
per_cpu_pageset array which is discussed later with Section 7.

Pages are represented in almost the same fashion as 2.4 except for one
innocent looking difference. In 2.4, process pages tables were traversed
to locate the struct page that represented the page mapped by the process.
However, there was no mechanism that would give a list of Page Table
Entries (PTE) that had mapped a particular page. This is now addressed by
what is referred to as Reverse Mapping1. PTEs which map a page can now be
found by traversing a chain of PTEs which is headed by a list stored in
the struct page. This is discussed later in Section 3.

Reverse Page Table Mapping
==========================

One of the most important introductions in the 2.6 kernel is Reverse
Mapping commonly referred to as rmap. In 2.4, there is a one way mapping
from a PTE to a struct page which is sufficient for process addressing.
However, this presents a problem for the page replacement algorithm when
dealing with shared pages are used as there is no way in 2.4 to acquire a
list of PTEs which map a particular struct page without traversing the
page tables for every running process in the system.

In 2.6, a new field is introduced to the struct page called union pte.
When a page is shared between two or more processes, a PTE chain is
created and linked to this field. PTE chain elements are managed by the
slab allocator and each node is able to locate up to NRPTE number of PTEs
where NRPTE is related to the L1 cache size of the target architecture.
Once NRPTE PTEs have been mapped, a new node is added to the PTE chain.

Using these chains, it is possible for all PTEs mapping a struct page to
be located and swapped out to backing storage. This means that it is now
unnecessary for whole processes to be swapped out when memory is tight and
most pages in the LRU lists are shared as was the case with 2.4. The 2.6
VM is able to make much better page replacement decisions as the LRU list
ordering is obeyed.

PTEs in High Memory
===================

In 2.4, Page Table Entries (PTEs) must be allocated from ZONE_ NORMAL as
the kernel needs to address them directly for page table traversal. In a
system with many tasks or with large mapped memory regions, this can place
significant pressure on ZONE_ NORMAL so 2.6 has the option of allocating
PTEs from high memory.

Allocating from high memory is a compile time option as there is a
significant penalty associated with high memory PTEs. As the PTEs can no
longer be addressed directly by the kernel, kmap() must be used to map the
high memory page into lower memory. It must be later unmapped with
kunmap() placing a limit on the number of PTEs that can be addressed by a
CPU at a time.

To reflect the move of PTEs to high memory, the API related to PTEs has
changed. The allocation function pte_alloc() has been replaced by two
functions. pte_alloc_kernel() is for kernel PTEs and are always allocated
from ZONE_ NORMAL. pte_alloc_map() for userspace allocations.

As the PTEs need to be mapped before addressing, page table walking has
changed slightly. In 2.4, the PTE could be referenced directly but in 2.6,
pte_offset_map() must be called before it may be used and pte_unmap()
called as quickly as possible to unmap it again.

The overhead of mapping the PTEs from high memory should not be ignored.
Generally, only one PTE may be mapped at a time for a CPU with the rare
exception when pte_offset_map_nested() is used. This can be a heavy
penalty when all PTEs need to be examined such as when a memory region is
being unmapped. It has been proposed to have a Userspace Kernel Virtual
Area (UKVA) which is a region in the kernel address space private to a
process. This UKVA could be used to map high memory PTEs into low memory
but it is currently a proposal and unlikely to be merged by 2.6.

At time of writing, a patch has been submitted which implemented PMDs in
high memory. The implementation is essentially the same as PTEs in high
memory.

Huge TLB Filesystem
===================

Most modern architectures support more than one page size. For example,
the IA-32 architecture supports 4KiB pages or 4MiB pages but Linux only
used large pages for mapping the actual kernel image. As TLB slots are a
scarce resource, it is desirable to be able to take advantages of the
large pages especially on machines with large amounts of physical memory.

In 2.6, Linux allows processes to use large pages, referred to as huge
pages. The number of available huge pages is configured by the system
administrator via the /proc/sys/vm/nr_hugepages proc interface. As the
success of the allocation depends on the availability of physically
contiguous memory, the allocation should be made during system startup.

The root of the implementation is a Huge TLB Filesystem (hugetlbfs) which
is a pseudo-filesystem implemented in fs/hugetlbfs/inode.c and based on
ramfs. The basic idea is that any file that exists in the filesystem is
backed by huge pages. This filesystem is initialised and registered as an
internal filesystem at system start-up.

There is two ways that huge pages may be accessed by a process. The first
is by using shmget() to setup a shared region backed by huge pages and the
second is the call mmap() on a file opened in the huge page filesystem.

When a shared memory region should be backed by huge pages, the process
should call shmget() and pass SHM_HUGETLB as one of the flags. This
results in a file being created in the root of the internal filesystem.
The name of the file is determined by an atomic counter called
hugetlbfs_counter which is incremented every time a shared region is
setup.

To create a file backed by huge pages, a filesystem of type hugetlbfs must
first be mounted by the system administrator. Once the filesystem is
mounted, files can be created as normal with the system call open(). When
mmap() is called on the open file, the hugetlbfs registered mmap()
function creates the appropriate VMA for the process.

Huge TLB pages have their own function for the management of page tables,
address space operations and filesystem operations. The names of the
functions for page table management can all be seen in $<$linux/hugetlb.h
$>$ and they are named very similar to their ``normal'' page equivalents.

Non-Linear Populating of Virtual Areas
======================================

In 2.4, a VMA backed by a file would be populated in a linear fashion.
This can be optionally changed in 2.6 with the introduction of the
MAP_POPULATE flag to mmap() and the new system call remap_file_pages().
This system call allows arbitrary pages in an existing VMA to be remapped
to an arbitrary location on the backing file. This is mainly of interest
to database servers which previously simulated this behavior by the
creation of many VMAs.

On page-out, the non-linear address for the file is encoded within the PTE
so that it can be installed again correctly on page fault. How it is
encoded is architecture specific so two macros are defined called
pgoff_to_pte() and pte_to_pgoff() for the task.

Physical Page Allocation
========================

Physical page allocation is still based on the buddy algorithm but it has
been extended to resemble a lazy buddy algorithm which delays the
splitting and coalescing of buddies until it is necessary [BL89]. To
implement this, kernel 2.6 has per-cpu page lists of 0 order pages, the
most common type of allocation and free. These pagesets contain two lists
for hot and cold pages where hot pages have been recently used and can
still be expected to be present in the CPU cache. For an allocation, the
pageset for the running CPU will be first checked and if pages are
available, they will be allocated. To determine when the pageset should be
emptied or filled, two watermarks are in place. When the low watermark is
reached, a batch of pages will be allocated and placed on the list. When
the high watermark is reached, a batch of pages will be freed at the same
time. Higher order allocations are treated essentially the same as they
were in 2.4.

The implication of this change is straight forward, the number of times
the spinlock protecting the buddy lists must be acquired is drastically
reduced. Higher order allocations are relatively rare in Linux so the
optimisation is for the common case. This change will be noticeable on
large number of CPU machines but will make little difference to single
CPUs.

The second major change is largely cosmetic with function declarations,
flags and defines bring moved to new header files. One important part of
the shuffling is that the distinction between NUMA and UMA architectures
no longer exists for page allocations. In 2.4, there was an explicit
function which implemented a node-local allocation policy. In 2.6,
architectures are expected to provide a CPU to NUMA node mapping. They
then export a macro called NODE_DATA() which takes the NID, returned by
numa_node_id() as a parameter. With UMA architectures, this will always
return the static node descriptor contig_page_data, but with NUMA
architectures, the node ``closest'' to the running CPU will be used. This
is still a node-local allocation policy but without the two separate
allocation schemes. This change is largely a ``cleaniness'' issue reducing
the amount of special case code that deals with UMA and NUMA architectures
separately.

The last major change is the introduction of three new GFP flags called
__GFP_NOFAIL, __GFP_REPEAT and __GFP_NORETRY. The three flags determine
how hard the VM works to satisfy a given allocation and were introduced as
the flags are implemented in many different parts of the kernel. The
NOFAIL flag requires teh VM to constantly retry an allocation until it
succeeds. The REPEAT flag is intended to retry a number of times before
failing, although currently it is implemented like NOFAIL. The last,
NORETRY says that an allocation will fail immediately and return.

Delayed Coalescing
==================

2.6 extends the buddy algorithm to resemble a lazy buddy algorithm [BL89]
which delays the splitting and coalescing of buddies until it is
necessary. The delay is implemented only for 0 order allocations with the
per-cpu pagesets. Higher order allocations, which are rare anyway, will
require the interrupt safe spinlock to be held and there will be no delay
in the splits or coalescing. With 0 order allocations, splits will be
delayed until the low watermark is reached in the per-cpu set and
coalescing will be delayed until the high watermark is reached.

It is interesting to note that there is a possibility that high order
allocations will fail if many of the pagesets are just below the high
watermark. There is also the problem that buddies could exist in different
pagesets leading to fragmentation problems.

Emergency Memory Pools
======================

In 2.4, the high memory manager was the only subsystem that maintained
emergency pools of pages. In 2.6, memory pools are implemented as a
generic concept when a minimum amount of ``stuff'' needs to be reserved
for when memory is tight. ``Stuff'' in this case can be any type of object
such as pages in the case of the high memory manager or, more frequently,
some object managed by the slab allocator.

Pools are created with mempool_create() and a number of parameters are
provided. They are the minimum number of objects that should be reserved
(min_nr), an allocator function for the object type (alloc_fn()), a free
function (free_fn()) and optional private data that is passed to the
allocate and free functions.

In most cases, the allocate and free functions used are
mempool_alloc_slab() and mempool_free_slab() for reserving objects managed
by the slab allocator. In this case, the private data based is a pointer
to the slab cache descriptor.

In the case of the high memory manager, two pools of pages are created. On
page pool is for normal use and the second page pool is for use with ISA
devices that must allocate from ZONE_ DMA. The memory pools replace the
emergency pool code that exists in 2.4.

Aging Slab Allocator Objects
============================

Many of the changes in the slab allocator for 2.6 are either cosmetic or
are related to the reduction of lock contention. However, one significant
new feature is the introduction of slab shrinker callback.

In 2.4, the function kmem_cache_reap() was called in low memory situations
which selected a cache to delete all free slabs from and free objects in
the per-cpu object cache. In 2.6, caches can register a shrinker callback
with set_shrinker() and kmem_cache_reap() has been dispensed with.

set_shrinker() populates a struct with a pointer to the callback and a
``seeks' weight which indicates how hard it is to recreate the object.
During page reclaim, each shrinker is called twice. The first call passes
0 as a parameter which indicates that the callback should return how many
pages it expects it could free if it was called properly. A basic
heuristic is applied to determine if it is worth the cost of using the
callback. If it is, it is called a second time with a parameter indicating
how many objects to free.

Page Replacement
================

In the 2.4 kernel, a global daemon called kswapd acted as the page
replacement daemon. It was responsible for keeping a reserve of memory
free and only when memory was tight would processes have to free pages on
their own behalf. In 2.6, a kswapd daemon exists for each node in the
system. Each daemon is called kswapdN where N is the node ID. The
presumption is that there will be a one-to-few mapping between CPUs and
nodes and Linux wishes to avoid a penalty of a kswapd daemon trying to
free pages on a remote node.

Manipulating LRU Lists
======================

In 2.4, a spinlock is acquired when removing pages from the LRU list. This
made the lock very heavily contended so in 2.6, operations involving the
LRU lists take place via struct pagevec structures. This allows pages to
be added or removed from the LRU lists in batches of up to PAGEVEC_SIZE
numbers of pages.

When removing pages, the zone>lru_lock lock is acquired and
the pages placed on a temporary list. Once the list of pages to remove is
assembled, shrink_list() is called to perform the actual freeing of pages
which can now perform most of it's task without needing the
zone>lru_lock spinlock.

When adding the pages back, a new page vector struct is initialised with
pagevec_init(). Pages are added to the vector with pagevec_add() and then
committed to being placed on the LRU list in bulk with pagevec_release().

Footnotes

... Mapping1
Reverse mapping is commonly referred to as rmap although the
phrase rmap is sometimes a little overloaded.

--
Mel Gorman
MSc Student, University of Limerick
http://www.csn.ul.ie/~mel


2003-07-01 02:11:40

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Tue, Jul 01, 2003 at 02:39:47AM +0100, Mel Gorman wrote:
> Reverse Page Table Mapping
> ==========================
>
> One of the most important introductions in the 2.6 kernel is Reverse
> Mapping commonly referred to as rmap. In 2.4, there is a one way mapping
> from a PTE to a struct page which is sufficient for process addressing.
> However, this presents a problem for the page replacement algorithm when
> dealing with shared pages are used as there is no way in 2.4 to acquire a
> list of PTEs which map a particular struct page without traversing the
> page tables for every running process in the system.
>
> In 2.6, a new field is introduced to the struct page called union pte.
> When a page is shared between two or more processes, a PTE chain is
> created and linked to this field. PTE chain elements are managed by the
> slab allocator and each node is able to locate up to NRPTE number of PTEs
> where NRPTE is related to the L1 cache size of the target architecture.
> Once NRPTE PTEs have been mapped, a new node is added to the PTE chain.
>
> Using these chains, it is possible for all PTEs mapping a struct page to
> be located and swapped out to backing storage. This means that it is now
> unnecessary for whole processes to be swapped out when memory is tight and
> most pages in the LRU lists are shared as was the case with 2.4. The 2.6
> VM is able to make much better page replacement decisions as the LRU list
> ordering is obeyed.

you mention only the positive things, and never the fact that's the most
hurting piece of kernel code in terms of performance and smp scalability
until you actually have to swapout or pageout.

> Non-Linear Populating of Virtual Areas
> ======================================
>
> In 2.4, a VMA backed by a file would be populated in a linear fashion.
> This can be optionally changed in 2.6 with the introduction of the
> MAP_POPULATE flag to mmap() and the new system call remap_file_pages().
> This system call allows arbitrary pages in an existing VMA to be remapped
> to an arbitrary location on the backing file. This is mainly of interest
> to database servers which previously simulated this behavior by the
> creation of many VMAs.
>
> On page-out, the non-linear address for the file is encoded within the PTE
> so that it can be installed again correctly on page fault. How it is
> encoded is architecture specific so two macros are defined called
> pgoff_to_pte() and pte_to_pgoff() for the task.

and it was used to break truncate, furthmore the API doesn't look good
to me, the vma should have a special VM_NONLINEAR created with a
MAP_NONLINEAR so the vm will skip it enterely and it should be possible
to put multiple files in the same vma IMHO. If these areas are
vmtruncated the VM_NONLINEAR will keep vmtruncate out of them and the
pages will become anonymous, which is perfectly fine since they're
created with a special MAP_NONLINEAR that can have different semantics.

Also this feature is mainly useful only in kernels with rmap while using
the VLM, to workaround the otherwise overkill cpu and memory overhead
generated by rmap. It is discouraged to use it as a default API unless
you fall into those special corner cases. the object of this API is to
bypass the VM so you would lose security and the VM advantages. if you
want the same features you have w/o nonlinaer w/ nonlinear, then you
invalidate the whole point of nonlinear.

The other very corner cases are emulators, they also could benefit from
a VM bypass like remap_file_pags provides.

Pinning ram from multiple files plus a sysctl or a sudo wrapper would be
an optimal API and it would avoid the VM to even look at the pagetables
of those areas.

Last but not the least, remap_file_pages is nearly useless (again modulo
emulators) w/o bigpages support backing it (not yet the case in 2.5 as
far as I can see so it's unusable compared to 2.4-aa [but 2.5 -
remap_file_pages is even less usable than 2.5 + remap_file_pages due
rmap that wouldn't only run slow but it wouldn't run at all]). objrmap
fixes that despites it introduces some complex algorithm.

the only significant cost is the tlb flushing and pagetable walking at
32G of working set in cache with a 30-31bit window on the cache.

> the flags are implemented in many different parts of the kernel.
> The
> NOFAIL flag requires teh VM to constantly retry an allocation until it

described this way it sounds like NOFAIL imply a deadlock condition. We
already have very longstanding design deadlock since the first linux
I've seen in the callers, moving it down to the VM doesn't sound any
good as if something it would encourage this deadlock prone usages.

The idea of an allocation non failing is broken in the first place and
it should not get propagated, every allocation visible to the VM users
must be allowed to fail. If you can't avoid writing buggy code, then
loop in the caller not in the VM.

> Delayed Coalescing
> ==================
>
> 2.6 extends the buddy algorithm to resemble a lazy buddy algorithm [BL89]
> which delays the splitting and coalescing of buddies until it is
> necessary. The delay is implemented only for 0 order allocations with the

desribed this way it sounds like it's not O(1) anymore for order > 0
allocations. Though it would be a nice feature for all the common cases.
And of course it's still O(1) if one assumes the number of orders
limited (and it's fixed at compile time).

as for the per-zone lists, sure they increase scalability, but it loses
aging information, the worst case will be reproducible on a 1.6G box,
some day even 2.3 had per-zone lists, I backed it out to avoid losing
information which is an order of magnitude more interesting than
the pure smp scalability for normal users (i.e. 2-way systems with 1-2G
of ram, of course the scalability numbers that people cares about on the
8-way with 32G of ram will be much better with the per-zone lru).

Just trying not to forget the other side too ;) Not every improvement on
one side cames for free on all other sides.

Andrea

2003-07-01 02:48:00

by Andrew Morton

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

Andrea Arcangeli <[email protected]> wrote:
>
> On Tue, Jul 01, 2003 at 02:39:47AM +0100, Mel Gorman wrote:
> > Reverse Page Table Mapping
> > ==========================
> ...
>
> you mention only the positive things, and never the fact that's the most
> hurting piece of kernel code in terms of performance and smp scalability
> until you actually have to swapout or pageout.

It has no SMP scalability cost, and not really much CPU cost (it's less
costly than HZ=1000, for example). Its main problem is space consumption.


> > Non-Linear Populating of Virtual Areas
> > ======================================
> ...
>
> and it was used to break truncate,

Spose so. Do we care though? Unix standards do not specify truncate
behaviour with nonlinear mappings anyway.

Our behaviour right now is "random crap". If there's a reason why we want
consistent semantics then yes, we'll need to do an rmap walk or something
in there. But is there a requirement? What is it?


One thing which clearly _should_ have sane semantics with nonlinear
mappings is mincore(). MAP_NONLINEAR broke that too.


> > the flags are implemented in many different parts of the kernel.
> > The
> > NOFAIL flag requires teh VM to constantly retry an allocation until it
>
> described this way it sounds like NOFAIL imply a deadlock condition.

NOFAIL is what 2.4 has always done, and has the deadlock opportunities
which you mention. The other modes allow the caller to say "don't try
forever".

It's mainly a cleanup - it allowed the removal of lots of "try forever"
loops by consolidating that behaviour in the page allocator. If all
callers are fixed up to not require NOFAIL then we don't need it any more.

> as for the per-zone lists, sure they increase scalability, but it loses
> aging information, the worst case will be reproducible on a 1.6G box,

Actually it improves aging information. Think of a GFP_KERNEL allocation
on an 8G 2.4.x box: an average of 10 or more highmem pages get bogusly
rotated to the wrong end of the LRU for each scanned lowmem page.

That's speculation btw. I don't have any numbers or tests which indicate
that it was a net win in this regard.


2003-07-01 03:08:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Mon, Jun 30, 2003 at 08:02:37PM -0700, Andrew Morton wrote:
> NOFAIL is what 2.4 has always done, and has the deadlock opportunities

fair enough ;) but you're talking 2.4 mainline, 2.4-aa never did but I
didn't convince Linus or Marcelo to merge it. As usual my argument is
that it's better to have the slight risk to fail an allocation than to
be deadlock prone.

> which you mention. The other modes allow the caller to say "don't try
> forever".
>
> It's mainly a cleanup - it allowed the removal of lots of "try forever"
> loops by consolidating that behaviour in the page allocator. If all

which is wrong since those loops should never exists, and this makes
their life easier, not harder.

> callers are fixed up to not require NOFAIL then we don't need it any more.

Agreed indeed.

> > as for the per-zone lists, sure they increase scalability, but it loses
> > aging information, the worst case will be reproducible on a 1.6G box,
>
> Actually it improves aging information. Think of a GFP_KERNEL allocation
> on an 8G 2.4.x box: an average of 10 or more highmem pages get bogusly
> rotated to the wrong end of the LRU for each scanned lowmem page.

if you assume GFP_KERNEL are more frequent or equal then GFP_HIGHMEM
then yes I would agree. if they would be equally frequent, there would
be nearly no problem at all.

But GFP_KERNEL allocations are normally an order of magnitude less
frequent than highmem allocations (even rmap should try not to allocate
a page for each page fault in some common case ;). Think allocating 1G
of ram and bzeroing it, it shouldn't generate a single GFP_KERNEL
allocation. Of course if you've 1 page per vma the things is different
but it's usually not the common case.

IMHO there is an high risk that very old data lives in cache for very
long times. There can be as much as 800M of cache in zone normal, and
before you allocate 800M of GFP_KERNEL ram it can take a very long time.
Also think if you've a 1G box, the highmem list would be very small and
if you shrink it first, you'll waste an huge amount of cache. Maybe you
go shrink the zone normal list first in such case of unbalance?

Overall I think rotating too fast a global list sounds much better in this
respect (with less infrequent GFP_KERNELS compared to the
highmem/pagecache/anonmemory allocation rate) as far as I can tell, but
I admit I didn't do any math (I didn't feel the need of a demonstration
but maybe we should?).

Andrea

2003-07-01 03:11:31

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Tue, Jul 01, 2003 at 02:39:47AM +0100, Mel Gorman wrote:
>> Reverse Page Table Mapping
>> ==========================
>> One of the most important introductions in the 2.6 kernel is Reverse
>> Mapping commonly referred to as rmap. In 2.4, there is a one way mapping
>> from a PTE to a struct page which is sufficient for process addressing.
>> However, this presents a problem for the page replacement algorithm when
>> dealing with shared pages are used as there is no way in 2.4 to acquire a

On Tue, Jul 01, 2003 at 04:25:16AM +0200, Andrea Arcangeli wrote:
> you mention only the positive things, and never the fact that's the most
> hurting piece of kernel code in terms of performance and smp scalability
> until you actually have to swapout or pageout.

Both of those are inaccurate.

(a) performance
This is not one-sided. Pagetable setup and teardown has higher
overhead with the pte-based methods. The largest gain is from
being able to atomically unmap pages. The overhead is generally
from the atomic operations involved with the pagewise locking.
This is generally not an issue for major workloads as process
creation and destruction is not heavily exercised by any but
the most naive applications. No lock contention arises as a
result of the pte_chains, but there are single-threaded cpu
costs arising from the atomic operations for pagewise locking.

Work has been done and is still being done to reduce the number
of atomic operations required to manipulate pages' mapping
metadata, by me and others, including Hugh Dickins, Andrew
Morton, Martin Bligh, and Dave McCracken. In some of my newer
patches I've been experimenting with RCU as a method for
reducing the number of atomic operations required to establish
a page's mappings.

The aspect of performance this benefits is pageout and minor
faults. When page replacement happens, it's no longer a
catastrophe. LRU scanning also does not unmap pages that aren't
specifically targeted for eviction, which reduces the number of
minor faults incurred. Both of these are benefits targeted
primarily at smaller machines, as, of course, larger machines
rarely ever do page replacement apart from some slow turnover
of unmapped cached file contents.

(b) SMP scalability
There are scalability issues, but they have absolutely nothing
to do with SMP or locking. The largest such issue is the
resource scalability issue on PAE. There are several long-
suffering patches with various amounts of work being put into
them that are trying to address this, by the same authors as
above. They are largely effective.


On Tue, Jul 01, 2003 at 02:39:47AM +0100, Mel Gorman wrote:
>> Non-Linear Populating of Virtual Areas
>> ======================================
>> In 2.4, a VMA backed by a file would be populated in a linear fashion.
>> This can be optionally changed in 2.6 with the introduction of the
>> MAP_POPULATE flag to mmap() and the new system call remap_file_pages().
>> This system call allows arbitrary pages in an existing VMA to be remapped
>> to an arbitrary location on the backing file. This is mainly of interest
>> to database servers which previously simulated this behavior by the
>> creation of many VMAs.
>> On page-out, the non-linear address for the file is encoded within the PTE
>> so that it can be installed again correctly on page fault. How it is
>> encoded is architecture specific so two macros are defined called
>> pgoff_to_pte() and pte_to_pgoff() for the task.

On Tue, Jul 01, 2003 at 04:25:16AM +0200, Andrea Arcangeli wrote:
> and it was used to break truncate, furthmore the API doesn't look good
> to me, the vma should have a special VM_NONLINEAR created with a
> MAP_NONLINEAR so the vm will skip it enterely and it should be possible
> to put multiple files in the same vma IMHO. If these areas are
> vmtruncated the VM_NONLINEAR will keep vmtruncate out of them and the
> pages will become anonymous, which is perfectly fine since they're
> created with a special MAP_NONLINEAR that can have different semantics.

Well, the truncate() breakage is just a bug that I guess we'll have to
clean up soon to get rid of criticisms like this. Tagging the vma for
exhaustive search during truncate() when it's touched by
remap_file_pages() sounds like a plausible fix.


On Tue, Jul 01, 2003 at 04:25:16AM +0200, Andrea Arcangeli wrote:
> Also this feature is mainly useful only in kernels with rmap while using
> the VLM, to workaround the otherwise overkill cpu and memory overhead
> generated by rmap. It is discouraged to use it as a default API unless
> you fall into those special corner cases. the object of this API is to
> bypass the VM so you would lose security and the VM advantages. if you
> want the same features you have w/o nonlinaer w/ nonlinear, then you
> invalidate the whole point of nonlinear.
> The other very corner cases are emulators, they also could benefit from
> a VM bypass like remap_file_pags provides.
> Pinning ram from multiple files plus a sysctl or a sudo wrapper would be
> an optimal API and it would avoid the VM to even look at the pagetables
> of those areas.

IMHO mlock() should be the VM bypass, and refrain from creating
metadata like pte_chains or buffer_heads. Some additional but not very
complex mechanics appear to be required to do this. The emulator case
is precisely the one which would like to use remap_file_pages() with
all the VM metadata attached, as they often prefer to not run with
elevated privileges as required by mlock(), and may in fact be using
data sets larger than RAM despite the slowness thereof. The immediate
application I see is for bochs to emulate highmem on smaller machines.


On Tue, Jul 01, 2003 at 04:25:16AM +0200, Andrea Arcangeli wrote:
> Last but not the least, remap_file_pages is nearly useless (again modulo
> emulators) w/o bigpages support backing it (not yet the case in 2.5 as
> far as I can see so it's unusable compared to 2.4-aa [but 2.5 -
> remap_file_pages is even less usable than 2.5 + remap_file_pages due
> rmap that wouldn't only run slow but it wouldn't run at all]). objrmap
> fixes that despites it introduces some complex algorithm.
> the only significant cost is the tlb flushing and pagetable walking at
> 32G of working set in cache with a 30-31bit window on the cache.

This is wildly inaccurate. Highly complex mappings rarely fall upon
boundaries suitable for large TLB entries apart from on architectures
(saner than x86) with things like 32KB TLB entry sizes. The only
overhead there is to be saved is metadata (i.e. space conservation)
like buffer_heads and pte_chains, which can just as easily be done with
proper mlock() bypasses.

Basically, this is for the buffer pool, not the shared segment, and
so it can't use large tlb entries anyway on account of the granularity.
The shared segment could very easily use hugetlb as-is without issue.

Mixed-mode hugetlb has been requested various times. I would much
prefer to address the creation, destruction, and space overheads of
normal mappings and implement fully implicit large TLB entry
utilization than implement such.


On Tue, Jul 01, 2003 at 02:39:47AM +0100, Mel Gorman wrote:
>> Delayed Coalescing
>> ==================
>> 2.6 extends the buddy algorithm to resemble a lazy buddy algorithm [BL89]
>> which delays the splitting and coalescing of buddies until it is
>> necessary. The delay is implemented only for 0 order allocations with the

On Tue, Jul 01, 2003 at 04:25:16AM +0200, Andrea Arcangeli wrote:
> desribed this way it sounds like it's not O(1) anymore for order > 0
> allocations. Though it would be a nice feature for all the common cases.
> And of course it's still O(1) if one assumes the number of orders
> limited (and it's fixed at compile time).
> as for the per-zone lists, sure they increase scalability, but it loses
> aging information, the worst case will be reproducible on a 1.6G box,
> some day even 2.3 had per-zone lists, I backed it out to avoid losing
> information which is an order of magnitude more interesting than
> the pure smp scalability for normal users (i.e. 2-way systems with 1-2G
> of ram, of course the scalability numbers that people cares about on the
> 8-way with 32G of ram will be much better with the per-zone lru).
> Just trying not to forget the other side too ;) Not every improvement on
> one side cames for free on all other sides.

The per-zone LRU lists are not related to SMP scalability either. They
are for:
(a) node locality
So a node's kswapd or processes doing try_to_free_pages() only
references struct pages on the local nodes.
(b) segregating the search space for page replacement
So the search for ZONE_NORMAL or ZONE_DMA pages doesn't scan
the known universe hunting for a needle in a haystack and spend
24+ hours doing so.

Lock contention was addressed differently, by amortizing the lock
acquisitions using the pagevec infrastructure. In general the
global aging and/or queue ordering information has not been found to be
particularly valuable, or at least no one's complaining that they miss it.


-- wli

2003-07-01 03:11:12

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Tue, Jul 01, 2003 at 05:22:48AM +0200, Andrea Arcangeli wrote:
> On Mon, Jun 30, 2003 at 08:02:37PM -0700, Andrew Morton wrote:
> > callers are fixed up to not require NOFAIL then we don't need it any more.
>
> Agreed indeed.

I also found one argument in favour of NOFAIL: now it'll be easier to
find all the deadlocking places ;)

Andrea

2003-07-01 03:15:49

by Rik van Riel

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Tue, 1 Jul 2003, Andrea Arcangeli wrote:

> Also think if you've a 1G box, the highmem list would be very small and
> if you shrink it first, you'll waste an huge amount of cache. Maybe you
> go shrink the zone normal list first in such case of unbalance?

That's why you have low and high watermarks and try to balance
the shrinking and allocating in both zones. Not sure how
classzone would influence this balancing though, maybe it'd be
harder maybe it'd be easier, but I guess it would be different.

> Overall I think rotating too fast a global list sounds much better in this
> respect (with less infrequent GFP_KERNELS compared to the
> highmem/pagecache/anonmemory allocation rate) as far as I can tell, but
> I admit I didn't do any math (I didn't feel the need of a demonstration
> but maybe we should?).

Remember that on large systems ZONE_NORMAL is often under much
more pressure than ZONE_HIGHMEM. Need any more arguments ? ;)

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-07-01 04:25:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> On Tue, Jul 01, 2003 at 02:39:47AM +0100, Mel Gorman wrote:
> >> Reverse Page Table Mapping
> >> ==========================
> >> One of the most important introductions in the 2.6 kernel is Reverse
> >> Mapping commonly referred to as rmap. In 2.4, there is a one way mapping
> >> from a PTE to a struct page which is sufficient for process addressing.
> >> However, this presents a problem for the page replacement algorithm when
> >> dealing with shared pages are used as there is no way in 2.4 to acquire a
>
> On Tue, Jul 01, 2003 at 04:25:16AM +0200, Andrea Arcangeli wrote:
> > you mention only the positive things, and never the fact that's the most
> > hurting piece of kernel code in terms of performance and smp scalability
> > until you actually have to swapout or pageout.
>
> Both of those are inaccurate.
>
> (a) performance
> This is not one-sided. Pagetable setup and teardown has higher
> overhead with the pte-based methods. The largest gain is from

I said "until you actually have to swapout or pageout". I very much that
you're right and that pagetable teardown has higher overhead with the
pte-based methods. If you were wrong about it rmap would be a total
waste on all sides, not just one.

I'm talking about the other side only (which is the one I want to run
fast on my machines).

> being able to atomically unmap pages. The overhead is generally
> from the atomic operations involved with the pagewise locking.

which is a scalability cost. Those so called atomic operations are
spinlocks (actually they're slower than spinlocks), that's the
cpu/scalability cost I mentioned above.

> metadata, by me and others, including Hugh Dickins, Andrew
> Morton, Martin Bligh, and Dave McCracken. In some of my newer

I was mainly talking about mainline, I also have a design that would
reduce the overhead to zero still having the same swapout, but there are
too many variables and if it has a chance to work, I'll talk about it
after I checked it actually works but likely it won't work well (I mean
during swap for some exponential recursion issue, because the other
costs would be zero), it's just an idea for now.

> (b) SMP scalability
> There are scalability issues, but they have absolutely nothing
> to do with SMP or locking. The largest such issue is the

readprofile shows it at the top constantly, so it's either cpu or
scalability. And the spinlock by hand in the minor page faults is a
scalability showstopper IMHO. You know the kernel compile on the 32way
w/ rmap and w/o rmap (with objrmap).

> resource scalability issue on PAE. There are several long-
> suffering patches with various amounts of work being put into
> them that are trying to address this, by the same authors as
> above. They are largely effective.

yes, this is why I wrote the email, to try not to forget that rmap in
mainline would better be replaced with one of those solutions under
development.

> On Tue, Jul 01, 2003 at 02:39:47AM +0100, Mel Gorman wrote:
> >> Non-Linear Populating of Virtual Areas
> >> ======================================
> >> In 2.4, a VMA backed by a file would be populated in a linear fashion.
> >> This can be optionally changed in 2.6 with the introduction of the
> >> MAP_POPULATE flag to mmap() and the new system call remap_file_pages().
> >> This system call allows arbitrary pages in an existing VMA to be remapped
> >> to an arbitrary location on the backing file. This is mainly of interest
> >> to database servers which previously simulated this behavior by the
> >> creation of many VMAs.
> >> On page-out, the non-linear address for the file is encoded within the PTE
> >> so that it can be installed again correctly on page fault. How it is
> >> encoded is architecture specific so two macros are defined called
> >> pgoff_to_pte() and pte_to_pgoff() for the task.
>
> On Tue, Jul 01, 2003 at 04:25:16AM +0200, Andrea Arcangeli wrote:
> > and it was used to break truncate, furthmore the API doesn't look good
> > to me, the vma should have a special VM_NONLINEAR created with a
> > MAP_NONLINEAR so the vm will skip it enterely and it should be possible
> > to put multiple files in the same vma IMHO. If these areas are
> > vmtruncated the VM_NONLINEAR will keep vmtruncate out of them and the
> > pages will become anonymous, which is perfectly fine since they're
> > created with a special MAP_NONLINEAR that can have different semantics.
>
> Well, the truncate() breakage is just a bug that I guess we'll have to
> clean up soon to get rid of criticisms like this. Tagging the vma for
> exhaustive search during truncate() when it's touched by
> remap_file_pages() sounds like a plausible fix.
>
>
> On Tue, Jul 01, 2003 at 04:25:16AM +0200, Andrea Arcangeli wrote:
> > Also this feature is mainly useful only in kernels with rmap while using
> > the VLM, to workaround the otherwise overkill cpu and memory overhead
> > generated by rmap. It is discouraged to use it as a default API unless
> > you fall into those special corner cases. the object of this API is to
> > bypass the VM so you would lose security and the VM advantages. if you
> > want the same features you have w/o nonlinaer w/ nonlinear, then you
> > invalidate the whole point of nonlinear.
> > The other very corner cases are emulators, they also could benefit from
> > a VM bypass like remap_file_pags provides.
> > Pinning ram from multiple files plus a sysctl or a sudo wrapper would be
> > an optimal API and it would avoid the VM to even look at the pagetables
> > of those areas.
>
> IMHO mlock() should be the VM bypass, and refrain from creating
> metadata like pte_chains or buffer_heads. Some additional but not very

mlock doesn't prevent rmap to be created on the mapping.

If mlock was enough as a vm bypass, you wouldn't need remap_file_pages
in the first place.

remap_file_pages is the vm bypass, mlock is not. VM bypass doesn't only
mean that you can't swap it, it means that the vm just pins the pages
and forget about them like if it was a framebuffer with PG_reserved set.
And as such it doesn't make much sense to give vm knowledge to
remap_file_pages when also remap_file_pages only makes sense in the 99%
of the usages in production together with largepages (which is not
possible in 2.5.73 yet), it would be an order of magnitude more
worhtwhile to include the fd as parameter in the API than to be able to
page it IMHO.

> complex mechanics appear to be required to do this. The emulator case
> is precisely the one which would like to use remap_file_pages() with
> all the VM metadata attached, as they often prefer to not run with
> elevated privileges as required by mlock(), and may in fact be using
> data sets larger than RAM despite the slowness thereof. The immediate
> application I see is for bochs to emulate highmem on smaller machines.

if it has to swap that much, the cost of walking the vma tree wouldn't
be that significant anymore (though you would save the ram for the vma).

Also normally the emulators are very strict at not allowing you to
allocate insane amount of emulated ram because otherwise it slowsdown
like a crawl, they look at the ram in the box to limit it. Thought it
would be theroetically possible to do what you mentioned, but the
benefit of remap_file_pages would be not noticeable under such a
workload IMHO. You want remap_file_pages when you have zero background
vm kernel activity in the mapping while you use it as far as I can tell.
If you have vm background activity, you want the most efficent way to
teardown that pte as you said at the top of the email.

> On Tue, Jul 01, 2003 at 04:25:16AM +0200, Andrea Arcangeli wrote:
> > Last but not the least, remap_file_pages is nearly useless (again modulo
> > emulators) w/o bigpages support backing it (not yet the case in 2.5 as
> > far as I can see so it's unusable compared to 2.4-aa [but 2.5 -
> > remap_file_pages is even less usable than 2.5 + remap_file_pages due
> > rmap that wouldn't only run slow but it wouldn't run at all]). objrmap
> > fixes that despites it introduces some complex algorithm.
> > the only significant cost is the tlb flushing and pagetable walking at
> > 32G of working set in cache with a 30-31bit window on the cache.
>
> This is wildly inaccurate. Highly complex mappings rarely fall upon
> boundaries suitable for large TLB entries apart from on architectures

If they don't fit it's an userspace issue, not a kernel issue.

> (saner than x86) with things like 32KB TLB entry sizes. The only
> overhead there is to be saved is metadata (i.e. space conservation)
> like buffer_heads and pte_chains, which can just as easily be done with
> proper mlock() bypasses.
>
> Basically, this is for the buffer pool, not the shared segment, and
> so it can't use large tlb entries anyway on account of the granularity.

the granularity is 2M, not 32k. The cost of 2M granularity is the same
as 32k if you use bigpages, infact it's much less because you only walk
2 level, not 3. so such an application using 32k granularity isn't
designed for the high end and it'll have to be fixed to compete with
better apps, so no matter the kernel it won't run as fast as the
hardware can deliver, it's not a kernel issue.

> The shared segment could very easily use hugetlb as-is without issue.
>
> Mixed-mode hugetlb has been requested various times. I would much
> prefer to address the creation, destruction, and space overheads of
> normal mappings and implement fully implicit large TLB entry
> utilization than implement such.
>
>
> On Tue, Jul 01, 2003 at 02:39:47AM +0100, Mel Gorman wrote:
> >> Delayed Coalescing
> >> ==================
> >> 2.6 extends the buddy algorithm to resemble a lazy buddy algorithm [BL89]
> >> which delays the splitting and coalescing of buddies until it is
> >> necessary. The delay is implemented only for 0 order allocations with the
>
> On Tue, Jul 01, 2003 at 04:25:16AM +0200, Andrea Arcangeli wrote:
> > desribed this way it sounds like it's not O(1) anymore for order > 0
> > allocations. Though it would be a nice feature for all the common cases.
> > And of course it's still O(1) if one assumes the number of orders
> > limited (and it's fixed at compile time).
> > as for the per-zone lists, sure they increase scalability, but it loses
> > aging information, the worst case will be reproducible on a 1.6G box,
> > some day even 2.3 had per-zone lists, I backed it out to avoid losing
> > information which is an order of magnitude more interesting than
> > the pure smp scalability for normal users (i.e. 2-way systems with 1-2G
> > of ram, of course the scalability numbers that people cares about on the
^^^^^^^^^^^
> > 8-way with 32G of ram will be much better with the per-zone lru).
> > Just trying not to forget the other side too ;) Not every improvement on
> > one side cames for free on all other sides.
>
> The per-zone LRU lists are not related to SMP scalability either. They

I never said SMP scalability, I said just plain _scalability_.

> are for:
> (a) node locality
> So a node's kswapd or processes doing try_to_free_pages() only
> references struct pages on the local nodes.

I'm talking about the per-zone, not per-node. Since they're per-zone
they're also per-node, but per-node is completely different than
per-zone. Infact originally I even had per-node lrus at some point
around 2.4.0-testsomething for the reasons you said above. then I giveup
since there were too few numa users and more basic numa issues to
address first at that time and nobody cared much about numa, I even
heard the argument that classzone was bad for numa at some point ;)

> (b) segregating the search space for page replacement
> So the search for ZONE_NORMAL or ZONE_DMA pages doesn't scan
> the known universe hunting for a needle in a haystack and spend
> 24+ hours doing so.

yes, and that is good only for the very infrequent allocations, and
wastes aging for the common highmem ones. That was my whole point.

> Lock contention was addressed differently, by amortizing the lock
> acquisitions using the pagevec infrastructure. In general the

I said scalability feature, it is definitely a scalability feature, not
an smp scalabity one, I never spoke about lock contention, but you will
definitely appreciate it more on the 8G boxes due the tiny zone normal
they have compared to the highmem one, which is definitely a scalability
issue (loads of wasted cpu) but very worthless on a 1G box.

> global aging and/or queue ordering information has not been found to be
> particularly valuable, or at least no one's complaining that they miss it.

in -aa on a 64G I don't allow a single byte of normal zone to be
assigned to highmem capable users, so nobody could ever complain about
that in the high end on a good kernel since the whole lru should populated
only by highmem users regardless - which would be equivalent to 2.5. On
the highend the normal zone is so tiny, that you don't waste if it's
loaded in old worthless cache or completely unused, the aging difference
with be near zero anyways. But for the low end if you will provide a
patch to revert it to a global list and if you give that patch on the
list to the 1G-1.5G ram users, then maybe somebody will have a chance to
notice the potential regression.

Andrea

2003-07-01 06:19:39

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
>> Both of those are inaccurate.
>> (a) performance
>> This is not one-sided. Pagetable setup and teardown has higher
>> overhead with the pte-based methods. The largest gain is from

On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> I said "until you actually have to swapout or pageout". I very much that
> you're right and that pagetable teardown has higher overhead with the
> pte-based methods. If you were wrong about it rmap would be a total
> waste on all sides, not just one.
> I'm talking about the other side only (which is the one I want to run
> fast on my machines).

Okay, so let's speed it up. I've got various hacks related to this in
my -wli tree (which are even quite a bit beyond objrmap). It includes
an NIH of hugh's old anobjrmap patches with some tiny speedups like
inlining the rmap functions, which became small enough to do that with,
and RCU'ing ->i_shared_lock and anon->lock.


On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
>> being able to atomically unmap pages. The overhead is generally
>> from the atomic operations involved with the pagewise locking.

On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> which is a scalability cost. Those so called atomic operations are
> spinlocks (actually they're slower than spinlocks), that's the
> cpu/scalability cost I mentioned above.

Okay this is poorly stated. Those cpu costs do not affect scaling
factors.


On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
>> metadata, by me and others, including Hugh Dickins, Andrew
>> Morton, Martin Bligh, and Dave McCracken. In some of my newer

On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> I was mainly talking about mainline, I also have a design that would
> reduce the overhead to zero still having the same swapout, but there are
> too many variables and if it has a chance to work, I'll talk about it
> after I checked it actually works but likely it won't work well (I mean
> during swap for some exponential recursion issue, because the other
> costs would be zero), it's just an idea for now.

The anobjrmap code from Hugh Dickins that I NIH'd for -wli essentially
does this, apart from eliminating the per-page locking, and it's
already out there.


On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
>> (b) SMP scalability
>> There are scalability issues, but they have absolutely nothing
>> to do with SMP or locking. The largest such issue is the

On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> readprofile shows it at the top constantly, so it's either cpu or
> scalability. And the spinlock by hand in the minor page faults is a
> scalability showstopper IMHO. You know the kernel compile on the 32way
> w/ rmap and w/o rmap (with objrmap).

Kernel compiles aren't really very good benchmarks. I've moved on to
SDET and gotten various things done to cut down on process creation
and destruction, including but not limited to NIH'ing various object-
based ptov resolution patches and RCU'ing locks they pound on.


On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
>> resource scalability issue on PAE. There are several long-
>> suffering patches with various amounts of work being put into
>> them that are trying to address this, by the same authors as
>> above. They are largely effective.

On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> yes, this is why I wrote the email, to try not to forget that rmap in
> mainline would better be replaced with one of those solutions under
> development.

There are some steps that need to be taken before merging them, but
AFAICT they're progressing toward merging smoothly.


On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
>> Well, the truncate() breakage is just a bug that I guess we'll have to
>> clean up soon to get rid of criticisms like this. Tagging the vma for
>> exhaustive search during truncate() when it's touched by
>> remap_file_pages() sounds like a plausible fix.

On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
>> IMHO mlock() should be the VM bypass, and refrain from creating
>> metadata like pte_chains or buffer_heads. Some additional but not very

On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> mlock doesn't prevent rmap to be created on the mapping.
> If mlock was enough as a vm bypass, you wouldn't need remap_file_pages
> in the first place.

I was proposing that it should be used to prevent pte_chains from being
created, not suggesting that it does so now.


On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> remap_file_pages is the vm bypass, mlock is not. VM bypass doesn't only
> mean that you can't swap it, it means that the vm just pins the pages
> and forget about them like if it was a framebuffer with PG_reserved set.
> And as such it doesn't make much sense to give vm knowledge to
> remap_file_pages when also remap_file_pages only makes sense in the 99%
> of the usages in production together with largepages (which is not
> possible in 2.5.73 yet), it would be an order of magnitude more
> worhtwhile to include the fd as parameter in the API than to be able to
> page it IMHO.

There is nothing particularly special about remap_file_pages() apart
from the fact some additional checks are needed when walking ->i_mmap
and ->i_mmap_shared to avoid range checks being fooled by it.

remap_file_pages() in combination with hugetlb just doesn't make sense.
The buffer pool does not use 2MB windows (though it's configurable, so
in principle you _could_). Furthermore 2MB windows are so large the vma
overhead of using mmap() could already be tolerated.


On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
>> complex mechanics appear to be required to do this. The emulator case
>> is precisely the one which would like to use remap_file_pages() with
>> all the VM metadata attached, as they often prefer to not run with
>> elevated privileges as required by mlock(), and may in fact be using
>> data sets larger than RAM despite the slowness thereof. The immediate
>> application I see is for bochs to emulate highmem on smaller machines.

On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> if it has to swap that much, the cost of walking the vma tree wouldn't
> be that significant anymore (though you would save the ram for the vma).
> Also normally the emulators are very strict at not allowing you to
> allocate insane amount of emulated ram because otherwise it slowsdown
> like a crawl, they look at the ram in the box to limit it. Thought it
> would be theroetically possible to do what you mentioned, but the
> benefit of remap_file_pages would be not noticeable under such a
> workload IMHO. You want remap_file_pages when you have zero background
> vm kernel activity in the mapping while you use it as far as I can tell.
> If you have vm background activity, you want the most efficent way to
> teardown that pte as you said at the top of the email.

In general the way to use it would be to treat virtualspace as a fully
associative cache. This also does not imply swapping since the stuff may
just as easily be cached in highmem. The fact one _could_ do it with
data sets large than RAM, albeit slowly, is also of interest. It would
effectively want random page replacement, e.g. a FIFO queue, and some
method of mitigating its impact on other applications on the system.


On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
>> This is wildly inaccurate. Highly complex mappings rarely fall upon
>> boundaries suitable for large TLB entries apart from on architectures

On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> If they don't fit it's an userspace issue, not a kernel issue.

You say that as if it's in the least bit feasible to get major apps to
change. They're almost as bad as hardware.


On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
>> (saner than x86) with things like 32KB TLB entry sizes. The only
>> overhead there is to be saved is metadata (i.e. space conservation)
>> like buffer_heads and pte_chains, which can just as easily be done with
>> proper mlock() bypasses.
>> Basically, this is for the buffer pool, not the shared segment, and
>> so it can't use large tlb entries anyway on account of the granularity.

On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> the granularity is 2M, not 32k. The cost of 2M granularity is the same
> as 32k if you use bigpages, infact it's much less because you only walk
> 2 level, not 3. so such an application using 32k granularity isn't
> designed for the high end and it'll have to be fixed to compete with
> better apps, so no matter the kernel it won't run as fast as the
> hardware can deliver, it's not a kernel issue.

What on earth? 32KB granularity still needs the 3rd-level. I also fail
to see a resource scalability impact for the kernels bigpages are used
on since although they will be eliminated from consideration for
swapping, it's not in use in combination with rmap-based VM's that I
know of.


On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
>> The per-zone LRU lists are not related to SMP scalability either. They

On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> I never said SMP scalability, I said just plain _scalability_.

To measure scaling a performance metric (possibly workload feasibility)
and an input are required. For instance, throughput on SDET and number
of cpus, or amount of RAM and number of processes. "Scalability",
wherever it's unqualified by such things, is meaningless.


On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
>> are for:
>> (a) node locality
>> So a node's kswapd or processes doing try_to_free_pages() only
>> references struct pages on the local nodes.

On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> I'm talking about the per-zone, not per-node. Since they're per-zone
> they're also per-node, but per-node is completely different than
> per-zone. Infact originally I even had per-node lrus at some point
> around 2.4.0-testsomething for the reasons you said above. then I giveup
> since there were too few numa users and more basic numa issues to
> address first at that time and nobody cared much about numa, I even
> heard the argument that classzone was bad for numa at some point ;)

Well, the reason why it's per-zone and not just per-node is in point (b).
Per-node would suffice to reap the qualitative benefits in point (a).


On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
>> (b) segregating the search space for page replacement
>> So the search for ZONE_NORMAL or ZONE_DMA pages doesn't scan
>> the known universe hunting for a needle in a haystack and spend
>> 24+ hours doing so.

On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> yes, and that is good only for the very infrequent allocations, and
> wastes aging for the common highmem ones. That was my whole point.

The question is largely of whether the algorithm explodes when it's
asked to recover lowmem.


On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
>> Lock contention was addressed differently, by amortizing the lock
>> acquisitions using the pagevec infrastructure. In general the

On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> I said scalability feature, it is definitely a scalability feature, not
> an smp scalabity one, I never spoke about lock contention, but you will
> definitely appreciate it more on the 8G boxes due the tiny zone normal
> they have compared to the highmem one, which is definitely a scalability
> issue (loads of wasted cpu) but very worthless on a 1G box.

Unfortunately I don't have such a machine at my disposal (the discontig
code barfs on mem= at the moment). But I'd like to see the results.


On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
>> global aging and/or queue ordering information has not been found to be
>> particularly valuable, or at least no one's complaining that they miss it.

On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> in -aa on a 64G I don't allow a single byte of normal zone to be
> assigned to highmem capable users, so nobody could ever complain about
> that in the high end on a good kernel since the whole lru should populated
> only by highmem users regardless - which would be equivalent to 2.5. On
> the highend the normal zone is so tiny, that you don't waste if it's
> loaded in old worthless cache or completely unused, the aging difference
> with be near zero anyways. But for the low end if you will provide a
> patch to revert it to a global list and if you give that patch on the
> list to the 1G-1.5G ram users, then maybe somebody will have a chance to
> notice the potential regression.

I would like to hear of the results of testing on such systems. I
personally have doubts any significant difference will be found.


-- wli

2003-07-01 07:35:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
> On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> >> Both of those are inaccurate.
> >> (a) performance
> >> This is not one-sided. Pagetable setup and teardown has higher
> >> overhead with the pte-based methods. The largest gain is from
>
> On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> > I said "until you actually have to swapout or pageout". I very much that
> > you're right and that pagetable teardown has higher overhead with the
> > pte-based methods. If you were wrong about it rmap would be a total
> > waste on all sides, not just one.
> > I'm talking about the other side only (which is the one I want to run
> > fast on my machines).
>
> Okay, so let's speed it up. I've got various hacks related to this in

last time we got stuck in objrmap due the complexty complains ( the
i_mmap links all the vmas of the file, not only the ones relative to the
range that maps the page that we need to free, so we would end checking
lots of vmas multiple times for each page)

In practice it didn't look bad enough to me though, there's mainly the
security concerns where an user could force the vm not to schedule for
the time it takes to reach that page (or we could make it schedule w/o
guaranteeing the unmap, the unmap isn't guaranteed in 2.4 either if the
user keeps looping, and that's also a feature)

> my -wli tree (which are even quite a bit beyond objrmap). It includes
> an NIH of hugh's old anobjrmap patches with some tiny speedups like
> inlining the rmap functions, which became small enough to do that with,
> and RCU'ing ->i_shared_lock and anon->lock.
>
>
> On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> >> being able to atomically unmap pages. The overhead is generally
> >> from the atomic operations involved with the pagewise locking.
>
> On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> > which is a scalability cost. Those so called atomic operations are
> > spinlocks (actually they're slower than spinlocks), that's the
> > cpu/scalability cost I mentioned above.
>
> Okay this is poorly stated. Those cpu costs do not affect scaling
> factors.

those atomic operations happens on the page and the page is shared.
That's where the smp scalability costs cames from, the cacheline bounces
off across all cpus that executes gcc at the same time.

>
>
> On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> >> metadata, by me and others, including Hugh Dickins, Andrew
> >> Morton, Martin Bligh, and Dave McCracken. In some of my newer
>
> On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> > I was mainly talking about mainline, I also have a design that would
> > reduce the overhead to zero still having the same swapout, but there are
> > too many variables and if it has a chance to work, I'll talk about it
> > after I checked it actually works but likely it won't work well (I mean
> > during swap for some exponential recursion issue, because the other
> > costs would be zero), it's just an idea for now.
>
> The anobjrmap code from Hugh Dickins that I NIH'd for -wli essentially
> does this, apart from eliminating the per-page locking, and it's
> already out there.

I said it would provide the same swapout features, so without the
complexity issue in the swapout. objrmap has the complexity issue,
that's its main problem as far as I know. But personally I prefer it.

> On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> >> (b) SMP scalability
> >> There are scalability issues, but they have absolutely nothing
> >> to do with SMP or locking. The largest such issue is the
>
> On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> > readprofile shows it at the top constantly, so it's either cpu or
> > scalability. And the spinlock by hand in the minor page faults is a
> > scalability showstopper IMHO. You know the kernel compile on the 32way
> > w/ rmap and w/o rmap (with objrmap).
>
> Kernel compiles aren't really very good benchmarks. I've moved on to

true, many important workloads aren't nearly similar to kernel compiles,
but still kernel compiles are relevant.

> SDET and gotten various things done to cut down on process creation
> and destruction, including but not limited to NIH'ing various object-
> based ptov resolution patches and RCU'ing locks they pound on.

I'm unsure if RCU is worthwhile for this, here the cost is the writer
not the reader, the readers is the swapout code (ok, also
mprotoect/mremap and friends) but the real issue is the writer. The
reader is always been ok with the spinlock (or at the very least I don't
mind about the scalability of the reader). RCU (unless you benefit
greatly from the coalescing of the free operations) could make the
writer even slower to benefit the reader. So overall RCU here doesn't
make much sense to me.

> On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> >> resource scalability issue on PAE. There are several long-
> >> suffering patches with various amounts of work being put into
> >> them that are trying to address this, by the same authors as
> >> above. They are largely effective.
>
> On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> > yes, this is why I wrote the email, to try not to forget that rmap in
> > mainline would better be replaced with one of those solutions under
> > development.
>
> There are some steps that need to be taken before merging them, but
> AFAICT they're progressing toward merging smoothly.

Which steps? And how did you solve the complexity issue? Who given the
agreement for merging objrmap? AFIK there was disagreement due the
complexity issues (hence I thought at a different way that probably
won't work but that would solve all the complexity issues and the cost
in the page faults, though it wouldn't be simple at all even if it works)

> On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> >> Well, the truncate() breakage is just a bug that I guess we'll have to
> >> clean up soon to get rid of criticisms like this. Tagging the vma for
> >> exhaustive search during truncate() when it's touched by
> >> remap_file_pages() sounds like a plausible fix.
>
> On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> >> IMHO mlock() should be the VM bypass, and refrain from creating
> >> metadata like pte_chains or buffer_heads. Some additional but not very
>
> On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> > mlock doesn't prevent rmap to be created on the mapping.
> > If mlock was enough as a vm bypass, you wouldn't need remap_file_pages
> > in the first place.
>
> I was proposing that it should be used to prevent pte_chains from being
> created, not suggesting that it does so now.

if you want to change mlock to drop the pte_chains then it would
definitely make mlock a VM bypass, even if not as strong as the
remap_file_pages that bypass the vma layer too (not only the
pte/pte_chain side).

> On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> > remap_file_pages is the vm bypass, mlock is not. VM bypass doesn't only
> > mean that you can't swap it, it means that the vm just pins the pages
> > and forget about them like if it was a framebuffer with PG_reserved set.
> > And as such it doesn't make much sense to give vm knowledge to
> > remap_file_pages when also remap_file_pages only makes sense in the 99%
> > of the usages in production together with largepages (which is not
> > possible in 2.5.73 yet), it would be an order of magnitude more
> > worhtwhile to include the fd as parameter in the API than to be able to
> > page it IMHO.
>
> There is nothing particularly special about remap_file_pages() apart
> from the fact some additional checks are needed when walking ->i_mmap
> and ->i_mmap_shared to avoid range checks being fooled by it.

there is something very special about remap_file_pages: it is the only
way to avoid rmap right now. And if you claim that remap_file_pages
works better than mmap() for the simulator that allocates more highmem
than ram, you're just saying that rmap is completely useless even during
the pte teardown. So either you admit remap_file_pages would be a bad
idea to use with heavy paging (i.e. as much highmem as you want, booting
64G of ram emulated on a 4G box) because it misses rmap, or you admit
rmap is useless.

>
> remap_file_pages() in combination with hugetlb just doesn't make sense.
> The buffer pool does not use 2MB windows (though it's configurable, so
> in principle you _could_). Furthermore 2MB windows are so large the vma
> overhead of using mmap() could already be tolerated.

remap_file_pages theoretically provides the lowest possible overhead if
implemented as I'm advocating for some month. The full true VM bypass,
even stronger than your mlock that drops the pte_chains and reconstruct
them when the mlock goes away.

I mean, nothing can be faster than remap_file_pages if implemented as I
describe (with support of largepages as well). Done that way it would be
an obvious improvement, for the emulators too (provided a sysctl that
voids all the security concerns w.r.t. to the amount of pinned ram).

But as soon as you refuse to support largepages with remap_file_pages
you automatically make it slower than mmap, and then remap_file_pages
becomes almost useless (except for the emulators).

Note: remap_file_pages is not needed at all with current kernels, the
only reason we may want to backport it to 2.4 is for the lower vma
overhead (very minor issue) so maybe it's a 0.5-1% or similar (I'm only
guessing). But the lack of largepage support would offset whatever
benefit the vma overhead avoidance can provide IMHO (that applies to 2.5
too of course).

> On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> >> complex mechanics appear to be required to do this. The emulator case
> >> is precisely the one which would like to use remap_file_pages() with
> >> all the VM metadata attached, as they often prefer to not run with
> >> elevated privileges as required by mlock(), and may in fact be using
> >> data sets larger than RAM despite the slowness thereof. The immediate
> >> application I see is for bochs to emulate highmem on smaller machines.
>
> On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> > if it has to swap that much, the cost of walking the vma tree wouldn't
> > be that significant anymore (though you would save the ram for the vma).
> > Also normally the emulators are very strict at not allowing you to
> > allocate insane amount of emulated ram because otherwise it slowsdown
> > like a crawl, they look at the ram in the box to limit it. Thought it
> > would be theroetically possible to do what you mentioned, but the
> > benefit of remap_file_pages would be not noticeable under such a
> > workload IMHO. You want remap_file_pages when you have zero background
> > vm kernel activity in the mapping while you use it as far as I can tell.
> > If you have vm background activity, you want the most efficent way to
> > teardown that pte as you said at the top of the email.
>
> In general the way to use it would be to treat virtualspace as a fully
> associative cache. This also does not imply swapping since the stuff may
> just as easily be cached in highmem. The fact one _could_ do it with
> data sets large than RAM, albeit slowly, is also of interest. It would
> effectively want random page replacement, e.g. a FIFO queue, and some
> method of mitigating its impact on other applications on the system.

that would imply that those methods would be almost as fast as the true
rmap algorithm for swapping? So why we don't use them for everything and
we drop rmap? Or are you going to hack the kernel and reboot just before
running a special emulator and hack this algorithm and reboot every time
you change the application that uses remap_file_pages to get sane heavy
swapping out of it?

with your "swapping emulator" workload all the ram (either 1T or 10M)
will be mapped in this vma populated with remap_file_pages. It's like
running the emulator in a 2.4 kernel w/o rmap, with only the vma tree
cost removed. If it's still always faster it means rmap saves less than
the rbtree cost, and so basically rmap is useless since its main
advantage is in terms of computational complexity IMHO.

The fact is that as far as it's true that rmap pays off (I mean more
than the rbtree cost), if you need to page the nonlinear vma, you'd
better not use a nonlinear vma but to use the linear vmas instead so
you'll get the real scalabile and optimized swapping algorithm. And in
turn IMHO remap_file_pages makes sense only when you don't need to swap.

making remap_file_pages pageables only helps in terms of security
requirements as far as I can tell but that can be addressed more simply
by a sysctl too (and then we can pass the fd to remap_file_pages too as
last argument, which would be a very nice addition IMHO).

> On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> >> This is wildly inaccurate. Highly complex mappings rarely fall upon
> >> boundaries suitable for large TLB entries apart from on architectures
>
> On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> > If they don't fit it's an userspace issue, not a kernel issue.
>
> You say that as if it's in the least bit feasible to get major apps to
> change. They're almost as bad as hardware.

yes. But some db is allocating ~32G of shm in largepages and it runs
significantly faster when the largepags are enabled, so I assume it must
use 2M granularity somewhere to measure this difference.

> On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> >> (saner than x86) with things like 32KB TLB entry sizes. The only
> >> overhead there is to be saved is metadata (i.e. space conservation)
> >> like buffer_heads and pte_chains, which can just as easily be done with
> >> proper mlock() bypasses.
> >> Basically, this is for the buffer pool, not the shared segment, and
> >> so it can't use large tlb entries anyway on account of the granularity.
>
> On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> > the granularity is 2M, not 32k. The cost of 2M granularity is the same
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > as 32k if you use bigpages, infact it's much less because you only walk
> > 2 level, not 3. so such an application using 32k granularity isn't
> > designed for the high end and it'll have to be fixed to compete with
> > better apps, so no matter the kernel it won't run as fast as the
> > hardware can deliver, it's not a kernel issue.
>
> What on earth? 32KB granularity still needs the 3rd-level. I also fail

when the granularity is 2M (not 32k) it only walks 2 levels, the 3rd
level doesn't exist with largepages. of course the pagetables will be
the PAE ones still, but the largepages only generates 2 walks, there's
no PTE, there's the pmd only. So it's like if it's a 2 level pagetables
for the whole shm. If the app is smart enough to have some locality, the
single tlb entry will cache what could otherwise generate multiple mmaps
(and in turn tlb flushes) and multiple tlb entries.

> to see a resource scalability impact for the kernels bigpages are used
> on since although they will be eliminated from consideration for
> swapping, it's not in use in combination with rmap-based VM's that I
> know of.

not sure to understand this sentence. There's no scalability issue at
all related to remap_file_pages and largepages. If something they can
help scalability by avoiding rmap (i.e. like mlock or with "slow swap
ala 2.4").

> On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> >> The per-zone LRU lists are not related to SMP scalability either. They
>
> On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> > I never said SMP scalability, I said just plain _scalability_.
>
> To measure scaling a performance metric (possibly workload feasibility)
> and an input are required. For instance, throughput on SDET and number
> of cpus, or amount of RAM and number of processes. "Scalability",
> wherever it's unqualified by such things, is meaningless.

This was a ram scalability, the more you increase the amount of ram, the
more you lose in those loops for GFP_KERNEL allocations. Overall I
simply meant it works much better the biggest the box is.

> On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> >> are for:
> >> (a) node locality
> >> So a node's kswapd or processes doing try_to_free_pages() only
> >> references struct pages on the local nodes.
>
> On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> > I'm talking about the per-zone, not per-node. Since they're per-zone
> > they're also per-node, but per-node is completely different than
> > per-zone. Infact originally I even had per-node lrus at some point
> > around 2.4.0-testsomething for the reasons you said above. then I giveup
> > since there were too few numa users and more basic numa issues to
> > address first at that time and nobody cared much about numa, I even
> > heard the argument that classzone was bad for numa at some point ;)
>
> Well, the reason why it's per-zone and not just per-node is in point (b).

I know and it works perfectly for the highend but I seriously doubt it
is the optimal design for normal 1-2G boxes.

> Per-node would suffice to reap the qualitative benefits in point (a).

yes.

>
>
> On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> >> (b) segregating the search space for page replacement
> >> So the search for ZONE_NORMAL or ZONE_DMA pages doesn't scan
> >> the known universe hunting for a needle in a haystack and spend
> >> 24+ hours doing so.
>
> On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> > yes, and that is good only for the very infrequent allocations, and
> > wastes aging for the common highmem ones. That was my whole point.
>
> The question is largely of whether the algorithm explodes when it's
> asked to recover lowmem.

yeah the ram scalability. So we penalize the boxes where it wouldn't
explode.

I agree being able to scale on the ~32G is much more important to run
optimally on the 1-2G. My whole email is about pointing out and not
forgetting that there are also downsides in those scalability
enhacements. It's not always the case, but in this case definitely yes.
In the 2.3.x days I definitely cared more about the 1-2G boxes than
about the 32G ones.

> On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> >> Lock contention was addressed differently, by amortizing the lock
> >> acquisitions using the pagevec infrastructure. In general the
>
> On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> > I said scalability feature, it is definitely a scalability feature, not
> > an smp scalabity one, I never spoke about lock contention, but you will
> > definitely appreciate it more on the 8G boxes due the tiny zone normal
> > they have compared to the highmem one, which is definitely a scalability
> > issue (loads of wasted cpu) but very worthless on a 1G box.
>
> Unfortunately I don't have such a machine at my disposal (the discontig
> code barfs on mem= at the moment). But I'd like to see the results.

A plain bonnie run with -s 1800 on a 2G box would be very interesting
(it should in theory read it all from cache, and if my theory is right
it will do some worthless I/O instead during the read).

> On Mon, Jun 30, 2003 at 08:25:31PM -0700, William Lee Irwin III wrote:
> >> global aging and/or queue ordering information has not been found to be
> >> particularly valuable, or at least no one's complaining that they miss it.
>
> On Tue, Jul 01, 2003 at 06:39:02AM +0200, Andrea Arcangeli wrote:
> > in -aa on a 64G I don't allow a single byte of normal zone to be
> > assigned to highmem capable users, so nobody could ever complain about
> > that in the high end on a good kernel since the whole lru should populated
> > only by highmem users regardless - which would be equivalent to 2.5. On
> > the highend the normal zone is so tiny, that you don't waste if it's
> > loaded in old worthless cache or completely unused, the aging difference
> > with be near zero anyways. But for the low end if you will provide a
> > patch to revert it to a global list and if you give that patch on the
> > list to the 1G-1.5G ram users, then maybe somebody will have a chance to
> > notice the potential regression.
>
> I would like to hear of the results of testing on such systems. I
> personally have doubts any significant difference will be found.

that's why I mentioned those issues. I'd be glad to get numbers
confirming that I'm plain wrong on all of them ;)

Andrea

2003-07-01 08:46:35

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> Okay, so let's speed it up. I've got various hacks related to this in

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> last time we got stuck in objrmap due the complexty complains ( the
> i_mmap links all the vmas of the file, not only the ones relative to the
> range that maps the page that we need to free, so we would end checking
> lots of vmas multiple times for each page)
> In practice it didn't look bad enough to me though, there's mainly the
> security concerns where an user could force the vm not to schedule for
> the time it takes to reach that page (or we could make it schedule w/o
> guaranteeing the unmap, the unmap isn't guaranteed in 2.4 either if the
> user keeps looping, and that's also a feature)

What I found was that enough vma walking was done even on unpressured
systems (something like 4GB/32GB utilized) that RCU'ing the lock helped.
I went from semaphore -> spinlock -> rwlock -> RCU, each with some small
speed increase on the 16x/32GB system.


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> my -wli tree (which are even quite a bit beyond objrmap). It includes
>> an NIH of hugh's old anobjrmap patches with some tiny speedups like
>> inlining the rmap functions, which became small enough to do that with,
>> and RCU'ing ->i_shared_lock and anon->lock.
>> Okay this is poorly stated. Those cpu costs do not affect scaling
>> factors.

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> those atomic operations happens on the page and the page is shared.
> That's where the smp scalability costs cames from, the cacheline bounces
> off across all cpus that executes gcc at the same time.

Yes, that's why I'm trying to use RCU to eliminate the bitwise locking
overhead on the page except in the two rare situations where chaining is
required in hugh's scheme, i.e. in-flight mremap() and remap_file_pages().
I've been having a very tough time doing this. If you'd like to get
involved I'd more than welcome it.


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> The anobjrmap code from Hugh Dickins that I NIH'd for -wli essentially
>> does this, apart from eliminating the per-page locking, and it's
>> already out there.

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> I said it would provide the same swapout features, so without the
> complexity issue in the swapout. objrmap has the complexity issue,
> that's its main problem as far as I know. But personally I prefer it.

I actually outright dislike it, but it addresses the resource
scalability issue.


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> Kernel compiles aren't really very good benchmarks. I've moved on to

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> true, many important workloads aren't nearly similar to kernel compiles,
> but still kernel compiles are relevant.

First I ask, "What is this exercising?" That answer is largely process
creation and destruction and SMP scheduling latency when there are very
rapidly fluctuating imbalances.

After observing that, the benchmark is flawed because
(a) it doesn't run long enough to produce stable numbers
(b) the results are apparently measured with gettimeofday(), which is
wildly inaccurate for such short-lived phenomena
(c) large differences in performance appear to come about as a result
of differing versions of common programs (i.e. gcc)

The retired #$%@.org benchmark whose name I actually shouldn't have
mentioned in my prior post without consulting a lawyer, while being
some of the shittiest code on earth, seems to have a-c covered.


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> #$%@ and gotten various things done to cut down on process creation
>> and destruction, including but not limited to NIH'ing various object-
>> based ptov resolution patches and RCU'ing locks they pound on.

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> I'm unsure if RCU is worthwhile for this, here the cost is the writer
> not the reader, the readers is the swapout code (ok, also
> mprotoect/mremap and friends) but the real issue is the writer. The
> reader is always been ok with the spinlock (or at the very least I don't
> mind about the scalability of the reader). RCU (unless you benefit
> greatly from the coalescing of the free operations) could make the
> writer even slower to benefit the reader. So overall RCU here doesn't
> make much sense to me.

I saw spintime in the vma modifiers, and RCU'd as a brute-force cookbook
attack on it without really considering what it was or who it came from.
And it worked. Open & shut case if you ask me. =)


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> There are some steps that need to be taken before merging them, but
>> AFAICT they're progressing toward merging smoothly.

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> Which steps? And how did you solve the complexity issue? Who given the
> agreement for merging objrmap? AFIK there was disagreement due the
> complexity issues (hence I thought at a different way that probably
> won't work but that would solve all the complexity issues and the cost
> in the page faults, though it wouldn't be simple at all even if it works)

Dave McCracken is doing something unusual involving truncate() locking
to prevent the situation that causes objrmap trouble. I'm not convinced
I like the approach but I'll settle for results.


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> I was proposing that it should be used to prevent pte_chains from being
>> created, not suggesting that it does so now.

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> if you want to change mlock to drop the pte_chains then it would
> definitely make mlock a VM bypass, even if not as strong as the
> remap_file_pages that bypass the vma layer too (not only the
> pte/pte_chain side).

Well, the thing is it's closer to the primitive. You're suggesting
making remap_file_pages() both locked and unaligned with the vma, where
it seems to me the real underlying mechanism is using the semantics of
locked memory to avoid creating pte_chains. Bypassing vma's doesn't
seem to be that exciting. There are only a couple of places where an
assumption remap_file_pages() breaks matters, i.e. vmtruncate() and
try_to_unmap_one_obj(), and that can be dodged with exhaustive search
in the non-anobjrmap VM's and is naturally handled by chaining the
distinct virtual addresses where pages are mapped against the page by
the anobjrmap VM's.


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> There is nothing particularly special about remap_file_pages() apart
>> from the fact some additional checks are needed when walking ->i_mmap
>> and ->i_mmap_shared to avoid range checks being fooled by it.

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> there is something very special about remap_file_pages: it is the only
> way to avoid rmap right now. And if you claim that remap_file_pages
> works better than mmap() for the simulator that allocates more highmem
> than ram, you're just saying that rmap is completely useless even during
> the pte teardown. So either you admit remap_file_pages would be a bad
> idea to use with heavy paging (i.e. as much highmem as you want, booting
> 64G of ram emulated on a 4G box) because it misses rmap, or you admit
> rmap is useless.

What do you mean? pte_chains are still created for remap_file_pages().
The resource scalability issue it addresses is vma's, not pte_chains.
vma creation is largely useless for such complex mappings because it
would create close to one 64B+ structure per PTE.


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> remap_file_pages() in combination with hugetlb just doesn't make sense.
>> The buffer pool does not use 2MB windows (though it's configurable, so
>> in principle you _could_). Furthermore 2MB windows are so large the vma
>> overhead of using mmap() could already be tolerated.

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> remap_file_pages theoretically provides the lowest possible overhead if
> implemented as I'm advocating for some month. The full true VM bypass,
> even stronger than your mlock that drops the pte_chains and reconstruct
> them when the mlock goes away.

Well, that would basically be the "no pte_chains for mlock()'d memory"
trick plus forcing remap_file_pages() to be mlock()'d plus
remap_file_pages(). Which I don't find terribly offensive, but sounds
a bit less flexible than just using the mlock() trick.


On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> I mean, nothing can be faster than remap_file_pages if implemented as I
> describe (with support of largepages as well). Done that way it would be
> an obvious improvement, for the emulators too (provided a sysctl that
> voids all the security concerns w.r.t. to the amount of pinned ram).

Some environments would probably dislike the privilege requirements.

If I ever get the time to do some farting around in userspace (highly
unlikely), I'd like to write an MUA that uses an aio remap_file_pages()
(freshly implemented for that specific purpose) to window into indices
and mailspools.


On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> But as soon as you refuse to support largepages with remap_file_pages
> you automatically make it slower than mmap, and then remap_file_pages
> becomes almost useless (except for the emulators).

I'm really terribly unconvinced of this hugetlb + remap_file_pages()
stuff. I'm going to write it anyway, but there really isn't enough
overhead as it stands to save anything.


On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> Note: remap_file_pages is not needed at all with current kernels, the
> only reason we may want to backport it to 2.4 is for the lower vma
> overhead (very minor issue) so maybe it's a 0.5-1% or similar (I'm only
> guessing). But the lack of largepage support would offset whatever
> benefit the vma overhead avoidance can provide IMHO (that applies to 2.5
> too of course).

AIUI it was a large issue for some databases (obviously not just the
standalone engine) that had large numbers of processes all performing
highly complex mappings (e.g. 5K vma's per process, 1500 processes or
some such nonsense on the same order(s) of magnitude).


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> In general the way to use it would be to treat virtualspace as a fully
>> associative cache. This also does not imply swapping since the stuff may
>> just as easily be cached in highmem. The fact one _could_ do it with
>> data sets large than RAM, albeit slowly, is also of interest. It would
>> effectively want random page replacement, e.g. a FIFO queue, and some
>> method of mitigating its impact on other applications on the system.

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> that would imply that those methods would be almost as fast as the true
> rmap algorithm for swapping? So why we don't use them for everything and
> we drop rmap? Or are you going to hack the kernel and reboot just before
> running a special emulator and hack this algorithm and reboot every time
> you change the application that uses remap_file_pages to get sane heavy
> swapping out of it?

So we're considering this case:
(a) single-threaded emulator
(b) a large file representing the memory windowed with remap_file_pages()
(c) best case page replacement == FIFO/random

Now, to get FIFO you have to queue pages in order and follow that order
when dequeueing them. Which means you have to be able to unmap an
arbitrary page determined by it falling off the cold end of the not-LRU
queue.

rmap would actually shine here. No pte_chains: every page mapped by the
emulator is PG_direct. It falls off the end of the FIFO queue and the
PTE is pointed at directly by the struct page and everything can be
unmapped in the perfectly random order it wants.

But what I was actually trying to hint at was local page replacment
with adaptive policies. And _no_, that has absolutely nothing to do
with walking the mmlist, and _no_, "local" does not mean some global
thread pounding over the known universe holding locks until the end
of time while doing O(processes*avg_virtualspace) algorithms. Trolls,
I know who you are. Don't jump in on this thread and don't bother me.


On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> with your "swapping emulator" workload all the ram (either 1T or 10M)
> will be mapped in this vma populated with remap_file_pages. It's like
> running the emulator in a 2.4 kernel w/o rmap, with only the vma tree
> cost removed. If it's still always faster it means rmap saves less than
> the rbtree cost, and so basically rmap is useless since its main
> advantage is in terms of computational complexity IMHO.

No more RAM than fits into either the vma or virtualspace can be mapped
by a single process, even with remap_file_pages().

Could you restate this?


On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> The fact is that as far as it's true that rmap pays off (I mean more
> than the rbtree cost), if you need to page the nonlinear vma, you'd
> better not use a nonlinear vma but to use the linear vmas instead so
> you'll get the real scalabile and optimized swapping algorithm. And in
> turn IMHO remap_file_pages makes sense only when you don't need to swap.
> making remap_file_pages pageables only helps in terms of security
> requirements as far as I can tell but that can be addressed more simply
> by a sysctl too (and then we can pass the fd to remap_file_pages too as
> last argument, which would be a very nice addition IMHO).

I'm very unconvinced by these arguments. remap_file_pages() makes good
sense whenever objects are very sparsely used, and I'd really rather
not restrict it to privileged apps.


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> You say that as if it's in the least bit feasible to get major apps to
>> change. They're almost as bad as hardware.

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> yes. But some db is allocating ~32G of shm in largepages and it runs
> significantly faster when the largepags are enabled, so I assume it must
> use 2M granularity somewhere to measure this difference.

This doesn't sound like a resource scalability issue. I'd not mind
hearing of where the speedups came from since there doesn't seem to be
anything that should make a large difference.


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> What on earth? 32KB granularity still needs the 3rd-level. I also fail

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> when the granularity is 2M (not 32k) it only walks 2 levels, the 3rd
> level doesn't exist with largepages. of course the pagetables will be
> the PAE ones still, but the largepages only generates 2 walks, there's
> no PTE, there's the pmd only. So it's like if it's a 2 level pagetables
> for the whole shm. If the app is smart enough to have some locality, the
> single tlb entry will cache what could otherwise generate multiple mmaps
> (and in turn tlb flushes) and multiple tlb entries.

The buffer pool is windowed with a 32KB granularity AFAIK. Mind if I
ask someone from Oracle? jlbec, could you clarify?


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> to see a resource scalability impact for the kernels bigpages are used
>> on since although they will be eliminated from consideration for
>> swapping, it's not in use in combination with rmap-based VM's that I
>> know of.

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> not sure to understand this sentence. There's no scalability issue at
> all related to remap_file_pages and largepages. If something they can
> help scalability by avoiding rmap (i.e. like mlock or with "slow swap
> ala 2.4").

That appears to be what people are using hugetlb for at the moment.


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> To measure scaling a performance metric (possibly workload feasibility)
>> and an input are required. For instance, throughput on SDET and number
>> of cpus, or amount of RAM and number of processes. "Scalability",
>> wherever it's unqualified by such things, is meaningless.

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> This was a ram scalability, the more you increase the amount of ram, the
> more you lose in those loops for GFP_KERNEL allocations. Overall I
> simply meant it works much better the biggest the box is.

IMHO 2.5.x has brought about a modicum of sanity in all kinds of places
so that more aggressive methods for scaling can be attacked in 2.7.x.
There are several major WIP here.


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> Well, the reason why it's per-zone and not just per-node is in point (b).

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> I know and it works perfectly for the highend but I seriously doubt it
> is the optimal design for normal 1-2G boxes.

I'll try to get a hold of Scott Kaplan to get an idea of how to
instrument things and compare the effectiveness on several zone
proportionalities.


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> The question is largely of whether the algorithm explodes when it's
>> asked to recover lowmem.

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> yeah the ram scalability. So we penalize the boxes where it wouldn't
> explode.

I'm not sure that argument really flies; things shouldn't really
explode ever. Then again, that doesn't really fly either, because all
you have to do to send any kernel into perpetually spinning with
interrupts off is boot it on a box with a sufficient number of cpus.


On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> I agree being able to scale on the ~32G is much more important to run
> optimally on the 1-2G. My whole email is about pointing out and not
> forgetting that there are also downsides in those scalability
> enhacements. It's not always the case, but in this case definitely yes.
> In the 2.3.x days I definitely cared more about the 1-2G boxes than
> about the 32G ones.

Did your results ever making it to mailing list archives?


On Mon, Jun 30, 2003 at 11:33:17PM -0700, William Lee Irwin III wrote:
>> Unfortunately I don't have such a machine at my disposal (the discontig
>> code barfs on mem= at the moment). But I'd like to see the results.

On Tue, Jul 01, 2003 at 09:49:15AM +0200, Andrea Arcangeli wrote:
> A plain bonnie run with -s 1800 on a 2G box would be very interesting
> (it should in theory read it all from cache, and if my theory is right
> it will do some worthless I/O instead during the read).

I borrowed a specweb client from Dave Hansen overnight. If I finish
debugging whatever went wrong with highpmd I'll try booting with mem=2G
since that isn't discontig.


-- wli

2003-07-01 09:14:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Tue, Jul 01, 2003 at 01:59:39AM -0700, William Lee Irwin III wrote:
> After observing that, the benchmark is flawed because
> (a) it doesn't run long enough to produce stable numbers
> (b) the results are apparently measured with gettimeofday(), which is
> wildly inaccurate for such short-lived phenomena
> (c) large differences in performance appear to come about as a result
> of differing versions of common programs (i.e. gcc)

not enough time right now to answer the whole email which is growing and
growing in size ;), but I wanted to add a quick comment on this. many
shell loads happens to do something similar, and the speed of
compilation will be a very important factor until you rewrite make and
gcc not to exit and to compile multiple files from a single invocation.

The fact is that this is not a flawed benchmark, this is a real life
workload that you can't avoid to deal with, and I want my kernel to run
the fastest on the most common apps I run. I don't mind if swapping is
slightly slower, I simply don't swap all the time for the whole system
time, while I tend to keep the cpu 100% busy always. Still I want the
best possible swapping that is zerocost for me on the other side. Giving
me a CONFIG_SLOWSWAP_FAST_GCC would be more than enough to make me
happy. I don't think I'll resist to the rmap slowdown while migrating to
2.6 if it keeps showing up in the profiling. Especially Martin's number
were not good.

Andrea

2003-07-01 10:45:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Mon, 30 Jun 2003, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > described this way it sounds like NOFAIL imply a deadlock condition.
>
> NOFAIL is what 2.4 has always done, and has the deadlock opportunities
> which you mention. The other modes allow the caller to say "don't try
> forever".

__GFP_NOFAIL is also very badly named: patently it can and does fail,
when PF_MEMALLOC or PF_MEMDIE or not __GFP_WAIT. Or is the idea that
its users might as well oops when it does fail? Should its users be
changed to use the less perniciously named __GFP_REPEAT, or should
__alloc_pages be changed to deadlock more thoroughly?

Hugh

2003-07-01 11:55:11

by tu guangxiu

[permalink] [raw]
Subject: severe problem of linux 2.4.21 usb


lists,
kernel: linux 2.4.21
I met one severe problem when using usbnet. I
unplugged the line made of PL2301, linux got to
freeze. I tested it with linux 2.4.20. There was no
such problem. Please help!

tu


__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

2003-07-01 14:10:13

by Martin J. Bligh

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

> First I ask, "What is this exercising?" That answer is largely process
> creation and destruction and SMP scheduling latency when there are very
> rapidly fluctuating imbalances.
>
> After observing that, the benchmark is flawed because
> (a) it doesn't run long enough to produce stable numbers
> (b) the results are apparently measured with gettimeofday(), which is
> wildly inaccurate for such short-lived phenomena

Bullshit. Use a maximal config file, and run it multiple times. I have
sub 0.5% variance.

> (c) large differences in performance appear to come about as a result
> of differing versions of common programs (i.e. gcc)

So use the same frigging gcc each time. Why you want to screw with
userspace at the same time as the kernel is a mystery to me. If you
change gcc, you're also changing what's compiling your kernel, so you'll
get different binaries - ergo the whole argument is fallacious anyway,
and *any* benchmarking you're doing is completely innacurrate.

>> if you want to change mlock to drop the pte_chains then it would
>> definitely make mlock a VM bypass, even if not as strong as the
>> remap_file_pages that bypass the vma layer too (not only the
>> pte/pte_chain side).
>
> Well, the thing is it's closer to the primitive. You're suggesting
> making remap_file_pages() both locked and unaligned with the vma, where
> it seems to me the real underlying mechanism is using the semantics of
> locked memory to avoid creating pte_chains. Bypassing vma's doesn't
> seem to be that exciting. There are only a couple of places where an
> assumption remap_file_pages() breaks matters, i.e. vmtruncate() and
> try_to_unmap_one_obj(), and that can be dodged with exhaustive search
> in the non-anobjrmap VM's and is naturally handled by chaining the
> distinct virtual addresses where pages are mapped against the page by
> the anobjrmap VM's.

If we just lock the damned things in memory, OR flag them at create
time (or at least before use), none of this is an issue anyway - we
get rid of all the conversion stuff. Seeing as this is mainly for big
DBs, I still don't see why mem locking it is a problem - no reason
for it to fuck up the rest of the VM.

M.

2003-07-01 16:08:07

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

At some point in the past, I wrote:
>> First I ask, "What is this exercising?" That answer is largely process
>> creation and destruction and SMP scheduling latency when there are very
>> rapidly fluctuating imbalances.
>> After observing that, the benchmark is flawed because
>> (a) it doesn't run long enough to produce stable numbers
>> (b) the results are apparently measured with gettimeofday(), which is
>> wildly inaccurate for such short-lived phenomena

On Tue, Jul 01, 2003 at 07:24:03AM -0700, Martin J. Bligh wrote:
> Bullshit. Use a maximal config file, and run it multiple times. I have
> sub 0.5% variance.

My thought here had more to do with the measurements becoming dominated
by ramp-up and ramp-down and than the thing literally producing unreliable
timings. "Instability" was almost certainly the wrong word.

I'm also skeptical of its usefulness for scalability comparisons with a
single-threaded phase like the linking phase and the otherwise large
variations in concurrency. It seems much more like a binary test of
"does it get slower when I add more cpus?" than a measure of scalability.

For instance, if you were to devise some throughput measure say per
gigacycle based on this and compare efficiencies on various systems
with it so as to measure scaling factors, what would it be? Would you
feel comfortable using it for scalability comparisons given the
concurrency limitations for a single compile as a benchmark?

Well, at the very least I don't.


At some point in the past, I wrote:
>> (c) large differences in performance appear to come about as a result
>> of differing versions of common programs (i.e. gcc)

On Tue, Jul 01, 2003 at 07:24:03AM -0700, Martin J. Bligh wrote:
> So use the same frigging gcc each time. Why you want to screw with
> userspace at the same time as the kernel is a mystery to me. If you
> change gcc, you're also changing what's compiling your kernel, so you'll
> get different binaries - ergo the whole argument is fallacious anyway,
> and *any* benchmarking you're doing is completely innacurrate.

This is all true. Yet how many of those who conduct the tests are so
careful?


At some point in the past, I wrote:
>> Well, the thing is it's closer to the primitive. You're suggesting
>> making remap_file_pages() both locked and unaligned with the vma, where
>> it seems to me the real underlying mechanism is using the semantics of
>> locked memory to avoid creating pte_chains. Bypassing vma's doesn't
>> seem to be that exciting. There are only a couple of places where an
>> assumption remap_file_pages() breaks matters, i.e. vmtruncate() and
>> try_to_unmap_one_obj(), and that can be dodged with exhaustive search
>> in the non-anobjrmap VM's and is naturally handled by chaining the
>> distinct virtual addresses where pages are mapped against the page by
>> the anobjrmap VM's.

On Tue, Jul 01, 2003 at 07:24:03AM -0700, Martin J. Bligh wrote:
> If we just lock the damned things in memory, OR flag them at create
> time (or at least before use), none of this is an issue anyway - we
> get rid of all the conversion stuff. Seeing as this is mainly for big
> DBs, I still don't see why mem locking it is a problem - no reason
> for it to fuck up the rest of the VM.

Partially disabling it is a workable solution. I'm moderately skeptical
about page_convert_anon(), though. Not so much as to whether it works
or not, but basically whether it's the right approach. It would seem
incrementally chaining ptes mismatching the file offset would do better
than trying to perform numerous allocations at once, especially since
the cached pte_chains are already at hand with the various points that
were arranged to do pte_chain_alloc() safely around the VM. I suspect
part of the trouble there is that the mapcount needed to rapidly
determine whether a page is mapped shares the space with the chain,
but unsharing it can be avoided by shoving the counter into one of the
chain's blocks in the chained filebacked page case a la anobjrmap.

I suppose there's also an aesthetic question of whether to choose an
approach that integrates it smoothly as opposed to one that turns it
into a Frankensteinian bolt poking out of the VM's proverbial neck.
As it stands now it's somewhat more useful and prettier than DSHM etc.
equivalents elsewhere, at least API-wise. I actually thought the
resource scalability issues were getting addressed without disfiguring
it for some reason.

Also, have you checked with the big DB's about whether they can
tolerate privilege requirements for using it?


-- wli

2003-07-01 17:27:41

by Daniel Phillips

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Tuesday 01 July 2003 03:39, Mel Gorman wrote:
> I'm writing a small paper on the 2.6 VM for a conference.

Nice stuff, and very timely.

> In 2.4, Page Table Entries (PTEs) must be allocated from ZONE_ NORMAL as
> the kernel needs to address them directly for page table traversal. In a
> system with many tasks or with large mapped memory regions, this can
> place significant pressure on ZONE_ NORMAL so 2.6 has the option of
> allocating PTEs from high memory.

You probably ought to mention that this is only needed by 32 bit architectures
with silly amounts of memory installed. On that topic, you might mention
that the VM subsystem generally gets simpler and in some cases faster (i.e.,
no more highmem mapping cost) in the move to 64 bits.

You also might want to mention pdflush.

Regards,

Daniel

2003-07-01 17:52:56

by Martin J. Bligh

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

--On Tuesday, July 01, 2003 09:22:04 -0700 William Lee Irwin III <[email protected]> wrote:

> At some point in the past, I wrote:
>>> First I ask, "What is this exercising?" That answer is largely process
>>> creation and destruction and SMP scheduling latency when there are very
>>> rapidly fluctuating imbalances.
>>> After observing that, the benchmark is flawed because
>>> (a) it doesn't run long enough to produce stable numbers
>>> (b) the results are apparently measured with gettimeofday(), which is
>>> wildly inaccurate for such short-lived phenomena
>
> On Tue, Jul 01, 2003 at 07:24:03AM -0700, Martin J. Bligh wrote:
>> Bullshit. Use a maximal config file, and run it multiple times. I have
>> sub 0.5% variance.
>
> My thought here had more to do with the measurements becoming dominated
> by ramp-up and ramp-down and than the thing literally producing unreliable
> timings. "Instability" was almost certainly the wrong word.
>
> I'm also skeptical of its usefulness for scalability comparisons with a
> single-threaded phase like the linking phase and the otherwise large
> variations in concurrency. It seems much more like a binary test of
> "does it get slower when I add more cpus?" than a measure of scalability.
>
> For instance, if you were to devise some throughput measure say per
> gigacycle based on this and compare efficiencies on various systems
> with it so as to measure scaling factors, what would it be? Would you
> feel comfortable using it for scalability comparisons given the
> concurrency limitations for a single compile as a benchmark?

I'm not convinced it's that limited. I'm getting about 1460% cpu out
of 16 processors - that's pretty well parallelized in my mind.

> On Tue, Jul 01, 2003 at 07:24:03AM -0700, Martin J. Bligh wrote:
>> So use the same frigging gcc each time. Why you want to screw with
>> userspace at the same time as the kernel is a mystery to me. If you
>> change gcc, you're also changing what's compiling your kernel, so you'll
>> get different binaries - ergo the whole argument is fallacious anyway,
>> and *any* benchmarking you're doing is completely innacurrate.
>
> This is all true. Yet how many of those who conduct the tests are so
> careful?

Anyone who's not that careful should find something else to do in life
than run benchmarks. Crap data is worse than no data. At the very least,
it takes some pretty active stupidity to not have numbers that are
consistent against themselves.

Again, I don't see this demonstrates anything wrong with the test ;-)
In fact, it's better, since the variances are more obvious if you
screw it up.

> On Tue, Jul 01, 2003 at 07:24:03AM -0700, Martin J. Bligh wrote:
>> If we just lock the damned things in memory, OR flag them at create
>> time (or at least before use), none of this is an issue anyway - we
>> get rid of all the conversion stuff. Seeing as this is mainly for big
>> DBs, I still don't see why mem locking it is a problem - no reason
>> for it to fuck up the rest of the VM.
>
> Partially disabling it is a workable solution. I'm moderately skeptical
> about page_convert_anon(), though. Not so much as to whether it works
> or not, but basically whether it's the right approach. It would seem
> incrementally chaining ptes mismatching the file offset would do better
> than trying to perform numerous allocations at once, especially since
> the cached pte_chains are already at hand with the various points that
> were arranged to do pte_chain_alloc() safely around the VM. I suspect
> part of the trouble there is that the mapcount needed to rapidly
> determine whether a page is mapped shares the space with the chain,
> but unsharing it can be avoided by shoving the counter into one of the
> chain's blocks in the chained filebacked page case a la anobjrmap.

I don't think anyone is desperately fond of the conversion stuff, it
just seems to be the most pragmatic solution. However, I think the
main problem is more that people seem to think that nonlinear VMAs
are somehow sacred, and cannot be touched in this process - I'm not
convinced by that, as long as we continue to fulfill their original
goal.

For one, I'm not convinced that keeping the VMAs (or equiv data for
non-linear dereference) around is a worse problem than the PTE-chains themselves. Call it a "per linear subsegment pte-chain equivalent"
if that makes you feel better about it. Better per linear area than
per page, IMHO.

Of course, if we can agree to just lock the damned things in memory,
all this goes away.

> I suppose there's also an aesthetic question of whether to choose an
> approach that integrates it smoothly as opposed to one that turns it
> into a Frankensteinian bolt poking out of the VM's proverbial neck.
> As it stands now it's somewhat more useful and prettier than DSHM etc.
> equivalents elsewhere, at least API-wise. I actually thought the
> resource scalability issues were getting addressed without disfiguring
> it for some reason.

I'd like it to be elegant. But I'd settle for it working ;-)

M.

2003-07-01 20:08:09

by Martin J. Bligh

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

>> In 2.4, Page Table Entries (PTEs) must be allocated from ZONE_ NORMAL as
>> the kernel needs to address them directly for page table traversal. In a
>> system with many tasks or with large mapped memory regions, this can
>> place significant pressure on ZONE_ NORMAL so 2.6 has the option of
>> allocating PTEs from high memory.
>
> You probably ought to mention that this is only needed by 32 bit architectures
> with silly amounts of memory installed.

Actually, it has more to do with the number of processes sharing data,
than the amount of memory in the machine. And that's only because we
insist on making duplicates of identical pagetables all over the place ...

M.

2003-07-01 21:27:07

by Mel Gorman

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Mon, 30 Jun 2003, Daniel Phillips wrote:

> On Tuesday 01 July 2003 03:39, Mel Gorman wrote:
> > I'm writing a small paper on the 2.6 VM for a conference.
>
> Nice stuff, and very timely.
>

I was hoping someone else would write it so I could read it but thats what
I said about the 2.4 VM :-) . Yep, once again, my contributions are mainly
documenting related, believe it or not, I actually do code a bit from time
to time

I was going to update the whole document based on this thread and repost
it but it's looking like it'll take me a few days for a week before I work
through it all (so I'm slow, sue me). This is especially true as there is
a lot of old email threads I need to read before I understand 100% of the
current discussion (which is also why I'm not replying to most posts in
this thread). Instead, I'm going to post up the bits that are changed and
hopefully get everything together.

This is the first change....

> You probably ought to mention that this is only needed by 32 bit architectures
> with silly amounts of memory installed.

Point... Taking into account what Martin said, the introduction to "PTEs
in high memory" now reads;

--Begin Extract--
PTEs in High Memory
===================

In 2.4, Page Table Entries (PTEs) must be allocated from ZONE_NORMAL as
the kernel needs to address them directly for page table traversal. In a
system with many tasks or with large mapped memory regions, this can place
significant pressure on ZONE_NORMAL so 2.6 has the option of allocating
PTEs from high memory.

Allocating PTEs from high memory is a compile time option for two reasons.
First and foremost, this is only really needed by 32 bit architectures
with very large amounts of memory or when the workloads require many
processes to share pages. With lower memory machines or 64 bit
architectures, it is simply not required. Patches were submitted that
would allow page tables to be shared between processes in a Copy-On-Write
fashion which would mitigate the need for high memory PTEs but they were
never merged.
--End Extract--

> On that topic, you might mention
> that the VM subsystem generally gets simpler and in some cases faster (i.e.,
> no more highmem mapping cost) in the move to 64 bits.
>

I'm wary of making a statement like that. I'm not sure the code actually
simpler with 64 bit but to me it looks about as complicated (or simple
depending on your perspective). On the faster point, I understand that it
is possible to have a net loss due to TLB and CPU cache misses. In this
case, I think I'll just keep quiet

> You also might want to mention pdflush.
>

Added to the todo list as well as object based rmap. I know object based
rmap isn't merged but it is discussed enough that I'll put the time in to
write about it.

--
Mel Gorman

2003-07-01 21:31:10

by Mel Gorman

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Mon, 30 Jun 2003, William Lee Irwin III wrote:

> On Tue, Jul 01, 2003 at 02:39:47AM +0100, Mel Gorman wrote:
> >> Delayed Coalescing
> >> ==================
> >> 2.6 extends the buddy algorithm to resemble a lazy buddy algorithm [BL89]
> >> which delays the splitting and coalescing of buddies until it is
> >> necessary. The delay is implemented only for 0 order allocations with the
>

On delayed coalescing, I was seeing things that weren't there. I've this
section removed and changed to;

--Begin Extract--
Per-CPU Page Lists
==================

The most frequent type of allocation or free is an order-0 (i.e. one page)
allocation or free. In 2.4, each page allocation or free requires the
acquisition of an interrupt safe spinlock to protect the lists of free
pages which is an expensive operation. To reduce lock contention, kernel
2.6 has per-cpu page lists of order-0 pages called pagesets.

These pagesets contain two lists for hot and cold pages where hot pages
have been recently used and can still be expected to be present in the CPU
cache. For an allocation, the pageset for the running CPU will be first
checked and if pages are available, they will be allocated. To determine
when the pageset should be emptied or filled, two watermarks are in place.
When the low watermark is reached, a batch of pages will be allocated and
placed on the list. When the high watermark is reached, a batch of pages
will be freed at the same time. Higher order allocations still require the
interrupt safe spinlock to be held and there is no delay in the splits or
coalescing.

While coalescing of order-0 pages is delayed, this is not a lazy buddy
algorithm [BL89]. While pagesets introduce a merging delay for order-0
allocations, it is a side-effect rather than an intended feature and there
is no method available to drain the pagesets and merge the buddies. In
other words, despite the per-cpu and new accounting code bulking up the
amount of code in mm/page_alloc.c, the core of the buddy algorithm remains
the same as it was in 2.4.

The implication of this change is straight forward; the number of times
the spinlock protecting the buddy lists must be acquired is reduced.
Higher order allocations are relatively rare in Linux so the optimisation
is for the common case. This change will be noticeable on large number of
CPU machines but will make little difference to single CPUs. There is some
issues with the idea though although they are not considered a serious
problem. The first item of note is that high order allocations may fail of
many of the pagesets are just below the high watermark. The second is that
when memory is low and the current CPU pageset is empty, an allocation may
fail as there is no means of draining remote pagesets. The last problem is
that buddies of newly freed pages may exist in other pagesets leading to
possible fragmentation problems.
--End Extract

2003-07-01 21:32:25

by Mel Gorman

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Tue, 1 Jul 2003, Andrea Arcangeli wrote:

> On Tue, Jul 01, 2003 at 02:39:47AM +0100, Mel Gorman wrote:
> > Reverse Page Table Mapping
> > ==========================
> >
> > <rmap stuff snipped>
>
> you mention only the positive things, and never the fact that's the most
> hurting piece of kernel code in terms of performance and smp scalability
> until you actually have to swapout or pageout.
>

You're right, I was commenting only on the positive side of things. I
didn't pay close enough attention to the development of the 2.5 series so
right now I can only comment on whats there and only to a small extent on
what it means or why it might be a bad thing. Here goes a more balanced
view...

Based on what I can decipher from this thread and other rmap related
threads, I've added this to the end of the rmap section. I'm still working
on the non-linear issues. It'll probably take me another day or two to get
that together.

--Begin Extract--
There are two main benefits, both page-out related, with the introduction
of reverse mapping. The first is with the management of page tables. In
2.4, anonymous shared pages had to be placed in a swap cache until all
references has been found. This could result in a large number of minor
faults as page adjacent in virtual space were moved to the swap cache
resulting in unnecessary page table updates. The second benefit is with
the actual paging out mechanism. 2.6 is much better at selecting the
correct page and atomically swapping it out from each virtual address
space.

Reverse mapping is not free though. The first obvious penalty is the space
requirements for PTE chains. The space requirements will obviously depend
on the number of shared pages, especially anonymous pages, in the system.
There was patches submitted for the sharing of page tables but it was not
merged into the mainline kernel. The second penalty is the CPU time
required to maintain the mappings but no figures are available to measure
how severe this is. The last point to note is that reverse mapping is only
of benefit when page-outs are frequent which is mainly the case with
lower-end machines. With large memory machines or with workloads that do
not cause much swapping, there is only costs to reverse mapping but no
benefits.
--End Extract--

--
Mel Gorman

2003-07-01 21:44:47

by Davide Libenzi

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Tue, 1 Jul 2003, Mel Gorman wrote:

> I was hoping someone else would write it so I could read it but thats what
> I said about the 2.4 VM :-) . Yep, once again, my contributions are mainly
> documenting related, believe it or not, I actually do code a bit from time
> to time

I happen to take a look at one of your docs and it was pretty good indeed.
You're underestimating the value of the documentation, that many
developers do not love to do. A university professor of mine once told me
that developers do not want to do documentation, not because they're lazy,
but because they do not want to show that what they did, at the very end,
is not that cool ;)



- Davide

2003-07-01 21:44:09

by Martin J. Bligh

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

> I was hoping someone else would write it so I could read it but thats what
> I said about the 2.4 VM :-)

Sigh. Sadly I have a lot of this written up (including for object-based
rmap you were thinkin about doing, etc), but it's an OLS paper, so I
can't release it, I believe. If you need this after OLS, it's easy ;-)

M.

2003-07-01 21:52:54

by Martin J. Bligh

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

> On delayed coalescing, I was seeing things that weren't there. I've this
> section removed and changed to;

Heh. I was wondering about that ...

> --Begin Extract--
> Per-CPU Page Lists
> ==================
>
> The most frequent type of allocation or free is an order-0 (i.e. one page)
> allocation or free. In 2.4, each page allocation or free requires the
> acquisition of an interrupt safe spinlock to protect the lists of free
> pages which is an expensive operation. To reduce lock contention, kernel
> 2.6 has per-cpu page lists of order-0 pages called pagesets.
>
> These pagesets contain two lists for hot and cold pages where hot pages
> have been recently used and can still be expected to be present in the CPU
> cache. For an allocation, the pageset for the running CPU will be first
> checked and if pages are available, they will be allocated. To determine
> when the pageset should be emptied or filled, two watermarks are in place.
> When the low watermark is reached, a batch of pages will be allocated and
> placed on the list. When the high watermark is reached, a batch of pages
> will be freed at the same time. Higher order allocations still require the
> interrupt safe spinlock to be held and there is no delay in the splits or
> coalescing.
>
> While coalescing of order-0 pages is delayed, this is not a lazy buddy
> algorithm [BL89]. While pagesets introduce a merging delay for order-0
> allocations, it is a side-effect rather than an intended feature and there
> is no method available to drain the pagesets and merge the buddies. In
> other words, despite the per-cpu and new accounting code bulking up the
> amount of code in mm/page_alloc.c, the core of the buddy algorithm remains
> the same as it was in 2.4.
>
> The implication of this change is straight forward; the number of times
> the spinlock protecting the buddy lists must be acquired is reduced.
> Higher order allocations are relatively rare in Linux so the optimisation
> is for the common case. This change will be noticeable on large number of
> CPU machines but will make little difference to single CPUs. There is some
> issues with the idea though although they are not considered a serious
> problem. The first item of note is that high order allocations may fail of
> many of the pagesets are just below the high watermark. The second is that
> when memory is low and the current CPU pageset is empty, an allocation may
> fail as there is no means of draining remote pagesets. The last problem is
> that buddies of newly freed pages may exist in other pagesets leading to
> possible fragmentation problems.


Looks good. Might be useful to distinguish more carefully between the hot
and cold lists - what you've described is basically just the cold list.

The hot list is similar, except it's also used as a LIFO stack, so the
the most recently freed page is assumed to be cache-warm, and is reallocated
first. This reduces the overall number of cacheline misses in the system,
by reusing cachelines that are already present in that CPU's cache.

Moreover, the cold list tries to use pages that are NOT in another CPUs
cache. The main thing that allocates from the cold list is DMA operations,
and the main thing that populates it is page-reclaim. Other things are
generally assumed to be hot (this is one of the areas where more work
could probably be done ...)

M.


M.

2003-07-02 02:50:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Tue, Jul 01, 2003 at 10:54:56AM -0700, Martin J. Bligh wrote:
> --On Tuesday, July 01, 2003 09:22:04 -0700 William Lee Irwin III <[email protected]> wrote:
>
> > At some point in the past, I wrote:
> >>> First I ask, "What is this exercising?" That answer is largely process
> >>> creation and destruction and SMP scheduling latency when there are very
> >>> rapidly fluctuating imbalances.
> >>> After observing that, the benchmark is flawed because
> >>> (a) it doesn't run long enough to produce stable numbers
> >>> (b) the results are apparently measured with gettimeofday(), which is
> >>> wildly inaccurate for such short-lived phenomena
> >
> > On Tue, Jul 01, 2003 at 07:24:03AM -0700, Martin J. Bligh wrote:
> >> Bullshit. Use a maximal config file, and run it multiple times. I have
> >> sub 0.5% variance.
> >
> > My thought here had more to do with the measurements becoming dominated
> > by ramp-up and ramp-down and than the thing literally producing unreliable
> > timings. "Instability" was almost certainly the wrong word.
> >
> > I'm also skeptical of its usefulness for scalability comparisons with a
> > single-threaded phase like the linking phase and the otherwise large
> > variations in concurrency. It seems much more like a binary test of
> > "does it get slower when I add more cpus?" than a measure of scalability.
> >
> > For instance, if you were to devise some throughput measure say per
> > gigacycle based on this and compare efficiencies on various systems
> > with it so as to measure scaling factors, what would it be? Would you
> > feel comfortable using it for scalability comparisons given the
> > concurrency limitations for a single compile as a benchmark?
>
> I'm not convinced it's that limited. I'm getting about 1460% cpu out
> of 16 processors - that's pretty well parallelized in my mind.

yes, especially considering what William said above.

But the point here is not to measure scalability in absolute terms, the
point IMHO is the scalability regression introduced by rmap (w/o
objrmap).

The fact the kernel workload isn't the most scalable thing on earth,
simply means other workloads doing the same thing that make+gcc does - but
never serializing in linking - will be hurted even more.

Andrea

2003-07-02 02:54:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Tue, Jul 01, 2003 at 10:46:31PM +0100, Mel Gorman wrote:
> On Tue, 1 Jul 2003, Andrea Arcangeli wrote:
>
> > On Tue, Jul 01, 2003 at 02:39:47AM +0100, Mel Gorman wrote:
> > > Reverse Page Table Mapping
> > > ==========================
> > >
> > > <rmap stuff snipped>
> >
> > you mention only the positive things, and never the fact that's the most
> > hurting piece of kernel code in terms of performance and smp scalability
> > until you actually have to swapout or pageout.
> >
>
> You're right, I was commenting only on the positive side of things. I
> didn't pay close enough attention to the development of the 2.5 series so
> right now I can only comment on whats there and only to a small extent on
> what it means or why it might be a bad thing. Here goes a more balanced
> view...

never mind, I think for your talk that was just perfect ;) Though I
think your last paragraph addition on the rmap thing is fair enough.

I only abused your very nice and detailed list of features, to comment
on some that IMHO had some drawback (and for some [not rmap] I don't
recall any discussion about their drawbacks on l-k ever, that's why I
answered).

thanks,

Andrea

2003-07-02 08:47:28

by Mel Gorman

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Tue, 1 Jul 2003, Martin J. Bligh wrote:

>
> Sigh. Sadly I have a lot of this written up (including for object-based
> rmap you were thinkin about doing, etc), but it's an OLS paper, so I

have it on the TODO list all right.

As for written up, "Sadly" my eye. I'm delighted you have it written up,
it means I can check how close I am/was whenever I read it! I'm not going
to reach OLS unless I find a giant trebuchet facing west to slingshot me
over so I'll just dig it out of the proceedings :-)

--
Mel Gorman

2003-07-02 15:43:13

by Mel Gorman

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Tue, 1 Jul 2003, Andrea Arcangeli wrote:

> > Non-Linear Populating of Virtual Areas
> > ======================================
> >
> and it was used to break truncate, furthmore the API doesn't look good
> to me, the vma should have a special VM_NONLINEAR created with a
> MAP_NONLINEAR so the vm will skip it enterely and it should be possible
> to put multiple files in the same vma IMHO.

OK, I think I absorbed most of this and read through most of the old
threads on the subject that I could find. It is very difficult to express
all the viewpoints in any type of concise manner so I'm settling for
getting about 70% of it.

The discussions on what API to use instead are all over the place and I
got lost in a twisty maze of emails, all similar. Below is the summary of
what I found that could be made into something coherent.

--Begin Extract--
Whether this feature should remain is still being argued but it is likely
to remain until an acceptable alternative is implemented. The main
benefits only apply to applications such as database servers or
virtualising applications such as emulators. There is a small number of
reasons why it was introduced.

The first is space requirements for large numbers of mappings. For every
mapped region a process needs, a VMA has to be allocated to it which
becomes a considerable space commitment when a process needs a large
number of mappings. At worst case, there will be one VMA for every
file-backed page mapped by the process.

The second reason is avoiding the poor linear search algorithm used by
get_unmapped_area() when looking for a large free virtual area. With a
large number of mappings, this search is very expensive. It has been
proposed to alter the function to perform a tree based search. This could
be a tree of free areas ordered by size for example but none has yet been
implemented. In the meantime, non-linear mappings are being used to bypass
the VM.

The third reason is related to frequent page faults associated with linear
mappings. A non-linear mapping is able to prefault in all pages that are
required by the mapping as it is presumed they will be needed very soon.
To some extent, this can be addressed by specifying the MAP_POPULATE when
calling mmap() for a normal mapping.

This feature has a very serious drawback. The system calls truncate() and
mincore() are broken with respect to non-linear mappings. Both calls
depend on vm_area_struct>vm_pgoff, which is the offset within
the mapped file, but the field is meaningless within a non-linear mapping.
This means that truncated files will still have mapped pages that no
longer have a physical backing. A number of possible solutions, such as
allowing the pages to exist but be anonymous and private to the process,
have been suggested but none implemented.

--End Extract--

--
Mel Gorman

2003-07-02 16:58:07

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Wed, Jul 02, 2003 at 04:57:27PM +0100, Mel Gorman wrote:
> The second reason is avoiding the poor linear search algorithm used by
> get_unmapped_area() when looking for a large free virtual area. With a
> large number of mappings, this search is very expensive. It has been

this is true too. however get_unmapped_area never need to find a new
place for the granular usages since mmap is always called with MAP_FIXED for
those.

> proposed to alter the function to perform a tree based search. This could
> be a tree of free areas ordered by size for example but none has yet been

it can't be trivially a tree of free areas or (if naturally indexed with
the size of the hole) it would return the smallest-fitting hole, not the
leftmost-smallest-fitting-hole ;). A better solution is possible. Then
everybody will benefit w/o need of userspace changes. It's still pretty
orthogonal with remap_file_pages though.

> implemented. In the meantime, non-linear mappings are being used to bypass
> the VM.
>
> The third reason is related to frequent page faults associated with linear
> mappings. A non-linear mapping is able to prefault in all pages that are
> required by the mapping as it is presumed they will be needed very soon.
> To some extent, this can be addressed by specifying the MAP_POPULATE when
> calling mmap() for a normal mapping.

mlock already does it too.

> This feature has a very serious drawback. The system calls truncate() and
> mincore() are broken with respect to non-linear mappings. Both calls
> depend on vm_area_struct>vm_pgoff, which is the offset within
> the mapped file, but the field is meaningless within a non-linear mapping.
> This means that truncated files will still have mapped pages that no
> longer have a physical backing. A number of possible solutions, such as
> allowing the pages to exist but be anonymous and private to the process,
> have been suggested but none implemented.

the major reason you didn't mention for remap_file_pages is the rmap
avoidance. There's no rmap backing the remap_file_pages regions, so the
overhead per task is reduced greatly and the box stops running oom
(actually deadlocking for mainline thanks to the oom killer and NOFAIL
default behaviour). since there's no rmap, this in turn means either
this nonlinear vma will swap badly, or it means rmap is totally useless
to swap well. Which in short means either rmap has to go in its current
form (and the usefulness of remap_file_pages would be greatly reduced),
or nonlinear mappings would better stay pinned in ram since they'd
better not be used for the emaulator with 63G of highmem into swap on a
1G host anyways (the sysctl would fix the security detail in pinning
into ram like we're doing today with the largepages in 2.4).

Andrea

2003-07-02 17:11:30

by Martin J. Bligh

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

> the major reason you didn't mention for remap_file_pages is the rmap
> avoidance. There's no rmap backing the remap_file_pages regions, so the
> overhead per task is reduced greatly and the box stops running oom
> (actually deadlocking for mainline thanks to the oom killer and NOFAIL
> default behaviour).

Maybe I'm just taking this out of context, and it's twisting my brain,
but as far as I know, the nonlinear vma's *are* backed by pte_chains.
That was the whole problem with objrmap having to do conversions, etc.

Am I just confused for some reason? I was pretty sure that was right ...

M.

2003-07-02 17:33:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Wed, Jul 02, 2003 at 10:10:09AM -0700, Martin J. Bligh wrote:
> Maybe I'm just taking this out of context, and it's twisting my brain,
> but as far as I know, the nonlinear vma's *are* backed by pte_chains.
> That was the whole problem with objrmap having to do conversions, etc.
>
> Am I just confused for some reason? I was pretty sure that was right ...

you're right:

int install_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, struct page *page, pgprot_t prot)
[..]
flush_icache_page(vma, page);
set_pte(pte, mk_pte(page, prot));
pte_chain = page_add_rmap(page, pte, pte_chain);
pte_unmap(pte);
[..]

(this make me understand better some of the arguments in the previous
emails too ;)

So ether we declare 32bit archs obsolete in production with 2.6, or we
drop rmap behind remap_file_pages.

actually other more invasive ways could be to move rmap into highmem.
Also the page clustering could also hide part of the mem overhead by
assuming the pagetables to be contiguos, but page clustering isn't part
of mainline yet either.

Something has to change since IMHO in the current 2.5.73 remap_file_pages
is nearly useless.

Andrea

2003-07-02 17:54:03

by Martin J. Bligh

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

--On Wednesday, July 02, 2003 19:47:00 +0200 Andrea Arcangeli <[email protected]> wrote:

> On Wed, Jul 02, 2003 at 10:10:09AM -0700, Martin J. Bligh wrote:
>> Maybe I'm just taking this out of context, and it's twisting my brain,
>> but as far as I know, the nonlinear vma's *are* backed by pte_chains.
>> That was the whole problem with objrmap having to do conversions, etc.
>>
>> Am I just confused for some reason? I was pretty sure that was right ...
>
> you're right:
>
> int install_page(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, struct page *page, pgprot_t prot)
> [..]
> flush_icache_page(vma, page);
> set_pte(pte, mk_pte(page, prot));
> pte_chain = page_add_rmap(page, pte, pte_chain);
> pte_unmap(pte);
> [..]
>
> (this make me understand better some of the arguments in the previous
> emails too ;)

OK, nice to know I haven't totally lost it ;-)

> So ether we declare 32bit archs obsolete in production with 2.6, or we
> drop rmap behind remap_file_pages.

Indeed - if we could memlock it, it'd be OK to drop that stuff. Would
make everything a lot simpler.

M.

2003-07-02 17:54:37

by Rik van Riel

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Wed, 2 Jul 2003, Andrea Arcangeli wrote:

> So ether we declare 32bit archs obsolete in production with 2.6, or we
> drop rmap behind remap_file_pages.

> Something has to change since IMHO in the current 2.5.73 remap_file_pages
> is nearly useless.

Agreed. What we did for a certain unspecified kernel tree
at Red Hat was the following:

1) limit sys_remap_file_pages functionality to shared memory
segments on ramfs (unswappable) and tmpfs (mostly unswappable;))

2) have the VMAs with remapped pages in them marked VM_LOCKED

3) do not set up pte chains for the pages that get mapped with
install_page

4) remove said pages from the LRU list, in the ramfs case, they're
unswappable anyway so we shouldn't have the VM scan them

The only known user of sys_remap_file_pages was more than happy
to have the functionality limited to just what they actually need,
in order to get simpler code with less overhead.

Lets face it, nobody is going to use sys_remap_file_pages for
anything but a database shared memory segment anyway. You don't
need to care about truncate or the other corner cases.

2003-07-02 17:55:48

by Rik van Riel

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Wed, 2 Jul 2003, Martin J. Bligh wrote:

> Maybe I'm just taking this out of context, and it's twisting my brain,
> but as far as I know, the nonlinear vma's *are* backed by pte_chains.

They are, but IMHO they shouldn't be. The nonlinear vmas are used
only for database shared memory segments and other "bypass the VM"
applications, so I don't see any reason why we need to complicate
things hopelessly in order to deal with corner cases like truncate.

2003-07-02 18:00:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Wed, Jul 02, 2003 at 10:52:42AM -0700, Martin J. Bligh wrote:
> Indeed - if we could memlock it, it'd be OK to drop that stuff. Would
> make everything a lot simpler.

yes.

Andrea

2003-07-02 20:06:12

by Martin J. Bligh

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

>> So ether we declare 32bit archs obsolete in production with 2.6, or we
>> drop rmap behind remap_file_pages.
>
>> Something has to change since IMHO in the current 2.5.73 remap_file_pages
>> is nearly useless.
>
> Agreed. What we did for a certain unspecified kernel tree
> at Red Hat was the following:
>
> 1) limit sys_remap_file_pages functionality to shared memory
> segments on ramfs (unswappable) and tmpfs (mostly unswappable;))
>
> 2) have the VMAs with remapped pages in them marked VM_LOCKED
>
> 3) do not set up pte chains for the pages that get mapped with
> install_page
>
> 4) remove said pages from the LRU list, in the ramfs case, they're
> unswappable anyway so we shouldn't have the VM scan them
>
> The only known user of sys_remap_file_pages was more than happy
> to have the functionality limited to just what they actually need,
> in order to get simpler code with less overhead.
>
> Lets face it, nobody is going to use sys_remap_file_pages for
> anything but a database shared memory segment anyway. You don't
> need to care about truncate or the other corner cases.

Well if RH have done this internally, and they invented the thing,
then I see no reason not do that in 2.5 ...

>> Maybe I'm just taking this out of context, and it's twisting my brain,
>> but as far as I know, the nonlinear vma's *are* backed by pte_chains.
>
> Rik:
>
> They are, but IMHO they shouldn't be. The nonlinear vmas are used
> only for database shared memory segments and other "bypass the VM"
> applications, so I don't see any reason why we need to complicate
> things hopelessly in order to deal with corner cases like truncate.

Agreed. Oddly, most of us seem to agree on this ... ;-)

M.

2003-07-02 21:26:26

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Wed, Jul 02, 2003 at 07:47:00PM +0200, Andrea Arcangeli wrote:
> actually other more invasive ways could be to move rmap into highmem.
> Also the page clustering could also hide part of the mem overhead by
> assuming the pagetables to be contiguos, but page clustering isn't part
> of mainline yet either.

BSD-style page clustering preserves virtual contiguity of a software
page, but the new things don't; for ABI preservation, virtually
discontiguous, partial, and misaligned mappings of pages are handled.

The desired behavior can in principle be partially recovered by
scanning within a software page size -sized "blast radius" for each
chain element and only chaining enough elements to find the relevant
ptes that way.

As for remap_file_pages(), either people are misunderstanding or
ignoring me. There is a lovely three-step method to handling it:

(a) fix the truncate() bug; it is just a literal bug. There are at
least 3 different ways to fix it:
(i) tag vmas touched by remap_file_pages() for exhaustive search
(ii) do a cleanup pass after the current vmtruncate() doing
try_to_unmap() on any still-mapped pages
(iii) drop the current vmtruncate() entirely and do try_to_unmap()
on each truncated page
(ii) and (iii) do the locks in the wrong order, so some still-
mapped but truncated page could be out there; this could be
handed by Yet Another Cleanup Pass that does (i) or by tolerating
the new state elsewhere in the VM. There's plenty of ways to
code this and a couple choices of semantics (i.e make it
failable or reliable).

(b) implement the bits omitting pte_chains for mlock()'d mappings
This is obvious. Yank them off the LRU, set a bitflag, and
reuse page->lru for a counter.

(c) redo the logic around page_convert_anon() and incrementally build
pte_chains for remap_file_pages().
The anobjrmap code did exactly this, but it was chaining
distinct user virtual addresses instead.
(i) you always have the pte_chain in hand anyway; the core
is always prepped to handle allocating them now
(ii) instead of just bailing for file-backed pages in
page_add_rmap(), pass it enough information to know
whether the address matches what it should from the
vma, and start chaining if it doesn't
(iii) but you say ->mapcount sharing space with the chain is a
problem? no, it's not; again, take a cue from anobjrmap:
if a file-backed page needs a pte_chain, shoehorn
->mapcount into the first pte_chain block dangling off it

After all 3 are done, remap_file_pages() integrates smoothly into the VM,
requires no magical privileges, nothing magical or brutally invasive
that would scare people just before 2.6.0 is required, and the big
apps can get their magical lowmem savings by just mlock()'ing _anything_
they do massive sharing with, regardless of remap_file_pages().

Does anyone get it _now_?


-- wli

2003-07-02 21:46:15

by Martin J. Bligh

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

> (c) redo the logic around page_convert_anon() and incrementally build
> pte_chains for remap_file_pages().
> The anobjrmap code did exactly this, but it was chaining
> distinct user virtual addresses instead.
> (i) you always have the pte_chain in hand anyway; the core
> is always prepped to handle allocating them now
> (ii) instead of just bailing for file-backed pages in
> page_add_rmap(), pass it enough information to know
> whether the address matches what it should from the
> vma, and start chaining if it doesn't
> (iii) but you say ->mapcount sharing space with the chain is a
> problem? no, it's not; again, take a cue from anobjrmap:
> if a file-backed page needs a pte_chain, shoehorn
> ->mapcount into the first pte_chain block dangling off it
>
> After all 3 are done, remap_file_pages() integrates smoothly into the VM,
> requires no magical privileges, nothing magical or brutally invasive
> that would scare people just before 2.6.0 is required, and the big
> apps can get their magical lowmem savings by just mlock()'ing _anything_
> they do massive sharing with, regardless of remap_file_pages().
>
> Does anyone get it _now_?

If you have (anon) object based rmap, I don't see why you want to build
a pte_chain on a per-page basis - keeping this info on a per linear
area seems much more efficient. We still have a reverse mapping for
everything this way.

M.

2003-07-02 21:49:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Wed, Jul 02, 2003 at 02:40:32PM -0700, William Lee Irwin III wrote:
> On Wed, Jul 02, 2003 at 07:47:00PM +0200, Andrea Arcangeli wrote:
> > actually other more invasive ways could be to move rmap into highmem.
> > Also the page clustering could also hide part of the mem overhead by
> > assuming the pagetables to be contiguos, but page clustering isn't part
> > of mainline yet either.
>
> BSD-style page clustering preserves virtual contiguity of a software
> page, but the new things don't; for ABI preservation, virtually
> discontiguous, partial, and misaligned mappings of pages are handled.
>
> The desired behavior can in principle be partially recovered by
> scanning within a software page size -sized "blast radius" for each
> chain element and only chaining enough elements to find the relevant
> ptes that way.
>
> As for remap_file_pages(), either people are misunderstanding or
> ignoring me. There is a lovely three-step method to handling it:
>
> (a) fix the truncate() bug; it is just a literal bug. There are at
> least 3 different ways to fix it:
> (i) tag vmas touched by remap_file_pages() for exhaustive search
> (ii) do a cleanup pass after the current vmtruncate() doing
> try_to_unmap() on any still-mapped pages
> (iii) drop the current vmtruncate() entirely and do try_to_unmap()
> on each truncated page
> (ii) and (iii) do the locks in the wrong order, so some still-
> mapped but truncated page could be out there; this could be
> handed by Yet Another Cleanup Pass that does (i) or by tolerating
> the new state elsewhere in the VM. There's plenty of ways to
> code this and a couple choices of semantics (i.e make it
> failable or reliable).
>
> (b) implement the bits omitting pte_chains for mlock()'d mappings
> This is obvious. Yank them off the LRU, set a bitflag, and
> reuse page->lru for a counter.
>
> (c) redo the logic around page_convert_anon() and incrementally build
> pte_chains for remap_file_pages().
> The anobjrmap code did exactly this, but it was chaining
> distinct user virtual addresses instead.
> (i) you always have the pte_chain in hand anyway; the core
> is always prepped to handle allocating them now
> (ii) instead of just bailing for file-backed pages in
> page_add_rmap(), pass it enough information to know
> whether the address matches what it should from the
> vma, and start chaining if it doesn't
> (iii) but you say ->mapcount sharing space with the chain is a
> problem? no, it's not; again, take a cue from anobjrmap:
> if a file-backed page needs a pte_chain, shoehorn
> ->mapcount into the first pte_chain block dangling off it
>
> After all 3 are done, remap_file_pages() integrates smoothly into the VM,
> requires no magical privileges, nothing magical or brutally invasive
> that would scare people just before 2.6.0 is required, and the big
> apps can get their magical lowmem savings by just mlock()'ing _anything_
> they do massive sharing with, regardless of remap_file_pages().
>
> Does anyone get it _now_?

the problem with the above is that it is an order of magnitude more
complicated than just providing the feature remap_file_pages is been
written for. Removing the pte_chains via mlock is trivial, but then go
ahead and rebuild it synchronously in O(N) scanning the whole 1T of
virtual address space when I munlock.

In turn I still prefer the simplest possible approch. I see no strong
reason why we should complicate the kernel like that to make
remap_file_pages generic.

IMHO remap_file_pages wouldn't exist today in the kernel if 32bit archs
would be limited to 4G of ram. It's primarly a 32bit hack and as such we
should try to get away with it with the minimal damage to the rest of
the kernel (in a way that emulator can use too though, via a sysctl or
similar).

Now releasing the pte_chain during mlock would be a generic feature
orthogonal with the above I know, but I doubt you really care about it
for all other usages (also given the nearly unfixable complexity it
would introduce in munlock).

Andrea

2003-07-02 22:00:16

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

At some point in the past, I wrote:
>> (c) redo the logic around page_convert_anon() and incrementally build
>> pte_chains for remap_file_pages().
>> The anobjrmap code did exactly this, but it was chaining
>> distinct user virtual addresses instead.
>> After all 3 are done, remap_file_pages() integrates smoothly into the VM,
>> requires no magical privileges, nothing magical or brutally invasive
>> that would scare people just before 2.6.0 is required, and the big
>> apps can get their magical lowmem savings by just mlock()'ing _anything_
>> they do massive sharing with, regardless of remap_file_pages().

On Wed, Jul 02, 2003 at 02:48:14PM -0700, Martin J. Bligh wrote:
> If you have (anon) object based rmap, I don't see why you want to build
> a pte_chain on a per-page basis - keeping this info on a per linear
> area seems much more efficient. We still have a reverse mapping for
> everything this way.

Eh? This is just suggesting using similar devices as were used in the
anobjrmap patch. I'm not terribly convinced about the remap_file_pages()
extents, since they're only going to be a factor of 8 or so space
reduction.

anobjrmap actually didn't use vma-like devices for anon pages, it merely
chained mm's that could share anon pages (fork()'s between exec()'s)
in a list that could be scanned, tagged anon pages with vaddrs, and
then walks that list of mm's when unmapping or checking referenced bits.


-- wli

2003-07-02 22:04:40

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 12:02:46AM +0200, Andrea Arcangeli wrote:
> Now releasing the pte_chain during mlock would be a generic feature
> orthogonal with the above I know, but I doubt you really care about it
> for all other usages (also given the nearly unfixable complexity it
> would introduce in munlock).

What complexity? Just unmap it if you can't allocate a pte_chain and
park it on the LRU.


-- wli

2003-07-02 22:13:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Wed, Jul 02, 2003 at 03:15:51PM -0700, William Lee Irwin III wrote:
> What complexity? Just unmap it if you can't allocate a pte_chain and
> park it on the LRU.

the complexity in munlock to rebuild what you destroyed in mlock, that's
linear at best (and for anonymous mappings there's no objrmap, plus
objrmap isn't even linear but quadratic in its scan [hence the problem
with it], though in practice it would be normally faster than the linear
of the page scanning ;)

Andrea

2003-07-02 22:59:50

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Wed, Jul 02, 2003 at 03:15:51PM -0700, William Lee Irwin III wrote:
>> What complexity? Just unmap it if you can't allocate a pte_chain and
>> park it on the LRU.

On Thu, Jul 03, 2003 at 12:26:41AM +0200, Andrea Arcangeli wrote:
> the complexity in munlock to rebuild what you destroyed in mlock, that's
> linear at best (and for anonymous mappings there's no objrmap, plus
> objrmap isn't even linear but quadratic in its scan [hence the problem
> with it], though in practice it would be normally faster than the linear
> of the page scanning ;)

Computational complexity; okay.

It's not quadratic; at each munlock(), it's not necessary to do
anything more than:

for each page this mlock()'er (not _all_ mlock()'ers) maps of this thing
grab some pagewise lock
if pte_chain allocation succeeded
add pte_chain
else
/* you'll need to put anon pages in swapcache in mlock() */
unmap the page
decrement lockcount
if lockcount vanished
park it on the LRU
drop the pagewise lock

Individual mappers whose mappings are not mlock()'d add pte_chains when
faulting the things in just like before.


-- wli

2003-07-02 23:16:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Wed, Jul 02, 2003 at 04:11:22PM -0700, William Lee Irwin III wrote:
> On Wed, Jul 02, 2003 at 03:15:51PM -0700, William Lee Irwin III wrote:
> >> What complexity? Just unmap it if you can't allocate a pte_chain and
> >> park it on the LRU.
>
> On Thu, Jul 03, 2003 at 12:26:41AM +0200, Andrea Arcangeli wrote:
> > the complexity in munlock to rebuild what you destroyed in mlock, that's
> > linear at best (and for anonymous mappings there's no objrmap, plus
> > objrmap isn't even linear but quadratic in its scan [hence the problem
> > with it], though in practice it would be normally faster than the linear
> > of the page scanning ;)
>
> Computational complexity; okay.
>
> It's not quadratic; at each munlock(), it's not necessary to do

yes, as said above it's linear with the number of virtual pages mapped
unless you use the objrmap to rebuild rmap.

> anything more than:

is this munmap right?

>
> for each page this mlock()'er (not _all_ mlock()'ers) maps of this thing
> grab some pagewise lock
> if pte_chain allocation succeeded
> add pte_chain

allocated sure, but it has no information yet, you dropped the info in
mlock

> else
> /* you'll need to put anon pages in swapcache in mlock() */
> unmap the page

how to unmap? there's no rmap. also there may not be swap at all to
allocate swapcache from

> decrement lockcount
> if lockcount vanished
> park it on the LRU
> drop the pagewise lock
>
> Individual mappers whose mappings are not mlock()'d add pte_chains when
> faulting the things in just like before.

Tell me how you reach the pagetable from munlock to do the unmap. If you
can reach the pagetable, the unmap isn't necessary and you only need to
rebuild the rmap. and if you can reach the pagetable efficiently w/o
rmap, it means rmap is useless in the first place.

Andrea

2003-07-02 23:41:46

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 01:30:14AM +0200, Andrea Arcangeli wrote:
> yes, as said above it's linear with the number of virtual pages mapped
> unless you use the objrmap to rebuild rmap.
> is this munmap right?

I was describing munlock(); munmap() would do the same except not even
bother trying to allocate the pte_chains and always unmap it from the
processes whose mappings are being fiddled with.


On Wed, Jul 02, 2003 at 04:11:22PM -0700, William Lee Irwin III wrote:
>> for each page this mlock()'er (not _all_ mlock()'ers) maps of this thing
>> grab some pagewise lock
>> if pte_chain allocation succeeded
>> add pte_chain

On Thu, Jul 03, 2003 at 01:30:14AM +0200, Andrea Arcangeli wrote:
> allocated sure, but it has no information yet, you dropped the info in
> mlock

We have the information because I'm describing this as part of doing a
pagetable walk over the mlock()'d area we're munlock()'ing.


On Wed, Jul 02, 2003 at 04:11:22PM -0700, William Lee Irwin III wrote:
>> else
>> /* you'll need to put anon pages in swapcache in mlock() */
>> unmap the page

On Thu, Jul 03, 2003 at 01:30:14AM +0200, Andrea Arcangeli wrote:
> how to unmap? there's no rmap. also there may not be swap at all to
> allocate swapcache from

That doesn't matter; it only has to have an entry in swapper_space's
radix tree. But this actually could mean trouble since things currently
assume swap is preallocated for each entry in swapper_space's page_tree.

Which is fine; just revert to the old chaining semantics for mlock()'d
pages with PG_anon high.


On Wed, Jul 02, 2003 at 04:11:22PM -0700, William Lee Irwin III wrote:
>> decrement lockcount
>> if lockcount vanished
>> park it on the LRU
>> drop the pagewise lock
>> Individual mappers whose mappings are not mlock()'d add pte_chains when
>> faulting the things in just like before.

On Thu, Jul 03, 2003 at 01:30:14AM +0200, Andrea Arcangeli wrote:
> Tell me how you reach the pagetable from munlock to do the unmap. If you
> can reach the pagetable, the unmap isn't necessary and you only need to
> rebuild the rmap. and if you can reach the pagetable efficiently w/o
> rmap, it means rmap is useless in the first place.

This algorithm occurs during a pagetable walk of the process we'd unmap
it from; we don't unmap it from all processes, just the current one, and
allow it to take minor faults.


-- wli

2003-07-03 11:17:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Wed, Jul 02, 2003 at 04:55:40PM -0700, William Lee Irwin III wrote:
> it from; we don't unmap it from all processes, just the current one, and

if this is the case, this also means mlock isn't enough to guarantee to
drop the pte_chains: you also will need everybody else mapping the file
to mlock after every mmap or the pte_chains will stay there.

Last but not the least mlock is a privilegied operations and it in turn
it *can't* be used. Those apps strictly runs as normal user for all the
good reasons. so at the very least you need a sysctl to allow anybody to
run mlock.

Yet another issue is that mlock at max locks in half of the physical
ram, this makes it unusable (google is patching it too for their
internal usage that skips the more costly page faults), so you would
need another sysctl to select the max amount of mlockable memory (or you
could share the same sysctl that makes mlock a non privilegied
operation, it's up to you).

Bottom line is that you will still need a sysctl for security reasons
(equivalent to my sysctl to make remap_file_pages runnable as normal
user with my proposal), and my proposal is an order of magnitude simpler
to implement and maintain, and it doesn't affect mlock and it doesn't
create any complication with the rest of VM, since the rest of the VM
will never see those populated-pages via remap_file_pages, they will be
threated like pages under I/O via kiobuf etc... (so anonymous ones)

Your only advantage for the VM complications is that the emulator won't
need to use the sysctl (and well, most emulators needs root privilegies
anyways for kernel modules for nat etc..) and that it will be allowed to
swap heavily without the vma overhead (that IMHO is negligeable anyways
during heavy swapping with the box idle, especially after mmap will
always run in O(log(N)) where N is the number of vmas, currently it's
O(N) but it'll be improved).

Andrea

2003-07-03 11:32:28

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Wed, Jul 02, 2003 at 04:55:40PM -0700, William Lee Irwin III wrote:
>> it from; we don't unmap it from all processes, just the current one, and

On Thu, Jul 03, 2003 at 01:31:44PM +0200, Andrea Arcangeli wrote:
> if this is the case, this also means mlock isn't enough to guarantee to
> drop the pte_chains: you also will need everybody else mapping the file
> to mlock after every mmap or the pte_chains will stay there.

IMHO the unpriviliged applications might as well just be subject to
restrictive resource limits on number of processes etc. to cope with
that. I see zero loss in creating the pte_chains for mappers of the
files that aren't privileged enough to mlock().


On Thu, Jul 03, 2003 at 01:31:44PM +0200, Andrea Arcangeli wrote:
> Last but not the least mlock is a privilegied operations and it in turn
> it *can't* be used. Those apps strictly runs as normal user for all the
> good reasons. so at the very least you need a sysctl to allow anybody to
> run mlock.

This is obviously out of the question if the entire goal of the exercise
of devolving inhibition of pte_chains to mlock() is enabling things to
run with minimal privileges.

I suggest granting CAP_IPC_LOCK with libcap and/or its associated
utility programs in pam modules in preference to sysctl's that allow
arbitrary users to exercise such privileges, despite the administrative
overhead and/or obscurity of such interfaces and/or utilities.


On Thu, Jul 03, 2003 at 01:31:44PM +0200, Andrea Arcangeli wrote:
> Yet another issue is that mlock at max locks in half of the physical
> ram, this makes it unusable (google is patching it too for their
> internal usage that skips the more costly page faults), so you would
> need another sysctl to select the max amount of mlockable memory (or you
> could share the same sysctl that makes mlock a non privilegied
> operation, it's up to you).

Twiddling resource limits doesn't sound like a significant obstacle to me.


On Thu, Jul 03, 2003 at 01:31:44PM +0200, Andrea Arcangeli wrote:
> Bottom line is that you will still need a sysctl for security reasons
> (equivalent to my sysctl to make remap_file_pages runnable as normal
> user with my proposal), and my proposal is an order of magnitude simpler
> to implement and maintain, and it doesn't affect mlock and it doesn't
> create any complication with the rest of VM, since the rest of the VM
> will never see those populated-pages via remap_file_pages, they will be
> threated like pages under I/O via kiobuf etc... (so anonymous ones)

Well, what I'm _trying_ to do is cut down the privilege requirements to
where API's remain generally useful as opposed to totally castrating
them to the point where almost nothing can use them. I think by the
time it's chopped down to maximum mlock()'able RAM with all other
mechanisms usable by normal users we're home free.


On Thu, Jul 03, 2003 at 01:31:44PM +0200, Andrea Arcangeli wrote:
> Your only advantage for the VM complications is that the emulator won't
> need to use the sysctl (and well, most emulators needs root privilegies
> anyways for kernel modules for nat etc..) and that it will be allowed to
> swap heavily without the vma overhead (that IMHO is negligeable anyways
> during heavy swapping with the box idle, especially after mmap will
> always run in O(log(N)) where N is the number of vmas, currently it's
> O(N) but it'll be improved).

Actually it's not entirely for vma overhead. Compacting the virtual
address space allows users to be "friendly" with respect to pagetable
space or other kernel metadata space consumption whether on 32-bit or
64-bit. For instance, the "vast and extremely sparsely accessed"
mapping on 64-bit machines can have its pagetable space mitigated by
userspace using the remap_file_pages() API, where otherwise it would
either OOM or incur pagetable reclamation overhead (where pagetable
reclamation is not yet implemented).


-- wli

2003-07-03 12:44:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 04:46:26AM -0700, William Lee Irwin III wrote:
> Actually it's not entirely for vma overhead. Compacting the virtual
> address space allows users to be "friendly" with respect to pagetable
> space or other kernel metadata space consumption whether on 32-bit or
> 64-bit. For instance, the "vast and extremely sparsely accessed"
> mapping on 64-bit machines can have its pagetable space mitigated by
> userspace using the remap_file_pages() API, where otherwise it would
> either OOM or incur pagetable reclamation overhead (where pagetable
> reclamation is not yet implemented).

apps should never try to use remap_file_pages like that in 64bit,
mangling the ptes, flushing tlb and entering kernel is an huge overhead
compared to the static pte ram cost that nobody can care about on 64bit
since as worse you can plug some more giga of ram. If you really have an
huge chunk that you've to release (and the problem isn't at all the pte,
the problem if something is the page pointed by the pte that you may
want to free if you know it'll never be useful again), munmap will work
fine too and it won't be slower than remap_file_pages and if it's a
really huge chunk munmap will get rid of the ptes too.

btw, for the really huge mappings largepages are always required anyways
which means the pte cost is zero because there aren't ptes at all.

even if you don't use largepages as you should, the ram cost of the pte
is nothing on 64bit archs, all you care about is to use all the mhz and
tlb entries of the cpu.

remap_file_pages is useful only for VLM in 32bit and theoretically
emulators (but I didn't hear any emulator developer ask for this feature
yet, and I doubt it would make a significant performance difference
anyways since the only thing that saves is the vma cost for the emulator
since you want to leave rmap behind it)

Andrea

2003-07-03 12:52:17

by Rik van Riel

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, 3 Jul 2003, Andrea Arcangeli wrote:

> even if you don't use largepages as you should, the ram cost of the pte
> is nothing on 64bit archs, all you care about is to use all the mhz and
> tlb entries of the cpu.

That depends on the number of Oracle processes you have.
Say that page tables need 0.1% of the space of the virtual
space they map. With 1000 Oracle users you'd end up needing
as much memory in page tables as your shm segment is large.

Of course, in this situation either the application should
use large pages or the kernel should simply reclaim the
page tables (possible while holding the mmap_sem for write).

> remap_file_pages is useful only for VLM in 32bit

Agreed on that. Please let the monstrosity die together
with 32 bit machines ;)

2003-07-03 13:34:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 09:06:32AM -0400, Rik van Riel wrote:
> On Thu, 3 Jul 2003, Andrea Arcangeli wrote:
>
> > even if you don't use largepages as you should, the ram cost of the pte
> > is nothing on 64bit archs, all you care about is to use all the mhz and
> > tlb entries of the cpu.
>
> That depends on the number of Oracle processes you have.

well, that wasn't necessairly a database but ok.

> Say that page tables need 0.1% of the space of the virtual
> space they map. With 1000 Oracle users you'd end up needing
> as much memory in page tables as your shm segment is large.

so just add more ram, ram is cheaper than cpu power (I mean, on 64bit)

> Of course, in this situation either the application should
> use large pages or the kernel should simply reclaim the

as you say, it should definitely use largepages if it's such kind of
usage, so the whole point of saving pte space is void. it should use
largepages even if it's not "many tasks mapping the shm", but just a
single task mapping some huge ram.

> Agreed on that. Please let the monstrosity die together
> with 32 bit machines ;)

Indeed ;)

Andrea

2003-07-03 18:34:50

by Jamie Lokier

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

Andrea Arcangeli wrote:
> but I didn't hear any emulator developer ask for this feature yet

No, but there was a meek request to get writable/read-only protection
working with remap_file_pages, so that a garbage collector can change
protection on individual pages without requiring O(nr_pages) vmas.

Perhaps that should have nothing to do with remap_file_pages, though.

-- Jamie

2003-07-03 18:40:38

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

Andrea Arcangeli wrote:
>> but I didn't hear any emulator developer ask for this feature yet

On Thu, Jul 03, 2003 at 07:48:25PM +0100, Jamie Lokier wrote:
> No, but there was a meek request to get writable/read-only protection
> working with remap_file_pages, so that a garbage collector can change
> protection on individual pages without requiring O(nr_pages) vmas.
> Perhaps that should have nothing to do with remap_file_pages, though.

I call that application #2.


-- wli

2003-07-03 18:39:36

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, 3 Jul 2003, Andrea Arcangeli wrote:
>> even if you don't use largepages as you should, the ram cost of the pte
>> is nothing on 64bit archs, all you care about is to use all the mhz and
>> tlb entries of the cpu.

On Thu, Jul 03, 2003 at 09:06:32AM -0400, Rik van Riel wrote:
> That depends on the number of Oracle processes you have.
> Say that page tables need 0.1% of the space of the virtual
> space they map. With 1000 Oracle users you'd end up needing
> as much memory in page tables as your shm segment is large.
> Of course, in this situation either the application should
> use large pages or the kernel should simply reclaim the
> page tables (possible while holding the mmap_sem for write).

No, it is not true that pagetable space can be wantonly wasted
on 64-bit.

Try mmap()'ing something sufficiently huge and accessing on average
every PAGE_SIZE'th virtual page, in a single-threaded single process.
e.g. various indexing schemes might do this. This is 1 pagetable page
per page of data (worse if shared), which blows major goats.

There's a reason why those things use inverted pagetables... at any
rate, compacting virtualspace with remap_file_pages() solves it too.

Large pages won't help, since the data isn't contiguous.


-- wli

2003-07-03 18:58:17

by Andrew Morton

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

Andrea Arcangeli <[email protected]> wrote:
>
> Yet another issue is that mlock at max locks in half of the physical
> ram,

I deleted that bit.

2003-07-03 19:13:27

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 11:53:41AM -0700, William Lee Irwin III wrote:
> On Thu, 3 Jul 2003, Andrea Arcangeli wrote:
> >> even if you don't use largepages as you should, the ram cost of the pte
> >> is nothing on 64bit archs, all you care about is to use all the mhz and
> >> tlb entries of the cpu.
>
> On Thu, Jul 03, 2003 at 09:06:32AM -0400, Rik van Riel wrote:
> > That depends on the number of Oracle processes you have.
> > Say that page tables need 0.1% of the space of the virtual
> > space they map. With 1000 Oracle users you'd end up needing
> > as much memory in page tables as your shm segment is large.
> > Of course, in this situation either the application should
> > use large pages or the kernel should simply reclaim the
> > page tables (possible while holding the mmap_sem for write).
>
> No, it is not true that pagetable space can be wantonly wasted
> on 64-bit.
>
> Try mmap()'ing something sufficiently huge and accessing on average
> every PAGE_SIZE'th virtual page, in a single-threaded single process.
> e.g. various indexing schemes might do this. This is 1 pagetable page
> per page of data (worse if shared), which blows major goats.

that's the very old exploit that touches 1 page per pmd.

>
> There's a reason why those things use inverted pagetables... at any
> rate, compacting virtualspace with remap_file_pages() solves it too.
>
> Large pages won't help, since the data isn't contiguous.

if you can't use a sane design it's not a kernel issue. this is bad
userspace code seeking like crazy on disk too, working around it with a
kernel feature sounds worthless. If algorithms have no locality at all,
and they spread 1 page per pmd that's their problem.

the easiest way to waste ram with bad code is to add this in the first
line of the main of a program:

p = malloc(1G)
bzero(p, 1G)

you don't need 1 page per pmd to waste ram. Should we also write a
kernel feature that checks if the page is zero and drops it so the above
won't swap etc..?

If you can come up with a real life example where the 1 page per pmd
scattered over 1T of address space (we're talking about the file here of
course, the on disk representation of the data) is the very best design
possible ever (without any concept of locality at all) and it speeds up
things of orders of magnitude not to have any locality at all,
especially for the huge seeking it will generate no matter what the
pagetable side is, you will then change my mind about it.

Andrea

2003-07-03 19:18:31

by Rik van Riel

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, 3 Jul 2003, Andrea Arcangeli wrote:

> that's the very old exploit that touches 1 page per pmd.

> if you can't use a sane design it's not a kernel issue.

Agreed, the kernel shouldn't have to go out of its way to
give these applications good performance. On the other
hand, I think the kernel should be able to _survive_
applications like this, reclaiming page tables when needed.

--
Great minds drink alike.

2003-07-03 20:02:02

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 11:53:41AM -0700, William Lee Irwin III wrote:
>> No, it is not true that pagetable space can be wantonly wasted
>> on 64-bit.
>> Try mmap()'ing something sufficiently huge and accessing on average
>> every PAGE_SIZE'th virtual page, in a single-threaded single process.
>> e.g. various indexing schemes might do this. This is 1 pagetable page
>> per page of data (worse if shared), which blows major goats.

On Thu, Jul 03, 2003 at 09:27:50PM +0200, Andrea Arcangeli wrote:
> that's the very old exploit that touches 1 page per pmd.

It's not an exploit. It's called "random access" or a "sparse mapping".


On Thu, Jul 03, 2003 at 11:53:41AM -0700, William Lee Irwin III wrote:
>> There's a reason why those things use inverted pagetables... at any
>> rate, compacting virtualspace with remap_file_pages() solves it too.
>> Large pages won't help, since the data isn't contiguous.

On Thu, Jul 03, 2003 at 09:27:50PM +0200, Andrea Arcangeli wrote:
> if you can't use a sane design it's not a kernel issue. this is bad
> userspace code seeking like crazy on disk too, working around it with a
> kernel feature sounds worthless. If algorithms have no locality at all,
> and they spread 1 page per pmd that's their problem.

Hashtables and B-trees are sane designs in my book.

The entire point of the exercise on 64-bit is to allow indexing
algorithms _some_ method of restoring virtual locality so as to
cooperate with the kernel and not throw away vast amounts of space
on pagetables.

Locality of reference in itself is typically already present in the
data access patterns.


On Thu, Jul 03, 2003 at 09:27:50PM +0200, Andrea Arcangeli wrote:
> the easiest way to waste ram with bad code is to add this in the first
> line of the main of a program:
> p = malloc(1G)
> bzero(p, 1G)
> you don't need 1 page per pmd to waste ram. Should we also write a
> kernel feature that checks if the page is zero and drops it so the above
> won't swap etc..?

I don't know what kind of moron you take me for but I don't care to be
patronized like that.

There is a very, very large difference between utter idiocy like the
above and sparse mappings. And there is a _much_ larger difference
when you add in direct cooperation from userspace to conserve resources
otherwise consumed by sparse mappings by compacting virtualspace with
remap_file_pages() and bullshit like implicitly checking if user pages
are 0 before trying to swap them out.

Color me thoroughly disgusted.


On Thu, Jul 03, 2003 at 09:27:50PM +0200, Andrea Arcangeli wrote:
> If you can come up with a real life example where the 1 page per pmd
> scattered over 1T of address space (we're talking about the file here of
> course, the on disk representation of the data) is the very best design
> possible ever (without any concept of locality at all) and it speeds up
> things of orders of magnitude not to have any locality at all,
> especially for the huge seeking it will generate no matter what the
> pagetable side is, you will then change my mind about it.

It doesn't even need to be an artifact of a design like a hash table or
a B-tree. It can be totally linear and 1:1 but the inputs will mandate
random access.

This entire tangent is ridiculous; the entire counterargument centers
around some desire not to go through the straightforward steps to
support a preexisting feature on the pretext of some notion that
cooperating with the kernel so as not to throw away vast amounts of
space on pagetables and/or vma's is an invalid application behavior.


-- wli

2003-07-03 22:00:08

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 11:54:31AM -0700, William Lee Irwin III wrote:
> Andrea Arcangeli wrote:
> >> but I didn't hear any emulator developer ask for this feature yet
>
> On Thu, Jul 03, 2003 at 07:48:25PM +0100, Jamie Lokier wrote:
> > No, but there was a meek request to get writable/read-only protection
> > working with remap_file_pages, so that a garbage collector can change
> > protection on individual pages without requiring O(nr_pages) vmas.
> > Perhaps that should have nothing to do with remap_file_pages, though.
>
> I call that application #2.

maybe I'm missing something but protections have nothing to do with
remap_file_pages IMHO. That's all about teaching the swap code to
reserve more bits in the swap entry and to store the protections there
and possibly teaching the page fault not to get confused. It might
prefer to use the populate callback too to avoid specializing the
pte_none case, but I think the syscall should be different, and it
shouldn't have anything to do with the nonlinearity (nor with rmap).

Andrea

2003-07-03 22:07:45

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 11:54:31AM -0700, William Lee Irwin III wrote:
>> I call that application #2.

On Thu, Jul 03, 2003 at 09:33:28PM +0200, Andrea Arcangeli wrote:
> maybe I'm missing something but protections have nothing to do with
> remap_file_pages IMHO. That's all about teaching the swap code to
> reserve more bits in the swap entry and to store the protections there
> and possibly teaching the page fault not to get confused. It might
> prefer to use the populate callback too to avoid specializing the
> pte_none case, but I think the syscall should be different, and it
> shouldn't have anything to do with the nonlinearity (nor with rmap).

It's obvious what to do about protections.


-- wli

2003-07-03 22:50:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 12:06:58PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > Yet another issue is that mlock at max locks in half of the physical
> > ram,
>
> I deleted that bit.

that's ok with me, I'm not going to deadlock my machine with it anyways ;).

Andrea

2003-07-04 00:25:39

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 01:16:07PM -0700, William Lee Irwin III wrote:
> I don't know what kind of moron you take me for but I don't care to be
> patronized like that.

If it's such a strong feature, go ahead and show me a patch to a real
life applications (there are plenty of things using hashes or btrees,
feel free to choose the one that according to you will behave closest to
the "exploit" [1]) using remap_file_pages to avoid the pte overhead and
show the huge improvement in the numbers compared to changing the design
of the code to have some sort of locality in the I/O access patterns
(NOTE: I don't care about ram, I care about speed). Given how hard you
advocate for this I assume you at least expect a 1% improvement, right?
Once you make the patch I can volounteer to benchmark it if you don't
have time or hardware for that. After you made the patch and you showed
a >=1% improvement by not keeping the file mapped linearly, but by
mapping it nonlinearly using remap_file_pages, I reserve myself to fix
the app to have some sort of locality of information so that the I/O
side will be able to get a boost too.

the fact is, no matter the VM side, your app has no way to nearly
perform in terms of I/O seeks if you're filling a page per pmd due the
huge seek it will generate with the major faults. And even the minor
faults if has no locality at all and it seeks all over the place in a
non predictable manner, the tlb flushes will kill performance compared
to keeping the file mapped statically, and it'll make it even slower
than accessing a new pte every time.

Until you produce pratical results IMHO the usage you advocated to use
remap_file_pages to avoid doing big linear mappings that may allocate
more ptes, sounds completely vapourware overkill overdesign that won't
last past emails. All in my humble opinion of course. I've no problem
to be wrong, I just don't buy what you say since it is not obvious at
all given the huge cost of entering exiting kernel, reaching the
pagetable in software, mangling them, flushing the tlb (on all common
archs that I'm assuming this doesn't only mean to flush a range but to
flush it all but it'd be slower even with a range-flush operation),
compared to doing nothing with a static linear mapping (regardless the
fact there are more ptes with a big linear mapping, I don't care to save
ram).

If you really go to change the app to use remap_file_pages, rather than
just compact the vm side with remap_file_pages (which will waste lots of
cpu power and it'll run slower IMHO), you'd better introduce locality
knowledge so the I/O side will have a slight chance to perform too and
the VM side will be improved as well, potentially also sharing the same
page, not only the same pmd (and after you do that if you really need to
save ram [not cpu] you can munmap/mmap at the same cost but this is just
a said note, I said I don't care to save ram, I care to perform the
fastest). reiserfs and other huge btree users have to do this locality
stuff all the time with their trees, for example to avoid a directory to
be completely scattered everywhere in the tree and in turn triggering
an huge amount of I/O seeks that may not even fit in buffercache. w/o
the locality information there would be no way for reiserfs to perform
with big filesystems and many directories, this is just the simples
example I can think of huge btrees that we use everyday.

Again, I don't care about saving ram, we're talking 64bit, I care about
speed, I hope I already made this clear enough in the previous email.

My arguments sounds all pretty strightforward to me.

Andrea

[1] I called the exploit because it was posted originally on bugtraq a
number of years ago, the pmd weren't reclaimed, and Linus fixed it (IIRC
in 2.3) by freeing the pmds when a PGD_SIZE range was completely
released. Of course yours isn't an exploit but it just behaves like
that by wasting lots of ram with pmds compared to the actual mapped
pages.

2003-07-04 00:32:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 03:21:13PM -0700, William Lee Irwin III wrote:
> On Thu, Jul 03, 2003 at 11:54:31AM -0700, William Lee Irwin III wrote:
> >> I call that application #2.
>
> On Thu, Jul 03, 2003 at 09:33:28PM +0200, Andrea Arcangeli wrote:
> > maybe I'm missing something but protections have nothing to do with
> > remap_file_pages IMHO. That's all about teaching the swap code to
> > reserve more bits in the swap entry and to store the protections there
> > and possibly teaching the page fault not to get confused. It might
> > prefer to use the populate callback too to avoid specializing the
> > pte_none case, but I think the syscall should be different, and it
> > shouldn't have anything to do with the nonlinearity (nor with rmap).
>
> It's obvious what to do about protections.

so you agree it'd better be a separate syscall, also given it seems
the current remap_file_pages api in 2.5 seems unfortunately already
frozen since I think it's wrong as it should only work on VM_NONLINEAR
vmas, it's very unclean to allow remap_file_pages to mangle whatever vma
out there despite it has to deal with truncate etc.. I think the minium
required change to the API is to add a MAP_NONLINEAR that converts in
kernel space to a VM_NONLINEAR. You allocate the mapping with
mmap(MAP_NONLINAER) and only then remap_file_pages will work. This
solves all the current brekages (and it'll be trivial to skip over
VM_NONLINEAR in the 2.4 vm too). (then there's the rmap/mlock/munlock
issue but that's an implementation issue non visible from userspace
[modulo security with the sysctl], this one instead is a API bug IMHO
and it'd better be fixed before people puts the backport in production)
Again, all in my humble opinion.

Andrea

2003-07-04 01:22:03

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 03:21:13PM -0700, William Lee Irwin III wrote:
>> It's obvious what to do about protections.

On Fri, Jul 04, 2003 at 02:46:41AM +0200, Andrea Arcangeli wrote:
> so you agree it'd better be a separate syscall, also given it seems
> the current remap_file_pages api in 2.5 seems unfortunately already
> frozen since I think it's wrong as it should only work on VM_NONLINEAR
> vmas, it's very unclean to allow remap_file_pages to mangle whatever vma
> out there despite it has to deal with truncate etc.. I think the minium
> required change to the API is to add a MAP_NONLINEAR that converts in
> kernel space to a VM_NONLINEAR. You allocate the mapping with
> mmap(MAP_NONLINAER) and only then remap_file_pages will work. This
> solves all the current brekages (and it'll be trivial to skip over
> VM_NONLINEAR in the 2.4 vm too). (then there's the rmap/mlock/munlock
> issue but that's an implementation issue non visible from userspace
> [modulo security with the sysctl], this one instead is a API bug IMHO
> and it'd better be fixed before people puts the backport in production)

sys_chattr_file_pages() etc. sounds fine to me; GC's would love it.

I'll tackle the rest in another (somewhat more inflammatory) post.

-- wli

2003-07-04 01:20:05

by Jamie Lokier

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

Andrea Arcangeli wrote:
> so you agree it'd better be a separate syscall

Per-page protections might be workable just through mremap(). As you
say, it's just a matter of appropriate bits in the swap entry. To
userspace it is a transparent performance improvement.

Unfortunately without an appropriate bit in the pte too, that
restricts per-page protections to work only with shared mappings, or
anon mappings which have not been forked, due to the lack of COW. It
would still be a good optimisation, although it would be a shame if,
say, a GC implementation of malloc et al. (eg. Boehm's allocator)
would not be transparent over fork().

-- Jamie

2003-07-04 01:32:31

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> If it's such a strong feature, go ahead and show me a patch to a real
> life applications (there are plenty of things using hashes or btrees,
> feel free to choose the one that according to you will behave closest to
> the "exploit" [1]) using remap_file_pages to avoid the pte overhead and
> show the huge improvement in the numbers compared to changing the design
> of the code to have some sort of locality in the I/O access patterns
> (NOTE: I don't care about ram, I care about speed). Given how hard you
> advocate for this I assume you at least expect a 1% improvement, right?
> Once you make the patch I can volounteer to benchmark it if you don't
> have time or hardware for that. After you made the patch and you showed
> a >=1% improvement by not keeping the file mapped linearly, but by
> mapping it nonlinearly using remap_file_pages, I reserve myself to fix
> the app to have some sort of locality of information so that the I/O
> side will be able to get a boost too.

You're obviously determined to write the issue the thing addresses
out of the cost/benefit analysis. This is tremendously obvious;
conserve RAM and you don't go into page (or pagetable) replacement.

I'm sick of hearing how it's supposedly legitimate to explode when
presented with random access patterns (this is not the only instance
of such an argument).

And I'd love to write an app that uses it for this; unfortunately for
both of us, I'm generally fully booked with kernelspace tasks.


On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> the fact is, no matter the VM side, your app has no way to nearly
> perform in terms of I/O seeks if you're filling a page per pmd due the
> huge seek it will generate with the major faults. And even the minor
> faults if has no locality at all and it seeks all over the place in a
> non predictable manner, the tlb flushes will kill performance compared
> to keeping the file mapped statically, and it'll make it even slower
> than accessing a new pte every time.

What minor faults? remap_file_pages() does pagecache lookups and
instantiates the ptes directly in the system call.

Also, the assumption of being seek-bound assumes a large enough cache
turnover rate to resort to io often. If such were the case it would be
unoptimizable apart from doubling the amount of pagecache it's feasible
to cache (which would provide only a slight advantage in such a case).

The obvious, but unstated assumption I made is that the virtual arena
(and hence physical pagecache backing it) is large enough to provide a
high cache hit rate to the mapped on-disk data structures. And the
virtualspace compaction is in order to prevent pagetables from
competing with pagecache.


On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> Until you produce pratical results IMHO the usage you advocated to use
> remap_file_pages to avoid doing big linear mappings that may allocate
> more ptes, sounds completely vapourware overkill overdesign that won't
> last past emails. All in my humble opinion of course. I've no problem
> to be wrong, I just don't buy what you say since it is not obvious at
> all given the huge cost of entering exiting kernel, reaching the
> pagetable in software, mangling them, flushing the tlb (on all common
> archs that I'm assuming this doesn't only mean to flush a range but to
> flush it all but it'd be slower even with a range-flush operation),
> compared to doing nothing with a static linear mapping (regardless the
> fact there are more ptes with a big linear mapping, I don't care to save
> ram).

Space _does_ matter. Everywhere. All the time. And it's not just
virtualspace this time. Throw away space, and you go into page or
pagetable replacement.

I'd love to write an app that uses it properly to conserve space. And
saving space is saving time. Time otherwise spent in page replacement
and waiting on io.


On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> If you really go to change the app to use remap_file_pages, rather than
> just compact the vm side with remap_file_pages (which will waste lots of
> cpu power and it'll run slower IMHO), you'd better introduce locality
> knowledge so the I/O side will have a slight chance to perform too and
> the VM side will be improved as well, potentially also sharing the same
> page, not only the same pmd (and after you do that if you really need to
> save ram [not cpu] you can munmap/mmap at the same cost but this is just
> a said note, I said I don't care to save ram, I care to perform the
> fastest). reiserfs and other huge btree users have to do this locality
> stuff all the time with their trees, for example to avoid a directory to
> be completely scattered everywhere in the tree and in turn triggering
> an huge amount of I/O seeks that may not even fit in buffercache. w/o
> the locality information there would be no way for reiserfs to perform
> with big filesystems and many directories, this is just the simples
> example I can think of huge btrees that we use everyday.

Filesystems don't use mmap() to simulate access to the B-trees, they
deal with virtual discontiguity with lookup structures. They essentially
are scattered on-disk, though not so greatly as general lookup
structures would be. The VM space cost of using mmap() on such on-disk
structures is what's addressed by this API.

Also, mmap()/munmap() do not have equivalent costs to remap_file_pages();
they do not instantiate ptes like remap_file_pages() and hence accesses
incur minor faults. So it also addresses minor fault costs. There is no
API not requiring privileges that speculatively populates ptes of a
mapping and would prevent minor faults like remap_file_pages().

Another advantage of the remap_file_pages() approach is that the
virtual arena can be made somewhat smaller than the actual pagecache,
which allows the VM to freely reclaim the cold bits of the cache and by
so doing not overcompete with other applications. i.e. I take the exact
opposite tack: apps should not hog every resource of the machine they
can for raw speed but rather be "good citizens".

In principle, there are other ways to do these things in some cases,
e.g. O_DIRECT. It's not truly an adequate substitute, of course, since
not only is the app then forced to deal with on-disk coherency, the
mappings aren't shared with pagecache (i.e. evictable via mere writeback)
and the cost of io is incurred for each miss of the process' cache of
the on-disk data.


On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> Again, I don't care about saving ram, we're talking 64bit, I care about
> speed, I hope I already made this clear enough in the previous email.

This is a very fundamentally mistaken set of priorities.


On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> My arguments sounds all pretty strightforward to me.

Sorry, this line of argument is specious.

As for the security issue, I'm not terribly interested, as rlimits on
numbers of processes and VSZ suffice (to get an idea of how much you've
entitled a user to, check RLIMIT_NPROC*RLIMIT_AS*sizeof(pte_t)/PAGE_SIZE).
This is primarily resource scalability and functionality, not
bencnmark-mania or security.


-- wli

2003-07-04 02:25:40

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 06:46:24PM -0700, William Lee Irwin III wrote:
> On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> > the fact is, no matter the VM side, your app has no way to nearly
> > perform in terms of I/O seeks if you're filling a page per pmd due the
> > huge seek it will generate with the major faults. And even the minor
> > faults if has no locality at all and it seeks all over the place in a
> > non predictable manner, the tlb flushes will kill performance compared
> > to keeping the file mapped statically, and it'll make it even slower
> > than accessing a new pte every time.
>
> What minor faults? remap_file_pages() does pagecache lookups and

with minor faults I meant the data in cache, and you calling
remap_file_pages to make it visible to the app. with minor faults I
meant only the cpu cost of the remap_file_pages operation, without the
need of hitting on the disk with ->readpage.

> On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> > Until you produce pratical results IMHO the usage you advocated to use
> > remap_file_pages to avoid doing big linear mappings that may allocate
> > more ptes, sounds completely vapourware overkill overdesign that won't
> > last past emails. All in my humble opinion of course. I've no problem
> > to be wrong, I just don't buy what you say since it is not obvious at
> > all given the huge cost of entering exiting kernel, reaching the
> > pagetable in software, mangling them, flushing the tlb (on all common
> > archs that I'm assuming this doesn't only mean to flush a range but to
> > flush it all but it'd be slower even with a range-flush operation),
> > compared to doing nothing with a static linear mapping (regardless the
> > fact there are more ptes with a big linear mapping, I don't care to save
> > ram).
>
> Space _does_ matter. Everywhere. All the time. And it's not just
> virtualspace this time. Throw away space, and you go into page or
> pagetable replacement.

So you admit you're wrong and that your usage is only meant to save
ram and that given enough ram your usage of remap_file_pages only wastes
cpu? Note that I given as strong assumption the non-interest in saving
ram. You're now saying that the benefit of your usage is to save ram.

Sure it will save ram.

But if your object is to save ram you're still wrong, it's an order of
magnitude more efficient to use shm and to load your scattered data with
aio + O_DIRECT/rawio to do the I/O, so you will also be able to
trivially use largepages and other things w/o messing the pagecache with
larepage knowledge, and furthmore without ever needing a tlb flush or a
pte mangling (pte mangling assuming you don't use largepages, pmd
mangling if you user largepages).

So your usage of remap_file_pages is worthless always.

In short:

1) if you want to save space it's the wrong design since it's an order of
magnitude slower of using direct-io on shared memory (with all the advantage
that it provides in terms of largepages and tlb preservation)

2) you don't need to save space then you'd better not waste time with
remap_file_pages and mmap the file linearly (my whole argument for all
my previous emails) And if something if you go to rewrite the app to
coalesce the VM side, you'd better coalesce the I/O side instead, then
the vm side will be free with the linear mapping with huge performance
advantages thanks to the locality knowledge (much more significant than
whatever remap_file_pages improvement, especially for the I/O side with
the major faults that will for sure happen at least once during startup,
often the disks are much bigger than ram). Of course the linear mapping
also involves trusting the OS paging meachanism to unmap pages
efficiently, while you do it by hand, and the OS can be much smarter
and efficient than duplicating VM management algorithms in every
userspace app using remap_file_pages. The OS will only unmap the
overflowing pages of data. the only requirement is to have ram >=
sizeofdisk >> 12 (this wouldn't even be a requirement with Rik's
pmd reclaim).

> On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> > If you really go to change the app to use remap_file_pages, rather than
> > just compact the vm side with remap_file_pages (which will waste lots of
> > cpu power and it'll run slower IMHO), you'd better introduce locality
> > knowledge so the I/O side will have a slight chance to perform too and
> > the VM side will be improved as well, potentially also sharing the same
> > page, not only the same pmd (and after you do that if you really need to
> > save ram [not cpu] you can munmap/mmap at the same cost but this is just
> > a said note, I said I don't care to save ram, I care to perform the
> > fastest). reiserfs and other huge btree users have to do this locality
> > stuff all the time with their trees, for example to avoid a directory to
> > be completely scattered everywhere in the tree and in turn triggering
> > an huge amount of I/O seeks that may not even fit in buffercache. w/o
> > the locality information there would be no way for reiserfs to perform
> > with big filesystems and many directories, this is just the simples
> > example I can think of huge btrees that we use everyday.
>
> Filesystems don't use mmap() to simulate access to the B-trees, they

One of the reasons I did the blkdev in pagecache in 2.4 is that the
partd developer asked me to use a _linear_ mmap to make the VM
management an order of magnitude more efficient for parsing the fs.
Otherwise he never knows how much ram it can allocate and during boot
the ram may be limited. mmap solves this completely leaving the decision
to the VM and it basically gives no limit in how low memory a system can
be to run parted.

Your saying that the in-kernel fs don't use mmap is a red-herring, of
course I know about that but it's totally unrelated to my point. The
only reason they do it is for the major faults (they can't care less if
they get the data through a linear mapping or a bh or a bio or a
pagecache entry). But of course if you add some locality the vm side
will be better to with mmap. I mean, there are worse problems than the
vm side if you costsantly seek all over the place.

And if you always seek reproducibly in the same places scattered
everywhere, then you can fix it on the design application side likely.

And anyways the linear mapping will always be faster than
remap_file_pages, as worse you need pagetable reclaim if you can't live
with ram >= btreesize >> 12. And if you really want to save pagetable
ram you must use shm + aio/o_direct, remap_file_pages is worthless.

The only reason remap_file_pages exists is to have a window on the
SHM!!!! On the 64bit this makes no sense at all. It would make no sense
at all to take a window on the shm.

> Also, mmap()/munmap() do not have equivalent costs to remap_file_pages();
> they do not instantiate ptes like remap_file_pages() and hence accesses

mlock istantiate ptes like remap_file_pages.

> incur minor faults. So it also addresses minor fault costs. There is no
> API not requiring privileges that speculatively populates ptes of a
> mapping and would prevent minor faults like remap_file_pages().

minor faults means nothing, the minor faults happens the first time only
and the first time you've the major fault too. It happens to you that
minor faults matters becaue you go mess the vm of the task with
remap_file_pages.

> Another advantage of the remap_file_pages() approach is that the
> virtual arena can be made somewhat smaller than the actual pagecache,

it's all senseless you've to use shm if you don't want to incour into
the huge penalty of the pte mangling and tlb flushing.

> In principle, there are other ways to do these things in some cases,
> e.g. O_DIRECT. It's not truly an adequate substitute, of course, since

the coherency has to be handled anyways, reads are never a problem for
coherency, writes are a problem regardless. if a write happens under you
in the file where the btree is stored you'll have fun w/ or w/o
O_DIRECT.

> not only is the app then forced to deal with on-disk coherency, the
> mappings aren't shared with pagecache (i.e. evictable via mere writeback)
> and the cost of io is incurred for each miss of the process' cache of
> the on-disk data.

the only difference is that if memory goes low you will need swap space
if you don't want to run out of ram. No other difference. The
remap_file_pages approch would be able to be paged out to disk rather
than to swap. This is absolutely the only difference and given the
advantage that shm provides (no need of tlb flushes and trivially
utilization of largepages) there's no way you can claim remap_file_pages
superior to shm + o_DIRECT/rawio also given you always need aio anyways
and there's no way to deal with aio with remap_file_pages.

In short:

1) remap_file_pages will never take anything but shm as argument and as
such it is only useful in two cases:
a) 32bit with shmfs files larger than virtual address space of the task
b) emulators (they need to map the stuff at fixed offsets
hardcoded into the asm, so they can't use a linear mapping)
2) if you've to access data efficiently you've two ways in 64bit archs:
a) trust the kernel and go with linear mapping, kernel
will do paging and everything for you with hopefully the
smartest algorithm efficiently in the background, it knows
what you don't need and with Rik's proposal it can even
discard null pmds

this way you let all the kernel heuristics to do their
work, they know nothing about the semantics of the data
so it's all best effort. This is the best design for
misc apps that doesn't need to be super tuned.
b) you choose not to trust the kernel and to build your own
view on the files, so you use shm + aio + rawio/O_DIRECT
so you get largepages w/o messing the fs, and you never
flush the tlb or mangle pagetables, this will always be
faster than remap_file_pages

disavantage is that if you need to swap the shm you will
fall into the swap space

however when you fall into 2b (and not 2a) you basically are
at a level of optimization where you want to takeover the machine
enterely so usually no swap is needed and you can't use
remap_file_pages or mmap anyways since you need aio and
largepages and no tlb flushes etc..


normal apps should take 2a

big irons running on millions of installations should take 2b

> On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> > Again, I don't care about saving ram, we're talking 64bit, I care about
> > speed, I hope I already made this clear enough in the previous email.
>
> This is a very fundamentally mistaken set of priorities.

that was my basic assumption to make everything simpler, and to make
obvious that remap_file_pages can only hurt performance in that case.

Anyways it doesn't matter, it's still the fastest even when you start to
care about space, and remap_file_pages makes no sense if run on a
regular filesystem.

> On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> > My arguments sounds all pretty strightforward to me.
>
> Sorry, this line of argument is specious.

It's so strightforward that it's incidentally how all applications are
coded today and in no way remap_file_pages can give any performance or
space advantage (except for mapping the shm that can't be mapped all at
the same time in 32bit systems)

>
> As for the security issue, I'm not terribly interested, as rlimits on
> numbers of processes and VSZ suffice (to get an idea of how much you've
> entitled a user to, check RLIMIT_NPROC*RLIMIT_AS*sizeof(pte_t)/PAGE_SIZE).
> This is primarily resource scalability and functionality, not
> bencnmark-mania or security.
>
>
> -- wli


Andrea

2003-07-04 03:56:45

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 06:46:24PM -0700, William Lee Irwin III wrote:
>> What minor faults? remap_file_pages() does pagecache lookups and

On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> with minor faults I meant the data in cache, and you calling
> remap_file_pages to make it visible to the app. with minor faults I
> meant only the cpu cost of the remap_file_pages operation, without the
> need of hitting on the disk with ->readpage.

Well, those aren't minor faults. That's prefaulting.


On Thu, Jul 03, 2003 at 06:46:24PM -0700, William Lee Irwin III wrote:
>> Space _does_ matter. Everywhere. All the time. And it's not just
>> virtualspace this time. Throw away space, and you go into page or
>> pagetable replacement.

On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> So you admit you're wrong and that your usage is only meant to save
> ram and that given enough ram your usage of remap_file_pages only wastes
> cpu? Note that I given as strong assumption the non-interest in saving
> ram. You're now saying that the benefit of your usage is to save ram.
> Sure it will save ram.

That was its explicit purpose for 32-bit and it's usable for that
same purpose on 64-bit. You're citing the precise scenario where
remap_file_pages() is used on 32-bit!


On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> But if your object is to save ram you're still wrong, it's an order of
> magnitude more efficient to use shm and to load your scattered data with
> aio + O_DIRECT/rawio to do the I/O, so you will also be able to
> trivially use largepages and other things w/o messing the pagecache with
> larepage knowledge, and furthmore without ever needing a tlb flush or a
> pte mangling (pte mangling assuming you don't use largepages, pmd
> mangling if you user largepages).
> So your usage of remap_file_pages is worthless always.

Incorrect. First, the above scenario is precisely where databases are
using remap_file_pages() on 32-bit. Second, the above circumvents the
pagecache and on machines with limited resources incurs swap overhead
as opposed to the data cache being reclaimable via simple VM writeback.
Third, it requires an application to perform its own writeback, which
is a transparency deficit in the above API's (though for database
engines it's fine as they would very much like to be aware of such
things and to more or less have the entire machine dedicated to them).


On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> In short:
> 1) if you want to save space it's the wrong design since it's an order of
> magnitude slower of using direct-io on shared memory (with all the advantage
> that it provides in terms of largepages and tlb preservation)

I already countered this argument in the earlier post, even before
you made it.


On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> 2) you don't need to save space then you'd better not waste time with
> remap_file_pages and mmap the file linearly (my whole argument for all
> my previous emails) And if something if you go to rewrite the app to
> coalesce the VM side, you'd better coalesce the I/O side instead, then
> the vm side will be free with the linear mapping with huge performance
> advantages thanks to the locality knowledge (much more significant than
> whatever remap_file_pages improvement, especially for the I/O side with
> the major faults that will for sure happen at least once during startup,
> often the disks are much bigger than ram). Of course the linear mapping
> also involves trusting the OS paging meachanism to unmap pages
> efficiently, while you do it by hand, and the OS can be much smarter
> and efficient than duplicating VM management algorithms in every
> userspace app using remap_file_pages. The OS will only unmap the
> overflowing pages of data. the only requirement is to have ram >=
> sizeofdisk >> 12 (this wouldn't even be a requirement with Rik's
> pmd reclaim).

That's quite some limitation!

At any rate, as unimpressed as I was with the other arguments, this
really fails to impress me as you're directly arguing for enforcing
one of the resource scalability limitation remap_file_pages() lifts.


On Thu, Jul 03, 2003 at 06:46:24PM -0700, William Lee Irwin III wrote:
>> Filesystems don't use mmap() to simulate access to the B-trees, they

On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> One of the reasons I did the blkdev in pagecache in 2.4 is that the
> partd developer asked me to use a _linear_ mmap to make the VM
> management an order of magnitude more efficient for parsing the fs.
> Otherwise he never knows how much ram it can allocate and during boot
> the ram may be limited. mmap solves this completely leaving the decision
> to the VM and it basically gives no limit in how low memory a system can
> be to run parted.

There's nothing wrong with linearly mapping something when it
doesn't present a problem (e.g. pagetable space out the wazoo).


On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> Your saying that the in-kernel fs don't use mmap is a red-herring, of
> course I know about that but it's totally unrelated to my point. The
> only reason they do it is for the major faults (they can't care less if
> they get the data through a linear mapping or a bh or a bio or a
> pagecache entry). But of course if you add some locality the vm side
> will be better to with mmap. I mean, there are worse problems than the
> vm side if you costsantly seek all over the place.

What on earth? Virtual locality is exactly what remap_file_pages()
restores to otherwise sparse access. Also c.f. the prior post as to
maintaining a large enough cache so as not to be seek-bound. If your
cache isn't getting a decent number of hits, you're dead with or
without remap_file_pages(); it can only save you from the contents of
it not being file offset contiguous. And it has to be that much
better of a cache for not being so.


On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> And if you always seek reproducibly in the same places scattered
> everywhere, then you can fix it on the design application side likely.

That's fine. If you really get that many seeks, you need to fix it.


On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> And anyways the linear mapping will always be faster than
> remap_file_pages, as worse you need pagetable reclaim if you can't live
> with ram >= btreesize >> 12. And if you really want to save pagetable
> ram you must use shm + aio/o_direct, remap_file_pages is worthless.

A linear mapping will be dog slow if an entire pagetable page must be
reconstituted for each minor fault and a minor fault is taken on a
large proportion of accesses (due to the pagetables needing to be GC'd
and page replacement evicting what could otherwise be used as cache),
which is effectively what you're proposing.


On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> The only reason remap_file_pages exists is to have a window on the
> SHM!!!! On the 64bit this makes no sense at all. It would make no sense
> at all to take a window on the shm.

It was what motivated its creation. It is more generally useful and
doesn't smack of a brutal database hack, which is why I don't find it
as distasteful as DSHM etc. from other kernels.


On Thu, Jul 03, 2003 at 06:46:24PM -0700, William Lee Irwin III wrote:
>> Also, mmap()/munmap() do not have equivalent costs to remap_file_pages();
>> they do not instantiate ptes like remap_file_pages() and hence accesses

On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> mlock istantiate ptes like remap_file_pages.

If you're going to do tit for tat at least read the rest of the
paragraph.


On Thu, Jul 03, 2003 at 06:46:24PM -0700, William Lee Irwin III wrote:
>> incur minor faults. So it also addresses minor fault costs. There is no
>> API not requiring privileges that speculatively populates ptes of a
>> mapping and would prevent minor faults like remap_file_pages().

On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> minor faults means nothing, the minor faults happens the first time only
> and the first time you've the major fault too. It happens to you that
> minor faults matters becaue you go mess the vm of the task with
> remap_file_pages.

This is incorrect. Minor faults are large overheads, especially when
they require pagetable pages to be instantiated. They are every bit as
expensive as interrupts (modulo drivers doing real device access there)
or int $0x80 -based system calls.


On Thu, Jul 03, 2003 at 06:46:24PM -0700, William Lee Irwin III wrote:
>> Another advantage of the remap_file_pages() approach is that the
>> virtual arena can be made somewhat smaller than the actual pagecache,

On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> it's all senseless you've to use shm if you don't want to incour into
> the huge penalty of the pte mangling and tlb flushing.

It is a time/space tradeoff. When the amount of virtualspace (and hence
pagetable space) required to map it linearly mandates mapping it
nonlinearly instead, it is called for. So have it there and just do it.


On Thu, Jul 03, 2003 at 06:46:24PM -0700, William Lee Irwin III wrote:
>> In principle, there are other ways to do these things in some cases,
>> e.g. O_DIRECT. It's not truly an adequate substitute, of course, since

On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> the coherency has to be handled anyways, reads are never a problem for
> coherency, writes are a problem regardless. if a write happens under you
> in the file where the btree is stored you'll have fun w/ or w/o
> O_DIRECT.

The coherency handling == the app has to do its own writeback. It is a
feature that since it's pagecache-coherent it automates that task for
the application. Other niceties are that it doesn't require tmpfs to
be mounted, doesn't bang against tmpfs resource limits with respect to
size, when paged out it's paged out to backing store, and so on.


On Thu, Jul 03, 2003 at 06:46:24PM -0700, William Lee Irwin III wrote:
>> not only is the app then forced to deal with on-disk coherency, the
>> mappings aren't shared with pagecache (i.e. evictable via mere writeback)
>> and the cost of io is incurred for each miss of the process' cache of
>> the on-disk data.

On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> the only difference is that if memory goes low you will need swap space
> if you don't want to run out of ram. No other difference. The
> remap_file_pages approch would be able to be paged out to disk rather
> than to swap. This is absolutely the only difference and given the
> advantage that shm provides (no need of tlb flushes and trivially
> utilization of largepages) there's no way you can claim remap_file_pages
> superior to shm + o_DIRECT/rawio also given you always need aio anyways
> and there's no way to deal with aio with remap_file_pages.

It is a significant advantage. It's also trivial to write an aio
remap_file_pages() opcode. I've actually been muttering about doing so
for weeks and the specific application I've not been getting the time
to write wants it (along with pagecache coherency).


On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> In short:
> 1) remap_file_pages will never take anything but shm as argument and as
> such it is only useful in two cases:
> a) 32bit with shmfs files larger than virtual address space of the task
> b) emulators (they need to map the stuff at fixed offsets
> hardcoded into the asm, so they can't use a linear mapping)

I'm with you on everything except shm; in fact, shm can never be used
directly with it (only through tmpfs) as it requires fd's to operate.


On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> 2) if you've to access data efficiently you've two ways in 64bit archs:
> a) trust the kernel and go with linear mapping, kernel
> will do paging and everything for you with hopefully the
> smartest algorithm efficiently in the background, it knows
> what you don't need and with Rik's proposal it can even
> discard null pmds
> this way you let all the kernel heuristics to do their
> work, they know nothing about the semantics of the data
> so it's all best effort. This is the best design for
> misc apps that doesn't need to be super tuned.

Linear mappings have their uses. What kernel heuristics are these?
I know of none that would be affected or "fooled" by its use.


On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> b) you choose not to trust the kernel and to build your own
> view on the files, so you use shm + aio + rawio/O_DIRECT
> so you get largepages w/o messing the fs, and you never
> flush the tlb or mangle pagetables, this will always be
> faster than remap_file_pages
> disavantage is that if you need to swap the shm you will
> fall into the swap space

This is a valid thing to do, especially for database engines. It's
not for everything, of course, especially things that don't want to
take over the entire machine.


On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> however when you fall into 2b (and not 2a) you basically are
> at a level of optimization where you want to takeover the machine
> enterely so usually no swap is needed and you can't use
> remap_file_pages or mmap anyways since you need aio and
> largepages and no tlb flushes etc..
> normal apps should take 2a
> big irons running on millions of installations should take 2b

Why is it inappropriate for normal apps that aren't going to take over
the entire machine to be space-efficient with respect to pagetables?


On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> that was my basic assumption to make everything simpler, and to make
> obvious that remap_file_pages can only hurt performance in that case.
> Anyways it doesn't matter, it's still the fastest even when you start to
> care about space, and remap_file_pages makes no sense if run on a
> regular filesystem.

Well, first off, given its mechanics, the only way it functions at all
is on regular filesystems. Second, time is not the only metric of
performance. Third, the other suggestions essentially put the operations
required for performing the tasks forever out of the reach of ordinary
applications and so limit their functionality.

One of the big reasons I want it to be generally useful is because it's
going to require some effort to cope with in the VM internals no matter
what. And that effort is close to 99% wasted counting userbase-wise if
the only thing that can use it is Oracle (not to cast aspersions; it's
generally worthwhile to help Oracle out in various ways).


On Thu, Jul 03, 2003 at 06:46:24PM -0700, William Lee Irwin III wrote:
>> Sorry, this line of argument is specious.

On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> It's so strightforward that it's incidentally how all applications are
> coded today and in no way remap_file_pages can give any performance or
> space advantage (except for mapping the shm that can't be mapped all at
> the same time in 32bit systems)

Well, the conclusion isn't effectively supported by the evidence.

It's clear it does what it does, has its various features, and has its
uses.

On 64-bit, if you need to mmap() something that you would bog the box
down with pagetables when doing it, it's how apps can cooperate and not
fill up RAM with redundant and useless garbage (pagetables with only
one pte a piece set in them). On 32-bit, it's how you can get at many
pieces of something simultaneously at all. We've already got it, let's
not castrate it before it ever ships. Heck, I could have written the
fixups for truncate() in the time it took to write all these messages.

How about this: I fix up truncate() and whatever else cares, and all
this hassle ends right then?

The opponents of the thing seem to be vastly overestimating its impact
on the VM AFAICT.


-- wli

2003-07-04 05:41:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
> On Thu, Jul 03, 2003 at 06:46:24PM -0700, William Lee Irwin III wrote:
> >> What minor faults? remap_file_pages() does pagecache lookups and
>
> On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> > with minor faults I meant the data in cache, and you calling
> > remap_file_pages to make it visible to the app. with minor faults I
> > meant only the cpu cost of the remap_file_pages operation, without the
> > need of hitting on the disk with ->readpage.
>
> Well, those aren't minor faults. That's prefaulting.

you still enter the kernel, just via the syscall as opposed than via an
exception of the usual minor faults, the overhead should be similar.
Also note that the linear mapping would never need to enter kernel after
the initial major fault (until you start doing paging because you run
out of resources, and that's managed by the vm at best w/o replication
of replacement algorithms in the user program).

> On Thu, Jul 03, 2003 at 06:46:24PM -0700, William Lee Irwin III wrote:
> >> Space _does_ matter. Everywhere. All the time. And it's not just
> >> virtualspace this time. Throw away space, and you go into page or
> >> pagetable replacement.
>
> On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> > So you admit you're wrong and that your usage is only meant to save
> > ram and that given enough ram your usage of remap_file_pages only wastes
> > cpu? Note that I given as strong assumption the non-interest in saving
> > ram. You're now saying that the benefit of your usage is to save ram.
> > Sure it will save ram.
>
> That was its explicit purpose for 32-bit and it's usable for that
> same purpose on 64-bit. You're citing the precise scenario where
> remap_file_pages() is used on 32-bit!

on 64bit you don't need remap_file_pages to keep all the shm mapped. I
said it a dozen of times in the last emails.

> On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> > But if your object is to save ram you're still wrong, it's an order of
> > magnitude more efficient to use shm and to load your scattered data with
> > aio + O_DIRECT/rawio to do the I/O, so you will also be able to
> > trivially use largepages and other things w/o messing the pagecache with
> > larepage knowledge, and furthmore without ever needing a tlb flush or a
> > pte mangling (pte mangling assuming you don't use largepages, pmd
> > mangling if you user largepages).
> > So your usage of remap_file_pages is worthless always.
>
> Incorrect. First, the above scenario is precisely where databases are
> using remap_file_pages() on 32-bit. Second, the above circumvents the

also in 64bit and they're right since they don't want to mangle
pagetables and flush tlb every time they have to look at a different
piece of the db. You have to if you use remap_file_pages on the mapped
"pool" of pages. The pool being the shm is the most efficient solution
to avoid flushing and replacing ptes all the time.

> pagecache and on machines with limited resources incurs swap overhead
> as opposed to the data cache being reclaimable via simple VM writeback.

I addressed this in the last email.

> Third, it requires an application to perform its own writeback, which
> is a transparency deficit in the above API's (though for database
> engines it's fine as they would very much like to be aware of such
> things and to more or less have the entire machine dedicated to them).

This is my point. Either you go w/o the OS, or you trust the OS and you
take the linear map. The intermediate thing is not as efficient and
flexible as the shm and the linear mapping, and you need to take all the
difficult decisions about what has to be mapped and what not in
userspace anyways.

> On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> > In short:
> > 1) if you want to save space it's the wrong design since it's an order of
> > magnitude slower of using direct-io on shared memory (with all the advantage
> > that it provides in terms of largepages and tlb preservation)
>
> I already countered this argument in the earlier post, even before
> you made it.
>
>
> On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> > 2) you don't need to save space then you'd better not waste time with
> > remap_file_pages and mmap the file linearly (my whole argument for all
> > my previous emails) And if something if you go to rewrite the app to
> > coalesce the VM side, you'd better coalesce the I/O side instead, then
> > the vm side will be free with the linear mapping with huge performance
> > advantages thanks to the locality knowledge (much more significant than
> > whatever remap_file_pages improvement, especially for the I/O side with
> > the major faults that will for sure happen at least once during startup,
> > often the disks are much bigger than ram). Of course the linear mapping
> > also involves trusting the OS paging meachanism to unmap pages
> > efficiently, while you do it by hand, and the OS can be much smarter
> > and efficient than duplicating VM management algorithms in every
> > userspace app using remap_file_pages. The OS will only unmap the
> > overflowing pages of data. the only requirement is to have ram >=
> > sizeofdisk >> 12 (this wouldn't even be a requirement with Rik's
> > pmd reclaim).
>
> That's quite some limitation!
>
> At any rate, as unimpressed as I was with the other arguments, this
> really fails to impress me as you're directly arguing for enforcing
> one of the resource scalability limitation remap_file_pages() lifts.

I don't care about memory resources for the ptes. And just to make the
most obvious example you're not hitting any harddisk when you click "I'm
feeling lucky" in google.

And the limitation exists only because it's not a limitation and nobody
cares. The day somebody will complain zeroed pmd will be reclaimed and
it'll go away (as Rik suggested a dozen of emails ago).

> On Thu, Jul 03, 2003 at 06:46:24PM -0700, William Lee Irwin III wrote:
> >> Filesystems don't use mmap() to simulate access to the B-trees, they
>
> On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> > One of the reasons I did the blkdev in pagecache in 2.4 is that the
> > partd developer asked me to use a _linear_ mmap to make the VM
> > management an order of magnitude more efficient for parsing the fs.
> > Otherwise he never knows how much ram it can allocate and during boot
> > the ram may be limited. mmap solves this completely leaving the decision
> > to the VM and it basically gives no limit in how low memory a system can
> > be to run parted.
>
> There's nothing wrong with linearly mapping something when it
> doesn't present a problem (e.g. pagetable space out the wazoo).

so then go ahead and fix it, if the exploit for you is still not running
good enough.

Linus already fixed the biggest issue with the exploit, what was needed
to be fixed years ago, nobody ever complained again. That's the pmd
reclaim when the whole pgd is empty.

I would at least like to hear a real life request or bugreport about
running out of pte with the linear mapping or whatever real life usage,
before falling into what I consier worthless overengeneering. At least
until somebody complains and asks for it. Especially given I can't see
any benefit.

> On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> > Your saying that the in-kernel fs don't use mmap is a red-herring, of
> > course I know about that but it's totally unrelated to my point. The
> > only reason they do it is for the major faults (they can't care less if
> > they get the data through a linear mapping or a bh or a bio or a
> > pagecache entry). But of course if you add some locality the vm side
> > will be better to with mmap. I mean, there are worse problems than the
> > vm side if you costsantly seek all over the place.
>
> What on earth? Virtual locality is exactly what remap_file_pages()

I'm talking about I/O locality, on the file. This is why I said that
there are worse problems than the vm side, if you costantly seek with
the I/O.

> restores to otherwise sparse access. Also c.f. the prior post as to
> maintaining a large enough cache so as not to be seek-bound. If your

you've also to write the stuff, besides trying to avoid nearly unused
pagetables at the same time, locality always helps even with lots of
cache. They're not all reads.

> cache isn't getting a decent number of hits, you're dead with or
> without remap_file_pages(); it can only save you from the contents of
> it not being file offset contiguous. And it has to be that much
> better of a cache for not being so.
>
>
> On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> > And if you always seek reproducibly in the same places scattered
> > everywhere, then you can fix it on the design application side likely.
>
> That's fine. If you really get that many seeks, you need to fix it.

thanks, at least we can agree on this.

>
>
> On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> > And anyways the linear mapping will always be faster than
> > remap_file_pages, as worse you need pagetable reclaim if you can't live
> > with ram >= btreesize >> 12. And if you really want to save pagetable
> > ram you must use shm + aio/o_direct, remap_file_pages is worthless.
>
> A linear mapping will be dog slow if an entire pagetable page must be
> reconstituted for each minor fault and a minor fault is taken on a
> large proportion of accesses (due to the pagetables needing to be GC'd
> and page replacement evicting what could otherwise be used as cache),
> which is effectively what you're proposing.

minor faults should happen raraly with pure mmap I/O. Depends enterely
on the vm algorithms though. And the pageout will happen with
remap_file_pages with the difference that remap_file_page forces the
user to code for a certain number of pagetables, that he can't know
about. So it's as hard as the shm + aio + largepages + rawio/O_DIRECT
from the end user and it needs more tuning as well. And it is slower
than the linear mapping due the tlb flushing that may happen even if
you're not out of resources if tuning isn't perfect for your hardware.

the only difference w.r.t. minor faults is the initial state post mmap
(vs post remap_file_pages), but nothing prevents me to add a
sys_populate, that in a single kernel entry/exit overhead will populate
all the needed ranges (you can pass an array of ranges to the kernel).
That would avoid the issue with mlock that is privilegied and a single
call could load all important regions of the data.

> On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> > The only reason remap_file_pages exists is to have a window on the
> > SHM!!!! On the 64bit this makes no sense at all. It would make no sense
> > at all to take a window on the shm.
>
> It was what motivated its creation. It is more generally useful and
> doesn't smack of a brutal database hack, which is why I don't find it
> as distasteful as DSHM etc. from other kernels.

database hack? Give me a faster database hack and I'm sure database will
start using it immediatly, no matter if you find it cleaner or not.

In this case you are trying to convince me of the superiority of a
solution that in your own opinion is "cleaner", but that's slower
lacks aio, largepages and it mangles pagetables and flushes tlbs for no
good reasons at every new bit of window watched from disk and it can't
even control writeback in a finegrined manner (again with aio). I'm
pretty sure nobody will care about it.

Oh and you need to control writeback anyways, with msync minimum, or
if the box crashes you'll find nothing on disk.

If you start dropping the disavantages and you start adding at least one
advantage, there's a chance that your design solution will change from
inferior to superior than what you call the "database hack".

> This is incorrect. Minor faults are large overheads, especially when
> they require pagetable pages to be instantiated. They are every bit as
> expensive as interrupts (modulo drivers doing real device access there)
> or int $0x80 -based system calls.

and you suffer from them too, since remap_file_pages is pageable.

> I'm with you on everything except shm; in fact, shm can never be used
> directly with it (only through tmpfs) as it requires fd's to operate.

shm == shmfs == tmpfs, I only care about the internals in-kernel, I
don't mind which api you use, go with MAP_SHARED on /dev/zero if that's
what you like. Of course tmpfs is the cleanest to handle while coding.

> This is a valid thing to do, especially for database engines. It's
> not for everything, of course, especially things that don't want to
> take over the entire machine.

problem is that you've to take over the whole machine anyways with your
"pte saving" remap_file_pages to know how many ptes you can allocate you
need all the hardware details and your program won't be a generic simple
and flexible libgdbm then.

Sure you saves ptes, but you waste tlb and forbids all the other
features that you're going to need if you do anything remotely that can
go near to hit a pte limitation issue at runtime.

> On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> > however when you fall into 2b (and not 2a) you basically are
> > at a level of optimization where you want to takeover the machine
> > enterely so usually no swap is needed and you can't use
> > remap_file_pages or mmap anyways since you need aio and
> > largepages and no tlb flushes etc..
> > normal apps should take 2a
> > big irons running on millions of installations should take 2b
>
> Why is it inappropriate for normal apps that aren't going to take over
> the entire machine to be space-efficient with respect to pagetables?

because it's a kind of intermediate solution that doesn't worth, linear
vma will be more efficient, and it is completely generic.

I ask you to find an app that runs out of pte today, tell me the
bugzilla #number in kernel.org, and that remap_file_pages is what it
wants, IMHO if something likely it wants the pte reclaim instead. Or it
wants to switch to using shm.

> what. And that effort is close to 99% wasted counting userbase-wise if

What do you think your pmdhighmem effort will be useful for, 10 years
from now? (I hope, otherwise we'd be screwed anyways moving everything
into highmem, not just the pte and pmd ;) It's the same logic, temporary
32bit hacks useful for the short term.

Are you going to implement highmem on 64bit archs so it won't be a
wasted effort? I think it's the same problem (modulo the emulator case
that IMHO should be ok with the sysctl too, the VM bypass as a concept
sounds clean enough to me). I also think I forgot why the emulator it's
not using the more efficient solution of shm + O_DIRECT.

> On Thu, Jul 03, 2003 at 06:46:24PM -0700, William Lee Irwin III wrote:
> >> Sorry, this line of argument is specious.
>
> On Fri, Jul 04, 2003 at 04:34:14AM +0200, Andrea Arcangeli wrote:
> > It's so strightforward that it's incidentally how all applications are
> > coded today and in no way remap_file_pages can give any performance or
> > space advantage (except for mapping the shm that can't be mapped all at
> > the same time in 32bit systems)
>
> Well, the conclusion isn't effectively supported by the evidence.
>
> It's clear it does what it does, has its various features, and has its
> uses.

also pmdhighmem can have uses on 64bit archs, it will work but it
doesn't mean it's useful.

>
> On 64-bit, if you need to mmap() something that you would bog the box
> down with pagetables when doing it, it's how apps can cooperate and not
> fill up RAM with redundant and useless garbage (pagetables with only
> one pte a piece set in them). On 32-bit, it's how you can get at many
> pieces of something simultaneously at all. We've already got it, let's
> not castrate it before it ever ships. Heck, I could have written the
> fixups for truncate() in the time it took to write all these messages.
>
> How about this: I fix up truncate() and whatever else cares, and all
> this hassle ends right then?

I would certainly like to see the truncate fixed.

>
> The opponents of the thing seem to be vastly overestimating its impact
> on the VM AFAICT.

Well, just the fact it has any impact at all is not a good thing IMHO.

Anyways, even if we've no time for benchmarks now, in a few years we'll
see who's right, I'm sure if it's useful as you claim we'll have to see
all the big apps switching to it in their 64bit ports. My bet is that
it will never happen unless something substantial changes in its
features, because in its current 2.5 form it provides many disavantages
and zero advantages as far as I can tell (especially if you want mlock
and not remap_file_pages to be the way to destroy the rmap pte_chains).
Even in 2.4 it is almost not worth the effort w/o rmap. Would be very
interesting to genreate some number to see -aa w/ remap_file_pages and
w/o infact (2.5 is unusable in some high end setup due rmap currently,
yeah i know you want to drop rmap via mlock and to use PAM to raise
privilegies so you can call mlock on the shm, instead of implementing
the trivial clean and non VM-visible VM bypass + sysctl; what's the 2.6
planned release date btw? end of August right? [I hope :) ])

Andrea

2003-07-04 06:40:30

by Zack Weinberg

[permalink] [raw]
Subject: Garbage collectors and VM (was Re: What to expect with the 2.6 VM)


> No, but there was a meek request to get writable/read-only protection
> working with remap_file_pages, so that a garbage collector can change
> protection on individual pages without requiring O(nr_pages) vmas.
> Perhaps that should have nothing to do with remap_file_pages, though.

I have an old design for this lying around from early 2.4 days. I
never got anywhere with it, but maybe it's of interest... My tests,
back then, indicated that fully half the overhead of write-barrier
handling was in signal delivery. So I wanted to avoid that, as well
as having to split vmas endlessly. I also didn't want to add new
syscalls if it could be avoided. Thus, a new pseudo-device, with the
semantics:

* mmapping it creates anonymous pages, just like /dev/zero.
* Data written to the file descriptor is interpreted as a list of
user-space pointers to pages. All the pages pointed to, that
are anonymous pages created by mmapping that same descriptor,
become read-only.
* But when the program takes a write fault to such a page, the
kernel simply records the user-space address of that page,
resets it to read-write, and restarts the faulting instruction.
The user space process doesn't get a signal.
* Reading from the descriptor produces a list of user-space pointers
to all the pages that have been reset to read-write since the last
read.
* I never decided what to do if the program forks. The application I
personally care about doesn't do that, but for a general GC like
Boehm it matters.

Thoughts?

Please cc: me, I'm not subscribed to l-k.

zw

2003-07-04 08:01:24

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> Well, those aren't minor faults. That's prefaulting.

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> you still enter the kernel, just via the syscall as opposed than via an
> exception of the usual minor faults, the overhead should be similar.
> Also note that the linear mapping would never need to enter kernel after
> the initial major fault (until you start doing paging because you run
> out of resources, and that's managed by the vm at best w/o replication
> of replacement algorithms in the user program).

remap_file_pages() prefaults a range, not a single pte.


On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> on 64bit you don't need remap_file_pages to keep all the shm mapped. I
> said it a dozen of times in the last emails.

That's not what it's used for on 64-bit. shm segments are far too small
anyway; they're normally smaller than RAM. Obviously the cost of
pagetables is insignificant until much larger areas (by a factor of
at least PAGE_SIZE/sizeof(pte_t) and usually much larger in the case of
sparse mappings) are mapped.


On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> Incorrect. First, the above scenario is precisely where databases are
>> using remap_file_pages() on 32-bit. Second, the above circumvents the

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> also in 64bit and they're right since they don't want to mangle
> pagetables and flush tlb every time they have to look at a different
> piece of the db. You have to if you use remap_file_pages on the mapped
> "pool" of pages. The pool being the shm is the most efficient solution
> to avoid flushing and replacing ptes all the time.

Anything with intelligence would treat the virtualspace as a cache of
mappings and so won't incur that overhead unless wildly thrashing,
and that's no different on 32-bit or 64-bit.


On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> pagecache and on machines with limited resources incurs swap overhead
>> as opposed to the data cache being reclaimable via simple VM writeback.

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> I addressed this in the last email.

Where I refuted it.


On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> Third, it requires an application to perform its own writeback, which
>> is a transparency deficit in the above API's (though for database
>> engines it's fine as they would very much like to be aware of such
>> things and to more or less have the entire machine dedicated to them).

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> This is my point. Either you go w/o the OS, or you trust the OS and you
> take the linear map. The intermediate thing is not as efficient and
> flexible as the shm and the linear mapping, and you need to take all the
> difficult decisions about what has to be mapped and what not in
> userspace anyways.

The app is trusting the kernel regardless unless it's running in
supervisor mode; this is merely another memory mapping API.


On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> That's quite some limitation!
>> At any rate, as unimpressed as I was with the other arguments, this
>> really fails to impress me as you're directly arguing for enforcing
>> one of the resource scalability limitation remap_file_pages() lifts.

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> I don't care about memory resources for the ptes. And just to make the
> most obvious example you're not hitting any harddisk when you click "I'm
> feeling lucky" in google.

Well it seems you either don't understand pagetable space issues or
deliberately refuse to acknowledge they exist. Which seems highly
unusual for the originator of pte-highmem (which IIRC shoved pmd's
into highmem too =).

Denying the issue exists isn't going to make it go away.


On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> And the limitation exists only because it's not a limitation and nobody
> cares. The day somebody will complain zeroed pmd will be reclaimed and
> it'll go away (as Rik suggested a dozen of emails ago).

PTE garbage collection's inefficiency is rather obvious. It's necessary
for correctness, not desirable (apart from correctness being desirable
and the PTE's being such worthless garbage you'd much rather dump them).


On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> There's nothing wrong with linearly mapping something when it
>> doesn't present a problem (e.g. pagetable space out the wazoo).

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> so then go ahead and fix it, if the exploit for you is still not running
> good enough.

What the hell? This isn't an exploit. This is the kernel giving
userspace a way to carry out perfectly valid userspace farting around
without degenerating into butt slow uselessness.

Virtual mappings are not free. Virtual mappings cost physical memory.
remap_file_pages() is a more space-efficient virtual mapping API
(which is also the 32-bit case -- there it can actually squeeze into
the tiny virtualspace where linear mappings cannot).

There's nothing particularly hard to understand about this.


On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> Linus already fixed the biggest issue with the exploit, what was needed
> to be fixed years ago, nobody ever complained again. That's the pmd
> reclaim when the whole pgd is empty.

That's not the biggest issue, though it does count as a space leak.


On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> I would at least like to hear a real life request or bugreport about
> running out of pte with the linear mapping or whatever real life usage,
> before falling into what I consier worthless overengeneering. At least
> until somebody complains and asks for it. Especially given I can't see
> any benefit.

Already exists and has since something like 2.4.0, and on a 64-bit
platform at that. I'll see what I can fish out of internal sources.

For fsck's sake, "worthless overengineering"? This is approaching
trolling. The thing is literally already in the kernel and I'm
literally arguing nothing more than preserving its current semantics.


On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> What on earth? Virtual locality is exactly what remap_file_pages()

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> I'm talking about I/O locality, on the file. This is why I said that
> there are worse problems than the vm side, if you costantly seek with
> the I/O.

You said something about VM heuristics. But anyway.

The memory footprint is actually irrelevant to this; it'll seek
regardless of whether the mapping is linear or not as it will access
the same data regardless. With a smaller pagetable overhead, it can
cache more data.


On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> restores to otherwise sparse access. Also c.f. the prior post as to
>> maintaining a large enough cache so as not to be seek-bound. If your

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> you've also to write the stuff, besides trying to avoid nearly unused
> pagetables at the same time, locality always helps even with lots of
> cache. They're not all reads.

But the linear mapping does nothing to address the file offset locality.
The access pattern with respect to file offset is identical in both
cases.


On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> A linear mapping will be dog slow if an entire pagetable page must be
>> reconstituted for each minor fault and a minor fault is taken on a
>> large proportion of accesses (due to the pagetables needing to be GC'd
>> and page replacement evicting what could otherwise be used as cache),
>> which is effectively what you're proposing.

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> minor faults should happen raraly with pure mmap I/O. Depends enterely
> on the vm algorithms though. And the pageout will happen with
> remap_file_pages with the difference that remap_file_page forces the
> user to code for a certain number of pagetables, that he can't know
> about. So it's as hard as the shm + aio + largepages + rawio/O_DIRECT
> from the end user and it needs more tuning as well. And it is slower
> than the linear mapping due the tlb flushing that may happen even if
> you're not out of resources if tuning isn't perfect for your hardware.

Minor faults rare on sparse linear mappings? NFI what you're smoking
but pass it here, I could use something to cheer me up.


On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> the only difference w.r.t. minor faults is the initial state post mmap
> (vs post remap_file_pages), but nothing prevents me to add a
> sys_populate, that in a single kernel entry/exit overhead will populate
> all the needed ranges (you can pass an array of ranges to the kernel).
> That would avoid the issue with mlock that is privilegied and a single
> call could load all important regions of the data.

On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> It was what motivated its creation. It is more generally useful and
>> doesn't smack of a brutal database hack, which is why I don't find it
>> as distasteful as DSHM etc. from other kernels.

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> database hack? Give me a faster database hack and I'm sure database will
> start using it immediatly, no matter if you find it cleaner or not.

A moment ago remap_file_pages() was slow, now it's fast. Which is it?


On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> In this case you are trying to convince me of the superiority of a
> solution that in your own opinion is "cleaner", but that's slower
> lacks aio, largepages and it mangles pagetables and flushes tlbs for no
> good reasons at every new bit of window watched from disk and it can't
> even control writeback in a finegrined manner (again with aio). I'm
> pretty sure nobody will care about it.

Where on earth are aio and large pages coming into the picture?

Well, anyway, I can burn a few minutes and bang out remap_file_pages()
for hugetlbfs and an aio remap_file_pages() opcode to silence this one
when I'm done with the truncate() bug.


On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> Oh and you need to control writeback anyways, with msync minimum, or
> if the box crashes you'll find nothing on disk.

If the box crashes you're in deep shit regardless of API.


On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> If you start dropping the disavantages and you start adding at least one
> advantage, there's a chance that your design solution will change from
> inferior to superior than what you call the "database hack".

Presumably you're going on about the unimplemented space conservation
features. I'll have to bang them out to silence this kind of crap.


On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> This is incorrect. Minor faults are large overheads, especially when
>> they require pagetable pages to be instantiated. They are every bit as
>> expensive as interrupts (modulo drivers doing real device access there)
>> or int $0x80 -based system calls.

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> and you suffer from them too, since remap_file_pages is pageable.

But since it's conserved RAM, it won't need to be paged out.


On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> I'm with you on everything except shm; in fact, shm can never be used
>> directly with it (only through tmpfs) as it requires fd's to operate.

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> shm == shmfs == tmpfs, I only care about the internals in-kernel, I
> don't mind which api you use, go with MAP_SHARED on /dev/zero if that's
> what you like. Of course tmpfs is the cleanest to handle while coding.

But those _are_ the internals, the userspace-visible API's are vastly
different (hence the distinction I made).


On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> This is a valid thing to do, especially for database engines. It's
>> not for everything, of course, especially things that don't want to
>> take over the entire machine.

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> problem is that you've to take over the whole machine anyways with your
> "pte saving" remap_file_pages to know how many ptes you can allocate you
> need all the hardware details and your program won't be a generic simple
> and flexible libgdbm then.

Not at all. All that information is unprivileged and available from
/proc/ and other sources. Being self-tuning and adapting to the
surroundings based on unprivileged information is nothing remotely
like taking over the machine at all.


On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> Sure you saves ptes, but you waste tlb and forbids all the other
> features that you're going to need if you do anything remotely that can
> go near to hit a pte limitation issue at runtime.

You do need enough virtualspace to have an effective cache.

I'm unconvinced of the alternative pte mitigation strategy argument:

(a) x86 and/or that disgusting attempt to immortalize it are the only
ones that truncate the radix tree with large pages (worse yet
they actually interpret the FPOS's in hardware), and even worse
yet, the approach risks severe internal cache fragmentation.

(b) shared pagetables require something to share with
the assumption all along here is a single process

(c) sure, GC'ing pagetables will make it feasible, but it just turns
from a kernel correctness issue to an API efficiency issue

It's a different situation, which wants a different technique used
to cope with and/or optimize it.


On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> Why is it inappropriate for normal apps that aren't going to take over
>> the entire machine to be space-efficient with respect to pagetables?

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> because it's a kind of intermediate solution that doesn't worth, linear
> vma will be more efficient, and it is completely generic.
> I ask you to find an app that runs out of pte today, tell me the
> bugzilla #number in kernel.org, and that remap_file_pages is what it
> wants, IMHO if something likely it wants the pte reclaim instead. Or it
> wants to switch to using shm.

I'll see if I can get the internal sources to ship a writeup to
bugme.osdl.org. IIRC this is truly ancient (and on 64-bit) so I may
have to find different internal contacts from the original ones.


On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> what. And that effort is close to 99% wasted counting userbase-wise if

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> What do you think your pmdhighmem effort will be useful for, 10 years
> from now? (I hope, otherwise we'd be screwed anyways moving everything
> into highmem, not just the pte and pmd ;) It's the same logic, temporary
> 32bit hacks useful for the short term.

The 32-bit machines won't ever change. It will be every bit as useful
on 32-bit machines 10 years from now as it is now, not to say they'll
be much more than legacy systems at that point. But legacy systems
aren't unimportant.

This hardware switcharoo argument is utter bullshit; all the grossly
offensive attempts to immortalize x86 in the world won't fix a single
bug that prevents my SS1 from booting and so on and so forth.

Unfortunately there's nothing obvious I can do to shut up the
disseminators of this insidious tripe.


On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> Are you going to implement highmem on 64bit archs so it won't be a
> wasted effort? I think it's the same problem (modulo the emulator case

No. This is a patently ridiculous notion. One architecture's needs for
core kernel support are generally unrelated to another's.


On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> that IMHO should be ok with the sysctl too, the VM bypass as a concept
> sounds clean enough to me). I also think I forgot why the emulator it's
> not using the more efficient solution of shm + O_DIRECT.

If you need a VM bypass, the VM isn't good enough.


On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> Well, the conclusion isn't effectively supported by the evidence.
>> It's clear it does what it does, has its various features, and has its
>> uses.

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> also pmdhighmem can have uses on 64bit archs, it will work but it
> doesn't mean it's useful.

AFAICT highpmd is just source-level noise for 64-bit arches. The only
potential use would be establishing virtual contiguity for pmd's, but
IMHO that technique is not useful in Linux and appears to have a very
obvious poor interaction with the core VM that prevents it from ever
being an optimization.


On Thu, Jul 03, 2003 at 09:10:48PM -0700, William Lee Irwin III wrote:
>> The opponents of the thing seem to be vastly overestimating its impact
>> on the VM AFAICT.

On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> Well, just the fact it has any impact at all is not a good thing IMHO.

It's a feature, it has to have _some_ impact.


On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> Anyways, even if we've no time for benchmarks now, in a few years we'll
> see who's right, I'm sure if it's useful as you claim we'll have to see
> all the big apps switching to it in their 64bit ports. My bet is that
> it will never happen unless something substantial changes in its
> features, because in its current 2.5 form it provides many disavantages
> and zero advantages as far as I can tell (especially if you want mlock
> and not remap_file_pages to be the way to destroy the rmap pte_chains).

This is something of an excessively high hurdle. It has uses, but it's
not a Swiss army knife.

As for the pte_chain savings, your entire argument was to bolt mlock()
semantics onto remap_file_pages() to accomplish (you guessed it!) the
exact same thing, in almost the same way, even.


On Fri, Jul 04, 2003 at 07:54:26AM +0200, Andrea Arcangeli wrote:
> Even in 2.4 it is almost not worth the effort w/o rmap. Would be very
> interesting to genreate some number to see -aa w/ remap_file_pages and
> w/o infact (2.5 is unusable in some high end setup due rmap currently,
> yeah i know you want to drop rmap via mlock and to use PAM to raise
> privilegies so you can call mlock on the shm, instead of implementing
> the trivial clean and non VM-visible VM bypass + sysctl; what's the 2.6
> planned release date btw? end of August right? [I hope :) ])

And the benchmarks have already been done, and they showed rather hefty
speedups over mmap(). And that was with pte_chains and all.

As for 2.5 being unusable in high-end setups, I run it on 32GB i386
daily and test on 64GB i386 every once in a while, albeit with some
patches, and run and optimize various kinds of benchmarks there.


-- wli

2003-07-04 11:53:10

by Jamie Lokier

[permalink] [raw]
Subject: Re: Garbage collectors and VM (was Re: What to expect with the 2.6 VM)

Zack Weinberg wrote:
> Thus, a new pseudo-device, with the semantics:

I like it! I'd use it, too.

> * mmapping it creates anonymous pages, just like /dev/zero.
> * Data written to the file descriptor is interpreted as a list of
> user-space pointers to pages. All the pages pointed to, that
> are anonymous pages created by mmapping that same descriptor,
> become read-only.
> * But when the program takes a write fault to such a page, the
> kernel simply records the user-space address of that page,
> resets it to read-write, and restarts the faulting instruction.
> The user space process doesn't get a signal.
> * Reading from the descriptor produces a list of user-space pointers
> to all the pages that have been reset to read-write since the last
> read.

Would it be appropriate to have a limit on the number of pages which
become writable before a signal is delivered instead of continuing to
make more pages writable? Just like the kernel, sometimes its good to
limit the number of dirty pages in flight in userspace, too.

> * I never decided what to do if the program forks. The application I
> personally care about doesn't do that, but for a general GC like
> Boehm it matters.

It should clone the state, obviously, and COW should be invisible to
the application :)

Btw, are these pages swappable?

On a different but related topic, it would be most cool if there were
a way for the kernel to request memory to be released from a userspace
GC, prior to swapping the GC's memory. Currently the best strategy is
for each GC to guess how much of the machine's RAM it can use, however
this is not a good strategy if you wish to launch multiple programs
each of which has its own GC, nor is it a particularly good balance
between GC application pages and other page-cache pages.

-- Jamie

2003-07-04 15:59:09

by Zack Weinberg

[permalink] [raw]
Subject: Re: Garbage collectors and VM

Jamie Lokier <[email protected]> writes:

> Zack Weinberg wrote:
>> Thus, a new pseudo-device, with the semantics:
>
> I like it! I'd use it, too.

Thanks. Implementing it is way beyond me, but it's good to hear that
other people might find it useful.

>> * Reading from the descriptor produces a list of user-space pointers
>> to all the pages that have been reset to read-write since the last
>> read.
>
> Would it be appropriate to have a limit on the number of pages which
> become writable before a signal is delivered instead of continuing to
> make more pages writable? Just like the kernel, sometimes its good to
> limit the number of dirty pages in flight in userspace, too.

Maybe. The GC I wanted to use this with is rather constrained in some
ways - it can't walk the stack, for instance - so it can only collect
when the mutator tells it it's okay. I suppose it could just copy the
list to its own buffer and continue, which saves the kernel from
having to store a potentially very large list.

>> * I never decided what to do if the program forks. The application I
>> personally care about doesn't do that, but for a general GC like
>> Boehm it matters.
>
> It should clone the state, obviously, and COW should be invisible to
> the application :)

Well, yeah, that would be the cleanest thing, but it does require two
dirty bits.

> Btw, are these pages swappable?

They certainly ought to be.

> On a different but related topic, it would be most cool if there were
> a way for the kernel to request memory to be released from a userspace
> GC, prior to swapping the GC's memory. Currently the best strategy is
> for each GC to guess how much of the machine's RAM it can use, however
> this is not a good strategy if you wish to launch multiple programs
> each of which has its own GC, nor is it a particularly good balance
> between GC application pages and other page-cache pages.

Again, this is not particularly useful to me since the collector can't
collect at arbitrary points - but it would be a good thing to have for
a general GC, and maybe I ought to be using a general GC anyway...

zw

2003-07-04 23:30:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

I can't disagree more but I don't have any more time to argue with you.
Especially your paragraph where you mention pte-highmem wasn't worth an
answer. If you for istance could raise one single thing that make me
realize I'm missing something, you would have a chance to change my
mind. In the meantime it simply seems I fail to communicate with you and
the more posts, the less I find those posts informative and useful.

So to avoid endless non informative emails I'm only interested to
mention numbers or things where the english language is not involved so
there's no risk to disagree anymore.

Since we seem not to have time to hack on a userspace app that you
consier to have "pagetable" problems, just tell me a date number
(1/1/2005 or 1/1/2010? whatever you want) when you expect a number of
significant 64bit programs (1/5/10/200?) to use remap_file_pages with
"rather hefty speedups over mmap" (1%/5%/10%/200%?) on 64bit archs and
we'll check how many are using it and what kind of improvements they get
out of it. this is the only thing I care about. Excluding emulators of
course.

My forecast is the number of 64bit apps will be 0, without date limit
and the improvement will be of the order of -1/2%.

If it generates as you claim "rather hefty speedups over mmap" on 64bit
archs (with rather hefty I hope you mean something more than 1%), you
must expect at least a dozen (or at the very least 2/3) of those 64bit
apps to be converted to remap_file_pages before 2005 to save the
pagetables overhead right? I would also expect to hear from many excited
userspace developers to post here agreeing with you, claiming that
they're looking forward to drop mmap and to start using
remap_file_pages so they can save pagetables, despite they'll generate
tlb flushes and pte mangling and they'll enter kernel all the time, and
they'll have to reinvent some part of the page replacement of the vm to
choose the parameters of remap_file_pages in their 64bit apps. I'm
talking about 64bit only of course. We all agree remap_file_pages makes
perfect sense in 32bit archs when used on top of shmfs (especially to
get rid of rmap in the 2.4 backports).

Also note that by 2005 I guarantee mmap will work in O(log(N)) always
(also w/o MAP_FIXED, as worse in 2.7, but infact it might happen even in
2.4) and likely by 2005 the unused pagetables might be garbage collected
too, shall anybody think remap_file_pages could help them to drop
pagetable overhead.

If you're not confortable to give at least a very conservative forecast
for a "rather hefty speedups over mmap" feature like:

1/1/2005 - 2/3 apps - >1% improvement

then you also know you are wrong in my voucabulary but you fail to admit
it.

Also please tell me a deadline where you will have mlock dropping rmap
merged in mainline and some user tool to raise privilegies via PAM to
run mlock as single user, so that 2.5 will be able to run as good as 2.4
for some critical app.

If it's me doing the work you already know what I would do (VM bypass
for remap_file_pages, no slowdown and additional complexity in munlock,
and a single sysctl to manage security if swap is non null), so I think
you prefer me to stay away from remap_file_pages in 2.5. I disagree in
the direction you take for 2.5, but since I'll never call into those
syscalls in my boxes, I don't mind as far as the rest of the system
isn't slowed down or unstable, so I'm fine that you do what you want in
that area.

Andrea

2003-07-04 23:49:37

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Sat, Jul 05, 2003 at 01:44:32AM +0200, Andrea Arcangeli wrote:
> I can't disagree more but I don't have any more time to argue with you.
> Especially your paragraph where you mention pte-highmem wasn't worth an
> answer. If you for istance could raise one single thing that make me
> realize I'm missing something, you would have a chance to change my
> mind. In the meantime it simply seems I fail to communicate with you and
> the more posts, the less I find those posts informative and useful.

The feeling on this thread is mutual. Best to cut it off. If I try to
answer other points in here it'll just get worse.

Sorry we couldn't come to a productive conclusion on this issue.


-- wli

2003-07-04 23:54:15

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: What to expect with the 2.6 VM

On Fri, Jul 04, 2003 at 05:05:18PM -0700, William Lee Irwin III wrote:
> The feeling on this thread is mutual. Best to cut it off. If I try to

well, at least we agree on this ;)

Andrea