2002-10-22 17:38:26

by Ingo Molnar

[permalink] [raw]
Subject: [patch] generic nonlinear mappings, 2.5.44-mm2-D0


the attached patch (ontop of 2.5.44-mm2) implements generic (swappable!)
nonlinear mappings and sys_remap_file_pages() support. Ie. no more
MAP_LOCKED restrictions and strange pagefault semantics.

to implement this i added a new pte concept: "file pte's". This means that
upon swapout, shared-named mappings do not get cleared but get converted
into file pte's, which can then be decoded by the pagefault path and can
be looked up in the pagecache.

the normal linear pagefault path from now on does not assume linearity and
decodes the offset in the pte. This also tests pte encoding/decoding in
the pagecache case, and the ->populate functions.

the offset is stored in the raw pte the following way, for x86 32-bit
ptes:

#define pte_to_pgoff(pte) \
((((pte).pte_low >> 1) & 0x1f ) + (((pte).pte_low >> 8) << 5 ))

#define pgoff_to_pte(off) \
((pte_t) { (((off) & 0x1f) << 1) + (((off) >> 5) << 8) + _PAGE_FILE })

the encoding is much simpler in the 64-bit x86 pte (PAE-highmem) case:

#define pte_to_pgoff(pte) ((pte).pte_high)
#define pgoff_to_pte(off) ((pte_t) { _PAGE_FILE, (off) })

in the 32-bit pte (4K page) case the max offset (ie. max file size)
supported is 2 TBs. In the PAE case it's the former 16 TB limit (can be
much more if/when page->index is enlarged).

the swap-pte needs one more bit (besides _PAGE_PROTNONE) to tell pagecache
and swapcache ptes apart, but fortunately _PAGE_FILE does not affect swap
device limits.

NOTE: sys_file_remap_pages() currently ignores the 'prot' parameter and
uses the default one in the vma, but i'm planning, as a next step, to
extend these new pte encodings with protection bits as well, which also
enables an mprotect() implementation that does not cause vmas to be split
up (yummie!).

NOTE2: anonymous mappings are not random-remappable yet, that will be yet
another future step.

the patch also cleans up some aspects of nonlinear vmas.

i have tested the patch quite extensively on nohighmem and highmem-64GB
(PAE) as well, under heavy swapping load, but with such an intrusive patch
there might be some details missing.

Ingo

--- linux/include/linux/mm.h.orig 2002-10-22 17:03:22.000000000 +0200
+++ linux/include/linux/mm.h 2002-10-22 17:08:28.000000000 +0200
@@ -130,7 +130,7 @@
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int unused);
- int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, unsigned long prot, unsigned long pgoff, int nonblock);
+ int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
};

/* forward declaration; pte_chain is meant to be internal to rmap.c */
@@ -373,7 +373,7 @@
extern pmd_t *FASTCALL(__pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address));
extern pte_t *FASTCALL(pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address));
extern pte_t *FASTCALL(pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address));
-extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, unsigned long prot);
+extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot);
extern int handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access);
extern int make_pages_present(unsigned long addr, unsigned long end);
extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
--- linux/include/linux/swapops.h.orig 2002-09-20 17:20:13.000000000 +0200
+++ linux/include/linux/swapops.h 2002-10-22 19:20:00.000000000 +0200
@@ -51,6 +51,7 @@
{
swp_entry_t arch_entry;

+ BUG_ON(pte_file(pte));
arch_entry = __pte_to_swp_entry(pte);
return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
}
@@ -64,5 +65,6 @@
swp_entry_t arch_entry;

arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
+ BUG_ON(pte_file(__swp_entry_to_pte(arch_entry)));
return __swp_entry_to_pte(arch_entry);
}
--- linux/include/asm-i386/pgtable.h.orig 2002-10-22 17:03:22.000000000 +0200
+++ linux/include/asm-i386/pgtable.h 2002-10-22 19:12:58.000000000 +0200
@@ -121,6 +121,7 @@
#define _PAGE_PSE 0x080 /* 4 MB (or 2MB) page, Pentium+, if present.. */
#define _PAGE_GLOBAL 0x100 /* Global TLB entry PPro+ */

+#define _PAGE_FILE 0x040 /* pagecache or swap? */
#define _PAGE_PROTNONE 0x080 /* If not present */

#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | _PAGE_DIRTY)
@@ -199,6 +200,7 @@
static inline int pte_dirty(pte_t pte) { return (pte).pte_low & _PAGE_DIRTY; }
static inline int pte_young(pte_t pte) { return (pte).pte_low & _PAGE_ACCESSED; }
static inline int pte_write(pte_t pte) { return (pte).pte_low & _PAGE_RW; }
+static inline int pte_file(pte_t pte) { return (pte).pte_low & _PAGE_FILE; }

static inline pte_t pte_rdprotect(pte_t pte) { (pte).pte_low &= ~_PAGE_USER; return pte; }
static inline pte_t pte_exprotect(pte_t pte) { (pte).pte_low &= ~_PAGE_USER; return pte; }
@@ -306,7 +308,7 @@
#define update_mmu_cache(vma,address,pte) do { } while (0)

/* Encode and de-code a swap entry */
-#define __swp_type(x) (((x).val >> 1) & 0x3f)
+#define __swp_type(x) (((x).val >> 1) & 0x1f)
#define __swp_offset(x) ((x).val >> 8)
#define __swp_entry(type, offset) ((swp_entry_t) { ((type) << 1) | ((offset) << 8) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low })
--- linux/include/asm-i386/pgtable-2level.h.orig 2002-10-22 19:12:15.000000000 +0200
+++ linux/include/asm-i386/pgtable-2level.h 2002-10-22 19:13:16.000000000 +0200
@@ -63,4 +63,14 @@
#define pfn_pte(pfn, prot) __pte(((pfn) << PAGE_SHIFT) | pgprot_val(prot))
#define pfn_pmd(pfn, prot) __pmd(((pfn) << PAGE_SHIFT) | pgprot_val(prot))

+/*
+ * Bits 0, 6 and 7 are taken, split up the 29 bits of offset
+ * into this range:
+ */
+#define pte_to_pgoff(pte) \
+ ((((pte).pte_low >> 1) & 0x1f ) + (((pte).pte_low >> 8) << 5 ))
+
+#define pgoff_to_pte(off) \
+ ((pte_t) { (((off) & 0x1f) << 1) + (((off) >> 5) << 8) + _PAGE_FILE })
+
#endif /* _I386_PGTABLE_2LEVEL_H */
--- linux/include/asm-i386/pgtable-3level.h.orig 2002-10-22 19:12:21.000000000 +0200
+++ linux/include/asm-i386/pgtable-3level.h 2002-10-22 19:17:08.000000000 +0200
@@ -108,4 +108,11 @@

extern struct kmem_cache_s *pae_pgd_cachep;

+/*
+ * Bits 0, 6 and 7 are taken in the low part of the pte,
+ * put the 32 bits of offset into the high part.
+ */
+#define pte_to_pgoff(pte) ((pte).pte_high)
+#define pgoff_to_pte(off) ((pte_t) { _PAGE_FILE, (off) })
+
#endif /* _I386_PGTABLE_3LEVEL_H */
--- linux/mm/fremap.c.orig 2002-10-22 17:03:22.000000000 +0200
+++ linux/mm/fremap.c 2002-10-22 19:07:22.000000000 +0200
@@ -17,12 +17,12 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>

-static inline void zap_pte(struct mm_struct *mm, pte_t *ptep)
+static inline int zap_pte(struct mm_struct *mm, pte_t *ptep)
{
pte_t pte = *ptep;

if (pte_none(pte))
- return;
+ return 0;
if (pte_present(pte)) {
unsigned long pfn = pte_pfn(pte);

@@ -37,9 +37,12 @@
mm->rss--;
}
}
+ return 1;
} else {
- free_swap_and_cache(pte_to_swp_entry(pte));
+ if (!pte_file(pte))
+ free_swap_and_cache(pte_to_swp_entry(pte));
pte_clear(ptep);
+ return 0;
}
}

@@ -48,16 +51,16 @@
* previously existing mapping.
*/
int install_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, struct page *page, unsigned long prot)
+ unsigned long addr, struct page *page, pgprot_t prot)
{
- int err = -ENOMEM;
+ int err = -ENOMEM, flush;
struct page *ptepage;
pte_t *pte, entry;
pgd_t *pgd;
pmd_t *pmd;

- spin_lock(&mm->page_table_lock);
pgd = pgd_offset(mm, addr);
+ spin_lock(&mm->page_table_lock);

pmd = pmd_alloc(mm, pgd, addr);
if (!pmd)
@@ -72,7 +75,6 @@
ptepage = pmd_page(*pmd);
} else
pte = pte_offset_map(pmd, addr);
- goto mapped;
} else {
#endif
pte = pte_alloc_map(mm, pmd, addr);
@@ -83,22 +85,19 @@
pte_page_lock(ptepage);
#ifdef CONFIG_SHAREPTE
}
-mapped:
#endif
- zap_pte(mm, pte);
+ flush = zap_pte(mm, pte);

mm->rss++;
flush_page_to_ram(page);
flush_icache_page(vma, page);
- entry = mk_pte(page, protection_map[prot]);
- if (prot & PROT_WRITE)
- entry = pte_mkwrite(pte_mkdirty(entry));
+ entry = mk_pte(page, prot);
set_pte(pte, entry);
page_add_rmap(page, pte);
pte_unmap(pte);
- flush_tlb_page(vma, addr);
+ if (flush)
+ flush_tlb_page(vma, addr);

- pte_page_unlock(ptepage);
spin_unlock(&mm->page_table_lock);

return 0;
@@ -119,26 +118,28 @@
*
* this syscall works purely via pagetables, so it's the most efficient
* way to map the same (large) file into a given virtual window. Unlike
- * mremap()/mmap() it does not create any new vmas.
+ * mmap()/mremap() it does not create any new vmas. The new mappings are
+ * also safe across swapout.
*
- * The new mappings do not live across swapout, so either use MAP_LOCKED
- * or use PROT_NONE in the original linear mapping and add a special
- * SIGBUS pagefault handler to reinstall zapped mappings.
+ * NOTE: the 'prot' parameter right now is ignored, and the vma's default
+ * protection is used. Arbitrary protections might be implemented in the
+ * future.
*/
int sys_remap_file_pages(unsigned long start, unsigned long size,
- unsigned long prot, unsigned long pgoff, unsigned long flags)
+ unsigned long __prot, unsigned long pgoff, unsigned long flags)
{
struct mm_struct *mm = current->mm;
unsigned long end = start + size;
struct vm_area_struct *vma;
int err = -EINVAL;

+ if (__prot & ~0xf)
+ return err;
/*
* Sanitize the syscall parameters:
*/
- start = PAGE_ALIGN(start);
- size = PAGE_ALIGN(size);
- prot &= 0xf;
+ start = start & PAGE_MASK;
+ size = size & PAGE_MASK;

down_read(&mm->mmap_sem);

@@ -151,15 +152,9 @@
if (vma && (vma->vm_flags & VM_SHARED) &&
vma->vm_ops && vma->vm_ops->populate &&
end > start && start >= vma->vm_start &&
- end <= vma->vm_end) {
- /*
- * Change the default protection to PROT_NONE:
- */
- if (pgprot_val(vma->vm_page_prot) != pgprot_val(__S000))
- vma->vm_page_prot = __S000;
- err = vma->vm_ops->populate(vma, start, size, prot,
- pgoff, flags & MAP_NONBLOCK);
- }
+ end <= vma->vm_end)
+ err = vma->vm_ops->populate(vma, start, size, vma->vm_page_prot,
+ pgoff, flags & MAP_NONBLOCK);

up_read(&mm->mmap_sem);

--- linux/mm/shmem.c.orig 2002-10-22 17:03:22.000000000 +0200
+++ linux/mm/shmem.c 2002-10-22 17:09:41.000000000 +0200
@@ -954,7 +954,7 @@

static int shmem_populate(struct vm_area_struct *vma,
unsigned long addr, unsigned long len,
- unsigned long prot, unsigned long pgoff, int nonblock)
+ pgprot_t prot, unsigned long pgoff, int nonblock)
{
struct inode *inode = vma->vm_file->f_dentry->d_inode;
struct mm_struct *mm = vma->vm_mm;
--- linux/mm/filemap.c.orig 2002-10-22 17:03:22.000000000 +0200
+++ linux/mm/filemap.c 2002-10-22 17:10:23.000000000 +0200
@@ -1290,7 +1290,7 @@
static int filemap_populate(struct vm_area_struct *vma,
unsigned long addr,
unsigned long len,
- unsigned long prot,
+ pgprot_t prot,
unsigned long pgoff,
int nonblock)
{
--- linux/mm/swapfile.c.orig 2002-10-22 17:03:22.000000000 +0200
+++ linux/mm/swapfile.c 2002-10-22 18:59:53.000000000 +0200
@@ -377,6 +377,8 @@
{
pte_t pte = *dir;

+ if (pte_file(pte))
+ return;
if (likely(pte_to_swp_entry(pte).val != entry.val))
return;
if (unlikely(pte_none(pte) || pte_present(pte)))
--- linux/mm/rmap.c.orig 2002-10-22 17:03:22.000000000 +0200
+++ linux/mm/rmap.c 2002-10-22 19:17:09.000000000 +0200
@@ -666,11 +666,21 @@
}
pte_page_unlock(ptepage);

- /* Store the swap location in the pte. See handle_pte_fault() ... */
if (PageSwapCache(page)) {
+ /*
+ * Store the swap location in the pte.
+ * See handle_pte_fault() ...
+ */
swp_entry_t entry = { .val = page->index };
swap_duplicate(entry);
set_pte(ptep, swp_entry_to_pte(entry));
+ BUG_ON(pte_file(*ptep));
+ } else {
+ /*
+ * Store the file page offset in the pte.
+ */
+ set_pte(ptep, pgoff_to_pte(page->index));
+ BUG_ON(!pte_file(*ptep));
}

/* Move the dirty bit to the physical page now the pte is gone. */
--- linux/mm/memory.c.orig 2002-10-22 17:03:22.000000000 +0200
+++ linux/mm/memory.c 2002-10-22 19:11:42.000000000 +0200
@@ -311,7 +311,8 @@
goto unshare_skip_set;

if (!pte_present(pte)) {
- swap_duplicate(pte_to_swp_entry(pte));
+ if (!pte_file(pte))
+ swap_duplicate(pte_to_swp_entry(pte));
set_pte(dst_pte, pte);
goto unshare_skip_set;
}
@@ -716,7 +717,8 @@
goto cont_copy_pte_range_noset;
/* pte contains position in swap, so copy. */
if (!pte_present(pte)) {
- swap_duplicate(pte_to_swp_entry(pte));
+ if (!pte_file(pte))
+ swap_duplicate(pte_to_swp_entry(pte));
set_pte(dst_pte, pte);
goto cont_copy_pte_range_noset;
}
@@ -810,7 +812,8 @@
}
}
} else {
- free_swap_and_cache(pte_to_swp_entry(pte));
+ if (!pte_file(pte))
+ free_swap_and_cache(pte_to_swp_entry(pte));
pte_clear(ptep);
}
}
@@ -1805,6 +1808,41 @@
}

/*
+ * Fault of a previously existing named mapping. Repopulate the pte
+ * from the encoded file_pte if possible. This enables swappable
+ * nonlinear vmas.
+ */
+static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
+ unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+{
+ unsigned long pgoff;
+ int err;
+
+ BUG_ON(!vma->vm_ops || !vma->vm_ops->nopage);
+ /*
+ * Fall back to the linear mapping if the fs does not support
+ * ->populate:
+ */
+ if (!vma->vm_ops || !vma->vm_ops->populate ||
+ (write_access && !(vma->vm_flags & VM_SHARED))) {
+ pte_clear(pte);
+ return do_no_page(mm, vma, address, write_access, pte, pmd);
+ }
+
+ pgoff = pte_to_pgoff(*pte);
+
+ pte_unmap(pte);
+ spin_unlock(&mm->page_table_lock);
+
+ err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
+ if (err == -ENOMEM)
+ return VM_FAULT_OOM;
+ if (err)
+ return VM_FAULT_OOM;
+ return VM_FAULT_MAJOR;
+}
+
+/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
* RISC architectures). The early dirtying is also good on the i386.
@@ -1833,13 +1871,10 @@

entry = *pte;
if (!pte_present(entry)) {
- /*
- * If it truly wasn't present, we know that kswapd
- * and the PTE updates will not touch it later. So
- * drop the lock.
- */
if (pte_none(entry))
return do_no_page(mm, vma, address, write_access, pte, pmd);
+ if (pte_file(entry))
+ return do_file_page(mm, vma, address, write_access, pte, pmd);
return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
}



2002-10-22 17:45:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0

On Tue, Oct 22, 2002 at 07:57:00PM +0200, Ingo Molnar wrote:
> the attached patch (ontop of 2.5.44-mm2) implements generic (swappable!)
> nonlinear mappings and sys_remap_file_pages() support. Ie. no more
> MAP_LOCKED restrictions and strange pagefault semantics.
>
> to implement this i added a new pte concept: "file pte's". This means that
> upon swapout, shared-named mappings do not get cleared but get converted
> into file pte's, which can then be decoded by the pagefault path and can
> be looked up in the pagecache.
>
> the normal linear pagefault path from now on does not assume linearity and
> decodes the offset in the pte. This also tests pte encoding/decoding in
> the pagecache case, and the ->populate functions.

Ingo,

what is the reason for that interface? It looks like a gross performance
hack for misdesigned applications to me, kindof windowsish..

Is this for whoracle or something like that?

2002-10-22 19:07:55

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0

On Tue, 22 Oct 2002, Ingo Molnar wrote:

> the attached patch (ontop of 2.5.44-mm2) implements generic (swappable!)
> nonlinear mappings and sys_remap_file_pages() support. Ie. no more
> MAP_LOCKED restrictions and strange pagefault semantics.

The code looks sane, but like Christoph, I'm curious why
we need it.

Rik
--
A: No.
Q: Should I include quotations after my reply?

http://www.surriel.com/ http://distro.conectiva.com/

2002-10-22 19:17:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0

Christoph Hellwig wrote:
>
> On Tue, Oct 22, 2002 at 07:57:00PM +0200, Ingo Molnar wrote:
> > the attached patch (ontop of 2.5.44-mm2) implements generic (swappable!)
> > nonlinear mappings and sys_remap_file_pages() support. Ie. no more
> > MAP_LOCKED restrictions and strange pagefault semantics.
> >
> > to implement this i added a new pte concept: "file pte's". This means that
> > upon swapout, shared-named mappings do not get cleared but get converted
> > into file pte's, which can then be decoded by the pagefault path and can
> > be looked up in the pagecache.
> >
> > the normal linear pagefault path from now on does not assume linearity and
> > decodes the offset in the pte. This also tests pte encoding/decoding in
> > the pagecache case, and the ->populate functions.
>
> Ingo,
>
> what is the reason for that interface? It looks like a gross performance
> hack for misdesigned applications to me, kindof windowsish..
>

So that evicted pages in non-linear mappings can be reestablished
at fault time by the kernel, rather than by delegation to userspace
via SIGBUS.


We seem to have lost a pte_page_unlock() from fremap.c:zap_pte()?
I fixed up the ifdef tangle in there within the shpte-ng patch
and then put the pte_page_unlock() back.

I also added a page_cache_release() to the error path in filemap_populate(),
if install_page() failed.

The 2TB file size limit for mmap on non-PAE is a little worrisome.
I wonder if we can only instantiate the pte_file() bit if the
mapping is using MAP_POPULATE? Seems hard to do.

2002-10-22 20:00:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0


On Tue, 22 Oct 2002, Christoph Hellwig wrote:

> what is the reason for that interface? It looks like a gross
> performance hack for misdesigned applications to me, kindof windowsish..

there are a number of reasons why we very much want this extension to the
Linux VM. Please catch up with the full email discussion, check out the
first announcement of the interface to lkml, the subject of the email was:
"[patch, feature] nonlinear mappings, prefaulting support, 2.5.42-F8".

(and add one more application category to the list of beneficiaries,
NPTL-style threading libraries, see the "[patch] mmap-speedup-2.5.42-C3"
discussion on lkml.)

I think it is quite a bit architectural step for the Linux VM to have more
generic vmas that 1) can be nonlinear 2) can have finegrained, non-uniform
protection bits. It has been clearly established in the past few years
empirically that the vma tree approach itself sucks performance-wise for
applications that have many different mappings.

And is it a big problem that RAM-windowing applications can make use of
the new capabilities as well, to overcome the limits of 32 bits? Your
response is a bit knee-jerk, what do you think the kernel itself does when
it piggybacks to the highmem range and using kmap? There's no other way to
overcome 32 bitness limits on a box that has much more than 32 bits worth
of RAM, but to start mapping things in dynamically. So what's your point?

Ingo

2002-10-22 20:08:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0


On Tue, 22 Oct 2002, Andrew Morton wrote:

> We seem to have lost a pte_page_unlock() from fremap.c:zap_pte()? I
> fixed up the ifdef tangle in there within the shpte-ng patch and then
> put the pte_page_unlock() back.

ok. I too fixed up the shpte #ifdef tangle in there as well, and it was
complex for no good reason, so i suspected that it was missing a line or
two.

> I also added a page_cache_release() to the error path in
> filemap_populate(), if install_page() failed.

hm, i somehow missed to add this, this was reported once already.

> The 2TB file size limit for mmap on non-PAE is a little worrisome. [...]

the limit is only there for 32-bit ptes on 32-bit platforms. 64-bit ptes
(both true 64-bit architectures and x86-PAE) has a ~64 zetabyte filesize
limit. I do not realistically believe that any 32-bit x86 box that is
connected to a larger than 2 TB disk array cannot possibly run a PAE
kernel. Just like you need PAE for more than 4 GB physical RAM. I find it
a bit worrisome that 32-bit x86 ptes can only support up to 4 GB of
physical RAM, but such is life :-)

Ingo

2002-10-22 20:25:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0


On 22 Oct 2002, Alan Cox wrote:

> Actually I know a few. 2Tb is cheap - its one pci controller and eight
> ide disks.

2Tb should still work. And to get to the 16 TB limit you'd have to
recompile with PAE. It costs some (rather limited) RAM overhead and some
fork() overhead. I think ext2/ext3fs's current 2Tb/4Tb limit is a much
bigger problem, you cannot compile around that - are there any patches in
fact that lift that limit? (well, one solution is to use another
filesystem.)

Ingo

2002-10-22 20:18:13

by Alan

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0

On Tue, 2002-10-22 at 21:27, Ingo Molnar wrote:
> limit. I do not realistically believe that any 32-bit x86 box that is
> connected to a larger than 2 TB disk array cannot possibly run a PAE
> kernel. Just like you need PAE for more than 4 GB physical RAM. I find it
> a bit worrisome that 32-bit x86 ptes can only support up to 4 GB of
> physical RAM, but such is life :-)

Actually I know a few. 2Tb is cheap - its one pci controller and eight
ide disks.

2002-10-22 20:30:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0


On 22 Oct 2002, Alan Cox wrote:

> Actually I know a few. 2Tb is cheap - its one pci controller and eight
> ide disks.

what we can do is to still use the linear mapping, ie. to impose the limit
only on fremap() users. This is ugly but works. It needs quite some
hacking though, since at the point of pagecache-pte zapping we dont have a
vma handy, so we cannot tell from the pte alone whether it's mapped
linearly or not. We could perhaps use the free bit in the pte to signal
this condition, but i'm not sure whether this is possible on every
architecture. Are there architectures that has no freely OS-usable bit in
the pte?

the limit will become even more prominent once i've moved the protection
bits into the swap pte format as well - that reduces the fremap() limit to
0.5 Tb, for 32-bit ptes.

(there's no real reason to keep the offset in the pte in the linearly
mapped case anyway, besides some vague 'symmetry' arguments.)

Ingo

2002-10-22 20:56:13

by Alan

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0

On Tue, 2002-10-22 at 21:42, Ingo Molnar wrote:
> 2Tb should still work. And to get to the 16 TB limit you'd have to
> recompile with PAE. It costs some (rather limited) RAM overhead and some
> fork() overhead. I think ext2/ext3fs's current 2Tb/4Tb limit is a much
> bigger problem, you cannot compile around that - are there any patches in
> fact that lift that limit? (well, one solution is to use another
> filesystem.)

At > 2Tb XFS/JFS would probably make a lot more sense anyway. We have
the 320Gb disks available now, no doubt by the time 2.6 is out we'll be
looking at 640Gb disks

2002-10-22 22:11:47

by David S. Miller

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0

On Tue, 2002-10-22 at 10:57, Ingo Molnar wrote:
> - flush_tlb_page(vma, addr);
> + if (flush)
> + flush_tlb_page(vma, addr);

You're still using page level flushes, even though we agreed that
a range flush one level up was more appropriate.

2002-10-22 22:37:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0


On 22 Oct 2002, David S. Miller wrote:

> > - flush_tlb_page(vma, addr);
> > + if (flush)
> > + flush_tlb_page(vma, addr);
>
> You're still using page level flushes, even though we agreed that a
> range flush one level up was more appropriate.

yes - i wanted to keep the ->populate() functions as simple as possible.
I hope to get there soon.

Ingo


2002-10-23 02:00:01

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0

On Tue, Oct 22, 2002 at 10:19:37PM +0200, Ingo Molnar wrote:
> protection bits. It has been clearly established in the past few years
> empirically that the vma tree approach itself sucks performance-wise for
> applications that have many different mappings.

if you're talking about get_unmapped_area showing up heavy on the
profiling then you're on the wrong track with this, if nobody beats me I
will fix that one soon right, I discussed that some month ago with Claus
Fisher and it's going to be optimized away completely from all
profilings out there (at least as much as mmap). The vma ram overhead
will be still there though, just the cpu overhead will go away, but I
never heard anybody complaining about finishing ram because of vmas yet
(while I know several cases where the lack of O(log(N)) in
get_unmapped_area is a showstopper, the GUI as well suffers badly with
the hundred of librarians but the guis are otherwise idle so it doesn't
matter much for them if the cpu is wasted but they will get a bit lower
latency).

Andrea

2002-10-23 06:59:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0


On Wed, 23 Oct 2002, Andrea Arcangeli wrote:

> > protection bits. It has been clearly established in the past few years
> > empirically that the vma tree approach itself sucks performance-wise for
> > applications that have many different mappings.
>
> if you're talking about get_unmapped_area showing up heavy on the
> profiling then you're on the wrong track with this, [...]

algorithmic overhead is just one aspect of the 'thousands of vmas'
problem, the real killer issue for some applications is RAM overhead,
because a single vma takes up roughly 3 cachelines _per process_, which
means 1 MB unswappable lowmem overhead per 10 thousand vmas. I know about
valid applications (and i'm not counting electric fence here) with
hundreds of thousands of vmas per process.

Even a O(log(N)) algorithm means ~15 tree walking steps in a perfectly
balanced tree (which ours arent), which means 0.5 KB of cache touched per
lookup with 32 byte cacheline size, and 2KB of cache touched per lookup,
with 128 byte cacheline size - which is far from cheap. While it's _much_
better than a O(N) algorithm in this case, the approach i'm working on
solves the more fundamental problem, instead of working it around.

> (while I know several cases where the lack of O(log(N)) in
> get_unmapped_area is a showstopper, [...]

Exactly which applications do you have in mind, most of the ones i can
think of are helped by the file-pte/swap-pte work.

do you still claim this to be the case once i've extended file-ptes (and
anon-swap pte's) with protection bits? That will enable the following
important things: mprotect() does not break up vmas(), neighboring
anonymous areas can _always_ be merged.

> [...] the GUI as well suffers badly with the hundred of librarians but
> the guis are otherwise idle so it doesn't matter much for them if the
> cpu is wasted but they will get a bit lower latency).

the GUI cases should be helped quite significantly by the simple approach
i added. The NPTL testcase that triggered the get_unmapped_area()
regression is fixed by the mmap-speedup patch as well, get_unmapped_area()
completely vanished from the kernel profile. But sure, its simplicity
makes it still an O(N) algorithm conceptually, but the typical statistical
behavior should be much better.

but the thing i'm working on is a much better than O(log(N)) improvement,
because i'm trying to fix the real and fundamental problem: the
artificially high number of vmas. That is not possible in every case
though (eg. the KDE libraries case we do have roughly 100 vmas from
different libraries), so your O(log(N)) patch will definitely be worth a
review.

in any case, if you have followed the mmap-speedup discussion then you
must know all the arguments already. Sure, O(log(N)) would be nice in
theory (and i raised that possibility in the discussion), but i'd like to
see your patch first, because yet another vma tree is quite some
complexity and it further increases the size of the vma, which is not
quite a no-cost approach.

Ingo

2002-10-23 11:44:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0

On Wed, Oct 23, 2002 at 09:19:23AM +0200, Ingo Molnar wrote:
> theory (and i raised that possibility in the discussion), but i'd like to
> see your patch first, because yet another vma tree is quite some
> complexity and it further increases the size of the vma, which is not
> quite a no-cost approach.

it's not another vma tree, furthmore another vma tree indexed by the
hole size wouldn't be able to defragment and it would find the best fit
not the first fit on the left.

Andrea

2002-10-23 14:07:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0


On Wed, 23 Oct 2002, Andrea Arcangeli wrote:

> it's not another vma tree, furthmore another vma tree indexed by the
> hole size wouldn't be able to defragment and it would find the best fit
> not the first fit on the left.

what i was talking about was a hole-tree indexed by the hole start
address, not a vma tree indexed by the hole size. (the later is pretty
pointless.) And even this solution still has to search the tree linearly
for a matching hole.

Ingo

2002-10-23 14:18:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0

On Wed, Oct 23, 2002 at 04:20:30PM +0200, Ingo Molnar wrote:
>
> On Wed, 23 Oct 2002, Andrea Arcangeli wrote:
>
> > it's not another vma tree, furthmore another vma tree indexed by the
> > hole size wouldn't be able to defragment and it would find the best fit
> > not the first fit on the left.
>
> what i was talking about was a hole-tree indexed by the hole start

yes an hole tree.

> address, not a vma tree indexed by the hole size. (the later is pretty

indexed by the hole start address? then it's still O(N), then your quick
cache for the first hole would not be much different. I above meant hole
size, then it would be O(log(N)), but it wouldn't defragment anymore.

> pointless.) And even this solution still has to search the tree linearly
> for a matching hole.

exactly, it's still O(N).

The final solution needed isn't a plain tree, it needs modifications to
the rbtree code to make the data structure more powerful, that will
provide that info in O(log(N)) without an additional tree and starting
from the left (i.e. it won't alter the retval of get_unmapped_area, just
its speed). the design is just finished for some time, what's left is
to implement it and it isn't trivial ;).

Anyways this O(log(N)) complexity improvement is needed anyways, old
applications will be still around in particular when they will find they
don't need special API to work fast.

Andrea

2002-10-23 19:03:50

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [patch] generic nonlinear mappings, 2.5.44-mm2-D0

On Tue, 2002-10-22 at 13:42, Ingo Molnar wrote:

> I think ext2/ext3fs's current 2Tb/4Tb limit is a much
> bigger problem, you cannot compile around that - are there any patches in
> fact that lift that limit? (well, one solution is to use another
> filesystem.)

Peter Chubb's sector_t changes effectively raise this to an 8TB limit in
2.5.x. The limit would be 16TB, but ext3 and jbd are rather cavalier
with casting block offsets between int, long, and unsigned long.
Changes to fix that would be highly intrusive.

<b