2008-03-18 01:09:58

by Andi Kleen

[permalink] [raw]
Subject: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables


This patchkit is an experimental optimization I played around with
some time ago.

This is more a prototype still, but I wanted to push it out
so that other people can play with it.

The basic idea is that most programs have the same working set
over multiple runs. So instead of demand paging all the text pages
in the order the program runs save the working set to disk and prefetch
it at program start and then save it at program exit.

This allows some optimizations:
- it can avoid unnecessary disk seeks because the blocks will be fetched in
sorted offset order instead of program execution order.
- batch kernel entries (each demand page exception has some
overhead just for entering the kernel). This keeps the caches hot too.
- The prefetch could be in theory done in the background while the program
runs (although that is not implemented currently)

Some details on the implementation:

To do all this we need a bitmap space somewhere in the ELF executable. I originally
hoped to use a standard ELF PHDR for this, which are already parsed by the
Linux ELF loader. However the problem is that PHDRs are part of the
mapped program image and inserting any new ones requires relinking
the program. Since relinking programs just to get this would be
rather heavy-handed I used a hack by setting another bitflag
in the gnu_execstack header and when it is set let the kernel
look for ELF SHDRs at the end of the file. Disadvantage is that
this costs a seek, but it allows easily to update existing
executables with a simple too.

The seek overhead would be gone if the linkers are taught to
always generate a PBITMAP bitmap header.

I also considered external bitmap files, but just putting it into the ELF
files and keeping it all together seemed much nicer policywise.

Then there is some probability of thrashing the bitmap, e.g. when
a program runs in different modi with totally different working sets
(a good example of this would be busybox). I haven't found
a good heuristic to handle this yet (e.g. one possibility would
be to or the bitmap instead of rewriting it on exit) this is something
that could need further experimentation. Also one doesn't want
too many bitmap updates of course so there is a simple heuristic
to not update bitmaps more often than a sysctl configurable
interval.

User tools:
ftp://ftp.firstfloor.org/pub/ak/pbitmap/pbitmap.c
is a simple program to a pbitmap shdrs to an existing ELF executable.

Base kernel:
Again 2.6.25-rc6

Drawbacks:
- No support for dynamic libraries right now (except very clumpsily
through the mmap_slurp hack). This is the main reason it is not
very useful for speed up desktops currently.

- Executable files have to be writable by the user executing it
currently to get bitmap updates. It would be possible to let the
kernel bypass this, but I haven't thought too much about the security
implications of it.
However any user can use the bitmap data written by a user with
write rights.

That's currently one of the bigger usability issues (together
with the missing shared library support) and why it is more
a prototype than a fully usable solution.

Possible areas of improvements if anybody is interested:
- Background prefetch
- Tune all the sysctl defaults
- Implement shared library support (will require glibc support)
- Do something about the executable access problem
- Experiment with more fancy heuristics to update bitmaps (like OR
or do aging etc.)

-Andi


2008-03-18 01:10:21

by Andi Kleen

[permalink] [raw]
Subject: [PATCH prototype] [2/8] Add support to override mmap exec write protection with O_FORCEWRITE


pbitmaps need to write to files mapped as executable, so add
a way to bypass the normal write protection using a new O_FORCEWRITE
flag.

Right now the flag can be set from user space too. If have not
made up my mind if that is a good or a bad thing (I don't think
it has any real security implications). Probably it's more bad
than good.

Signed-off-by: Andi Kleen <[email protected]>

---
fs/namei.c | 9 +++++++--
fs/open.c | 2 +-
include/asm-generic/fcntl.h | 4 ++++
include/linux/fs.h | 1 +
4 files changed, 13 insertions(+), 3 deletions(-)

Index: linux/fs/namei.c
===================================================================
--- linux.orig/fs/namei.c
+++ linux/fs/namei.c
@@ -334,10 +334,10 @@ int file_permission(struct file *file, i
* the inode->i_lock spinlock.
*/

-int get_write_access(struct inode * inode)
+int __get_write_access(struct inode * inode, int flags)
{
spin_lock(&inode->i_lock);
- if (atomic_read(&inode->i_writecount) < 0) {
+ if (atomic_read(&inode->i_writecount) < 0 && !(flags&O_FORCEWRITE)) {
spin_unlock(&inode->i_lock);
return -ETXTBSY;
}
@@ -347,6 +347,11 @@ int get_write_access(struct inode * inod
return 0;
}

+int get_write_access(struct inode * inode)
+{
+ return __get_write_access(inode, 0);
+}
+
int deny_write_access(struct file * file)
{
struct inode *inode = file->f_path.dentry->d_inode;
Index: linux/fs/open.c
===================================================================
--- linux.orig/fs/open.c
+++ linux/fs/open.c
@@ -742,7 +742,7 @@ static struct file *__dentry_open(struct
FMODE_PREAD | FMODE_PWRITE;
inode = dentry->d_inode;
if (f->f_mode & FMODE_WRITE) {
- error = get_write_access(inode);
+ error = __get_write_access(inode, f->f_flags & O_FORCEWRITE);
if (error)
goto cleanup_file;
}
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h
+++ linux/include/linux/fs.h
@@ -1720,6 +1720,7 @@ extern int generic_permission(struct ino
int (*check_acl)(struct inode *, int));

extern int get_write_access(struct inode *);
+extern int __get_write_access(struct inode *i, int flags);
extern int deny_write_access(struct file *);
static inline void put_write_access(struct inode * inode)
{
Index: linux/include/asm-generic/fcntl.h
===================================================================
--- linux.orig/include/asm-generic/fcntl.h
+++ linux/include/asm-generic/fcntl.h
@@ -54,6 +54,10 @@
#ifndef O_NDELAY
#define O_NDELAY O_NONBLOCK
#endif
+/* ignore ETXTBSY -- should probably hide it from user space */
+#ifndef O_FORCEWRITE
+#define O_FORCEWRITE 02000000
+#endif

#define F_DUPFD 0 /* dup */
#define F_GETFD 1 /* get close_on_exec */

2008-03-18 01:10:56

by Andi Kleen

[permalink] [raw]
Subject: [PATCH prototype] [3/8] Make readahead max pinned value a sysctl


Previously it was hard coded to 2MB, which seems dubious as systems
are getting more and more memory. With a sysctl it is easier to play
around with it and tune it.

Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/mm.h | 2 ++
kernel/sysctl.c | 10 ++++++++++
mm/readahead.c | 14 ++++++++++++--
3 files changed, 24 insertions(+), 2 deletions(-)

Index: linux/mm/readahead.c
===================================================================
--- linux.orig/mm/readahead.c
+++ linux/mm/readahead.c
@@ -17,6 +17,14 @@
#include <linux/pagevec.h>
#include <linux/pagemap.h>

+/*
+ * Max memory pinned during forced readahead.
+ * Should be probably auto scaled with available memory.
+ */
+unsigned ra_max_pinned = 2*1024*1024;
+
+#define MAX_PINNED_CHUNK ra_max_pinned
+
void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
{
}
@@ -176,7 +184,7 @@ out:
}

/*
- * Chunk the readahead into 2 megabyte units, so that we don't pin too much
+ * Chunk the readahead into MAX_PINNED_CHUNK units, so that we don't pin too much
* memory at once.
*/
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
@@ -190,7 +198,7 @@ int force_page_cache_readahead(struct ad
while (nr_to_read) {
int err;

- unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
+ unsigned long this_chunk = MAX_PINNED_CHUNK / PAGE_CACHE_SIZE;

if (this_chunk > nr_to_read)
this_chunk = nr_to_read;
@@ -229,6 +237,8 @@ int do_page_cache_readahead(struct addre
*/
unsigned long max_sane_readahead(unsigned long nr)
{
+ /* AK: RED-PEN for file cached pages there is no reason
+ to limit ourselves to the current node */
return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
}
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -1113,6 +1113,8 @@ void page_cache_async_readahead(struct a

unsigned long max_sane_readahead(unsigned long nr);

+extern unsigned ra_max_pinned;
+
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
#ifdef CONFIG_IA64
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -877,6 +877,16 @@ static struct ctl_table vm_table[] = {
.proc_handler = &proc_dointvec,
},
{
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "ra_max_pinned",
+ .data = &ra_max_pinned,
+ .maxlen = sizeof(ra_max_pinned),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
+ {
.ctl_name = VM_DIRTY_BACKGROUND,
.procname = "dirty_background_ratio",
.data = &dirty_background_ratio,

2008-03-18 01:11:20

by Andi Kleen

[permalink] [raw]
Subject: [PATCH prototype] [4/8] Add readahead function to read-ahead based on a bitmap


Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/mm.h | 3 ++
mm/filemap.c | 2 -
mm/readahead.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 61 insertions(+), 1 deletion(-)

Index: linux/mm/readahead.c
===================================================================
--- linux.orig/mm/readahead.c
+++ linux/mm/readahead.c
@@ -117,6 +117,26 @@ out:
return ret;
}

+static int preallocate_page(struct address_space *mapping,
+ pgoff_t page_offset, struct list_head *page_pool)
+{
+ struct page *page;
+
+ /* silently cries for a gang lookup */
+ rcu_read_lock();
+ page = radix_tree_lookup(&mapping->page_tree, page_offset);
+ rcu_read_unlock();
+ if (page)
+ return 0;
+
+ page = page_cache_alloc_cold(mapping);
+ if (!page)
+ return 0;
+ page->index = page_offset;
+ list_add(&page->lru, page_pool);
+ return 1;
+}
+
/*
* do_page_cache_readahead actually reads a chunk of disk. It allocates all
* the pages first, then submits them all for I/O. This avoids the very bad
@@ -140,6 +160,7 @@ __do_page_cache_readahead(struct address
int page_idx;
int ret = 0;
loff_t isize = i_size_read(inode);
+ int n;

if (isize == 0)
goto out;
@@ -216,6 +237,42 @@ int force_page_cache_readahead(struct ad
}

/*
+ * Read-ahead a page for each bit set in the bitmap.
+ */
+void readahead_bitmap(struct file *f, pgoff_t pgoffset, unsigned long *bitmap,
+ unsigned nbits)
+{
+ long bit, n;
+ LIST_HEAD(page_pool);
+ loff_t isize = i_size_read(f->f_dentry->d_inode);
+ unsigned long end_index;
+ struct address_space *mapping = f->f_mapping;
+
+ if (isize == 0)
+ return;
+
+ end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+
+ if (!mapping->a_ops || (!mapping->a_ops->readpages &&
+ !mapping->a_ops->readpage))
+ return;
+
+ bit = -1;
+ n = 0;
+ while ((bit = find_next_bit(bitmap, nbits, bit + 1)) < nbits) {
+ if (pgoffset + bit >= end_index)
+ break;
+ n += preallocate_page(f->f_mapping, pgoffset + bit,&page_pool);
+ if (n >= (MAX_PINNED_CHUNK / PAGE_CACHE_SIZE)) {
+ read_pages(f->f_mapping, f, &page_pool, n);
+ n = 0;
+ }
+ }
+ if (n > 0)
+ read_pages(f->f_mapping, f, &page_pool, n);
+}
+
+/*
* This version skips the IO if the queue is read-congested, and will tell the
* block layer to abandon the readahead if request allocation would block.
*
Index: linux/mm/filemap.c
===================================================================
--- linux.orig/mm/filemap.c
+++ linux/mm/filemap.c
@@ -1240,7 +1240,7 @@ static ssize_t
do_readahead(struct address_space *mapping, struct file *filp,
pgoff_t index, unsigned long nr)
{
- if (!mapping || !mapping->a_ops || !mapping->a_ops->readpage)
+ if (!mapping || !mapping->a_ops)
return -EINVAL;

force_page_cache_readahead(mapping, filp, index,
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -1104,6 +1104,9 @@ void page_cache_sync_readahead(struct ad
pgoff_t offset,
unsigned long size);

+void readahead_bitmap(struct file *f, pgoff_t pgoffset, unsigned long *bitmap,
+ unsigned nbits);
+
void page_cache_async_readahead(struct address_space *mapping,
struct file_ra_state *ra,
struct file *filp,

2008-03-18 01:11:34

by Andi Kleen

[permalink] [raw]
Subject: [PATCH prototype] [5/8] Add ELF constants for pbitmaps


I made up some numbers for the present bitmap phdrs and shdrs.

For serious use this would probably require an allocation somewhere.


Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/elf.h | 4 ++++
1 file changed, 4 insertions(+)

Index: linux/include/linux/elf.h
===================================================================
--- linux.orig/include/linux/elf.h
+++ linux/include/linux/elf.h
@@ -49,6 +49,7 @@ typedef __s64 Elf64_Sxword;
#define PT_GNU_EH_FRAME 0x6474e550

#define PT_GNU_STACK (PT_LOOS + 0x474e551)
+#define PT_PRESENT_BITMAP (PT_GNU_STACK + 1)

/* These constants define the different elf file types */
#define ET_NONE 0
@@ -230,6 +231,8 @@ typedef struct elf64_hdr {
#define PF_W 0x2
#define PF_X 0x1

+#define PF_PLEASE_LOAD_SHDRS 0x8 /* hack. checked on PT_GNU_STACK */
+
typedef struct elf32_phdr{
Elf32_Word p_type;
Elf32_Off p_offset;
@@ -270,6 +273,7 @@ typedef struct elf64_phdr {
#define SHT_HIPROC 0x7fffffff
#define SHT_LOUSER 0x80000000
#define SHT_HIUSER 0xffffffff
+#define SHT_PRESENT_BITMAP (SHT_LOPROC - 1000)

/* sh_flags */
#define SHF_WRITE 0x1

2008-03-18 01:10:41

by Andi Kleen

[permalink] [raw]
Subject: [PATCH prototype] [1/8] Give ELF shdr types a name


The pbitmap code is the first code in Linux to know
about ELF shdrs. They were already in the include files, but
not given a suitable type name. Fix that.

Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/elf.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux/include/linux/elf.h
===================================================================
--- linux.orig/include/linux/elf.h
+++ linux/include/linux/elf.h
@@ -286,7 +286,7 @@ typedef struct elf64_phdr {
#define SHN_COMMON 0xfff2
#define SHN_HIRESERVE 0xffff

-typedef struct {
+typedef struct elf32_shdr {
Elf32_Word sh_name;
Elf32_Word sh_type;
Elf32_Word sh_flags;
@@ -382,6 +382,7 @@ extern Elf32_Dyn _DYNAMIC [];
#define elf_phdr elf32_phdr
#define elf_note elf32_note
#define elf_addr_t Elf32_Off
+#define elf_shdr elf32_shdr

#else

@@ -390,6 +391,7 @@ extern Elf64_Dyn _DYNAMIC [];
#define elf_phdr elf64_phdr
#define elf_note elf64_note
#define elf_addr_t Elf64_Off
+#define elf_shdr elf64_shdr

#endif

2008-03-18 01:11:49

by Andi Kleen

[permalink] [raw]
Subject: [PATCH prototype] [6/8] Core predictive bitmap engine


This patchkit is an experimental optimization I played around with
some time ago.

This is more a prototype still, but I wanted to push it out
so that other people can play with it.

The basic idea is that most programs have the same working set
over multiple runs. So instead of demand paging all the text pages
in the order the program runs save the working set to disk and prefetch
it at program start and then save it at program exit.

This allows some optimizations:
- it can avoid unnecessary disk seeks because the blocks will be fetched in
sorted offset order instead of program execution order.
- batch kernel entries (each demand page exception has some
overhead just for entering the kernel). This keeps the caches hot too.
- The prefetch could be in theory done in the background while the program
runs (although that is not implemented currently)

Some details on the implementation:

To do all this we need a bitmap space somewhere in the ELF executable. I originally
hoped to use a standard ELF PHDR for this, which are already parsed by the
Linux ELF loader. However the problem is that PHDRs are part of the
mapped program image and inserting any new ones requires relinking
the program. Since relinking programs just to get this would be
rather heavy-handed I used a hack by setting another bitflag
in the gnu_execstack header and when it is set let the kernel
look for ELF SHDRs at the end of the file. Disadvantage is that
this costs a seek, but it allows easily to update existing
executables with a simple too.

The seek overhead would be gone if the linkers are taught to
always generate a PBITMAP bitmap header.

I also considered external bitmap files, but just putting it into the ELF
files and keeping it all together seemed much nicer policywise.

Then there is some probability of thrashing the bitmap, e.g. when
a program runs in different modi with totally different working sets
(a good example of this would be busybox). I haven't found
a good heuristic to handle this yet (e.g. one possibility would
be to or the bitmap instead of rewriting it on exit) this is something
that could need further experimentation. Also one doesn't want
too many bitmap updates of course so there is a simple heuristic
to not update bitmaps more often than a sysctl configurable
interval.

User tools:
ftp://ftp.firstfloor.org/pub/ak/pbitmap/pbitmap.c
is a simple program to a pbitmap shdrs to an existing ELF executable.

Base kernel:
Again 2.6.25-rc6

Drawbacks:
- No support for dynamic libraries right now (except very clumpsily
through the mmap_slurp hack). This is the main reason it is not
very useful for speed up desktops currently.

- Executable files have to be writable by the user executing it
currently to get bitmap updates. It would be possible to let the
kernel bypass this, but I haven't thought too much about the security
implications of it.
However any user can use the bitmap data written by a user with
write rights.

That's currently one of the bigger usability issues (together
with the missing shared library support) and why it is more
a prototype than a fully usable solution.

Possible areas of improvements if anybody is interested:
- Background prefetch
- Tune all the sysctl defaults
- Implement shared library support (will require glibc support)
- Do something about the executable access problem
- Experiment with more fancy heuristics to update bitmaps (like OR
or do aging etc.)

Signed-off-by: Andi Kleen <[email protected]>

---
fs/binfmt_elf.c | 101 +++++++++++
include/linux/mm.h | 7
include/linux/mm_types.h | 3
kernel/fork.c | 1
mm/Makefile | 2
mm/mmap.c | 1
mm/pbitmap.c | 404 +++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 517 insertions(+), 2 deletions(-)

Index: linux/fs/binfmt_elf.c
===================================================================
--- linux.orig/fs/binfmt_elf.c
+++ linux/fs/binfmt_elf.c
@@ -527,6 +527,89 @@ static unsigned long randomize_stack_top
#endif
}

+static void elf_read_pb_phdr(struct file *f, struct elf_phdr *phdr, int nhdr)
+{
+ int i, found, err;
+ unsigned long *buf;
+
+ if (!pbitmap_enabled)
+ return;
+ buf = (unsigned long *)__get_free_page(GFP_KERNEL);
+ if (!buf)
+ return;
+ found = 0;
+ for (i = 0; i < nhdr; i++) {
+ struct elf_phdr *ep = &phdr[i];
+ if (ep->p_type != PT_PRESENT_BITMAP)
+ continue;
+
+ err = pbitmap_load(f, buf, ep->p_vaddr, ep->p_offset,
+ ep->p_filesz);
+ if (err < 0)
+ printk("%s: pbitmap load failed: %d\n",
+ current->comm, err);
+ else
+ found += err;
+ }
+ if (found > 0)
+ printk("%s: %d pages prefetched\n", current->comm, found);
+ free_page((unsigned long)buf);
+}
+
+/* All errors are ignored because the pbitmap is optional */
+static void elf_read_pb_shdr(struct file *f, struct elfhdr *hdr)
+{
+ int err;
+ unsigned n;
+ void *buf;
+
+ if (!pbitmap_enabled)
+ return;
+
+ /* Need to rate limit them later */
+ if (hdr->e_shentsize != sizeof(struct elf_shdr)) {
+ printk(KERN_WARNING "%s: unexpected shdr size %d\n",
+ current->comm, hdr->e_shentsize);
+ return;
+ }
+ if (hdr->e_shnum >= PAGE_SIZE / sizeof(struct elf_shdr)) {
+ printk(KERN_WARNING "%s: too many shdrs (%u)\n",
+ current->comm, hdr->e_shnum);
+ return;
+ }
+ buf = (void *)__get_free_page(GFP_KERNEL);
+ if (!buf)
+ return;
+ n = hdr->e_shnum * sizeof(struct elf_shdr);
+ err = kernel_read(f, hdr->e_shoff, buf, n);
+ if (err != n) {
+ printk(KERN_WARNING "%s: shdr read failed: %d\n",
+ current->comm, err);
+ } else {
+ int found = 0;
+ struct elf_shdr *shdrs = (struct elf_shdr *)buf;
+
+ for (n = 0; n < hdr->e_shnum; n++) {
+ struct elf_shdr *sh = &shdrs[n];
+ if (sh->sh_type != SHT_PRESENT_BITMAP)
+ continue;
+ err = pbitmap_load(f, buf, sh->sh_addr,
+ sh->sh_offset,
+ sh->sh_size);
+ if (err < 0)
+ printk("%s: pbitmap load failed: %d\n",
+ current->comm, err);
+ else
+ found += err;
+ }
+ if (found > 0 && pbitmap_enabled > 1)
+ printk("%s: %d pages prefetched\n", current->comm,
+ found);
+
+ }
+ free_page((unsigned long)buf);
+}
+
static int load_elf_binary(struct linux_binprm *bprm, struct pt_regs *regs)
{
struct file *interpreter = NULL; /* to shut gcc up */
@@ -551,6 +634,8 @@ static int load_elf_binary(struct linux_
struct elfhdr interp_elf_ex;
struct exec interp_ex;
} *loc;
+ int pbitmap_seen = 0;
+ int please_load_shdrs = 0;

loc = kmalloc(sizeof(*loc), GFP_KERNEL);
if (!loc) {
@@ -706,6 +791,10 @@ static int load_elf_binary(struct linux_
executable_stack = EXSTACK_ENABLE_X;
else
executable_stack = EXSTACK_DISABLE_X;
+
+ /* Hack */
+ if (elf_ppnt->p_flags & PF_PLEASE_LOAD_SHDRS)
+ please_load_shdrs = 1;
break;
}

@@ -768,8 +857,13 @@ static int load_elf_binary(struct linux_
int elf_prot = 0, elf_flags;
unsigned long k, vaddr;

- if (elf_ppnt->p_type != PT_LOAD)
+ if (elf_ppnt->p_type != PT_LOAD) {
+ if (elf_ppnt->p_type == PT_PRESENT_BITMAP) {
+ /* convert to writable */
+ pbitmap_seen = 1;
+ }
continue;
+ }

if (unlikely (elf_brk > elf_bss)) {
unsigned long nbyte;
@@ -969,6 +1063,11 @@ static int load_elf_binary(struct linux_
arch_randomize_brk(current->mm);
#endif

+ if (pbitmap_seen)
+ elf_read_pb_phdr(bprm->file, elf_phdata, loc->elf_ex.e_phnum);
+ if (please_load_shdrs)
+ elf_read_pb_shdr(bprm->file, &loc->elf_ex);
+
if (current->personality & MMAP_PAGE_ZERO) {
/* Why this, you ask??? Well SVr4 maps page 0 as read-only,
and some applications "depend" upon this behavior.
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -1215,6 +1215,13 @@ unsigned long shrink_slab(unsigned long
void drop_pagecache(void);
void drop_slab(void);

+void pbitmap_update(struct mm_struct *mm);
+int pbitmap_load(struct file *f, unsigned long *buf,
+ unsigned long vaddr, unsigned long file_offset, unsigned long filesz);
+extern int pbitmap_enabled;
+extern int pbitmap_early_fault;
+extern unsigned pbitmap_update_interval;
+
#ifndef CONFIG_MMU
#define randomize_va_space 0
#else
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -359,6 +359,7 @@ static struct mm_struct * mm_init(struct
mm->free_area_cache = TASK_UNMAPPED_BASE;
mm->cached_hole_size = ~0UL;
mm_init_cgroup(mm, p);
+ INIT_LIST_HEAD(&mm->pbitmap_list);

if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
Index: linux/mm/pbitmap.c
===================================================================
--- /dev/null
+++ linux/mm/pbitmap.c
@@ -0,0 +1,404 @@
+/*
+ * Manage bitmaps of the working set of programs in executable files
+ * and use them later for quick prefetching.
+ *
+ * Subject to the GNU General Public License, version 2 only.
+ * Copyright 2007 Andi Kleen, SUSE Labs
+ *
+ * Locking: this file generally doesn't bother much with locking
+ * because during the update we're the last user of the mm just being
+ * destroyed and very little can interfere. And during the bitmap load
+ * the pbitmap object is just being constructed and also 100% private.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/gfp.h>
+#include <linux/mount.h>
+#include <asm/uaccess.h>
+
+#define Pprintk(x...)
+
+#define BITS_PER_PAGE (PAGE_SIZE * 8)
+#define BITS_TO_BYTES(x) (((x) + 8 - 1) / 8)
+
+struct pbitmap {
+ struct list_head node;
+ unsigned long start;
+ unsigned long npages;
+ loff_t file_offset; /* offset of bitmap in ELF file */
+ pgoff_t pgoff; /* page cache offset of VMA */
+ struct path backing;
+ /* Temporarily used in the update phase to pass down arguments: */
+ unsigned long *buffer; /* always a PAGE */
+ struct file *fh;
+};
+
+int pbitmap_enabled __read_mostly;
+int pbitmap_early_fault __read_mostly = 1;
+unsigned pbitmap_update_interval __read_mostly = 0; /* seconds */
+
+static struct pbitmap *
+alloc_pb(struct vm_area_struct *vma, unsigned long vaddr,
+ unsigned long file_offset, unsigned long filesz)
+{
+ struct pbitmap *pb;
+ pb = kzalloc(sizeof(struct pbitmap), GFP_KERNEL);
+ if (!pb)
+ return NULL;
+
+ Pprintk("alloc_pb vaddr %lx filesz %lu\n", vaddr, filesz);
+ pb->file_offset = file_offset;
+ pb->start = vaddr;
+ pb->pgoff = vma->vm_pgoff + ((vaddr - vma->vm_start) >> PAGE_SHIFT);
+ pb->npages = filesz*8;
+ pb->npages = min_t(unsigned long, pb->npages, vma_pages(vma));
+
+ /* Save away the file to make sure we access the same file later */
+ pb->backing.mnt = mntget(vma->vm_file->f_vfsmnt);
+ pb->backing.dentry = dget(vma->vm_file->f_dentry);
+ return pb;
+}
+
+void free_pb(struct pbitmap *pb)
+{
+ if (pb->fh)
+ filp_close(pb->fh, 0);
+ else {
+ dput(pb->backing.dentry);
+ mntput(pb->backing.mnt);
+ }
+ kfree(pb);
+}
+
+static int
+pbitmap_prefetch(struct file *f, unsigned long *buf, struct pbitmap *pb)
+{
+ int found = 0;
+ long left = BITS_TO_BYTES(pb->npages);
+ unsigned long offset = pb->file_offset;
+
+ Pprintk("%s prefetch %lx %lu\n", current->comm, offset, left);
+ while (left > 0) {
+ int n = left, k, bit;
+ if (n > PAGE_SIZE)
+ n = PAGE_SIZE;
+
+ /* Read the bitmap */
+
+ k = kernel_read(f, offset, (char *)buf, n);
+ if (n != k) {
+ printk("prefetch read failed: %d\n", n);
+ return -EIO;
+ }
+ Pprintk("bitmap [%lx]\n", *(unsigned long *)buf);
+
+ left -= n;
+ offset += n;
+
+ n *= 8;
+
+ /* First do IO on everything */
+ readahead_bitmap(f, pb->pgoff + pb->npages - left, buf, n);
+
+ if (!pbitmap_early_fault)
+ continue;
+
+ /* Now map it all in */
+ bit = 0;
+ while ((bit = find_next_bit(buf, n, bit)) < n) {
+ int err, i;
+ for (i = 1; i < n - bit; i++)
+ if (!test_bit(bit + i, buf))
+ break;
+ err = get_user_pages(current, current->mm,
+ pb->start + bit*PAGE_SIZE,
+ i*PAGE_SIZE,
+ 0,
+ 0,
+ NULL, NULL);
+ if (err < 0) {
+ printk("prefetch gup failed %d %lx\n", err,
+ pb->start+bit*PAGE_SIZE);
+ } else
+ found += err;
+ bit += i;
+ }
+ }
+ return found;
+}
+
+/* Prefetch the program's working set from last exit. */
+int pbitmap_load(struct file *f, unsigned long *buf,
+ unsigned long vaddr, unsigned long file_offset,
+ unsigned long filesz)
+{
+ int err;
+ struct pbitmap *pb;
+ struct vm_area_struct *vma;
+
+ vaddr &= PAGE_MASK;
+
+ vma = find_vma(current->mm, vaddr);
+ if (!vma)
+ return 0;
+
+ /* Likely BSS */
+ if (vma->vm_file == NULL)
+ return 0;
+
+ pb = alloc_pb(vma, vaddr, file_offset, filesz);
+ if (!pb)
+ return -ENOMEM;
+
+ /* For large programs it might be better to start a thread */
+ err = pbitmap_prefetch(f, buf, pb);
+ if (err < 0) {
+ free_pb(pb);
+ return err;
+ }
+ printk("%d prefetched\n", err);
+ list_add_tail(&pb->node, &current->mm->pbitmap_list);
+ return err;
+}
+
+/*
+ * Standard page table walker.
+ * Could later be converted to the .22 generic walker, but let's keep it like
+ * this for now.
+ */
+
+static void
+pbitmap_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct pbitmap *pb)
+{
+ pte_t *pte;
+ spinlock_t *ptl;
+
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+
+ Pprintk("addr %lx end %lx start %lx\n", addr, end, pb->start);
+
+ /* Turn addresses into bitmap offsets */
+ addr -= pb->start;
+ end -= pb->start;
+ addr >>= PAGE_SHIFT;
+ end >>= PAGE_SHIFT;
+
+ do {
+ if (pte_present(*pte))
+ __set_bit(addr, pb->buffer);
+ } while (pte++, addr++, addr != end);
+ pte_unmap_unlock(pte - 1, ptl);
+}
+
+static inline void pbitmap_pmd_range(struct vm_area_struct *vma, pud_t *pud,
+ unsigned long addr, unsigned long end,
+ struct pbitmap *pb)
+{
+ pmd_t *pmd;
+ unsigned long next;
+
+ pmd = pmd_offset(pud, addr);
+ do {
+ next = pmd_addr_end(addr, end);
+ if (pmd_none(*pmd))
+ continue;
+ pbitmap_pte_range(vma, pmd, addr, next, pb);
+ } while (pmd++, addr = next, addr != end);
+}
+
+static void pbitmap_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+ unsigned long addr, unsigned long end,
+ struct pbitmap *pb)
+{
+ pud_t *pud;
+ unsigned long next;
+
+ pud = pud_offset(pgd, addr);
+ do {
+ next = pud_addr_end(addr, end);
+ if (pud_none(*pud))
+ continue;
+ pbitmap_pmd_range(vma, pud, addr, next, pb);
+ } while (pud++, addr = next, addr != end);
+}
+
+static void pbitmap_page_range(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end,
+ struct pbitmap *pb)
+{
+ pgd_t *pgd;
+ unsigned long next;
+
+ BUG_ON(addr >= end);
+ pgd = pgd_offset(vma->vm_mm, addr);
+ do {
+ next = pgd_addr_end(addr, end);
+ if (pgd_none(*pgd))
+ continue;
+ pbitmap_pud_range(vma, pgd, addr, next, pb);
+ } while (pgd++, addr = next, addr != end);
+}
+
+/*
+ * Do some paranoid checks on the VMA just in case the user unmaped it and
+ * mapped something else there.
+ * This is not strictly needed for security because the bitmap update will only
+ * write to the cached executable.
+ */
+static int vma_in_file(struct vm_area_struct *vma, struct pbitmap *pb)
+{
+ if (!vma->vm_file || vma->vm_file->f_dentry != pb->backing.dentry)
+ return 0;
+ if (vma->vm_start >= pb->start + pb->npages*PAGE_SIZE)
+ return 0;
+ return vma->vm_pgoff + vma_pages(vma) > pb->pgoff &&
+ vma->vm_pgoff < pb->pgoff + pb->npages;
+}
+
+static void pbitmap_walk_vmas(struct pbitmap *pb, struct vm_area_struct *vma)
+{
+ unsigned long addr = pb->start;
+
+ for (; vma && vma_in_file(vma, pb); vma = vma->vm_next) {
+ unsigned long end = pb->start + pb->npages*PAGE_SIZE;
+ if (addr < vma->vm_start)
+ addr = vma->vm_start;
+ if (end > vma->vm_end) // off by one?
+ end = vma->vm_end;
+ Pprintk("p_p_r %lx %lx vma %lx-%lx pb start %lx pages %lu\n",
+ addr, end,
+ vma->vm_start, vma->vm_end,
+ pb->start, pb->npages);
+ if (addr == end)
+ continue;
+ pbitmap_page_range(vma, addr, end, pb);
+ }
+}
+
+long bit_count(unsigned char *buffer, int n)
+{
+ int i;
+ long x = 0;
+ for (i = 0; i < n; i++) {
+ unsigned char v = buffer[i];
+ while (v) {
+ if (v & 1)
+ x++;
+ v >>= 1;
+ }
+ }
+ return x;
+}
+
+/* Avoid thrashing from too frequent updates */
+static int no_update(struct pbitmap *pb)
+{
+ struct inode *i;
+ if (!pbitmap_update_interval)
+ return 0;
+ i = pb->backing.dentry->d_inode;
+ return i->i_mtime.tv_sec + pbitmap_update_interval < xtime.tv_sec;
+}
+
+/*
+ * Update the present bitmaps on disk on program exit.
+ *
+ * Doesn't do any mmap_sem locking anywhere because there should be no
+ * other users of the mm at this point. The page table locks are
+ * unfortunately still needed because the mm could still being swapped
+ * out in parallel (TBD teach the swapper to not do that)
+ *
+ * Could be merged with unmap_vmas later; but it's easier to do it this
+ * way for now.
+ */
+void pbitmap_update(struct mm_struct *mm)
+{
+ struct pbitmap *pb, *prev;
+ struct vm_area_struct *vma;
+ mm_segment_t oldfs;
+ unsigned long *buffer;
+ int order;
+
+ if (list_empty(&mm->pbitmap_list))
+ return;
+
+ buffer = NULL;
+
+ oldfs = get_fs();
+ prev = NULL;
+ list_for_each_entry (pb, &mm->pbitmap_list, node) {
+ long bytes = BITS_TO_BYTES(pb->npages);
+
+ if (!bytes)
+ continue;
+
+ /* This is typically order 0; only extremly large VMAs
+ (> 128MB) would have order >0. Failing is also not fatal.
+ We don't want any swapping so use GFP_NOFS */
+ if (!buffer || bytes > (PAGE_SIZE << order)) {
+ if (buffer)
+ free_pages((unsigned long)buffer, order);
+ order = get_order(bytes);
+ buffer = (unsigned long *)
+ __get_free_pages(GFP_NOFS|__GFP_NOWARN|
+ __GFP_ZERO, order);
+ if (!buffer) {
+ printk("allocation %d failed\n", order);
+ break;
+ }
+ }
+
+ pb->buffer = buffer;
+
+ vma = find_vma(mm, pb->start);
+ /* Reuse the last file if possible */
+ if (prev && prev->fh &&
+ pb->backing.dentry == prev->backing.dentry &&
+ pb->backing.mnt == prev->backing.mnt) {
+ pb->fh = prev->fh;
+ prev->fh = NULL;
+ } else {
+ if (no_update(pb))
+ continue;
+ /* Reopen to get a writable file */
+ pb->fh = dentry_open(pb->backing.dentry,
+ pb->backing.mnt,
+ O_WRONLY|O_FORCEWRITE);
+ if (IS_ERR(pb->fh)) {
+ printk("dentry open of %s failed: %ld\n",
+ pb->backing.dentry->d_name.name,
+ PTR_ERR(pb->fh));
+ pb->fh = NULL;
+ }
+ }
+ if (pb->fh) {
+ int n;
+ pbitmap_walk_vmas(pb, vma);
+ Pprintk("%s: %ld bytes end %ld bits [%lx] at %Lx\n",
+ current->comm, bytes,
+ bit_count((unsigned char *)buffer, bytes),
+ buffer[0], pb->file_offset);
+ set_fs(KERNEL_DS);
+ n = vfs_write(pb->fh, (char *)buffer, bytes,
+ &pb->file_offset);
+ set_fs(oldfs);
+ if (n != bytes)
+ printk("%s: bitmap write %d of %ld\n",
+ current->comm, n, bytes);
+
+ /* fh contains the references now */
+ pb->backing.dentry = NULL;
+ pb->backing.mnt = NULL;
+ }
+ prev = pb;
+ }
+ list_for_each_entry_safe (pb, prev, &mm->pbitmap_list, node)
+ free_pb(pb);
+
+ if (buffer)
+ free_page((unsigned long)buffer);
+}
Index: linux/mm/mmap.c
===================================================================
--- linux.orig/mm/mmap.c
+++ linux/mm/mmap.c
@@ -2039,6 +2039,7 @@ void exit_mmap(struct mm_struct *mm)
/* mm's last user has gone, and its about to be pulled down */
arch_exit_mmap(mm);

+ pbitmap_update(mm);
lru_add_drain();
flush_cache_mm(mm);
tlb = tlb_gather_mmu(mm, 1);
Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile
+++ linux/mm/Makefile
@@ -11,7 +11,7 @@ obj-y := bootmem.o filemap.o mempool.o
page_alloc.o page-writeback.o pdflush.o \
readahead.o swap.o truncate.o vmscan.o \
prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
- page_isolation.o $(mmu-y)
+ page_isolation.o $(mmu-y) pbitmap.o

obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
obj-$(CONFIG_BOUNCE) += bounce.o
Index: linux/include/linux/mm_types.h
===================================================================
--- linux.orig/include/linux/mm_types.h
+++ linux/include/linux/mm_types.h
@@ -222,6 +222,9 @@ struct mm_struct {
/* aio bits */
rwlock_t ioctx_list_lock;
struct kioctx *ioctx_list;
+
+ struct list_head pbitmap_list;
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
struct mem_cgroup *mem_cgroup;
#endif

2008-03-18 01:12:12

by Andi Kleen

[permalink] [raw]
Subject: [PATCH prototype] [7/8] Add the sysctls to control pbitmaps


- pbitmap_enabled: Master switch for pbitmap
- pbitmap_early_fault: Control whether pbitmap should do
early page faults or not. Default on.
- pbitmap_update_interval: How often the pbitmap should
be updated on disk.

Signed-off-by: Andi Kleen <[email protected]>

---
kernel/sysctl.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -1044,6 +1044,36 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
},
{
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "pbitmap_enabled",
+ .data = &pbitmap_enabled,
+ .maxlen = sizeof(pbitmap_enabled),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "pbitmap_early_fault",
+ .data = &pbitmap_early_fault,
+ .maxlen = sizeof(pbitmap_early_fault),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "pbitmap_update_interval",
+ .data = &pbitmap_update_interval,
+ .maxlen = sizeof(pbitmap_update_interval),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
+ {
.ctl_name = VM_VFS_CACHE_PRESSURE,
.procname = "vfs_cache_pressure",
.data = &sysctl_vfs_cache_pressure,

2008-03-18 01:12:30

by Andi Kleen

[permalink] [raw]
Subject: [PATCH prototype] [8/8] Add mmap_full_slurp support


This is the non-subtle brother of pbitmaps. Instead of only prefetching
the pages used last time always prefetch complete mmap areas
(subject to the max_ra_chunk limits)

Main advantage is that it works for shared libraries too (but
the prefetching is currently done for all file backed mmaps,
not just executable mappings)

Disadvantage is that uses far more memory and does a lot of
unnecessary work.

Still is faster in some circumstances.

Experimental. Mainly added for comparison. Off by default.
Probably not a good idea.

I'm more including it for easier experimentation.

Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/mm.h | 2 ++
kernel/sysctl.c | 10 ++++++++++
mm/mmap.c | 10 ++++++++++
3 files changed, 22 insertions(+)

Index: linux/mm/mmap.c
===================================================================
--- linux.orig/mm/mmap.c
+++ linux/mm/mmap.c
@@ -40,6 +40,8 @@
#define arch_rebalance_pgtables(addr, len) (addr)
#endif

+int mmap_full_slurp __read_mostly;
+
static void unmap_region(struct mm_struct *mm,
struct vm_area_struct *vma, struct vm_area_struct *prev,
unsigned long start, unsigned long end);
@@ -2023,6 +2025,14 @@ out:
mm->locked_vm += len >> PAGE_SHIFT;
make_pages_present(addr, addr + len);
}
+ if (vma->vm_file && mmap_full_slurp) {
+ up_write(&mm->mmap_sem);
+ force_page_cache_readahead(vma->vm_file->f_mapping,
+ vma->vm_file,
+ vma->vm_pgoff,
+ max_sane_readahead(vma_pages(vma)));
+ down_write(&mm->mmap_sem);
+ }
return addr;
}

Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -1077,6 +1077,8 @@ extern int do_munmap(struct mm_struct *,

extern unsigned long do_brk(unsigned long, unsigned long);

+extern int mmap_full_slurp;
+
/* filemap.c */
extern unsigned long page_unuse(struct page *);
extern void truncate_inode_pages(struct address_space *, loff_t);
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -909,6 +909,16 @@ static struct ctl_table vm_table[] = {
.extra2 = &one_hundred,
},
{
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "mmap_full_slurp",
+ .data = &mmap_full_slurp,
+ .maxlen = sizeof(mmap_full_slurp),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
+ {
.procname = "dirty_writeback_centisecs",
.data = &dirty_writeback_interval,
.maxlen = sizeof(dirty_writeback_interval),

2008-03-18 07:37:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Tue, 18 Mar 2008 02:09:34 +0100 (CET) Andi Kleen <[email protected]> wrote:

> This patchkit is an experimental optimization I played around with
> some time ago.
>
> This is more a prototype still, but I wanted to push it out
> so that other people can play with it.
>
> The basic idea is that most programs have the same working set
> over multiple runs. So instead of demand paging all the text pages
> in the order the program runs save the working set to disk and prefetch
> it at program start and then save it at program exit.
>
> This allows some optimizations:
> - it can avoid unnecessary disk seeks because the blocks will be fetched in
> sorted offset order instead of program execution order.
> - batch kernel entries (each demand page exception has some
> overhead just for entering the kernel). This keeps the caches hot too.
> - The prefetch could be in theory done in the background while the program
> runs (although that is not implemented currently)

Should be worthwhile for some things.

> Some details on the implementation:

Can't this all be done in userspace? Hook into exit() with an LD_PRELOAD,
use /proc/self/maps and the new pagemap code to work out which pages of
which files were faulted in, write that info into the elf file (or a
separate per-executable shadow file), then use that info the next time the
app is executed, either with an LD_PRELOAD or just a wrapper.

> Drawbacks:
> - No support for dynamic libraries right now (except very clumpsily
> through the mmap_slurp hack). This is the main reason it is not
> very useful for speed up desktops currently.
>
> - Executable files have to be writable by the user executing it
> currently to get bitmap updates. It would be possible to let the
> kernel bypass this, but I haven't thought too much about the security
> implications of it.
> However any user can use the bitmap data written by a user with
> write rights.

Those all get fixed with the userspace version?

2008-03-19 17:52:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Tue, Mar 18, 2008 at 12:36:20AM -0700, Andrew Morton wrote:
> On Tue, 18 Mar 2008 02:09:34 +0100 (CET) Andi Kleen <[email protected]> wrote:
>
> > This patchkit is an experimental optimization I played around with
> > some time ago.
> >
> > This is more a prototype still, but I wanted to push it out
> > so that other people can play with it.
> >
> > The basic idea is that most programs have the same working set
> > over multiple runs. So instead of demand paging all the text pages
> > in the order the program runs save the working set to disk and prefetch
> > it at program start and then save it at program exit.
> >
> > This allows some optimizations:
> > - it can avoid unnecessary disk seeks because the blocks will be fetched in
> > sorted offset order instead of program execution order.
> > - batch kernel entries (each demand page exception has some
> > overhead just for entering the kernel). This keeps the caches hot too.
> > - The prefetch could be in theory done in the background while the program
> > runs (although that is not implemented currently)
>
> Should be worthwhile for some things.
>
> > Some details on the implementation:
>
> Can't this all be done in userspace? Hook into exit() with an LD_PRELOAD,

In theory yes, but it would have much more overhead. Also I think prefetching
algorithms like this really belong in the kernel.

> > - Executable files have to be writable by the user executing it
> > currently to get bitmap updates. It would be possible to let the
> > kernel bypass this, but I haven't thought too much about the security
> > implications of it.
> > However any user can use the bitmap data written by a user with
> > write rights.
>
> Those all get fixed with the userspace version?

No in fact the permission problem is much harder to fix in user space.

The shared libraries would need some user support. Basically just a way
how the ld.so can tell the kernel about the offsets/locations of new
bitmaps to add to the pbitmap list

-Andi

>

2008-03-19 19:29:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Tue, 18 Mar 2008 18:20:45 +0100 Andi Kleen <[email protected]> wrote:

> > What's the permission problem? executable-but-not-readable files? Could
>
> Not writable.

Oh.

I doubt if a userspace implementation would even try to alter the ELF
files, really - there seems to be no point in it. This is just complexity
which was added by trying to do it in the kernel.

> > be handled by passing your request to a suitable-privileged server process,
> > I guess.
>
> Yes it could, but i dont even want to thi nk about all the issues of
> doing such an interface. It is basically an microkernelish approach.
> I prefer monolithic simplicity.

It's not complex at all. Pass a null-terminated pathname to the server and
keep running. The server will asynchronously read your pages for you.

That's assuming executable+unreadable libraries and binaries actually need
to be handled. If not: no server needed.

> e.g. i am pretty sure your user space implementation would be far
> more complicated than a nicely streamlined kernel implementation.
> And I am really not a friend of unnecessary complexity. In the end
> complexity hurts you, no matter if it is in ring 3 or ring 0.

There is no complexity here.

2008-03-19 19:29:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Tue, 18 Mar 2008 15:18:28 +0100 Andi Kleen <[email protected]> wrote:

> On Tue, Mar 18, 2008 at 12:36:20AM -0700, Andrew Morton wrote:
> > On Tue, 18 Mar 2008 02:09:34 +0100 (CET) Andi Kleen <[email protected]> wrote:
> >
> > > This patchkit is an experimental optimization I played around with
> > > some time ago.
> > >
> > > This is more a prototype still, but I wanted to push it out
> > > so that other people can play with it.
> > >
> > > The basic idea is that most programs have the same working set
> > > over multiple runs. So instead of demand paging all the text pages
> > > in the order the program runs save the working set to disk and prefetch
> > > it at program start and then save it at program exit.
> > >
> > > This allows some optimizations:
> > > - it can avoid unnecessary disk seeks because the blocks will be fetched in
> > > sorted offset order instead of program execution order.
> > > - batch kernel entries (each demand page exception has some
> > > overhead just for entering the kernel). This keeps the caches hot too.
> > > - The prefetch could be in theory done in the background while the program
> > > runs (although that is not implemented currently)
> >
> > Should be worthwhile for some things.
> >
> > > Some details on the implementation:
> >
> > Can't this all be done in userspace? Hook into exit() with an LD_PRELOAD,
>
> In theory yes, but it would have much more overhead.

s/much/slightly/. The main win will be in optimising disk patterns. We
can fault in cached pages at, what? A few GB/sec?

> Also I think prefetching
> algorithms like this really belong in the kernel.

hrmph.

> > > - Executable files have to be writable by the user executing it
> > > currently to get bitmap updates. It would be possible to let the
> > > kernel bypass this, but I haven't thought too much about the security
> > > implications of it.
> > > However any user can use the bitmap data written by a user with
> > > write rights.
> >
> > Those all get fixed with the userspace version?
>
> No in fact the permission problem is much harder to fix in user space.

What's the permission problem? executable-but-not-readable files? Could
be handled by passing your request to a suitable-privileged server process,
I guess.

2008-03-19 19:34:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Wed, 19 Mar 2008 09:32:28 +0100 Andi Kleen <[email protected]> wrote:

> On Tue, Mar 18, 2008 at 10:44:37AM -0700, Andrew Morton wrote:
> > On Tue, 18 Mar 2008 18:20:45 +0100 Andi Kleen <[email protected]> wrote:
> >
> > > > What's the permission problem? executable-but-not-readable files? Could
> > >
> > > Not writable.
> >
> > Oh.
> >
> > I doubt if a userspace implementation would even try to alter the ELF
> > files, really - there seems to be no point in it. This is just complexity
>
> Well the information has to be somewhere and i think the ELF file
> is the best location for it. It makes it the most user transparent.

Adopt a standard, stick with it.

Assuming that all users have the same access pattern might be inefficient,
a little bit. There might be some advantage to making it per-user, dunno.

The requirement to write to an executable sounds like a bit of a
showstopper.

> > > Yes it could, but i dont even want to thi nk about all the issues of
> > > doing such an interface. It is basically an microkernelish approach.
> > > I prefer monolithic simplicity.
> >
> > It's not complex at all. Pass a null-terminated pathname to the server and
> > keep running. The server will asynchronously read your pages for you.
>
> But how do you update the bitmap in your scheme?

umm,

BITMAP_TRAINING_RUN=1 /usr/lib64/firefox-2.0.0.12/firefox-bin

will write the bitmap to ~/.bitmaps/usr/lib64/firefox-2.0.0.12/firefox-bin ?

if it proves useful, build it all into libc..


I'm assuming that the per-page minor fault cost is relatively low and that
the major benefit is in disk scheduling[*]. If that's false then we'd need
kernel support I guess - some sort of gang-fault syscall?


* solid-state disks are going to put a lot of code out of a job.

2008-03-19 20:38:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

> What's the permission problem? executable-but-not-readable files? Could

Not writable.

> be handled by passing your request to a suitable-privileged server process,
> I guess.

Yes it could, but i dont even want to thi nk about all the issues of
doing such an interface. It is basically an microkernelish approach.
I prefer monolithic simplicity.

e.g. i am pretty sure your user space implementation would be far
more complicated than a nicely streamlined kernel implementation.
And I am really not a friend of unnecessary complexity. In the end
complexity hurts you, no matter if it is in ring 3 or ring 0.

-Andi

2008-03-19 21:45:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Tue, Mar 18, 2008 at 10:44:37AM -0700, Andrew Morton wrote:
> On Tue, 18 Mar 2008 18:20:45 +0100 Andi Kleen <[email protected]> wrote:
>
> > > What's the permission problem? executable-but-not-readable files? Could
> >
> > Not writable.
>
> Oh.
>
> I doubt if a userspace implementation would even try to alter the ELF
> files, really - there seems to be no point in it. This is just complexity

Well the information has to be somewhere and i think the ELF file
is the best location for it. It makes it the most user transparent.

> > Yes it could, but i dont even want to thi nk about all the issues of
> > doing such an interface. It is basically an microkernelish approach.
> > I prefer monolithic simplicity.
>
> It's not complex at all. Pass a null-terminated pathname to the server and
> keep running. The server will asynchronously read your pages for you.

But how do you update the bitmap in your scheme?

> > And I am really not a friend of unnecessary complexity. In the end
> > complexity hurts you, no matter if it is in ring 3 or ring 0.
>
> There is no complexity here.

I have my doubts on that if you consider update too.

-Andi
>

2008-03-19 23:58:38

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Wed, Mar 19, 2008 at 2:04 AM, Andrew Morton
<[email protected]> wrote:
> The requirement to write to an executable sounds like a bit of a
> showstopper.

Agreed. In addition, it makes all kinds of tools misbehave. I can
see extensions to the ELF format and those would require relinking
binaries, but without this you create chaos. ELF is so nice because
it describes the entire file and we can handle everything according to
rules. If you just somehow magically patch some new data structure in
this will break things.

Furthermore, by adding all this data to the end of the file you'll
normally create unnecessary costs. The parts of the binaries which
are used at runtime start at the front. At the end there might be a
lot of data which isn't needed at runtime and is therefore normally
not read.


> if it proves useful, build it all into libc..

I could see that. But to handle it efficiently kernel support is
needed. Especially since I don't think the currently proposed
"learning mode" is adequate. Over many uses of a program all kinds of
pages will be needed. Far more than in most cases. The prefetching
should really only cover the commonly used code paths in the program.
If you pull in everything, this will have advantages if you have that
much page cache to spare. In that case just prefetching the entire
file is even easier. No, such an improved method has to be more
selective.

But if we're selective in the loading of the pages we'll
(unfortunately) end up with holes. Possibly many of them. This would
mean with today's interfaces a large number of madvise() calls. What
would be needed is kernel support which takes a bitmap, each bit
representing a page. This bitmap could be allocated as part of the
binary by the linker. With appropriate ELF data structures supporting
it so that strip et.al won't stumble. To fill in the bitmaps one can
have separate a separate tool which is explicitly asked to update the
bitmap data. To collect the page fault data one could use systemtap.
It's easy enough to write a script which monitors the minor page
faults for each binary and writes the data into a file. The binary
update tool and can use the information from that file to generate the
bitmap.

2008-03-20 00:06:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Wed, 19 Mar 2008 15:45:16 -0700
"Ulrich Drepper" <[email protected]> wrote:

> On Wed, Mar 19, 2008 at 2:04 AM, Andrew Morton
> <[email protected]> wrote:
> > The requirement to write to an executable sounds like a bit of a
> > showstopper.
>
> Agreed. In addition, it makes all kinds of tools misbehave. I can
> see extensions to the ELF format and those would require relinking
> binaries, but without this you create chaos. ELF is so nice because
> it describes the entire file and we can handle everything according to
> rules. If you just somehow magically patch some new data structure in
> this will break things.
>
> Furthermore, by adding all this data to the end of the file you'll
> normally create unnecessary costs. The parts of the binaries which
> are used at runtime start at the front. At the end there might be a
> lot of data which isn't needed at runtime and is therefore normally
> not read.
>
>
> > if it proves useful, build it all into libc..
>
> I could see that. But to handle it efficiently kernel support is
> needed. Especially since I don't think the currently proposed
> "learning mode" is adequate. Over many uses of a program all kinds of
> pages will be needed. Far more than in most cases. The prefetching
> should really only cover the commonly used code paths in the program.
> If you pull in everything, this will have advantages if you have that
> much page cache to spare. In that case just prefetching the entire
> file is even easier. No, such an improved method has to be more
> selective.

yes, ultimately we'd end up pulling in most of the executable that way, and
a 90%-good-enough solution is perhaps to just suck the whole thing into
pagecache.

I did some work on that many years ago and I do recall that it helped, but
I forget how much.

There's a very very easy way of testing this though. filemap_fault()
_already_ does readaround when it gets a major fault. And this is tunable
via /sys/block/sda/queue/read_ahead_kb. So set that to "infinity" to pull
the whole file into pagecache at the first major fault and run some
benchmarks.

If that proves useful we could look at separating read()'s readahead
tunable from filemap_fault()'s tunable.

Note! Due to interaction between the linker and common filesystems,
executables tend to be very poorly laid out on disk: the blocks are
out-of-order, often grossly. So one shouldn't do performance testing of
this form against executable files which were written directly by
/usr/bin/ld. First use `cp' to straighten the file out..

> But if we're selective in the loading of the pages we'll
> (unfortunately) end up with holes. Possibly many of them. This would
> mean with today's interfaces a large number of madvise() calls. What
> would be needed is kernel support which takes a bitmap, each bit
> representing a page. This bitmap could be allocated as part of the
> binary by the linker. With appropriate ELF data structures supporting
> it so that strip et.al won't stumble. To fill in the bitmaps one can
> have separate a separate tool which is explicitly asked to update the
> bitmap data. To collect the page fault data one could use systemtap.
> It's easy enough to write a script which monitors the minor page
> faults for each binary and writes the data into a file. The binary
> update tool and can use the information from that file to generate the
> bitmap.

bitmap-based madvise() or fadvise() sounds pretty easy to do.

2008-03-20 00:36:45

by David Miller

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

From: Andrew Morton <[email protected]>
Date: Wed, 19 Mar 2008 16:12:11 -0700

> I did some work on that many years ago and I do recall that it
> helped, but I forget how much.

I wrote such a patch ages ago as well.

Frankly, based upon my experiences then and what I know now, I think
it's a lose to do this.

Better to 1) have enough ram and 2) make the reclaim smarter about
important "executable" page cache pages.

2008-03-20 00:37:32

by Diego Calleja

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

El Wed, 19 Mar 2008 02:04:40 -0700, Andrew Morton <[email protected]> escribi?:

> Assuming that all users have the same access pattern might be inefficient,
> a little bit. There might be some advantage to making it per-user, dunno.

In the Dark Side of operating systems, the prefetching system they use
can log several access patterns for a single executable, because a single
executable can have different behaviours even for the same user, depending
on what parameters the executable is passed and what COM machinery it
uses. For example, wmplayer.exe can play a dvd, rip a CD, listen to a music
stream, etc...diferent usages, different access patterns. Linux probably faces
the same problem (bash, cat...)

A alternative design for a userspace solution that doesn't needs LD_PRELOAD
is to use CONFIG_PROC_EVENTS to get notifications of what processes are
started, which can be used to poll its /proc files or try to preload data
(asynchronously, and a bit hacky maybe).

But if a kernel patch is really needed to implement this properly, maybe
it'd be worth to take a look at the prefetch project that the Ubuntu guys
are apparently going to merge in the next ubuntu development release (8.10)...
https://wiki.ubuntu.com/DesktopTeam/Specs/Prefetch

There are even kernel patches:
http://code.google.com/p/prefetch/source/browse/tags/soc2007-end/trunk/kernel-patches/2.6.22/submitted/0001-prefetch-core.diff
http://code.google.com/p/prefetch/source/browse/tags/soc2007-end/trunk/kernel-patches/2.6.22/submitted/0002-prefetch-boot.diff

2008-03-20 08:57:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Wed, Mar 19, 2008 at 03:45:16PM -0700, Ulrich Drepper wrote:
> On Wed, Mar 19, 2008 at 2:04 AM, Andrew Morton
> <[email protected]> wrote:
> > The requirement to write to an executable sounds like a bit of a
> > showstopper.
>
> Agreed. In addition, it makes all kinds of tools misbehave. I can
> see extensions to the ELF format and those would require relinking
> binaries, but without this you create chaos. ELF is so nice because

What chaos exactly? For me it looks rather that a separatate database
would be a recipe for chaos. e.g. for example how would you make sure
the database keeps track of changing executables?

The only minor issue is rpm -Va and similar, but that can be handled
in the same way as prelinking is handled there by using filter scripts
that reset the bitmap before taking a checksum.

Also the current way does not require relinking by using the shdr hack.
I have a simple program that adds a bitmap shdr to any executable.
But if the binutils leanred about this and added a bitmap phdr (it
tends to be only a few hundred bytes even on very large executables)
one seek could be avoided.


> it describes the entire file and we can handle everything according to
> rules. If you just somehow magically patch some new data structure in
> this will break things.

Can you elaborate what you think will be broken?

>
> Furthermore, by adding all this data to the end of the file you'll

We are talking about 32bytes for each MB worth of executable.
You can hardly call that "all that data". Besides the prefetcher supports
in theory larger page sizes, so if you wanted you could even reduce
that even more by e.g. using 64k prefetch blocks. But the overhead
is so tiny that it doesn't make much difference.

> normally create unnecessary costs. The parts of the binaries which
> are used at runtime start at the front. At the end there might be a
> lot of data which isn't needed at runtime and is therefore normally
> not read.

Yes as I said using the SHDR currently requires an additional seek
(although in practice i expect it to be served out of the track
buffer of the hard disk if the executable is not too large and is
continuous on disk). If binutils were taught of generating a phdr
that minor problem would go away.

>
>
> > if it proves useful, build it all into libc..
>
> I could see that. But to handle it efficiently kernel support is
> needed. Especially since I don't think the currently proposed

I agree.

> "learning mode" is adequate. Over many uses of a program all kinds of

It is not too bad, but could be certainly better. I outlined
some possible ways to improve the algorithms in my original way.
It would be a nice research project for someone to investigate
the various ways (anyone interested?)

> pages will be needed. Far more than in most cases. The prefetching
> should really only cover the commonly used code paths in the program.

Sorry that doesnt make sense. Anything that is read at startup
has to be prefetched, even if that code is only executed once.
Otherwise the whole scheme is rather useless.
Because even a single access requires the IO to read it from
disk.

> If you pull in everything, this will have advantages if you have that
> much page cache to spare. In that case just prefetching the entire

Yes that is what the "mmap_flush" hack (last patch does) I actually
have some numbers on a older kernel and in some cases it really
does quite well. But it also has a few problems (e.g. interaction
with data mmaps and memory waste) that are unpleasant.

-Andi

2008-03-21 17:15:30

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Thu, Mar 20, 2008 at 2:00 AM, Andi Kleen <[email protected]> wrote:
> What chaos exactly? For me it looks rather that a separatate database
> would be a recipe for chaos. e.g. for example how would you make sure
> the database keeps track of changing executables?

I didn't say that a separate file with the data is better. In fact, I
agree, it's not much better. What I referred to as the problem is
that this is an extension which is not part of the ELF spec and
doesn't fit in. The ELF spec has rules how programs have to deal with
unknown parts of a binary. Only this way programs like strip can work
in the presence of extensions. There are ways to embed such a bitmap
but not ad-hoc as you proposed.


> But if the binutils leanred about this and added a bitmap phdr (it
> tends to be only a few hundred bytes even on very large executables)
> one seek could be avoided.

And that is only one advantage. Let's not go done the path of invalid
ELF files.


> > Furthermore, by adding all this data to the end of the file you'll
>
> We are talking about 32bytes for each MB worth of executable.
> You can hardly call that "all that data".

This wasn't a comment about the size of the data but the type of data.
The end of a binary contains data which is not used at runtime. Now
you're mixing in data which is used.


> > pages will be needed. Far more than in most cases. The prefetching
> > should really only cover the commonly used code paths in the program.
>
> Sorry that doesnt make sense. Anything that is read at startup
> has to be prefetched, even if that code is only executed once.
> Otherwise the whole scheme is rather useless.
> Because even a single access requires the IO to read it from
> disk.

Again, you misunderstand. I'm not proposing to exclude pages which
are only used at startup time. I mean the data collection should stop
some time after a process was started to account for possibly quite
different code paths which are taken by different runs of the program.
I.e., don't record page faults for the entire runtime of the program
which, hopefully in most cases, will result in all the pages of a
program to be marked (unless you have a lot of dead code in the binary
and it's all located together).

2008-03-21 17:24:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Fri, Mar 21, 2008 at 10:15:15AM -0700, Ulrich Drepper wrote:
> On Thu, Mar 20, 2008 at 2:00 AM, Andi Kleen <[email protected]> wrote:
> > What chaos exactly? For me it looks rather that a separatate database
> > would be a recipe for chaos. e.g. for example how would you make sure
> > the database keeps track of changing executables?
>
> I didn't say that a separate file with the data is better. In fact, I
> agree, it's not much better. What I referred to as the problem is
> that this is an extension which is not part of the ELF spec and

Linux executables already contain plenty of extensions outside
the ELF spec like GNU_EH_FRAME or debuglink etc. It is not surprising
because the ELF spec is kind of not maintained anymore afaik.

> doesn't fit in. The ELF spec has rules how programs have to deal with
> unknown parts of a binary. Only this way programs like strip can work

Can you expand how the bitmap headers or pbitmap.c violate these rules?

> in the presence of extensions. There are ways to embed such a bitmap
> but not ad-hoc as you proposed.

Concrete suggestions please.

>
>
> > But if the binutils leanred about this and added a bitmap phdr (it
> > tends to be only a few hundred bytes even on very large executables)
> > one seek could be avoided.
>
> And that is only one advantage. Let's not go done the path of invalid
> ELF files.

What is invalid?

>
>
> > > Furthermore, by adding all this data to the end of the file you'll
> >
> > We are talking about 32bytes for each MB worth of executable.
> > You can hardly call that "all that data".
>
> This wasn't a comment about the size of the data but the type of data.
> The end of a binary contains data which is not used at runtime. Now
> you're mixing in data which is used.

Well there was no other choice I know of short of relinking. Or do you
have a way to add a PHDR without relinking? I am aware the SHDR is a hack,
I called it that myself. I just don't know of a better way.

If the pbitmaps were universally adopted the use of the SHDRs would
be phased out quickly I expect because the bitmaps would be standard
parts of all PHDRs, but short term not requiring relinking
is a huge advantage.

> Again, you misunderstand. I'm not proposing to exclude pages which
> are only used at startup time. I mean the data collection should stop
> some time after a process was started to account for possibly quite
> different code paths which are taken by different runs of the program.
> I.e., don't record page faults for the entire runtime of the program
> which, hopefully in most cases, will result in all the pages of a
> program to be marked (unless you have a lot of dead code in the binary
> and it's all located together).

When would that time be? I cannot think of a single heuristic that would
work for both "/bin/true" and a OpenOffice start.

-Andi

2008-03-22 04:37:05

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Fri, Mar 21, 2008 at 10:26 AM, Andi Kleen <[email protected]> wrote:
> Concrete suggestions please.

I already spelled it out. Add a new program header entry, point it to
a bit array large enough to cover all loadable segments.

It is not worth creating problems with this invalid extension just for
old binaries. Just let those go. New binaries can automatically get
the array and then there are no extra seeks, the format is well
defined, etc.

2008-03-22 04:38:39

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Fri, Mar 21, 2008 at 10:26 AM, Andi Kleen <[email protected]> wrote:
> When would that time be? I cannot think of a single heuristic that would
> work for both "/bin/true" and a OpenOffice start.

In both cases the stable state is reached after, say, 4 seconds. It's
just that true terminates before the time is up. I think something
like "trace the first N seconds" is a reasonable heuristics.

2008-03-22 07:15:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Fri, Mar 21, 2008 at 09:36:51PM -0700, Ulrich Drepper wrote:
> On Fri, Mar 21, 2008 at 10:26 AM, Andi Kleen <[email protected]> wrote:
> > Concrete suggestions please.
>
> I already spelled it out. Add a new program header entry, point it to
> a bit array large enough to cover all loadable segments.

Ah that's easy, the program header is already supported in the kernel code
(PT_PRESENT_BITMAP)

The additional SHDR is just there for easier testing/migration.
>
> It is not worth creating problems with this invalid extension just for

You still didn't say why it was invalid.

Anyways I disagree on the value of supporting old binaries. I believe
it is important.

-Andi

2008-03-22 07:25:11

by Nicholas Miell

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables


On Sat, 2008-03-22 at 08:17 +0100, Andi Kleen wrote:
> On Fri, Mar 21, 2008 at 09:36:51PM -0700, Ulrich Drepper wrote:
> > On Fri, Mar 21, 2008 at 10:26 AM, Andi Kleen <[email protected]> wrote:
> > > Concrete suggestions please.
> >
> > I already spelled it out. Add a new program header entry, point it to
> > a bit array large enough to cover all loadable segments.
>
> Ah that's easy, the program header is already supported in the kernel code
> (PT_PRESENT_BITMAP)
>
> The additional SHDR is just there for easier testing/migration.
> >
> > It is not worth creating problems with this invalid extension just for
>
> You still didn't say why it was invalid.
>
> Anyways I disagree on the value of supporting old binaries. I believe
> it is important.
>
> -Andi

Why not stick the bitmap in an xattr?

--
Nicholas Miell <[email protected]>

2008-03-22 09:07:24

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

> Why not stick the bitmap in an xattr?

xattrs are too small for potentially large binaries and a mess to manage
(a lot of tools don't know about them)

-Andi

2008-03-22 10:16:46

by Nicholas Miell

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables


On Sat, 2008-03-22 at 10:10 +0100, Andi Kleen wrote:
> > Why not stick the bitmap in an xattr?
>
> xattrs are too small for potentially large binaries

*sigh* this is probably true

> and a mess to manage (a lot of tools don't know about them)

At this point in time, all tools that don't support xattrs are
defective, but this is still probably true.

I just have an instinctive aversion towards the kernel mucking around in
ELF objects -- for one thing, you're going to have to blacklist
cryptographically signed binaries.


--
Nicholas Miell <[email protected]>

2008-03-22 14:27:05

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Sat, Mar 22, 2008 at 03:16:31AM -0700, Nicholas Miell wrote:
>
> *sigh* this is probably true

Actually it is a relatively weak argument assuming the standard
4k xattrs, but still an issue.

The other stronger argument against it is that larger xattrs tend to be
outside the inode so you would have another seek again.

> > and a mess to manage (a lot of tools don't know about them)
>
> At this point in time, all tools that don't support xattrs are
> defective,

Good joke.

> I just have an instinctive aversion towards the kernel mucking around in
> ELF objects -- for one thing, you're going to have to blacklist
> cryptographically signed binaries.

What signed binaries?

Anyways there are two ways to deal with this:

- Run the executable through a little filter that zeroes the bitmap before
computing the checksum. That is how rpm -V deals with prelinked binaries which
have a similar issue. You can probably reuse the scripts from rpm.
- Disable the pbitmap header before you sign, either by never adding
one or disabling it by turning the phdr type into a nop (should be very simple)

-Andi

2008-03-23 13:38:42

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Sat 2008-03-22 15:29:49, Andi Kleen wrote:
> On Sat, Mar 22, 2008 at 03:16:31AM -0700, Nicholas Miell wrote:
> >
> > *sigh* this is probably true
>
> Actually it is a relatively weak argument assuming the standard
> 4k xattrs, but still an issue.
>
> The other stronger argument against it is that larger xattrs tend to be
> outside the inode so you would have another seek again.
>
> > > and a mess to manage (a lot of tools don't know about them)
> >
> > At this point in time, all tools that don't support xattrs are
> > defective,
>
> Good joke.
>
> > I just have an instinctive aversion towards the kernel mucking around in
> > ELF objects -- for one thing, you're going to have to blacklist
> > cryptographically signed binaries.
>
> What signed binaries?
>
> Anyways there are two ways to deal with this:
>
> - Run the executable through a little filter that zeroes the bitmap before
> computing the checksum. That is how rpm -V deals with prelinked binaries which
> have a similar issue. You can probably reuse the scripts from rpm.

Is this good idea? Attacker can send you binary with the bitmap
inverted, it is now slow on your system and signature matches.

...might be important for benchmarks... 'here, see, Oracle is slow.
Feel free to verify the signature'.

...ok, I guess it is not too serious, because it is similar to
fragmentation....
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-03-23 17:05:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

> Is this good idea? Attacker can send you binary with the bitmap
> inverted, it is now slow on your system and signature matches.

The first run will fix up any missing bits in the bitmap. Right
now it cannot get rid of unnecessary pages though unless you
disable early_fault.

> ...might be important for benchmarks... 'here, see, Oracle is slow.
> Feel free to verify the signature'.
>
> ...ok, I guess it is not too serious, because it is similar to
> fragmentation....

It is actually far better than fragmentation because the bitmap
loader does IO always in big chunks -- not much seeking will go on.
The only problem is some wasted mmeory and more IO bandwidth
usage (but typically binaries are not bigger than a few MB so
it's not too dramatic)

So in summary I don't think it's an issue.

-Andi

2008-03-24 04:20:18

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Sat, Mar 22, 2008 at 2:10 AM, Andi Kleen <[email protected]> wrote:
> > Why not stick the bitmap in an xattr?
>
> xattrs are too small for potentially large binaries and a mess to manage
> (a lot of tools don't know about them)

It does not matter a bit whether any other tool know about the xattrs.
Binaries will not change after they are created to require changing
the attributes. And it is no problem to not back the data up etc.
One really doesn't want to waste these resources.

And as far as the size limitation is concerned. It depends on the
limit, which I don't know off hand. But really, really big binaries
don't have to be treated like this anyway. They are not started
frequently enough to justify this.

2008-03-24 05:16:18

by Nicholas Miell

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables


On Sun, 2008-03-23 at 21:20 -0700, Ulrich Drepper wrote:
> On Sat, Mar 22, 2008 at 2:10 AM, Andi Kleen <[email protected]> wrote:
> > > Why not stick the bitmap in an xattr?
> >
> > xattrs are too small for potentially large binaries and a mess to manage
> > (a lot of tools don't know about them)
>
> It does not matter a bit whether any other tool know about the xattrs.
> Binaries will not change after they are created to require changing
> the attributes. And it is no problem to not back the data up etc.
> One really doesn't want to waste these resources.

Actually, yeah, the tools issue is a bit of a red herring now that you
mention it -- the only thing that's going to create and use these
bitmaps is the kernel, and if bitmap gets lost due to the use of ancient
tools, the kernel will just recreate the bitmap the next time the binary
runs.

> And as far as the size limitation is concerned. It depends on the
> limit, which I don't know off hand. But really, really big binaries
> don't have to be treated like this anyway. They are not started
> frequently enough to justify this.

The limit is filesystem dependent -- I think ext2/3s is something like
4k total for attribute names and values per inode.

That's more than enough space for the largest executable on my system
(emacs at 36788160 bytes) which would have a 1123 byte predictive bitmap
(plus space for the name e.g. "system.predictive_bitmap"). The bitmap
also could be compressed.

--
Nicholas Miell <[email protected]>

2008-03-24 05:23:31

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables


First emacs is still a rather small executable, there are much larger
ones around.

> The limit is filesystem dependent -- I think ext2/3s is something like
> 4k total for attribute names and values per inode.

That large xattrs tend to be out of line on a separate block and that would cost
an additional seek. It would be unlikely to be continuous to the rest of the file
data and thus even be worse than the SHDR which is at least likely to be
served from the same track buffer as the executable.

-Andi

2008-03-24 19:42:35

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Sun, Mar 23, 2008 at 10:16 PM, Nicholas Miell <[email protected]> wrote:
> The limit is filesystem dependent -- I think ext2/3s is something like
> 4k total for attribute names and values per inode.
>
> That's more than enough space for the largest executable on my system
> (emacs at 36788160 bytes) which would have a 1123 byte predictive bitmap
> (plus space for the name e.g. "system.predictive_bitmap"). The bitmap
> also could be compressed.

4k attribute means support for about 32768 pages. That's a total of
134MB. I think this qualifies as sufficient. Also, I assume the
attribute limit is just a "because nobody needed more so far" limit
and could in theory be extended.

2008-03-24 21:47:31

by Nicholas Miell

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables


On Mon, 2008-03-24 at 12:42 -0700, Ulrich Drepper wrote:
> On Sun, Mar 23, 2008 at 10:16 PM, Nicholas Miell <[email protected]> wrote:
> > The limit is filesystem dependent -- I think ext2/3s is something like
> > 4k total for attribute names and values per inode.
> >
> > That's more than enough space for the largest executable on my system
> > (emacs at 36788160 bytes) which would have a 1123 byte predictive bitmap
> > (plus space for the name e.g. "system.predictive_bitmap"). The bitmap
> > also could be compressed.
>
> 4k attribute means support for about 32768 pages. That's a total of
> 134MB. I think this qualifies as sufficient. Also, I assume the
> attribute limit is just a "because nobody needed more so far" limit
> and could in theory be extended.

The on-disk format theoretically supports multi-block xattrs, but the
kernel driver is hardcoded to support only one.

Also, keep in mind that that 4k limit is for all attributes for an inode
and includes xattr names, values, and various bits of meta data. As
such, the limit is actually less than 4k total and the space is shared
among POSIX ACLs, SELinux contexts and whatever other attributes the
user would like to store on the file.

(Actually, it's 4k plus whatever xattr space there is in the inode,
which depends on how the filesystem was formatted.)

--
Nicholas Miell <[email protected]>

2008-03-25 07:51:07

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Mon, Mar 24, 2008 at 12:42:14PM -0700, Ulrich Drepper wrote:
> On Sun, Mar 23, 2008 at 10:16 PM, Nicholas Miell <[email protected]> wrote:
> > The limit is filesystem dependent -- I think ext2/3s is something like
> > 4k total for attribute names and values per inode.
> >
> > That's more than enough space for the largest executable on my system
> > (emacs at 36788160 bytes) which would have a 1123 byte predictive bitmap
> > (plus space for the name e.g. "system.predictive_bitmap"). The bitmap
> > also could be compressed.
>
> 4k attribute means support for about 32768 pages. That's a total of
> 134MB. I think this qualifies as sufficient. Also, I assume the
> attribute limit is just a "because nobody needed more so far" limit
> and could in theory be extended.

There is still the additional seek. Large xattrs tend to be out of line
from the inode.

-Andi

2008-03-25 14:43:22

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Sun 2008-03-23 18:08:27, Andi Kleen wrote:
> > Is this good idea? Attacker can send you binary with the bitmap
> > inverted, it is now slow on your system and signature matches.
>
> The first run will fix up any missing bits in the bitmap. Right
> now it cannot get rid of unnecessary pages though unless you
> disable early_fault.
>
> > ...might be important for benchmarks... 'here, see, Oracle is slow.
> > Feel free to verify the signature'.
> >
> > ...ok, I guess it is not too serious, because it is similar to
> > fragmentation....
>
> It is actually far better than fragmentation because the bitmap
> loader does IO always in big chunks -- not much seeking will go on.
> The only problem is some wasted mmeory and more IO bandwidth
> usage (but typically binaries are not bigger than a few MB so
> it's not too dramatic)
>
> So in summary I don't think it's an issue.

Agreed.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-03-26 18:16:08

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Tue, Mar 25, 2008 at 12:54 AM, Andi Kleen <[email protected]> wrote:
> There is still the additional seek.

You've been proposing and implementing a solution which needs an
additional seek. Don't use double standards.

2008-03-26 18:51:31

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

On Wed, Mar 26, 2008 at 11:15:49AM -0700, Ulrich Drepper wrote:
> On Tue, Mar 25, 2008 at 12:54 AM, Andi Kleen <[email protected]> wrote:
> > There is still the additional seek.
>
> You've been proposing and implementing a solution which needs an
> additional seek. Don't use double standards.

You're wrong. I am implementing a solution that allows two
methods -- one (SHDR) that needs an seek (but an continuous one
so it's likely served from the track buffer) but has some advantages
and another one (PHDR) that does not require a seek.

You two are arguing a method that always requires the seek
and has a couple of other drawbacks as earlier discussed too.

-Andi