2004-01-13 20:01:14

by Jeff Dike

[permalink] [raw]
Subject: [RFC] /dev/anon

UML has a need to free dirty pages in the middle of a file (which is described
in more detail below). The obvious way to do this, and one which has come up
before, is a sys_punch system call for making a hole in the middle of a file.
Since I need something like this now, and sys_punch is not immediately
forthcoming, I implemented a special device which implements the semantics
that UML needs.

/dev/anon acts just like anonymous memory, except for not being anonymous.
It implements the memory freeing semantics by keeping track of how many times
each page is mapped, and freeing pages when the map count goes to zero.

I did it by mashing the map counting into tmpfs. I've attached the patch
(against 2.4.23) below. The main additions are
a munmap file_operation which is called from do_munmap
map counts in a structure that's parallel to the swap entries in tmpfs
a new misc device at minor 10

Comments on the implementation are welcome, as I'm not particularly proud of
it, but nothing else came to mind quickly.

If this should be maintained out-of-tree, should I get an official minor for
it anyway, or just unofficially use the first unused misc minor (10 in 2.4,
11 in 2.6)?

I'd also like opinions on whether this (or something like it) is sane enough to
go into mainline, or whether I should just keep it as a out-of-tree patch until
sys_punch shows up.

The immediate need for this would go away as soon as we got sys_punch.

On the other hand, a file backing anonymous memory seems superficially
attractive because it can get rid of the vma->vm_file == NULL special cases
scattered through the VM system. vm_file is expected to be the backing file
for the data, which doesn't work so well for in-memory filesystems. tmpfs
seems to know a lot about swapping pages out. It might be a good idea to
resurrect the old swapfs idea and turn tmpfs into that. Then I guess vm_pgoff
would become the equivalent of a swp_entry, which raises the question of what
goes into a pte of a swapped-out page. Current convention says it becomes
0, which seems a bit wasteful. It's also not clear what vm_pgoff for an
in-memory page would be. vmas like their data to be contiguous in the backing
file, which suggests that there would be a swapfs file per process, with
contiguous anonymous memory being contiguous in the swap file, if not on the
actual device.

Jeff


Rationale:

UML uses a mmapped tmp file for its physical memory. When it reads
from disk, it allocates a page of this memory for the page cache, then calls
the ubd block driver to fill it, which does a read() on the host from the
file backing the device. This results in two copies of the data in the host's
memory, one in the host page cache, and one in this tmp file for the UML page
cache. These duplicate copies can be eliminated by mmapping pages from the
device's file into UML physical memory. However, in order to get any memory
savings on the host, the tmp file page which was mapped over needs to be
freed, which no filesystem will do. Hence the need for sys_punch or /dev/anon.

In the testing I've done, booting my Debian image takes about 25% less host
memory with ubd-mmap and /dev/anon than without (~28M vs ~21M), with a
corresponding 25% increase in the number of UMLs I can boot before the host
starts swapping (20 vs 16).

The patch:

diff -Naur host-2.4.23-skas3/drivers/char/mem.c host-2.4.23-skas3-devanon/drivers/char/mem.c
--- host-2.4.23-skas3/drivers/char/mem.c 2003-12-16 22:16:27.000000000 -0500
+++ host-2.4.23-skas3-devanon/drivers/char/mem.c 2004-01-09 04:09:26.000000000 -0500
@@ -664,6 +664,8 @@
write: write_full,
};

+extern struct file_operations anon_file_operations;
+
static int memory_open(struct inode * inode, struct file * filp)
{
switch (MINOR(inode->i_rdev)) {
@@ -693,6 +695,9 @@
case 9:
filp->f_op = &urandom_fops;
break;
+ case 10:
+ filp->f_op = &anon_file_operations;
+ break;
default:
return -ENXIO;
}
@@ -719,7 +724,8 @@
{5, "zero", S_IRUGO | S_IWUGO, &zero_fops},
{7, "full", S_IRUGO | S_IWUGO, &full_fops},
{8, "random", S_IRUGO | S_IWUSR, &random_fops},
- {9, "urandom", S_IRUGO | S_IWUSR, &urandom_fops}
+ {9, "urandom", S_IRUGO | S_IWUSR, &urandom_fops},
+ {10, "anon", S_IRUGO | S_IWUSR, &anon_file_operations},
};
int i;

diff -Naur host-2.4.23-skas3/include/linux/fs.h host-2.4.23-skas3-devanon/include/linux/fs.h
--- host-2.4.23-skas3/include/linux/fs.h 2003-12-16 22:16:36.000000000 -0500
+++ host-2.4.23-skas3-devanon/include/linux/fs.h 2004-01-10 00:59:17.000000000 -0500
@@ -864,6 +864,8 @@
unsigned int (*poll) (struct file *, struct poll_table_struct *);
int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
+ void (*munmap) (struct file *, struct vm_area_struct *,
+ unsigned long start, unsigned long len);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *);
int (*release) (struct inode *, struct file *);
diff -Naur host-2.4.23-skas3/include/linux/shmem_fs.h host-2.4.23-skas3-devanon/include/linux/shmem_fs.h
--- host-2.4.23-skas3/include/linux/shmem_fs.h 2003-09-02 15:44:03.000000000 -0400
+++ host-2.4.23-skas3-devanon/include/linux/shmem_fs.h 2004-01-09 04:09:26.000000000 -0500
@@ -22,6 +22,8 @@
unsigned long next_index;
swp_entry_t i_direct[SHMEM_NR_DIRECT]; /* for the first blocks */
void **i_indirect; /* indirect blocks */
+ unsigned long map_direct[SHMEM_NR_DIRECT];
+ void **map_indirect;
unsigned long swapped; /* data pages assigned to swap */
unsigned long flags;
struct list_head list;
diff -Naur host-2.4.23-skas3/Makefile host-2.4.23-skas3-devanon/Makefile
--- host-2.4.23-skas3/Makefile 2003-12-16 22:16:23.000000000 -0500
+++ host-2.4.23-skas3-devanon/Makefile 2004-01-11 02:39:25.000000000 -0500
@@ -1,7 +1,7 @@
VERSION = 2
PATCHLEVEL = 4
SUBLEVEL = 23
-EXTRAVERSION =
+EXTRAVERSION = -devanon

KERNELRELEASE=$(VERSION).$(PATCHLEVEL).$(SUBLEVEL)$(EXTRAVERSION)

diff -Naur host-2.4.23-skas3/mm/mmap.c host-2.4.23-skas3-devanon/mm/mmap.c
--- host-2.4.23-skas3/mm/mmap.c 2004-01-09 03:50:00.000000000 -0500
+++ host-2.4.23-skas3-devanon/mm/mmap.c 2004-01-09 04:09:26.000000000 -0500
@@ -995,6 +995,11 @@
remove_shared_vm_struct(mpnt);
mm->map_count--;

+ if((mpnt->vm_file != NULL) && (mpnt->vm_file->f_op != NULL) &&
+ (mpnt->vm_file->f_op->munmap != NULL))
+ mpnt->vm_file->f_op->munmap(mpnt->vm_file, mpnt, st,
+ size);
+
zap_page_range(mm, st, size);

/*
diff -Naur host-2.4.23-skas3/mm/shmem.c host-2.4.23-skas3-devanon/mm/shmem.c
--- host-2.4.23-skas3/mm/shmem.c 2003-12-16 22:16:36.000000000 -0500
+++ host-2.4.23-skas3-devanon/mm/shmem.c 2004-01-10 01:10:59.000000000 -0500
@@ -128,16 +128,17 @@
* +-> 48-51
* +-> 52-55
*/
-static swp_entry_t *shmem_swp_entry(struct shmem_inode_info *info, unsigned long index, unsigned long *page)
+static void *shmem_block(unsigned long index, unsigned long *page,
+ unsigned long *direct, void ***indirect)
{
unsigned long offset;
void **dir;

if (index < SHMEM_NR_DIRECT)
- return info->i_direct+index;
- if (!info->i_indirect) {
+ return direct+index;
+ if (!*indirect) {
if (page) {
- info->i_indirect = (void **) *page;
+ *indirect = (void **) *page;
*page = 0;
}
return NULL; /* need another page */
@@ -146,7 +147,7 @@
index -= SHMEM_NR_DIRECT;
offset = index % ENTRIES_PER_PAGE;
index /= ENTRIES_PER_PAGE;
- dir = info->i_indirect;
+ dir = *indirect;

if (index >= ENTRIES_PER_PAGE/2) {
index -= ENTRIES_PER_PAGE/2;
@@ -169,7 +170,21 @@
*dir = (void *) *page;
*page = 0;
}
- return (swp_entry_t *) *dir + offset;
+ return (unsigned long **) *dir + offset;
+}
+
+static swp_entry_t *shmem_swp_entry(struct shmem_inode_info *info, unsigned long index, unsigned long *page)
+{
+ return((swp_entry_t *) shmem_block(index, page,
+ (unsigned long *) info->i_direct,
+ &info->i_indirect));
+}
+
+static unsigned long *shmem_map_count(struct shmem_inode_info *info,
+ unsigned long index, unsigned long *page)
+{
+ return((unsigned long *) shmem_block(index, page, info->map_direct,
+ &info->map_indirect));
}

/*
@@ -838,6 +853,7 @@
ops = &shmem_vm_ops;
if (!S_ISREG(inode->i_mode))
return -EACCES;
+
UPDATE_ATIME(inode);
vma->vm_ops = ops;
return 0;
@@ -1723,4 +1739,131 @@
return 0;
}

+static int adjust_map_counts(struct shmem_inode_info *info,
+ unsigned long offset, unsigned long len,
+ int adjust)
+{
+ unsigned long idx, i, *count, page = 0;
+
+ spin_lock(&info->lock);
+ len >>= PAGE_SHIFT;
+ for(i = 0; i < len; i++){
+ idx = (i + offset) >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+ while((count = shmem_map_count(info, idx, &page)) == NULL){
+ spin_unlock(&info->lock);
+ page = get_zeroed_page(GFP_KERNEL);
+ if(page == 0)
+ return(-ENOMEM);
+ spin_lock(&info->lock);
+ }
+
+ if(page != 0)
+ free_page(page);
+
+ *count += adjust;
+ }
+ spin_unlock(&info->lock);
+ return(0);
+}
+
EXPORT_SYMBOL(shmem_file_setup);
+
+struct file_operations anon_file_operations;
+
+static int anon_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct file *new;
+ struct inode *inode;
+ loff_t size = vma->vm_end - vma->vm_start;
+ int err;
+
+ if(file->private_data == NULL){
+ new = shmem_file_setup("dev/anon", size);
+ if(IS_ERR(new))
+ return(PTR_ERR(new));
+
+ new->f_op = &anon_file_operations;
+ file->private_data = new;
+ }
+
+ if (vma->vm_file)
+ fput(vma->vm_file);
+ vma->vm_file = file->private_data;
+ get_file(vma->vm_file);
+
+ inode = vma->vm_file->f_dentry->d_inode;
+ err = adjust_map_counts(SHMEM_I(inode), vma->vm_pgoff, size, 1);
+ if(err)
+ return(err);
+
+ vma->vm_ops = &shmem_vm_ops;
+ return 0;
+}
+
+static void anon_munmap(struct file *file, struct vm_area_struct *vma,
+ unsigned long start, unsigned long len)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ struct shmem_inode_info *info = SHMEM_I(inode);
+ pgd_t *pgd;
+ pmd_t *pmd;
+ pte_t *pte;
+ swp_entry_t *entry;
+ struct page *page;
+ unsigned long addr, idx, *count;
+
+ for(addr = start; addr < start + len; addr += PAGE_SIZE){
+ idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+
+ count = shmem_map_count(info, idx, NULL);
+ if(count == NULL)
+ continue;
+
+ (*count)--;
+ if(*count > 0)
+ continue;
+
+ pgd = pgd_offset(vma->vm_mm, addr);
+ if(pgd_none(*pgd))
+ continue;
+
+ pmd = pmd_offset(pgd, addr);
+ if(pmd_none(*pmd))
+ continue;
+
+ pte = pte_offset(pmd, addr);
+ if(!pte_present(*pte))
+ continue;
+
+ *pte = pte_mkclean(*pte);
+
+ page = pte_page(*pte);
+
+ LockPage(page);
+ lru_cache_del(page);
+ ClearPageDirty(page);
+ ClearPageUptodate(page);
+ remove_inode_page(page);
+ UnlockPage(page);
+
+ entry = shmem_swp_entry(info, idx, 0);
+ if(entry != NULL)
+ shmem_free_swp(entry, 1);
+
+ page_cache_release(page);
+ }
+}
+
+int anon_release(struct inode *inode, struct file *file)
+{
+ if(file->private_data != NULL)
+ fput(file->private_data);
+ return(0);
+}
+
+struct file_operations anon_file_operations = {
+ .mmap = anon_mmap,
+ .munmap = anon_munmap,
+ .release = anon_release,
+};



2004-01-13 20:38:15

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] /dev/anon

On Tue, 13 Jan 2004, Jeff Dike wrote:

> UML has a need to free dirty pages in the middle of a file (which is described
> in more detail below). The obvious way to do this, and one which has come up
> before, is a sys_punch system call for making a hole in the middle of a file.

Now I'm going to say something really stupid, but why sys_madvise(MADV_DONTNEED)
won't work for this?



- Davide


2004-01-14 01:21:01

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] /dev/anon

On Tue, Jan 13, 2004 at 12:38:05PM -0800, Davide Libenzi wrote:
> Now I'm going to say something really stupid, but why sys_madvise(MADV_DONTNEED)
> won't work for this?
>

MADV_DONTNEED is fine for anonymous memory, but it can't make a filesystem
throw out data, which is what I need. If it did, then people wouldn't be
agitating for sys_punch.

Jeff

2004-01-14 04:46:43

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] /dev/anon

On Tue, 13 Jan 2004, Jeff Dike wrote:

> On Tue, Jan 13, 2004 at 12:38:05PM -0800, Davide Libenzi wrote:
> > Now I'm going to say something really stupid, but why sys_madvise(MADV_DONTNEED)
> > won't work for this?
> >
>
> MADV_DONTNEED is fine for anonymous memory, but it can't make a filesystem
> throw out data, which is what I need. If it did, then people wouldn't be
> agitating for sys_punch.

What do you mean for throw out data? If you mean writing DONTNEED'ed
dirty pages to the backed up file and release them to the page cache, it
does. If you mean stop handling page faults inside the DONTNEED'ed region,
it does not. If you mean zero-filling (ala ftruncate()) the DONTNEED'ed
region, it obviously does not. I thought your goal was to release memory
to the host, that's why I proposed sys_madvise(MADV_DONTNEED).



- Davide




2004-01-14 05:15:52

by Nuno Silva

[permalink] [raw]
Subject: Re: [RFC] /dev/anon



Davide Libenzi wrote:

[...]

>
> What do you mean for throw out data? If you mean writing DONTNEED'ed

You can find a detailed text, by Jeff Dike, here:
http://user-mode-linux.sourceforge.net/devanon.html

Regards,
Nuno Silva

2004-01-14 14:20:19

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] /dev/anon

On Tue, Jan 13, 2004 at 08:46:23PM -0800, Davide Libenzi wrote:
> What do you mean for throw out data? If you mean writing DONTNEED'ed
> dirty pages to the backed up file and release them to the page cache, it
> does.

Writing dirty pages to backing store isn't throwing them out.

> If you mean stop handling page faults inside the DONTNEED'ed region,
> it does not.

This is kind of moot since the region won't be mapped anywhere, so there
can't be page faults on it.

> If you mean zero-filling (ala ftruncate()) the DONTNEED'ed
> region, it obviously does not.

Yes, I mean this, as a side-effect of dropping the region as though it were
clean.

> I thought your goal was to release memory
> to the host, that's why I proposed sys_madvise(MADV_DONTNEED).

It is, I want memory released immediately as though it were clean, and
MADV_DONTNEED doesn't help.

Jeff

2004-01-14 18:23:24

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] /dev/anon

On Wed, 14 Jan 2004, Jeff Dike wrote:

> > I thought your goal was to release memory
> > to the host, that's why I proposed sys_madvise(MADV_DONTNEED).
>
> It is, I want memory released immediately as though it were clean, and
> MADV_DONTNEED doesn't help.

Strange, I didn't notice this before. If you look at the comment in
mm/madvise.c:madvise_dontneed, it advertises that dirty pages are actually
thrown away (that would be what you're actually looking for). But if you
go down to zap_page_range -> unmap_vmas -> unmap_page_range ->
zap_pmd_range -> zap_pte_range, if the page is dirty, set_page_dirty ->
__set_page_dirty_buffers pushes the page into the mapping dirty pages list
and __mark_inode_dirty push the inode inside the superblock dirty list. So
the comment seems to be wrong (I also verified this with a simple program,
and pages are actually flushed).



- Davide






2004-01-14 22:33:56

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] /dev/anon

Davide Libenzi <[email protected]> wrote:
>
> On Wed, 14 Jan 2004, Jeff Dike wrote:
>
> > > I thought your goal was to release memory
> > > to the host, that's why I proposed sys_madvise(MADV_DONTNEED).
> >
> > It is, I want memory released immediately as though it were clean, and
> > MADV_DONTNEED doesn't help.
>
> Strange, I didn't notice this before. If you look at the comment in
> mm/madvise.c:madvise_dontneed, it advertises that dirty pages are actually
> thrown away (that would be what you're actually looking for). But if you
> go down to zap_page_range -> unmap_vmas -> unmap_page_range ->
> zap_pmd_range -> zap_pte_range, if the page is dirty, set_page_dirty ->
> __set_page_dirty_buffers pushes the page into the mapping dirty pages list
> and __mark_inode_dirty push the inode inside the superblock dirty list. So
> the comment seems to be wrong (I also verified this with a simple program,
> and pages are actually flushed).
>

We cannot invalidate the pages due to MADV_DONTNEED: there may be
freshly-allocated, unwritten file blocks associated with them. You'd have
to be playing games with write() amd MAP_SHARED to do this.

2004-01-14 22:45:18

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] /dev/anon

On Wed, 14 Jan 2004, Andrew Morton wrote:

> Davide Libenzi <[email protected]> wrote:
> >
> > On Wed, 14 Jan 2004, Jeff Dike wrote:
> >
> > > > I thought your goal was to release memory
> > > > to the host, that's why I proposed sys_madvise(MADV_DONTNEED).
> > >
> > > It is, I want memory released immediately as though it were clean, and
> > > MADV_DONTNEED doesn't help.
> >
> > Strange, I didn't notice this before. If you look at the comment in
> > mm/madvise.c:madvise_dontneed, it advertises that dirty pages are actually
> > thrown away (that would be what you're actually looking for). But if you
> > go down to zap_page_range -> unmap_vmas -> unmap_page_range ->
> > zap_pmd_range -> zap_pte_range, if the page is dirty, set_page_dirty ->
> > __set_page_dirty_buffers pushes the page into the mapping dirty pages list
> > and __mark_inode_dirty push the inode inside the superblock dirty list. So
> > the comment seems to be wrong (I also verified this with a simple program,
> > and pages are actually flushed).
> >
>
> We cannot invalidate the pages due to MADV_DONTNEED: there may be
> freshly-allocated, unwritten file blocks associated with them. You'd have
> to be playing games with write() amd MAP_SHARED to do this.

That's fine Andrew. It was the comment that looked strange to me. We
actually do sync dirty pages, while the comment says we throw them away.



- Davide


2004-01-16 12:16:17

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [RFC] /dev/anon

On Tue, 13 Jan 2004, Jeff Dike wrote:
> If this should be maintained out-of-tree, should I get an official minor for
> it anyway, or just unofficially use the first unused misc minor (10 in 2.4,
> 11 in 2.6)?

Apparently there's a hole in the list in 2.6.[01] (/dev/kmsg has 11), so you
can use 10 for both 2.4 and 2.6.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2004-01-16 16:02:29

by Andries Brouwer

[permalink] [raw]
Subject: Re: [RFC] /dev/anon

On Fri, Jan 16, 2004 at 01:16:02PM +0100, Geert Uytterhoeven wrote:
> On Tue, 13 Jan 2004, Jeff Dike wrote:
> > If this should be maintained out-of-tree, should I get an official minor for
> > it anyway, or just unofficially use the first unused misc minor (10 in 2.4,
> > 11 in 2.6)?
>
> Apparently there's a hole in the list in 2.6.[01] (/dev/kmsg has 11), so you
> can use 10 for both 2.4 and 2.6.

Yes. 6 was /dev/core, added 0.98p3, removed 0.99p13X.
10 was reserved for /dev/aio, but when aio was
implemented it was done differently. So 10 has never been used.