2008-11-11 13:22:30

by Izik Eidus

[permalink] [raw]
Subject: [PATCH 0/4] ksm - dynamic page sharing driver for linux

KSM is a linux driver that allows dynamicly sharing identical memory pages
between one or more processes.

unlike tradtional page sharing that is made at the allocation of the
memory, ksm do it dynamicly after the memory was created.
Memory is periodically scanned; identical pages are identified and merged.
the sharing is unnoticeable by the process that use this memory.
(the shared pages are marked as readonly, and in case of write
do_wp_page() take care to create new copy of the page)

this driver is very useful for KVM as in cases of runing multiple guests
operation system of the same type, many pages are sharable.
this driver can be useful by OpenVZ as well.

KSM right now scan just memory that was registered to used by it, it
does not
scan the whole system memory (this can be changed, but the changes to
find
identical pages in normal linux system that doesnt run multiple guests)

KSM can run as kernel thread or as userspace application (or both (it is
allowed to run more than one scanner in a time)).

example for how to control the kernel thread:


ksmctl.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include "ksm.h"

int main(int argc, char *argv[])
{
int fd;
int used = 0;
int fd_start;
struct ksm_kthread_info info;


if (argc < 2) {
fprintf(stderr, "usage: %s {start npages sleep | stop |
info}\n", argv[0]);
exit(1);
}

fd = open("/dev/ksm", O_RDWR | O_TRUNC, (mode_t)0600);
if (fd == -1) {
fprintf(stderr, "could not open /dev/ksm\n");
exit(1);
}

if (!strncmp(argv[1], "start", strlen(argv[1]))) {
used = 1;
if (argc < 5) {
fprintf(stderr, "usage: %s start npages_to_scan",
argv[0]);
fprintf(stderr, "npages_max_merge sleep\n");
exit(1);
}
info.pages_to_scan = atoi(argv[2]);
info.max_pages_to_merge = atoi(argv[3]);
info.sleep = atoi(argv[4]);
info.running = 1;

fd_start = ioctl(fd, KSM_START_STOP_KTHREAD, &info);
if (fd_start == -1) {
fprintf(stderr, "KSM_START_KTHREAD failed\n");
exit(1);
}
printf("created scanner\n");
}

if (!strncmp(argv[1], "stop", strlen(argv[1]))) {
used = 1;
info.running = 0;
fd_start = ioctl(fd, KSM_START_STOP_KTHREAD, &info);
if (fd_start == -1) {
fprintf(stderr, "KSM_START_STOP_KTHREAD failed\n");
exit(1);
}
printf("stopped scanner\n");
}

if (!strncmp(argv[1], "info", strlen(argv[1]))) {
used = 1;
fd_start = ioctl(fd, KSM_GET_INFO_KTHREAD, &info);
if (fd_start == -1) {
fprintf(stderr, "KSM_GET_INFO_KTHREAD failed\n");
exit(1);
}
printf("running %d, pages_to_scan %d pages_max_merge %d",
info.running, info.pages_to_scan,
info.max_pages_to_merge);
printf("sleep_time %d\n", info.sleep);
}

if (!used)
fprintf(stderr, "unknown command %s\n", argv[1]);

return 0;
}


example of how to register qemu to ksm (or any userspace application)

diff --git a/qemu/vl.c b/qemu/vl.c
index 4721fdd..7785bf9 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -21,6 +21,7 @@
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
* DEALINGS IN
* THE SOFTWARE.
*/
+#include "ksm.h"
#include "hw/hw.h"
#include "hw/boards.h"
#include "hw/usb.h"
@@ -5799,6 +5800,37 @@ static void termsig_setup(void)

#endif

+int ksm_register_memory(void)
+{
+ int fd;
+ int ksm_fd;
+ int r = 1;
+ struct ksm_memory_region ksm_region;
+
+ fd = open("/dev/ksm", O_RDWR | O_TRUNC, (mode_t)0600);
+ if (fd == -1)
+ goto out;
+
+ ksm_fd = ioctl(fd, KSM_CREATE_SHARED_MEMORY_AREA);
+ if (ksm_fd == -1)
+ goto out_free;
+
+ ksm_region.npages = phys_ram_size / TARGET_PAGE_SIZE;
+ ksm_region.addr = phys_ram_base;
+ r = ioctl(ksm_fd, KSM_REGISTER_MEMORY_REGION, &ksm_region);
+ if (r)
+ goto out_free1;
+
+ return r;
+
+out_free1:
+ close(ksm_fd);
+out_free:
+ close(fd);
+out:
+ return r;
+}
+
int main(int argc, char **argv)
{
#ifdef CONFIG_GDBSTUB
@@ -6735,6 +6767,8 @@ int main(int argc, char **argv)
/* init the dynamic translator */
cpu_exec_init_all(tb_size * 1024 * 1024);

+ ksm_register_memory();
+
bdrv_init();

/* we always create the cdrom drive, even if no disk is there */


2008-11-11 13:22:45

by Izik Eidus

[permalink] [raw]
Subject: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

From: Izik Eidus <[email protected]>

this function is needed in cases you want to change the userspace
virtual mapping into diffrent physical page,
KSM need this for merging the identical pages.

this function is working by removing the oldpage from the rmap and
calling put_page on it, and by setting the virtual address pte
to point into the new page.
(note that the new page (the page that we change the pte to map to)
cannot be anonymous page)

Signed-off-by: Izik Eidus <[email protected]>
---
include/linux/mm.h | 3 ++
mm/memory.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 71 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ffee2f7..4da7fa8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1207,6 +1207,9 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);

+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+ struct page *newpage, pte_t orig_pte, pgprot_t prot);
+
struct page *follow_page(struct vm_area_struct *, unsigned long address,
unsigned int foll_flags);
#define FOLL_WRITE 0x01 /* check pte is writable */
diff --git a/mm/memory.c b/mm/memory.c
index 164951c..b2c542c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1472,6 +1472,74 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
}
EXPORT_SYMBOL(vm_insert_mixed);

+/**
+ * replace _page - replace the pte mapping related to vm area between two pages
+ * (from oldpage to newpage)
+ * NOTE: you should take into consideration the impact on the VM when replacing
+ * anonymous pages with kernel non swappable pages.
+ */
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+ struct page *newpage, pte_t orig_pte, pgprot_t prot)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *ptep;
+ spinlock_t *ptl;
+ unsigned long addr;
+ int ret;
+
+ BUG_ON(PageAnon(newpage));
+
+ ret = -EFAULT;
+ addr = page_address_in_vma(oldpage, vma);
+ if (addr == -EFAULT)
+ goto out;
+
+ pgd = pgd_offset(mm, addr);
+ if (!pgd_present(*pgd))
+ goto out;
+
+ pud = pud_offset(pgd, addr);
+ if (!pud_present(*pud))
+ goto out;
+
+ pmd = pmd_offset(pud, addr);
+ if (!pmd_present(*pmd))
+ goto out;
+
+ ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!ptep)
+ goto out;
+
+ if (!pte_same(*ptep, orig_pte)) {
+ pte_unmap_unlock(ptep, ptl);
+ goto out;
+ }
+
+ ret = 0;
+ get_page(newpage);
+ page_add_file_rmap(newpage);
+
+ flush_cache_page(vma, addr, pte_pfn(*ptep));
+ ptep_clear_flush(vma, addr, ptep);
+ set_pte_at(mm, addr, ptep, mk_pte(newpage, prot));
+
+ page_remove_rmap(oldpage, vma);
+ if (PageAnon(oldpage)) {
+ dec_mm_counter(mm, anon_rss);
+ inc_mm_counter(mm, file_rss);
+ }
+ put_page(oldpage);
+
+ pte_unmap_unlock(ptep, ptl);
+
+out:
+ return ret;
+}
+EXPORT_SYMBOL(replace_page);
+
/*
* maps a range of physical memory into the requested pages. the old
* mappings are removed. any references to nonexistent pages results
--
1.6.0.3

2008-11-11 13:22:59

by Izik Eidus

[permalink] [raw]
Subject: [PATCH 3/4] add ksm kernel shared memory driver

From: Izik Eidus <[email protected]>

ksm is driver that allow merging identical pages between one or more
applications in way unvisible to the application that use it.
pages that are merged are marked as readonly and are COWed when any application
try to change them.

ksm is working by walking over the memory pages of the applications it scan
in order to find identical pages.
it uses an hash table to find in effective way the identical pages.

when ksm find two identical pages, it marked them as readonly and merge them
into single one page,
after the pages are marked as readonly and merged into one page, linux
will treat this pages as normal copy_on_write pages and will fork them
when write access will happen to them.

ksm scan just memory areas that were registred to be scanned by it.

Signed-off-by: Izik Eidus <[email protected]>
---
drivers/Kconfig | 5 +
include/linux/ksm.h | 53 ++
include/linux/miscdevice.h | 1 +
mm/Kconfig | 3 +
mm/Makefile | 1 +
mm/ksm.c | 1202 ++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 1265 insertions(+), 0 deletions(-)
create mode 100644 include/linux/ksm.h
create mode 100644 mm/ksm.c

diff --git a/drivers/Kconfig b/drivers/Kconfig
index d38f43f..c1c701f 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -105,4 +105,9 @@ source "drivers/uio/Kconfig"
source "drivers/xen/Kconfig"

source "drivers/staging/Kconfig"
+
+config KSM
+ bool "KSM driver support"
+ help
+ ksm is driver for merging identical pages between applciations
endmenu
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
new file mode 100644
index 0000000..f873502
--- /dev/null
+++ b/include/linux/ksm.h
@@ -0,0 +1,53 @@
+#ifndef __LINUX_KSM_H
+#define __LINUX_KSM_H
+
+/*
+ * Userspace interface for /dev/ksm - kvm shared memory
+ */
+
+#include <asm/types.h>
+#include <linux/ioctl.h>
+
+#define KSM_API_VERSION 1
+
+/* for KSM_REGISTER_MEMORY_REGION */
+struct ksm_memory_region {
+ __u32 npages; /* number of pages to share */
+ __u32 pad;
+ __u64 addr; /* the begining of the virtual address */
+};
+
+struct ksm_user_scan {
+ __u32 pages_to_scan;
+ __u32 max_pages_to_merge;
+};
+
+struct ksm_kthread_info {
+ __u32 sleep; /* number of microsecoends to sleep */
+ __u32 pages_to_scan; /* number of pages to scan */
+ __u32 max_pages_to_merge;
+ __u32 running;
+};
+
+#define KSMIO 0xAB
+
+/* ioctls for /dev/ksm */
+#define KSM_GET_API_VERSION _IO(KSMIO, 0x00)
+#define KSM_CREATE_SHARED_MEMORY_AREA _IO(KSMIO, 0x01) /* return SMA fd */
+#define KSM_CREATE_SCAN _IO(KSMIO, 0x02) /* return SCAN fd */
+#define KSM_START_STOP_KTHREAD _IOW(KSMIO, 0x03,\
+ struct ksm_kthread_info)
+#define KSM_GET_INFO_KTHREAD _IOW(KSMIO, 0x04,\
+ struct ksm_kthread_info)
+
+
+/* ioctls for SMA fds */
+#define KSM_REGISTER_MEMORY_REGION _IOW(KSMIO, 0x20,\
+ struct ksm_memory_region)
+#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO, 0x21)
+
+/* ioctls for SCAN fds */
+#define KSM_SCAN _IOW(KSMIO, 0x40,\
+ struct ksm_user_scan)
+
+#endif
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index 26433ec..adc2435 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -30,6 +30,7 @@
#define TUN_MINOR 200
#define HPET_MINOR 228
#define KVM_MINOR 232
+#define KSM_MINOR 233

struct device;

diff --git a/mm/Kconfig b/mm/Kconfig
index 5b5790f..e7f0061 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -222,3 +222,6 @@ config UNEVICTABLE_LRU

config MMU_NOTIFIER
bool
+
+config KSM
+ bool
diff --git a/mm/Makefile b/mm/Makefile
index c06b45a..9722afe 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o
obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
obj-$(CONFIG_SLOB) += slob.o
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
+obj-$(CONFIG_KSM) += ksm.o
obj-$(CONFIG_SLAB) += slab.o
obj-$(CONFIG_SLUB) += slub.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
diff --git a/mm/ksm.c b/mm/ksm.c
new file mode 100644
index 0000000..977eb37
--- /dev/null
+++ b/mm/ksm.c
@@ -0,0 +1,1202 @@
+/*
+ * Memory merging driver for Linux
+ *
+ * This module enables dynamic sharing of identical pages found in different
+ * memory areas, even if they are not shared by fork()
+ *
+ * Copyright (C) 2008 Red Hat, Inc.
+ * Authors:
+ * Izik Eidus
+ * Andrea Arcangeli
+ * Chris Wright
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/vmalloc.h>
+#include <linux/file.h>
+#include <linux/mman.h>
+#include <linux/sched.h>
+#include <linux/rwsem.h>
+#include <linux/pagemap.h>
+#include <linux/vmalloc.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/spinlock.h>
+#include <linux/jhash.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
+#include <linux/wait.h>
+#include <linux/anon_inodes.h>
+#include <linux/ksm.h>
+#include <linux/crypto.h>
+#include <linux/scatterlist.h>
+#include <linux/random.h>
+#include <crypto/sha.h>
+
+#include <asm/tlbflush.h>
+
+MODULE_AUTHOR("Red Hat, Inc.");
+MODULE_LICENSE("GPL");
+
+static int page_hash_size;
+module_param(page_hash_size, int, 0);
+MODULE_PARM_DESC(page_hash_size, "Hash table size for the pages checksum");
+
+static int rmap_hash_size;
+module_param(rmap_hash_size, int, 0);
+MODULE_PARM_DESC(rmap_hash_size, "Hash table size for the reverse mapping");
+
+static int sha1_hash_size;
+module_param(sha1_hash_size, int, 0);
+MODULE_PARM_DESC(sha1_hash_size, "Hash table size for the sha1 caching");
+
+struct ksm_mem_slot {
+ struct list_head link;
+ struct list_head sma_link;
+ struct mm_struct *mm;
+ unsigned long addr; /* the begining of the virtual address */
+ int npages; /* number of pages to share */
+};
+
+/*
+ * sma - shared memory area, each process have its own sma that contain the
+ * information about the slots that it own
+ */
+struct ksm_sma {
+ struct list_head sma_slots;
+};
+
+struct ksm_scan {
+ struct ksm_mem_slot *slot_index; /* the slot we are scanning now */
+ int page_index; /* the page inside sma that is now being scanned */
+};
+
+struct page_hash_item {
+ struct hlist_node link;
+ struct mm_struct *mm;
+ unsigned long addr;
+};
+
+struct rmap_item {
+ struct hlist_node link;
+ struct page_hash_item *page_hash_item;
+ unsigned long oldindex;
+};
+
+struct sha1_item {
+ unsigned char sha1val[SHA1_DIGEST_SIZE];
+ unsigned long pfn;
+};
+
+static struct list_head slots;
+static struct rw_semaphore slots_lock;
+
+static DEFINE_MUTEX(sha1_lock);
+
+static int npages_hash;
+static struct hlist_head *page_hash_items;
+static int nrmaps_hash;
+static struct hlist_head *rmap_hash;
+static int nsha1s_hash;
+static struct sha1_item *sha1_hash;
+
+static struct kmem_cache *page_hash_item_cache;
+static struct kmem_cache *rmap_item_cache;
+
+static int kthread_sleep;
+static int kthread_pages_to_scan;
+static int kthread_max_npages;
+static struct ksm_scan kthread_ksm_scan;
+static int kthread_run;
+static struct task_struct *kthread;
+static wait_queue_head_t kthread_wait;
+static struct rw_semaphore kthread_lock;
+static struct crypto_hash *tfm;
+static unsigned char hmac_key[SHA1_DIGEST_SIZE];
+static DEFINE_MUTEX(tfm_mutex);
+
+static spinlock_t hash_lock;
+
+static int ksm_tfm_init(void)
+{
+ struct crypto_hash *hash;
+ int rc = 0;
+
+ mutex_lock(&tfm_mutex);
+ if (tfm)
+ goto out;
+
+ /* Must be called from user context before starting any scanning */
+ hash = crypto_alloc_hash("hmac(sha1)", 0, CRYPTO_ALG_ASYNC);
+ if (IS_ERR(hash)) {
+ rc = PTR_ERR(hash);
+ goto out;
+ }
+
+ get_random_bytes(hmac_key, sizeof(hmac_key));
+
+ rc = crypto_hash_setkey(hash, hmac_key, SHA1_DIGEST_SIZE);
+ if (rc) {
+ crypto_free_hash(hash);
+ goto out;
+ }
+ tfm = hash;
+out:
+ mutex_unlock(&tfm_mutex);
+ return rc;
+}
+
+static int ksm_slab_init(void)
+{
+ int ret = 1;
+
+ page_hash_item_cache = kmem_cache_create("ksm_page_hash_item",
+ sizeof(struct page_hash_item), 0, 0,
+ NULL);
+ if (!page_hash_item_cache)
+ goto out;
+
+ rmap_item_cache = kmem_cache_create("ksm_rmap_item",
+ sizeof(struct rmap_item), 0, 0,
+ NULL);
+ if (!rmap_item_cache)
+ goto out_free;
+ return 0;
+
+out_free:
+ kmem_cache_destroy(page_hash_item_cache);
+out:
+ return ret;
+}
+
+static void ksm_slab_free(void)
+{
+ kmem_cache_destroy(rmap_item_cache);
+ kmem_cache_destroy(page_hash_item_cache);
+}
+
+static struct page_hash_item *alloc_page_hash_item(void)
+{
+ void *obj;
+
+ obj = kmem_cache_zalloc(page_hash_item_cache, GFP_KERNEL);
+ return (struct page_hash_item *)obj;
+}
+
+static void free_page_hash_item(struct page_hash_item *page_hash_item)
+{
+ kmem_cache_free(page_hash_item_cache, page_hash_item);
+}
+
+static struct rmap_item *alloc_rmap_item(void)
+{
+ void *obj;
+
+ obj = kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
+ return (struct rmap_item *)obj;
+}
+
+static void free_rmap_item(struct rmap_item *rmap_item)
+{
+ kmem_cache_free(rmap_item_cache, rmap_item);
+}
+
+static inline int PageKsm(struct page *page)
+{
+ return !PageAnon(page);
+}
+
+static int page_hash_init(void)
+{
+ if (!page_hash_size) {
+ struct sysinfo sinfo;
+
+ si_meminfo(&sinfo);
+ page_hash_size = sinfo.totalram;
+ page_hash_size /= 40;
+ }
+ npages_hash = page_hash_size;
+ page_hash_items = vmalloc(npages_hash * sizeof(struct page_hash_item));
+ if (!page_hash_items)
+ return 1;
+
+ memset(page_hash_items, 0, npages_hash * sizeof(struct page_hash_item));
+ return 0;
+}
+
+static void page_hash_free(void)
+{
+ int i;
+ struct hlist_head *bucket;
+ struct hlist_node *node, *n;
+ struct page_hash_item *page_hash_item;
+
+ for (i = 0; i < npages_hash; ++i) {
+ bucket = &page_hash_items[i];
+ hlist_for_each_entry_safe(page_hash_item, node, n, bucket, link) {
+ hlist_del(&page_hash_item->link);
+ free_page_hash_item(page_hash_item);
+ }
+ }
+ vfree(page_hash_items);
+}
+
+static int rmap_hash_init(void)
+{
+ if (!rmap_hash_size) {
+ struct sysinfo sinfo;
+
+ si_meminfo(&sinfo);
+ rmap_hash_size = sinfo.totalram;
+ rmap_hash_size /= 40;
+ }
+ nrmaps_hash = rmap_hash_size;
+ rmap_hash = vmalloc(nrmaps_hash *
+ sizeof(struct hlist_head));
+ if (!rmap_hash)
+ return 1;
+ memset(rmap_hash, 0, nrmaps_hash * sizeof(struct hlist_head));
+ return 0;
+}
+
+static void rmap_hash_free(void)
+{
+ int i;
+ struct hlist_head *bucket;
+ struct hlist_node *node, *n;
+ struct rmap_item *rmap_item;
+
+ for (i = 0; i < nrmaps_hash; ++i) {
+ bucket = &rmap_hash[i];
+ hlist_for_each_entry_safe(rmap_item, node, n, bucket, link) {
+ hlist_del(&rmap_item->link);
+ free_rmap_item(rmap_item);
+ }
+ }
+ vfree(rmap_hash);
+}
+
+static int sha1_hash_init(void)
+{
+ if (!sha1_hash_size) {
+ struct sysinfo sinfo;
+
+ si_meminfo(&sinfo);
+ sha1_hash_size = sinfo.totalram;
+ sha1_hash_size /= 128;
+ }
+ nsha1s_hash = sha1_hash_size;
+ sha1_hash = vmalloc(nsha1s_hash *
+ sizeof(struct sha1_item));
+ if (!sha1_hash)
+ return 1;
+ memset(sha1_hash, 0, nsha1s_hash * sizeof(struct sha1_item));
+ return 0;
+}
+
+static void sha1_hash_free(void)
+{
+ vfree(sha1_hash);
+}
+
+static inline u32 calc_hash_index(struct page *page)
+{
+ u32 hash;
+ void *addr = kmap_atomic(page, KM_USER0);
+ hash = jhash(addr, PAGE_SIZE, 17);
+ kunmap_atomic(addr, KM_USER0);
+ return hash % npages_hash;
+}
+
+static void remove_page_from_hash(struct mm_struct *mm, unsigned long addr)
+{
+ struct rmap_item *rmap_item;
+ struct hlist_head *bucket;
+ struct hlist_node *node, *n;
+
+ bucket = &rmap_hash[addr % nrmaps_hash];
+ hlist_for_each_entry_safe(rmap_item, node, n, bucket, link) {
+ if (mm == rmap_item->page_hash_item->mm &&
+ rmap_item->page_hash_item->addr == addr) {
+ hlist_del(&rmap_item->page_hash_item->link);
+ free_page_hash_item(rmap_item->page_hash_item);
+ hlist_del(&rmap_item->link);
+ free_rmap_item(rmap_item);
+ return;
+ }
+ }
+}
+
+static int ksm_sma_ioctl_register_memory_region(struct ksm_sma *ksm_sma,
+ struct ksm_memory_region *mem)
+{
+ struct ksm_mem_slot *slot;
+ int ret = -1;
+
+ if (!current->mm)
+ goto out;
+
+ slot = kzalloc(sizeof(struct ksm_mem_slot), GFP_KERNEL);
+ if (!slot)
+ goto out;
+
+ slot->mm = get_task_mm(current);
+ slot->addr = mem->addr;
+ slot->npages = mem->npages;
+
+ down_write(&slots_lock);
+
+ list_add_tail(&slot->link, &slots);
+ list_add_tail(&slot->sma_link, &ksm_sma->sma_slots);
+
+ up_write(&slots_lock);
+ ret = 0;
+out:
+ return ret;
+}
+
+static void remove_mm_from_hash(struct mm_struct *mm)
+{
+ struct ksm_mem_slot *slot;
+ int pages_count = 0;
+
+ list_for_each_entry(slot, &slots, link)
+ if (slot->mm == mm)
+ break;
+ if (!slot)
+ BUG();
+
+ spin_lock(&hash_lock);
+ while (pages_count < slot->npages) {
+ remove_page_from_hash(mm, slot->addr + pages_count * PAGE_SIZE);
+ pages_count++;
+ }
+ spin_unlock(&hash_lock);
+ list_del(&slot->link);
+}
+
+static int ksm_sma_ioctl_remove_memory_region(struct ksm_sma *ksm_sma)
+{
+ struct ksm_mem_slot *slot, *node;
+
+ down_write(&slots_lock);
+ list_for_each_entry_safe(slot, node, &ksm_sma->sma_slots, sma_link) {
+ remove_mm_from_hash(slot->mm);
+ mmput(slot->mm);
+ list_del(&slot->sma_link);
+ kfree(slot);
+ }
+ up_write(&slots_lock);
+ return 0;
+}
+
+static int ksm_sma_release(struct inode *inode, struct file *filp)
+{
+ struct ksm_sma *ksm_sma = filp->private_data;
+ int r;
+
+ r = ksm_sma_ioctl_remove_memory_region(ksm_sma);
+ kfree(ksm_sma);
+ return r;
+}
+
+static long ksm_sma_ioctl(struct file *filp,
+ unsigned int ioctl, unsigned long arg)
+{
+ struct ksm_sma *sma = filp->private_data;
+ void __user *argp = (void __user *)arg;
+ int r = EINVAL;
+
+ switch (ioctl) {
+ case KSM_REGISTER_MEMORY_REGION: {
+ struct ksm_memory_region ksm_memory_region;
+
+ r = -EFAULT;
+ if (copy_from_user(&ksm_memory_region, argp,
+ sizeof ksm_memory_region))
+ goto out;
+ r = ksm_sma_ioctl_register_memory_region(sma,
+ &ksm_memory_region);
+ break;
+ }
+ case KSM_REMOVE_MEMORY_REGION:
+ r = ksm_sma_ioctl_remove_memory_region(sma);
+ break;
+ }
+
+out:
+ return r;
+}
+
+static int insert_page_to_hash(struct ksm_scan *ksm_scan,
+ unsigned long hash_index,
+ struct page_hash_item *page_hash_item,
+ struct rmap_item *rmap_item)
+{
+ struct ksm_mem_slot *slot;
+ struct hlist_head *bucket;
+
+ slot = ksm_scan->slot_index;
+ page_hash_item->addr = slot->addr + ksm_scan->page_index * PAGE_SIZE;
+ page_hash_item->mm = slot->mm;
+ bucket = &page_hash_items[hash_index];
+ hlist_add_head(&page_hash_item->link, bucket);
+
+ rmap_item->page_hash_item = page_hash_item;
+ rmap_item->oldindex = hash_index;
+ bucket = &rmap_hash[page_hash_item->addr % nrmaps_hash];
+ hlist_add_head(&rmap_item->link, bucket);
+ return 0;
+}
+
+static void update_hash(struct ksm_scan *ksm_scan,
+ unsigned long hash_index)
+{
+ struct rmap_item *rmap_item;
+ struct ksm_mem_slot *slot;
+ struct hlist_head *bucket;
+ struct hlist_node *node, *n;
+ unsigned long addr;
+
+ slot = ksm_scan->slot_index;;
+ addr = slot->addr + ksm_scan->page_index * PAGE_SIZE;
+ bucket = &rmap_hash[addr % nrmaps_hash];
+ hlist_for_each_entry_safe(rmap_item, node, n, bucket, link) {
+ if (slot->mm == rmap_item->page_hash_item->mm &&
+ rmap_item->page_hash_item->addr == addr) {
+ if (hash_index != rmap_item->oldindex) {
+ hlist_del(&rmap_item->page_hash_item->link);
+ free_page_hash_item(rmap_item->page_hash_item);
+ hlist_del(&rmap_item->link);
+ free_rmap_item(rmap_item);
+ }
+ return;
+ }
+ }
+}
+
+static unsigned long addr_in_vma(struct vm_area_struct *vma, struct page *page)
+{
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ unsigned long addr;
+
+ addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+ if (unlikely(addr < vma->vm_start || addr >= vma->vm_end))
+ return -EFAULT;
+ return addr;
+}
+
+static pte_t *get_pte(struct mm_struct *mm, unsigned long addr)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *ptep = NULL;
+
+ pgd = pgd_offset(mm, addr);
+ if (!pgd_present(*pgd))
+ goto out;
+
+ pud = pud_offset(pgd, addr);
+ if (!pud_present(*pud))
+ goto out;
+
+ pmd = pmd_offset(pud, addr);
+ if (!pmd_present(*pmd))
+ goto out;
+
+ ptep = pte_offset_map(pmd, addr);
+out:
+ return ptep;
+}
+
+static int is_present_pte(struct mm_struct *mm, unsigned long addr)
+{
+ pte_t *ptep;
+
+ ptep = get_pte(mm, addr);
+ if (!ptep)
+ return 0;
+
+ if (pte_present(*ptep))
+ return 1;
+ return 0;
+}
+
+#define PAGECMP_OFFSET 128
+#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
+/* hash the page */
+static void page_hash(struct page *page, unsigned char *digest)
+{
+ struct scatterlist sg;
+ struct hash_desc desc;
+
+ sg_init_table(&sg, 1);
+ sg_set_page(&sg, page, PAGEHASH_SIZE, 0);
+ desc.tfm = tfm;
+ desc.flags = 0;
+ crypto_hash_digest(&desc, &sg, PAGEHASH_SIZE, digest);
+}
+
+/* pages_identical
+ * calculate sha1 hash of each page, compare results,
+ * and return 1 if identical, 0 otherwise.
+ */
+static int pages_identical(struct page *oldpage, struct page *newpage, int new)
+{
+ int r;
+ unsigned char old_digest[SHA1_DIGEST_SIZE];
+ struct sha1_item *sha1_item;
+
+ page_hash(oldpage, old_digest);
+ /*
+ * If new = 1, it is never safe to use the sha1 value that is
+ * inside the cache, the reason is that the page can be released
+ * and then recreated and have diffrent sha1 value.
+ * (swapping as for now is not an issue, beacuse KsmPages cannot be
+ * swapped)
+ */
+ if (new) {
+ mutex_lock(&sha1_lock);
+ sha1_item = &sha1_hash[page_to_pfn(newpage) % nsha1s_hash];
+ page_hash(newpage, sha1_item->sha1val);
+ sha1_item->pfn = page_to_pfn(newpage);
+ r = !memcmp(old_digest, sha1_item->sha1val, SHA1_DIGEST_SIZE);
+ mutex_unlock(&sha1_lock);
+ } else {
+ mutex_lock(&sha1_lock);
+ sha1_item = &sha1_hash[page_to_pfn(newpage) % nsha1s_hash];
+ if (sha1_item->pfn != page_to_pfn(newpage)) {
+ page_hash(newpage, sha1_item->sha1val);
+ sha1_item->pfn = page_to_pfn(newpage);
+ }
+ r = !memcmp(old_digest, sha1_item->sha1val, SHA1_DIGEST_SIZE);
+ mutex_unlock(&sha1_lock);
+ }
+ if (PAGECMP_OFFSET && r) {
+ char *old_addr, *new_addr;
+ old_addr = kmap_atomic(oldpage, KM_USER0);
+ new_addr = kmap_atomic(newpage, KM_USER1);
+ r = !memcmp(old_addr+PAGECMP_OFFSET, new_addr+PAGECMP_OFFSET, PAGE_SIZE-PAGECMP_OFFSET);
+ kunmap_atomic(old_addr, KM_USER0);
+ kunmap_atomic(new_addr, KM_USER1);
+ }
+ return r;
+}
+
+/*
+ * try_to_merge_one_page - take two pages and merge them into one
+ * note:
+ * oldpage should be anon page while newpage should be file mapped page
+ */
+static int try_to_merge_one_page(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ struct page *oldpage,
+ struct page *newpage,
+ pgprot_t newprot,
+ int new)
+{
+ int ret = 0;
+ int odirect_sync;
+ unsigned long page_addr_in_vma;
+ pte_t orig_pte, *orig_ptep;
+
+ get_page(newpage);
+ get_page(oldpage);
+
+ down_read(&mm->mmap_sem);
+
+ page_addr_in_vma = addr_in_vma(vma, oldpage);
+ if (page_addr_in_vma == -EFAULT)
+ goto out_unlock;
+
+ orig_ptep = get_pte(mm, page_addr_in_vma);
+ if (!orig_ptep)
+ goto out_unlock;
+ orig_pte = *orig_ptep;
+ if (!pte_present(orig_pte))
+ goto out_unlock;
+ if (page_to_pfn(oldpage) != pte_pfn(orig_pte))
+ goto out_unlock;
+ /*
+ * page_wrprotect check if the page is swapped or in swap cache,
+ * in the future we might want to run here if_present_pte and then
+ * swap_free
+ */
+ if (!page_wrprotect(oldpage, &odirect_sync))
+ goto out_unlock;
+ if (!odirect_sync)
+ goto out_unlock;
+
+ orig_pte = pte_wrprotect(orig_pte);
+
+ ret = 1;
+ if (pages_identical(oldpage, newpage, new))
+ ret = replace_page(vma, oldpage, newpage, orig_pte, newprot);
+
+ if (!ret)
+ ret = 1;
+ else
+ ret = 0;
+
+out_unlock:
+ up_read(&mm->mmap_sem);
+ put_page(oldpage);
+ put_page(newpage);
+ return ret;
+}
+
+static int try_to_merge_two_pages(struct mm_struct *mm1, struct page *page1,
+ struct mm_struct *mm2, struct page *page2,
+ unsigned long addr1, unsigned long addr2)
+{
+ struct vm_area_struct *vma;
+ pgprot_t prot;
+ int ret = 0;
+
+ /*
+ * If page2 isn't shared (it isn't PageKsm) we have to allocate a new
+ * file mapped page and make the two ptes of mm1(page1) and mm2(page2)
+ * point to it. If page2 is shared, we can just make the pte of
+ * mm1(page1) point to page2
+ */
+ if (PageKsm(page2)) {
+ vma = find_vma(mm1, addr1);
+ if (!vma)
+ return ret;
+ prot = vma->vm_page_prot;
+ pgprot_val(prot) &= ~VM_WRITE;
+ ret = try_to_merge_one_page(mm1, vma, page1, page2, prot, 0);
+ } else {
+ struct page *kpage;
+
+ kpage = alloc_page(GFP_KERNEL | __GFP_HIGHMEM);
+ if (!kpage)
+ return ret;
+
+ vma = find_vma(mm1, addr1);
+ if (!vma) {
+ page_cache_release(kpage);
+ return ret;
+ }
+ prot = vma->vm_page_prot;
+ pgprot_val(prot) &= ~VM_WRITE;
+
+ copy_highpage(kpage, page1);
+ ret = try_to_merge_one_page(mm1, vma, page1, kpage, prot, 1);
+
+ if (ret) {
+ vma = find_vma(mm2, addr2);
+ if (!vma) {
+ page_cache_release(kpage);
+ return ret;
+ }
+
+ prot = vma->vm_page_prot;
+ pgprot_val(prot) &= ~VM_WRITE;
+
+ ret = try_to_merge_one_page(mm2, vma, page2, kpage,
+ prot, 0);
+ }
+ page_cache_release(kpage);
+ }
+ return ret;
+}
+
+static int cmp_and_merge_page(struct ksm_scan *ksm_scan, struct page *page)
+{
+ struct hlist_head *bucket;
+ struct hlist_node *node, *n;
+ struct page_hash_item *page_hash_item;
+ struct ksm_mem_slot *slot;
+ unsigned long hash_index;
+ unsigned long addr;
+ int ret = 0;
+ int used = 0;
+
+ hash_index = calc_hash_index(page);
+ bucket = &page_hash_items[hash_index];
+
+ slot = ksm_scan->slot_index;
+ addr = slot->addr + ksm_scan->page_index * PAGE_SIZE;
+
+ spin_lock(&hash_lock);
+ /*
+ * update_hash must be called every time because there is a chance
+ * that the data in the page has changed since the page was inserted
+ * into the hash table to avoid inserting the page more than once.
+ */
+ update_hash(ksm_scan, hash_index);
+ spin_unlock(&hash_lock);
+
+ hlist_for_each_entry_safe(page_hash_item, node, n, bucket, link) {
+ int npages;
+ struct page *hash_page[1];
+
+ if (slot->mm == page_hash_item->mm && addr == page_hash_item->addr) {
+ used = 1;
+ continue;
+ }
+
+ down_read(&page_hash_item->mm->mmap_sem);
+ /*
+ * If the page is swapped out or in swap cache we don't want to
+ * scan it (it is just for performance).
+ */
+ if (!is_present_pte(page_hash_item->mm, page_hash_item->addr)) {
+ up_read(&page_hash_item->mm->mmap_sem);
+ continue;
+ }
+ npages = get_user_pages(current, page_hash_item->mm,
+ page_hash_item->addr,
+ 1, 0, 0, hash_page, NULL);
+ up_read(&page_hash_item->mm->mmap_sem);
+ if (npages != 1)
+ break;
+
+ /* Recalculate the page's hash index in case it has changed. */
+ if (calc_hash_index(hash_page[0]) == hash_index) {
+
+ ret = try_to_merge_two_pages(slot->mm, page,
+ page_hash_item->mm,
+ hash_page[0], addr,
+ page_hash_item->addr);
+ if (ret) {
+ page_cache_release(hash_page[0]);
+ return ret;
+ }
+ }
+ page_cache_release(hash_page[0]);
+ }
+ /* If node is NULL and used=0, the page is not in the hash table. */
+ if (!node && !used) {
+ struct page_hash_item *page_hash_item;
+ struct rmap_item *rmap_item;
+
+ page_hash_item = alloc_page_hash_item();
+ if (!page_hash_item)
+ return ret;
+
+ rmap_item = alloc_rmap_item();
+ if (!rmap_item) {
+ free_page_hash_item(page_hash_item);
+ return ret;
+ }
+
+ spin_lock(&hash_lock);
+ update_hash(ksm_scan, hash_index);
+ insert_page_to_hash(ksm_scan, hash_index, page_hash_item, rmap_item);
+ spin_unlock(&hash_lock);
+ }
+
+ return ret;
+}
+
+/* return 1 - no slots registered, nothing to be done */
+static int scan_get_next_index(struct ksm_scan *ksm_scan, int nscan)
+{
+ struct ksm_mem_slot *slot;
+
+ if (list_empty(&slots))
+ return 1;
+
+ slot = ksm_scan->slot_index;
+
+ /* Are there pages left in this slot to scan? */
+ if ((slot->npages - ksm_scan->page_index - nscan) > 0) {
+ ksm_scan->page_index += nscan;
+ return 0;
+ }
+
+ list_for_each_entry_from(slot, &slots, link) {
+ if (slot == ksm_scan->slot_index)
+ continue;
+ ksm_scan->page_index = 0;
+ ksm_scan->slot_index = slot;
+ return 0;
+ }
+
+ /* look like we finished scanning the whole memory, starting again */
+ ksm_scan->page_index = 0;
+ ksm_scan->slot_index = list_first_entry(&slots,
+ struct ksm_mem_slot, link);
+ return 0;
+}
+
+/*
+ * update slot_index so it point to vaild data, it is possible that by
+ * the time we are here the data that ksm_scan was pointed to was released
+ * so we have to call this function every time after taking the slots_lock
+ */
+static void scan_update_old_index(struct ksm_scan *ksm_scan)
+{
+ struct ksm_mem_slot *slot;
+
+ if (list_empty(&slots))
+ return;
+
+ list_for_each_entry(slot, &slots, link) {
+ if (ksm_scan->slot_index == slot)
+ return;
+ }
+
+ ksm_scan->slot_index = list_first_entry(&slots,
+ struct ksm_mem_slot, link);
+ ksm_scan->page_index = 0;
+}
+
+static int ksm_scan_start(struct ksm_scan *ksm_scan, int scan_npages,
+ int max_npages)
+{
+ struct ksm_mem_slot *slot;
+ struct page *page[1];
+ int val;
+ int ret = 0;
+
+ down_read(&slots_lock);
+
+ scan_update_old_index(ksm_scan);
+
+ while (scan_npages > 0 && max_npages > 0) {
+ if (scan_get_next_index(ksm_scan, 1)) {
+ /* we have no slots, another ret value should be used */
+ goto out;
+ }
+
+ slot = ksm_scan->slot_index;
+ down_read(&slot->mm->mmap_sem);
+ /*
+ * If the page is swapped out or in swap cache, we don't want to
+ * scan it (it is just for performance).
+ */
+ if (is_present_pte(slot->mm, slot->addr +
+ ksm_scan->page_index * PAGE_SIZE)) {
+ val = get_user_pages(current, slot->mm, slot->addr +
+ ksm_scan->page_index * PAGE_SIZE ,
+ 1, 0, 0, page, NULL);
+ up_read(&slot->mm->mmap_sem);
+ if (val == 1) {
+ if (!PageKsm(page[0]))
+ max_npages -=
+ cmp_and_merge_page(ksm_scan, page[0]);
+ page_cache_release(page[0]);
+ }
+ } else
+ up_read(&slot->mm->mmap_sem);
+ scan_npages--;
+ }
+
+ scan_get_next_index(ksm_scan, 1);
+out:
+ up_read(&slots_lock);
+ return ret;
+}
+
+static int ksm_scan_ioctl_start(struct ksm_scan *ksm_scan,
+ struct ksm_user_scan *scan)
+{
+ return ksm_scan_start(ksm_scan, scan->pages_to_scan,
+ scan->max_pages_to_merge);
+}
+
+static int ksm_scan_release(struct inode *inode, struct file *filp)
+{
+ struct ksm_scan *ksm_scan = filp->private_data;
+
+ kfree(ksm_scan);
+ return 0;
+}
+
+static long ksm_scan_ioctl(struct file *filp,
+ unsigned int ioctl, unsigned long arg)
+{
+ struct ksm_scan *ksm_scan = filp->private_data;
+ void __user *argp = (void __user *)arg;
+ int r = EINVAL;
+
+ switch (ioctl) {
+ case KSM_SCAN: {
+ struct ksm_user_scan scan;
+
+ r = -EFAULT;
+ if (copy_from_user(&scan, argp,
+ sizeof(struct ksm_user_scan)))
+ break;
+
+ r = ksm_scan_ioctl_start(ksm_scan, &scan);
+ }
+ }
+ return r;
+}
+
+static struct file_operations ksm_sma_fops = {
+ .release = ksm_sma_release,
+ .unlocked_ioctl = ksm_sma_ioctl,
+ .compat_ioctl = ksm_sma_ioctl,
+};
+
+static int ksm_dev_ioctl_create_shared_memory_area(void)
+{
+ int fd = -1;
+ struct ksm_sma *ksm_sma;
+
+ ksm_sma = kmalloc(sizeof(struct ksm_sma), GFP_KERNEL);
+ if (!ksm_sma)
+ goto out;
+
+ INIT_LIST_HEAD(&ksm_sma->sma_slots);
+
+ fd = anon_inode_getfd("ksm-sma", &ksm_sma_fops, ksm_sma, 0);
+ if (fd < 0)
+ goto out_free;
+
+ return fd;
+out_free:
+ kfree(ksm_sma);
+out:
+ return fd;
+}
+
+static struct file_operations ksm_scan_fops = {
+ .release = ksm_scan_release,
+ .unlocked_ioctl = ksm_scan_ioctl,
+ .compat_ioctl = ksm_scan_ioctl,
+};
+
+static struct ksm_scan *ksm_scan_create(void)
+{
+ struct ksm_scan *ksm_scan;
+
+ ksm_scan = kzalloc(sizeof(struct ksm_scan), GFP_KERNEL);
+ return ksm_scan;
+}
+
+static int ksm_dev_ioctl_create_scan(void)
+{
+ int fd;
+ struct ksm_scan *ksm_scan;
+
+ if (!tfm) {
+ fd = ksm_tfm_init();
+ if (fd)
+ goto out;
+ }
+
+ fd = -ENOMEM;
+ ksm_scan = ksm_scan_create();
+ if (!ksm_scan)
+ goto out;
+
+ fd = anon_inode_getfd("ksm-scan", &ksm_scan_fops, ksm_scan, 0);
+ if (fd < 0)
+ goto out_free;
+ return fd;
+
+out_free:
+ kfree(ksm_scan);
+out:
+ return fd;
+}
+
+static int ksm_dev_ioctl_start_stop_kthread(struct ksm_kthread_info *info)
+{
+ int rc = 0;
+
+ /* Make sure crypto tfm is initialized before starting scanning */
+ if (info->running && !tfm) {
+ rc = ksm_tfm_init();
+ if (rc)
+ goto out;
+ }
+
+ down_write(&kthread_lock);
+
+ kthread_sleep = info->sleep;
+ kthread_pages_to_scan = info->pages_to_scan;
+ kthread_max_npages = info->max_pages_to_merge;
+ kthread_run = info->running;
+
+ up_write(&kthread_lock);
+
+ if (kthread_run)
+ wake_up_interruptible(&kthread_wait);
+
+out:
+ return rc;
+}
+
+static int ksm_dev_ioctl_get_info_kthread(struct ksm_kthread_info *info)
+{
+ down_read(&kthread_lock);
+
+ info->sleep = kthread_sleep;
+ info->pages_to_scan = kthread_pages_to_scan;
+ info->max_pages_to_merge = kthread_max_npages;
+ info->running = kthread_run;
+
+ up_read(&kthread_lock);
+ return 0;
+}
+
+static long ksm_dev_ioctl(struct file *filp,
+ unsigned int ioctl, unsigned long arg)
+{
+ void __user *argp = (void __user *)arg;
+ long r = -EINVAL;
+
+ switch (ioctl) {
+ case KSM_GET_API_VERSION:
+ r = KSM_API_VERSION;
+ break;
+ case KSM_CREATE_SHARED_MEMORY_AREA:
+ r = ksm_dev_ioctl_create_shared_memory_area();
+ break;
+ case KSM_CREATE_SCAN:
+ r = ksm_dev_ioctl_create_scan();
+ break;
+ case KSM_START_STOP_KTHREAD: {
+ struct ksm_kthread_info info;
+
+ r = -EFAULT;
+ if (copy_from_user(&info, argp,
+ sizeof(struct ksm_kthread_info)))
+ break;
+
+ r = ksm_dev_ioctl_start_stop_kthread(&info);
+ break;
+ }
+ case KSM_GET_INFO_KTHREAD: {
+ struct ksm_kthread_info info;
+
+ r = ksm_dev_ioctl_get_info_kthread(&info);
+ if (r)
+ break;
+ r = -EFAULT;
+ if (copy_to_user(argp, &info,
+ sizeof(struct ksm_kthread_info)))
+ break;
+ r = 0;
+ }
+ default:
+ return r;
+ }
+ return r;
+}
+
+static int ksm_dev_open(struct inode *inode, struct file *filp)
+{
+ try_module_get(THIS_MODULE);
+ return 0;
+}
+
+static int ksm_dev_release(struct inode *inode, struct file *filp)
+{
+ module_put(THIS_MODULE);
+ return 0;
+}
+
+static struct file_operations ksm_chardev_ops = {
+ .open = ksm_dev_open,
+ .release = ksm_dev_release,
+ .unlocked_ioctl = ksm_dev_ioctl,
+ .compat_ioctl = ksm_dev_ioctl,
+};
+
+static struct miscdevice ksm_dev = {
+ KSM_MINOR,
+ "ksm",
+ &ksm_chardev_ops,
+};
+
+int kthread_ksm_scan_thread(void *nothing)
+{
+ while (!kthread_should_stop()) {
+ if(kthread_run) {
+ down_read(&kthread_lock);
+ ksm_scan_start(&kthread_ksm_scan,
+ kthread_pages_to_scan,
+ kthread_max_npages);
+ up_read(&kthread_lock);
+ schedule_timeout_interruptible(usecs_to_jiffies(kthread_sleep));
+ } else
+ wait_event_interruptible(kthread_wait, kthread_run);
+ }
+ return 0;
+}
+
+static int __init ksm_init(void)
+{
+ int r = 1;
+
+ r = ksm_slab_init();
+ if (r)
+ goto out;
+
+ r = page_hash_init();
+ if (r)
+ goto out_free1;
+
+ r = rmap_hash_init();
+ if (r)
+ goto out_free2;
+
+ r = sha1_hash_init();
+ if (r)
+ goto out_free3;
+
+ INIT_LIST_HEAD(&slots);
+ init_rwsem(&slots_lock);
+ spin_lock_init(&hash_lock);
+ init_rwsem(&kthread_lock);
+ init_waitqueue_head(&kthread_wait);
+
+ kthread = kthread_run(kthread_ksm_scan_thread, NULL, "kksmd");
+ if (IS_ERR(kthread)) {
+ printk(KERN_ERR "ksm: creating kthread failed\n");
+ goto out_free4;
+ }
+
+ r = misc_register(&ksm_dev);
+ if (r) {
+ printk(KERN_ERR "ksm: misc device register failed\n");
+ goto out_free5;
+ }
+
+ printk(KERN_WARNING "ksm loaded\n");
+ return 0;
+
+out_free5:
+ kthread_stop(kthread);
+out_free4:
+ sha1_hash_free();
+out_free3:
+ rmap_hash_free();
+out_free2:
+ page_hash_free();
+out_free1:
+ ksm_slab_free();
+out:
+ return r;
+}
+
+static void __exit ksm_exit(void)
+{
+ misc_deregister(&ksm_dev);
+ kthread_run = 1;
+ kthread_stop(kthread);
+ if (tfm)
+ crypto_free_hash(tfm);
+ sha1_hash_free();
+ rmap_hash_free();
+ page_hash_free();
+ ksm_slab_free();
+}
+
+module_init(ksm_init)
+module_exit(ksm_exit)
--
1.6.0.3

2008-11-11 13:23:48

by Izik Eidus

[permalink] [raw]
Subject: [PATCH 1/4] rmap: add page_wrprotect() function,

From: Izik Eidus <[email protected]>

this function is useful for cases you want to compare page and know
that its value wont change during you compare it.

this function is working by walking over the whole rmap of a page
and mark every pte related to the page as write_protect.

the odirect_sync paramter is used to notify the caller of
page_wrprotect() if one pte or more was not marked readonly
in order to avoid race with odirect.

Signed-off-by: Izik Eidus <[email protected]>
---
include/linux/rmap.h | 7 ++++
mm/rmap.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 104 insertions(+), 0 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 89f0564..2a37fb7 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -121,6 +121,8 @@ static inline int try_to_munlock(struct page *page)
}
#endif

+int page_wrprotect(struct page *page, int *odirect_sync);
+
#else /* !CONFIG_MMU */

#define anon_vma_init() do {} while (0)
@@ -135,6 +137,11 @@ static inline int page_mkclean(struct page *page)
return 0;
}

+static inline int page_wrprotect(struct page *page, int *odirect_sync)
+{
+ return 0;
+}
+

#endif /* CONFIG_MMU */

diff --git a/mm/rmap.c b/mm/rmap.c
index 1099394..3684edd 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -576,6 +576,103 @@ int page_mkclean(struct page *page)
}
EXPORT_SYMBOL_GPL(page_mkclean);

+static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
+ int *odirect_sync)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address;
+ pte_t *pte;
+ spinlock_t *ptl;
+ int ret = 0;
+
+ address = vma_address(page, vma);
+ if (address == -EFAULT)
+ goto out;
+
+ pte = page_check_address(page, mm, address, &ptl, 0);
+ if (!pte)
+ goto out;
+
+ if (pte_write(*pte)) {
+ pte_t entry;
+
+ if (page_mapcount(page) != page_count(page)) {
+ *odirect_sync = 0;
+ goto out_unlock;
+ }
+ flush_cache_page(vma, address, pte_pfn(*pte));
+ entry = ptep_clear_flush_notify(vma, address, pte);
+ entry = pte_wrprotect(entry);
+ set_pte_at(mm, address, pte, entry);
+ }
+ ret = 1;
+
+out_unlock:
+ pte_unmap_unlock(pte, ptl);
+out:
+ return ret;
+}
+
+static int page_wrprotect_file(struct page *page, int *odirect_sync)
+{
+ struct address_space *mapping;
+ struct prio_tree_iter iter;
+ struct vm_area_struct *vma;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ int ret = 0;
+
+ mapping = page_mapping(page);
+ if (!mapping)
+ return ret;
+
+ spin_lock(&mapping->i_mmap_lock);
+
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
+ ret += page_wrprotect_one(page, vma, odirect_sync);
+
+ spin_unlock(&mapping->i_mmap_lock);
+
+ return ret;
+}
+
+static int page_wrprotect_anon(struct page *page, int *odirect_sync)
+{
+ struct vm_area_struct *vma;
+ struct anon_vma *anon_vma;
+ int ret = 0;
+
+ anon_vma = page_lock_anon_vma(page);
+ if (!anon_vma)
+ return ret;
+
+ list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
+ ret += page_wrprotect_one(page, vma, odirect_sync);
+
+ page_unlock_anon_vma(anon_vma);
+
+ return ret;
+}
+
+/**
+ * set all the ptes pointed to a page as read only,
+ * odirect_sync is set to 0 in case we cannot protect against race with odirect
+ * return the number of ptes that were set as read only
+ * (ptes that were read only before this function was called are couned as well)
+ */
+int page_wrprotect(struct page *page, int *odirect_sync)
+{
+ int ret =0;
+
+ *odirect_sync = 1;
+ if (PageAnon(page))
+ ret = page_wrprotect_anon(page, odirect_sync);
+ else
+ ret = page_wrprotect_file(page, odirect_sync);
+
+ return ret;
+}
+EXPORT_SYMBOL(page_wrprotect);
+
/**
* __page_set_anon_rmap - setup new anonymous rmap
* @page: the page to add the mapping to
--
1.6.0.3

2008-11-11 13:23:30

by Izik Eidus

[permalink] [raw]
Subject: [PATCH 4/4] MMU_NOTIFIRES: add set_pte_at_notify()

From: Izik Eidus <[email protected]>

this function is optimzation for kvm/users of mmu_notifiers for COW
pages, it is useful for kvm when ksm is used beacuse it allow kvm
not to have to recive VMEXIT and only then map the shared page into
the mmu shadow pages, but instead map it directly at the same time
linux map the page into the host page table.

this mmu notifer macro is working by calling to callback that will map
directly the physical page into the shadow page tables.

(users of mmu_notifiers that didnt implement the set_pte_at_notify()
call back will just recive the mmu_notifier_invalidate_page callback)

Signed-off-by: Izik Eidus <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu.c | 55 ++++++++++++++++++++++++++++++++++-----
include/linux/mmu_notifier.h | 33 +++++++++++++++++++++++
mm/memory.c | 12 ++++++--
mm/mmu_notifier.c | 20 ++++++++++++++
virt/kvm/kvm_main.c | 14 ++++++++++
6 files changed, 125 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 65679d0..a5d01d4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -748,5 +748,6 @@ asmlinkage void kvm_handle_fault_on_reboot(void);
#define KVM_ARCH_WANT_MMU_NOTIFIER
int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);

#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 99c239c..652a51c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -663,7 +663,8 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
kvm_flush_remote_tlbs(kvm);
}

-static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
+ unsigned long data)
{
u64 *spte;
int need_tlb_flush = 0;
@@ -678,8 +679,41 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
return need_tlb_flush;
}

+static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
+ unsigned long data)
+{
+ u64 *spte, new_spte;
+ u64 *cur_spte;
+ pte_t *ptep = (pte_t *)data;
+ pte_t pte;
+ struct page *new_page;
+ struct page *old_page;
+
+ pte = *ptep;
+ new_page = pfn_to_page(pte_pfn(pte));
+ cur_spte = rmap_next(kvm, rmapp, NULL);
+ while (cur_spte) {
+ spte = cur_spte;
+ BUG_ON(!(*spte & PT_PRESENT_MASK));
+ rmap_printk("kvm_set_pte_rmapp: spte %p %llx\n", spte, *spte);
+ new_spte = *spte &~ (PT64_BASE_ADDR_MASK);
+ new_spte |= pte_pfn(pte);
+ if (!pte_write(pte))
+ new_spte &= ~PT_WRITABLE_MASK;
+ old_page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT);
+ get_page(new_page);
+ cur_spte = rmap_next(kvm, rmapp, spte);
+ set_shadow_pte(spte, new_spte);
+ kvm_flush_remote_tlbs(kvm);
+ put_page(old_page);
+ }
+ return 0;
+}
+
static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
- int (*handler)(struct kvm *kvm, unsigned long *rmapp))
+ unsigned long data,
+ int (*handler)(struct kvm *kvm, unsigned long *rmapp,
+ unsigned long data))
{
int i;
int retval = 0;
@@ -700,11 +734,12 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
end = start + (memslot->npages << PAGE_SHIFT);
if (hva >= start && hva < end) {
gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT;
- retval |= handler(kvm, &memslot->rmap[gfn_offset]);
+ retval |= handler(kvm, &memslot->rmap[gfn_offset], data);
retval |= handler(kvm,
&memslot->lpage_info[
gfn_offset /
- KVM_PAGES_PER_HPAGE].rmap_pde);
+ KVM_PAGES_PER_HPAGE].rmap_pde,
+ data);
}
}

@@ -713,10 +748,16 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,

int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
{
- return kvm_handle_hva(kvm, hva, kvm_unmap_rmapp);
+ return kvm_handle_hva(kvm, hva, 0, kvm_unmap_rmapp);
+}
+
+void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+{
+ kvm_handle_hva(kvm, hva, (unsigned long)&pte, kvm_set_pte_rmapp);
}

-static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
+static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
+ unsigned long data)
{
u64 *spte;
int young = 0;
@@ -742,7 +783,7 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)

int kvm_age_hva(struct kvm *kvm, unsigned long hva)
{
- return kvm_handle_hva(kvm, hva, kvm_age_rmapp);
+ return kvm_handle_hva(kvm, hva, 0, kvm_age_rmapp);
}

#ifdef MMU_DEBUG
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b77486d..c2effe2 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,15 @@ struct mmu_notifier_ops {
struct mm_struct *mm,
unsigned long address);

+ /*
+ * change_pte is called in cases that pte mapping into page is changed
+ * for example when ksm mapped pte to point into a new shared page.
+ */
+ void (*change_pte)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address,
+ pte_t pte);
+
/*
* Before this is invoked any secondary MMU is still ok to
* read/write to the page previously pointed to by the Linux
@@ -154,6 +163,8 @@ extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
extern void __mmu_notifier_release(struct mm_struct *mm);
extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
unsigned long address);
+extern void __mmu_notifier_change_pte(struct mm_struct *mm,
+ unsigned long address, pte_t pte);
extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address);
extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
@@ -175,6 +186,13 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
return 0;
}

+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+ unsigned long address, pte_t pte)
+{
+ if (mm_has_notifiers(mm))
+ __mmu_notifier_change_pte(mm, address, pte);
+}
+
static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address)
{
@@ -236,6 +254,16 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
__young; \
})

+#define set_pte_at_notify(__mm, __address, __ptep, __pte) \
+({ \
+ struct mm_struct *___mm = __mm; \
+ unsigned long ___address = __address; \
+ pte_t ___pte = __pte; \
+ \
+ set_pte_at(__mm, __address, __ptep, ___pte); \
+ mmu_notifier_change_pte(___mm, ___address, ___pte); \
+})
+
#else /* CONFIG_MMU_NOTIFIER */

static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -248,6 +276,11 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
return 0;
}

+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+ unsigned long address, pte_t pte)
+{
+}
+
static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address)
{
diff --git a/mm/memory.c b/mm/memory.c
index b2c542c..374d695 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1524,7 +1524,7 @@ int replace_page(struct vm_area_struct *vma, struct page *oldpage,

flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush(vma, addr, ptep);
- set_pte_at(mm, addr, ptep, mk_pte(newpage, prot));
+ set_pte_at_notify(mm, addr, ptep, mk_pte(newpage, prot));

page_remove_rmap(oldpage, vma);
if (PageAnon(oldpage)) {
@@ -1981,13 +1981,19 @@ gotten:
* seen in the presence of one thread doing SMC and another
* thread doing COW.
*/
- ptep_clear_flush_notify(vma, address, page_table);
+ ptep_clear_flush(vma, address, page_table);
SetPageSwapBacked(new_page);
lru_cache_add_active_or_unevictable(new_page, vma);
page_add_new_anon_rmap(new_page, vma, address);

//TODO: is this safe? do_anonymous_page() does it this way.
- set_pte_at(mm, address, page_table, entry);
+ /*
+ * we call here for the notify macro beacuse in cases of using
+ * secondary mmu page table like kvm shadow page tables
+ * we want the new page to be mapped directly into the second
+ * page table.
+ */
+ set_pte_at_notify(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
if (old_page) {
/*
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 5f4ef02..c3e8779 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -99,6 +99,26 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
return young;
}

+void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
+ pte_t pte)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n;
+
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+ if (mn->ops->change_pte)
+ mn->ops->change_pte(mn, mm, address, pte);
+ /*
+ * some drivers dont have change_pte and therefor we must call
+ * for invalidate_page in that case
+ */
+ else if (mn->ops->invalidate_page)
+ mn->ops->invalidate_page(mn, mm, address);
+ }
+ rcu_read_unlock();
+}
+
void __mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address)
{
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cf0ab8e..00c12c4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -482,6 +482,19 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,

}

+static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address,
+ pte_t pte)
+{
+ struct kvm *kvm = mmu_notifier_to_kvm(mn);
+
+ spin_lock(&kvm->mmu_lock);
+ kvm->mmu_notifier_seq++;
+ kvm_set_spte_hva(kvm, address, pte);
+ spin_unlock(&kvm->mmu_lock);
+}
+
static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
@@ -554,6 +567,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
.invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
.invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
.clear_flush_young = kvm_mmu_notifier_clear_flush_young,
+ .change_pte = kvm_mmu_notifier_change_pte,
};
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */

--
1.6.0.3

2008-11-11 18:31:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

On Tue, 11 Nov 2008 15:21:37 +0200 Izik Eidus <[email protected]> wrote:

> KSM is a linux driver that allows dynamicly sharing identical memory pages
> between one or more processes.
>
> unlike tradtional page sharing that is made at the allocation of the
> memory, ksm do it dynamicly after the memory was created.
> Memory is periodically scanned; identical pages are identified and merged.
> the sharing is unnoticeable by the process that use this memory.
> (the shared pages are marked as readonly, and in case of write
> do_wp_page() take care to create new copy of the page)
>
> this driver is very useful for KVM as in cases of runing multiple guests
> operation system of the same type, many pages are sharable.
> this driver can be useful by OpenVZ as well.

These benefits should be quantified, please. Also any benefits to any
other workloads should be identified and quantified.

The whole approach seems wrong to me. The kernel lost track of these
pages and then we run around post-facto trying to fix that up again.
Please explain (for the changelog) why the kernel cannot get this right
via the usual sharing, refcounting and COWing approaches.

2008-11-11 18:48:56

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

Andrew Morton wrote:
> The whole approach seems wrong to me. The kernel lost track of these
> pages and then we run around post-facto trying to fix that up again.
> Please explain (for the changelog) why the kernel cannot get this right
> via the usual sharing, refcounting and COWing approaches.
>

For kvm, the kernel never knew those pages were shared. They are loaded
from independent (possibly compressed and encrypted) disk images. These
images are different; but some pages happen to be the same because they
came from the same installation media.

For OpenVZ the situation is less clear, but if you allow users to
independently upgrade their chroots you will eventually arrive at the
same scenario (unless of course you apply the same merging strategy at
the filesystem level).

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-11-11 19:08:21

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

Andrew Morton wrote:
> On Tue, 11 Nov 2008 15:21:37 +0200 Izik Eidus <[email protected]> wrote:
>
>
>> KSM is a linux driver that allows dynamicly sharing identical memory pages
>> between one or more processes.
>>
>> unlike tradtional page sharing that is made at the allocation of the
>> memory, ksm do it dynamicly after the memory was created.
>> Memory is periodically scanned; identical pages are identified and merged.
>> the sharing is unnoticeable by the process that use this memory.
>> (the shared pages are marked as readonly, and in case of write
>> do_wp_page() take care to create new copy of the page)
>>
>> this driver is very useful for KVM as in cases of runing multiple guests
>> operation system of the same type, many pages are sharable.
>> this driver can be useful by OpenVZ as well.
>>
>
> These benefits should be quantified, please. Also any benefits to any
> other workloads should be identified and quantified.
>
Sure,
we have used KSM in production for about half year and the numbers that
came from our QA is:
using KSM for desktop (KSM was tested just for windows desktop workload)
you can run as many as
52 windows xp with 1 giga ram each on server with just 16giga ram. (this
is more than 300% overcommit)
the reason is that most of the kernel/dlls of this guests is shared and
in addition we are sharing the windows zero
(windows keep making all its free memory as zero, so every time windows
release memory we take the page back to the host)
there is slide that give this numbers you can find at:
http://kvm.qumranet.com/kvmwiki/KvmForum2008?action=AttachFile&do=get&target=kdf2008_3.pdf
(slide 27)
beside more i gave presentation about ksm that can be found at:
http://kvm.qumranet.com/kvmwiki/KvmForum2008?action=AttachFile&do=get&target=kdf2008_12.pdf

if more numbers are wanted for other workloads i can test it.
(the idea of ksm is to run it slowly slowy at low priority and let it
merge pages when no one need the cpu)

2008-11-11 19:09:54

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

Avi Kivity wrote:
> Andrew Morton wrote:
>> The whole approach seems wrong to me. The kernel lost track of these
>> pages and then we run around post-facto trying to fix that up again.
>> Please explain (for the changelog) why the kernel cannot get this right
>> via the usual sharing, refcounting and COWing approaches.
>>
>
> For kvm, the kernel never knew those pages were shared. They are
> loaded from independent (possibly compressed and encrypted) disk
> images. These images are different; but some pages happen to be the
> same because they came from the same installation media.

As Avi said, in kvm we cannot know how the guest is going to map its
pages, we have nothing to do but to scan for the identical pages
(you can have pages that are shared that are in whole different offset
inside the guest)

>
> For OpenVZ the situation is less clear, but if you allow users to
> independently upgrade their chroots you will eventually arrive at the
> same scenario (unless of course you apply the same merging strategy at
> the filesystem level).
>

2008-11-11 19:12:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

On Tue, 11 Nov 2008 20:48:16 +0200
Avi Kivity <[email protected]> wrote:

> Andrew Morton wrote:
> > The whole approach seems wrong to me. The kernel lost track of these
> > pages and then we run around post-facto trying to fix that up again.
> > Please explain (for the changelog) why the kernel cannot get this right
> > via the usual sharing, refcounting and COWing approaches.
> >
>
> For kvm, the kernel never knew those pages were shared. They are loaded
> from independent (possibly compressed and encrypted) disk images. These
> images are different; but some pages happen to be the same because they
> came from the same installation media.

What userspace-only changes could fix this? Identify the common data,
write it to a flat file and mmap it, something like that?

> For OpenVZ the situation is less clear, but if you allow users to
> independently upgrade their chroots you will eventually arrive at the
> same scenario (unless of course you apply the same merging strategy at
> the filesystem level).

hm.

There has been the occasional discussion about idenfifying all-zeroes
pages and scavenging them, repointing them at the zero page. Could
this infrastructure be used for that? (And how much would we gain from
it?)

[I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
thing here ;) ]

2008-11-11 19:19:09

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

Andrew Morton wrote:
> On Tue, 11 Nov 2008 20:48:16 +0200
> Avi Kivity <[email protected]> wrote:
>
>
>> Andrew Morton wrote:
>>
>>> The whole approach seems wrong to me. The kernel lost track of these
>>> pages and then we run around post-facto trying to fix that up again.
>>> Please explain (for the changelog) why the kernel cannot get this right
>>> via the usual sharing, refcounting and COWing approaches.
>>>
>>>
>> For kvm, the kernel never knew those pages were shared. They are loaded
>> from independent (possibly compressed and encrypted) disk images. These
>> images are different; but some pages happen to be the same because they
>> came from the same installation media.
>>
>
> What userspace-only changes could fix this? Identify the common data,
> write it to a flat file and mmap it, something like that?
>
>
>> For OpenVZ the situation is less clear, but if you allow users to
>> independently upgrade their chroots you will eventually arrive at the
>> same scenario (unless of course you apply the same merging strategy at
>> the filesystem level).
>>
>
> hm.
>
> There has been the occasional discussion about idenfifying all-zeroes
> pages and scavenging them, repointing them at the zero page. Could
> this infrastructure be used for that? (And how much would we gain from
> it?)
>
> [I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
> thing here ;) ]
KSM is separate driver , it doesn't change anything in the VM but adding
two helper functions.

2008-11-11 19:22:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

On Tue, 11 Nov 2008 21:07:10 +0200
Izik Eidus <[email protected]> wrote:

> we have used KSM in production for about half year and the numbers that
> came from our QA is:
> using KSM for desktop (KSM was tested just for windows desktop workload)
> you can run as many as
> 52 windows xp with 1 giga ram each on server with just 16giga ram. (this
> is more than 300% overcommit)
> the reason is that most of the kernel/dlls of this guests is shared and
> in addition we are sharing the windows zero
> (windows keep making all its free memory as zero, so every time windows
> release memory we take the page back to the host)
> there is slide that give this numbers you can find at:
> http://kvm.qumranet.com/kvmwiki/KvmForum2008?action=AttachFile&do=get&target=kdf2008_3.pdf
> (slide 27)
> beside more i gave presentation about ksm that can be found at:
> http://kvm.qumranet.com/kvmwiki/KvmForum2008?action=AttachFile&do=get&target=kdf2008_12.pdf

OK, 300% isn't chicken feed.

It is quite important that information such as this be prepared, added to
the patch changelogs and maintained. For a start, without this basic
information, there is no reason for anyone to look at any of the code!

2008-11-11 19:30:24

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

Andrew Morton wrote:
>> For kvm, the kernel never knew those pages were shared. They are loaded
>> from independent (possibly compressed and encrypted) disk images. These
>> images are different; but some pages happen to be the same because they
>> came from the same installation media.
>>
>
> What userspace-only changes could fix this? Identify the common data,
> write it to a flat file and mmap it, something like that?
>
>

This was considered. You can't scan the image, because it may be
encrypted/compressed/offset (typical images _are_ offset because the
first partition starts at sector 63...). The data may come from the
network and not a disk image. You can't scan in userspace because the
images belong to different users and contain sensitive data. Pages may
come from several images (multiple disk images per guest) so you end up
with one vma per page.

So you have to scan memory, after the guest has retrieved it from
disk/network/manufactured it somehow, decompressed and encrypted it,
written it to the offset it wants. You can't scan from userspace since
it's sensitive data, and of course the actual merging need to be done
atomically, which can only be done from the holy of holies, the vm.

>> For OpenVZ the situation is less clear, but if you allow users to
>> independently upgrade their chroots you will eventually arrive at the
>> same scenario (unless of course you apply the same merging strategy at
>> the filesystem level).
>>
>
> hm.
>
> There has been the occasional discussion about idenfifying all-zeroes
> pages and scavenging them, repointing them at the zero page. Could
> this infrastructure be used for that?

Yes, trivially. ksm may be an overkill for this, though.

> (And how much would we gain from
> it?)
>

A lot of zeros.

> [I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
> thing here ;) ]
>

I sympathize -- us too. Consider the typical multiuser gnome
minicomputer with all 150 users reading lwn.net at the same time instead
of working. You could share the firefox rendered page cache, reducing
memory utilization drastically.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-11-11 19:33:23

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

On Tue, 11 Nov 2008 21:18:23 +0200
Izik Eidus <[email protected]> wrote:

> > hm.
> >
> > There has been the occasional discussion about idenfifying all-zeroes
> > pages and scavenging them, repointing them at the zero page. Could
> > this infrastructure be used for that? (And how much would we gain from
> > it?)
> >
> > [I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
> > thing here ;) ]

^^ this?

> KSM is separate driver , it doesn't change anything in the VM but adding
> two helper functions.

What, you mean I should actually read the code? Oh well, OK.

2008-11-11 19:41:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/4] rmap: add page_wrprotect() function,

On Tue, 11 Nov 2008 15:21:38 +0200
Izik Eidus <[email protected]> wrote:

> From: Izik Eidus <[email protected]>
>
> this function is useful for cases you want to compare page and know
> that its value wont change during you compare it.
>
> this function is working by walking over the whole rmap of a page
> and mark every pte related to the page as write_protect.
>
> the odirect_sync paramter is used to notify the caller of
> page_wrprotect() if one pte or more was not marked readonly
> in order to avoid race with odirect.

The grammar is a bit mangled. Missing apostrophes. Sentences start
with capital letters.

Yeah, it's a nit, but we don't actually gain anything from deliberately
mangling the language.

>
> ...
>
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -576,6 +576,103 @@ int page_mkclean(struct page *page)
> }
> EXPORT_SYMBOL_GPL(page_mkclean);
>
> +static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
> + int *odirect_sync)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + unsigned long address;
> + pte_t *pte;
> + spinlock_t *ptl;
> + int ret = 0;
> +
> + address = vma_address(page, vma);
> + if (address == -EFAULT)
> + goto out;
> +
> + pte = page_check_address(page, mm, address, &ptl, 0);
> + if (!pte)
> + goto out;
> +
> + if (pte_write(*pte)) {
> + pte_t entry;
> +
> + if (page_mapcount(page) != page_count(page)) {
> + *odirect_sync = 0;
> + goto out_unlock;
> + }
> + flush_cache_page(vma, address, pte_pfn(*pte));
> + entry = ptep_clear_flush_notify(vma, address, pte);
> + entry = pte_wrprotect(entry);
> + set_pte_at(mm, address, pte, entry);
> + }
> + ret = 1;
> +
> +out_unlock:
> + pte_unmap_unlock(pte, ptl);
> +out:
> + return ret;
> +}

OK. I think. We need to find a way of provoking Hugh to look at it.

> +static int page_wrprotect_file(struct page *page, int *odirect_sync)
> +{
> + struct address_space *mapping;
> + struct prio_tree_iter iter;
> + struct vm_area_struct *vma;
> + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> + int ret = 0;
> +
> + mapping = page_mapping(page);

What pins *mapping in memory? Usually this is done by requiring that
the caller has locked the page. But no such precondition is documented
here.

> + if (!mapping)
> + return ret;
> +
> + spin_lock(&mapping->i_mmap_lock);
> +
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
> + ret += page_wrprotect_one(page, vma, odirect_sync);
> +
> + spin_unlock(&mapping->i_mmap_lock);
> +
> + return ret;
> +}
> +
> +static int page_wrprotect_anon(struct page *page, int *odirect_sync)
> +{
> + struct vm_area_struct *vma;
> + struct anon_vma *anon_vma;
> + int ret = 0;
> +
> + anon_vma = page_lock_anon_vma(page);
> + if (!anon_vma)
> + return ret;
> +
> + list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
> + ret += page_wrprotect_one(page, vma, odirect_sync);
> +
> + page_unlock_anon_vma(anon_vma);
> +
> + return ret;
> +}
> +
> +/**

This token means "this is a kerneldoc comment".

> + * set all the ptes pointed to a page as read only,
> + * odirect_sync is set to 0 in case we cannot protect against race with odirect
> + * return the number of ptes that were set as read only
> + * (ptes that were read only before this function was called are couned as well)
> + */

But it isn't.

I don't understand this odirect_sync thing. What race? Please expand
this comment to make the function of odirect_sync more understandable.

> +int page_wrprotect(struct page *page, int *odirect_sync)
> +{
> + int ret =0;
> +
> + *odirect_sync = 1;
> + if (PageAnon(page))
> + ret = page_wrprotect_anon(page, odirect_sync);
> + else
> + ret = page_wrprotect_file(page, odirect_sync);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(page_wrprotect);

What do you think about making all this new code dependent upon some
CONFIG_ switch which CONFIG_KVM can select?

2008-11-11 19:47:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Tue, 11 Nov 2008 15:21:39 +0200
Izik Eidus <[email protected]> wrote:

> From: Izik Eidus <[email protected]>
>
> this function is needed in cases you want to change the userspace
> virtual mapping into diffrent physical page,

Not sure that I understand that description. We want to replace a live
page in an anonymous VMA with a different one?

It looks that way.

page migration already kinda does that. Is there common ground?

> KSM need this for merging the identical pages.
>
> this function is working by removing the oldpage from the rmap and
> calling put_page on it, and by setting the virtual address pte
> to point into the new page.
> (note that the new page (the page that we change the pte to map to)
> cannot be anonymous page)
>

I don't understand the restrictions on anonymous pages. Please expand
the changelog so that reviewers can understand the reasons for this
restriction.


> ---
> include/linux/mm.h | 3 ++
> mm/memory.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 71 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ffee2f7..4da7fa8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1207,6 +1207,9 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> unsigned long pfn);
>
> +int replace_page(struct vm_area_struct *vma, struct page *oldpage,
> + struct page *newpage, pte_t orig_pte, pgprot_t prot);
> +
> struct page *follow_page(struct vm_area_struct *, unsigned long address,
> unsigned int foll_flags);
> #define FOLL_WRITE 0x01 /* check pte is writable */
> diff --git a/mm/memory.c b/mm/memory.c
> index 164951c..b2c542c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1472,6 +1472,74 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> }
> EXPORT_SYMBOL(vm_insert_mixed);
>
> +/**
> + * replace _page - replace the pte mapping related to vm area between two pages

s/replace _page/replace_page/

> + * (from oldpage to newpage)
> + * NOTE: you should take into consideration the impact on the VM when replacing
> + * anonymous pages with kernel non swappable pages.
> + */

This _is_ a kerneldoc comment, but kernedoc comments conventionally
document the arguments and the return value also.

> +int replace_page(struct vm_area_struct *vma, struct page *oldpage,
> + struct page *newpage, pte_t orig_pte, pgprot_t prot)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *ptep;
> + spinlock_t *ptl;
> + unsigned long addr;
> + int ret;
> +
> + BUG_ON(PageAnon(newpage));
> +
> + ret = -EFAULT;
> + addr = page_address_in_vma(oldpage, vma);
> + if (addr == -EFAULT)
> + goto out;
> +
> + pgd = pgd_offset(mm, addr);
> + if (!pgd_present(*pgd))
> + goto out;
> +
> + pud = pud_offset(pgd, addr);
> + if (!pud_present(*pud))
> + goto out;
> +
> + pmd = pmd_offset(pud, addr);
> + if (!pmd_present(*pmd))
> + goto out;
> +
> + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> + if (!ptep)
> + goto out;
> +
> + if (!pte_same(*ptep, orig_pte)) {
> + pte_unmap_unlock(ptep, ptl);
> + goto out;
> + }
> +
> + ret = 0;
> + get_page(newpage);
> + page_add_file_rmap(newpage);
> +
> + flush_cache_page(vma, addr, pte_pfn(*ptep));
> + ptep_clear_flush(vma, addr, ptep);
> + set_pte_at(mm, addr, ptep, mk_pte(newpage, prot));
> +
> + page_remove_rmap(oldpage, vma);
> + if (PageAnon(oldpage)) {
> + dec_mm_counter(mm, anon_rss);
> + inc_mm_counter(mm, file_rss);
> + }
> + put_page(oldpage);
> +
> + pte_unmap_unlock(ptep, ptl);
> +
> +out:
> + return ret;
> +}
> +EXPORT_SYMBOL(replace_page);

Again, we could make the presence of this code selectable by subsystems
which want it.

2008-11-11 19:53:35

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

Andrew Morton wrote:
> On Tue, 11 Nov 2008 21:18:23 +0200
> Izik Eidus <[email protected]> wrote:
>
>
>>> hm.
>>>
>>> There has been the occasional discussion about idenfifying all-zeroes
>>> pages and scavenging them, repointing them at the zero page. Could
>>> this infrastructure be used for that? (And how much would we gain from
>>> it?)
>>>
>>> [I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
>>> thing here ;) ]
>>>
>
> ^^ this?
>
>
>> KSM is separate driver , it doesn't change anything in the VM but adding
>> two helper functions.
>>
>
> What, you mean I should actually read the code? Oh well, OK.
>
Andrea i think what is happening here is my fault
i will try to give here much more information about KSM:
first the bad things:
KSM shared pages are right now (we have patch that can change it but we
want to wait with it) unswappable
this mean that the entire memory of the guest is swappable but the pages
that are shared are not.
(when the pages are splited back by COW they become anonymous again with
the help of do_wp_page()
the reason that the pages are not swappable is beacuse the way the Linux
Rmap is working, this not allow us to create nonlinear anonymous pages
(we dont want to use nonlinear vma for kvm, as it will make swapping for
kvm very slow)
the reason that ksm pages need to have nonlinear reverse mapping is that
for one guest identical page can be found in whole diffrent offset than
other guest have it
(this is from the userspace VM point of view)

the rest is quite simple:
it is walking over the entire guest memory (or only some of it) and scan
for identical pages using hash table
it merge the pages into one single write protected page

numbers for ksm is something that i have just for desktops and just the
numbers i gave you
what is do know is:
big overcommit like 300% is possible just when you take into account
that some of the guest memory will be free
we are sharing mostly the DLLs/ KERNEL / ZERO pages, for the DLLS and
KERNEL PAGEs this pages likely will never break
but ZERO pages will be break when windows will allocate them and will
come back when windows will free the memory.
(i wouldnt suggest 300% overcommit for servers workload, beacuse you can
end up swapping in that case,
but for desktops after runing in production and passed some seiroes qa
tress tests it seems like 300% is a real number that can be use)

i just ran test on two fedora 8 guests and got that results (using GNOME
in both of them)
9959 root 15 0 730m 537m 281m S 8 3.4 0:44.28
kvm

9956 root 15 0 730m 537m 246m S 4 3.4 0:41.43 kvm
as you can see the physical sharing was 281mb and 246mb (kernel pages
are counted as shared)
there is small lie in this numbers beacuse pages that was shared across
two guests and was splited by writing from guest number 1 will still
have 1 refernce count to it
and will still be kernel page (untill the other guest (num 2) will write
to it as well)


anyway i am willing to make much better testing or everything that
needed for this patchs to be merged.
(just tell me what and i will do it)

beside that you should know that patch 4 is not a must, it is just nice
optimization...

thanks.

2008-11-11 19:55:27

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

Hi Andrew,

thanks for looking into this.

On Tue, Nov 11, 2008 at 11:11:10AM -0800, Andrew Morton wrote:
> What userspace-only changes could fix this? Identify the common data,
> write it to a flat file and mmap it, something like that?

The whole idea is to do something that works transparently and isn't
tailored for kvm. The mmu notifier change_pte method can be dropped as
well if you want (I recommended not to have it in the first submission
but Izik preferred to keep it because it will optimize away a kvm
shadow pte minor fault the first time kvm access the page after
sharing it). The page_wrprotect and replace_page can also be embedded
in ksm.

So the idea is that while we could do something specific to ksm that
keeps most of the code in userland, it'd be more tricky as it'd
require some communication with the core VM anyway (we can't just do
it in userland with mprotect, memcpy, mmap(MAP_PRIVATE) as it wouldn't
be atomic and second it'd be inefficient in terms of vma-buildup for
the same reason nonlinear-vmas exist), but most important: it wouldn't
work for all other regular process. With KSM we can share anonymous
memory for the whole system, KVM is just a random user.

This implementation is on the simple side because it can't
swap. Swapping and perhaps the limitation of sharing anonymous memory
is the only trouble here but those will be addressed in the
future. ksm is a new device driver so it's like /dev/mem, so no
swapping isn't a blocker here.

By sharing anon pages, in short we're making anonymous vmas nonlinear,
and this isn't supported by the current rmap code. So swapping can't
work unless we mark those anon-vmas nonlinear and we either build the
equivalent of the old pte_chains on demand just for those nonlinear
shared pages, or we do a full scan of all ptes in the nonlinear
anon-vmas. An external rmap infrastructure can allow ksm to build
whatever it wants inside ksm.c to track the nonlinear anon-pages
inside a regular anon-vma and rmap.c can invoke those methods to find
the ptes for those nonlinear pages. The core VM won't get more complex
and ksm can decide if to do a full nonlinear scan of the vma, or to
build the equivalent of pte_chains. This again has to be added later
and once everybody sees ksm, it'll be easier to agree on a
external-rmap API to allow it to swap. While the pte_chains are very
inefficent to reverse the regular anonymous mappings, they're
efficient solution as an exception for the shared KSM pages that gets
scattered over the linear anon-vmas.

It's a bit like the initial kvm that was merged despite it couldn't
swap. Then we added mmu notifiers, and now kvm can swap. So we add ksm
now without swapping and later we build an external-rmap to allow ksm
to swap after we agree ksm is useful and people starts using it.

> There has been the occasional discussion about idenfifying all-zeroes
> pages and scavenging them, repointing them at the zero page. Could
> this infrastructure be used for that? (And how much would we gain from
> it?)

Zero pages makes a lot of difference for windows, but they're totally
useless for linux. With current ksm all guest pagecache is 100% shared
across hosts, so when you start an app the .text runs on the same
physical memory on both guests. Works fine and code is quite simple in
this round. Once we add swapping it'll be a bit more complex in VM
terms as it'll have to handle nonlinear anon-vmas.

If we ever decide to share MAP_SHARED pagecache it'll be even more
complicated than just adding the external-rmap... I think this can be
done incrementally if needed at all. OpenVZ if the install is smart
enough could share the pagecache by just hardlinking the equal
binaries.. but AFIK they don't do that normally. For the anon ram they
need this too, they can't solve equal anon ram in userland as it has
to be handled atomically at runtime.

The folks at CERN LHC (was visiting it last month) badly need KSM too
for certain apps they're running that are allocating huge arrays
(aligned) in anon memory and they're mostly equal for all
processes. They tried to work around it with fork but it's not working
well, KSM would solve their problem (it'd solve it both on the same OS
and across OS with kvm as virtualization engine running on the same host).

So I think this is good stuff, and I'd focus discussions and reviews
on the KSM API of /dev/ksm that if merged will be longstanding and
much more troublesome than the rest of the code if changed later (if
we change the ksm internals at any time nobody will notice), and
post-merging we can focus on the external-rmap to make KSM pages first
class citizens in VM terms. But then anything can be changed here, so
suggestions welcome!

Thanks!

2008-11-11 20:09:00

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

Izik Eidus wrote:
> Andrew Morton wrote:
>> On Tue, 11 Nov 2008 21:18:23 +0200
>> Izik Eidus <[email protected]> wrote:
>>
>>
>>>> hm.
>>>>
>>>> There has been the occasional discussion about idenfifying all-zeroes
>>>> pages and scavenging them, repointing them at the zero page. Could
>>>> this infrastructure be used for that? (And how much would we gain
>>>> from
>>>> it?)
>>>>
>>>> [I'm looking for reasons why this is more than a
>>>> muck-up-the-vm-for-kvm
>>>> thing here ;) ]
>>>>
>>
>> ^^ this?
>>
>>
>>> KSM is separate driver , it doesn't change anything in the VM but
>>> adding two helper functions.
>>>
>>
>> What, you mean I should actually read the code? Oh well, OK.
>>
> Andrea i think what is happening here is my fault
Sorry, meant to write here Andrew :-)
> i will try to give here much more information about KSM:
> first the bad things:
> KSM shared pages are right now (we have patch that can change it but
> we want to wait with it) unswappable
> this mean that the entire memory of the guest is swappable but the
> pages that are shared are not.
> (when the pages are splited back by COW they become anonymous again
> with the help of do_wp_page()
> the reason that the pages are not swappable is beacuse the way the
> Linux Rmap is working, this not allow us to create nonlinear anonymous
> pages
> (we dont want to use nonlinear vma for kvm, as it will make swapping
> for kvm very slow)
> the reason that ksm pages need to have nonlinear reverse mapping is
> that for one guest identical page can be found in whole diffrent
> offset than other guest have it
> (this is from the userspace VM point of view)
>
> the rest is quite simple:
> it is walking over the entire guest memory (or only some of it) and
> scan for identical pages using hash table
> it merge the pages into one single write protected page
>
> numbers for ksm is something that i have just for desktops and just
> the numbers i gave you
> what is do know is:
> big overcommit like 300% is possible just when you take into account
> that some of the guest memory will be free
> we are sharing mostly the DLLs/ KERNEL / ZERO pages, for the DLLS and
> KERNEL PAGEs this pages likely will never break
> but ZERO pages will be break when windows will allocate them and will
> come back when windows will free the memory.
> (i wouldnt suggest 300% overcommit for servers workload, beacuse you
> can end up swapping in that case,
> but for desktops after runing in production and passed some seiroes qa
> tress tests it seems like 300% is a real number that can be use)
>
> i just ran test on two fedora 8 guests and got that results (using
> GNOME in both of them)
> 9959 root 15 0 730m 537m 281m S 8 3.4 0:44.28
> kvm
>
> 9956 root 15 0 730m 537m 246m S 4 3.4 0:41.43 kvm
> as you can see the physical sharing was 281mb and 246mb (kernel pages
> are counted as shared)
> there is small lie in this numbers beacuse pages that was shared
> across two guests and was splited by writing from guest number 1 will
> still have 1 refernce count to it
> and will still be kernel page (untill the other guest (num 2) will
> write to it as well)
>
>
> anyway i am willing to make much better testing or everything that
> needed for this patchs to be merged.
> (just tell me what and i will do it)
>
> beside that you should know that patch 4 is not a must, it is just
> nice optimization...
>
> thanks.
>
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2008-11-11 20:38:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 1/4] rmap: add page_wrprotect() function,

On Tue, Nov 11, 2008 at 11:39:48AM -0800, Andrew Morton wrote:
> > +static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
> > + int *odirect_sync)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + unsigned long address;
> > + pte_t *pte;
> > + spinlock_t *ptl;
> > + int ret = 0;
> > +
> > + address = vma_address(page, vma);
> > + if (address == -EFAULT)
> > + goto out;
> > +
> > + pte = page_check_address(page, mm, address, &ptl, 0);
> > + if (!pte)
> > + goto out;
> > +
> > + if (pte_write(*pte)) {
> > + pte_t entry;
> > +
> > + if (page_mapcount(page) != page_count(page)) {
> > + *odirect_sync = 0;
> > + goto out_unlock;
> > + }
> > + flush_cache_page(vma, address, pte_pfn(*pte));
> > + entry = ptep_clear_flush_notify(vma, address, pte);
> > + entry = pte_wrprotect(entry);
> > + set_pte_at(mm, address, pte, entry);
> > + }
> > + ret = 1;
> > +
> > +out_unlock:
> > + pte_unmap_unlock(pte, ptl);
> > +out:
> > + return ret;
> > +}
>
> OK. I think. We need to find a way of provoking Hugh to look at it.

Yes. Please focus on the page_mapcount != page_count, which is likely
missing from migrate.c too and in turn page migration currently breaks
O_DIRECT like fork() is buggy as well as discussed here:

http://marc.info/?l=linux-mm&m=122236799302540&w=2
http://marc.info/?l=linux-mm&m=122524107519182&w=2
http://marc.info/?l=linux-mm&m=122581116713932&w=2

The fix implemented in ksm currently handles older kernels (like
rhel/sles) not current mainline that does
get_user_pages_fast. get_user_pages_fast is unfixable yet (see my last
email to Nick above asking for a way to block gup_fast).

The fix proposed by Nick plus my additional fix, should stop the
corruption in fork the same way the above check fixes it for ksm. But
todate gup_fast remains unfixable.

> > +static int page_wrprotect_file(struct page *page, int *odirect_sync)
> > +{
> > + struct address_space *mapping;
> > + struct prio_tree_iter iter;
> > + struct vm_area_struct *vma;
> > + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> > + int ret = 0;
> > +
> > + mapping = page_mapping(page);
>
> What pins *mapping in memory? Usually this is done by requiring that
> the caller has locked the page. But no such precondition is documented
> here.

Looks buggy but we never call it from ksm 8). I guess Izik added it
for completeness when preparing for mainline submission. We've the
option to get rid of page_wrprotect_file entirely and only implement a
page_wrprotect_anon! Otherwise we can add a BUG_ON(!PageLocked(page))
before the above page_mapping to protect against truncate.

> > + * set all the ptes pointed to a page as read only,
> > + * odirect_sync is set to 0 in case we cannot protect against race with odirect
> > + * return the number of ptes that were set as read only
> > + * (ptes that were read only before this function was called are couned as well)
> > + */
>
> But it isn't.

What isn't?

> I don't understand this odirect_sync thing. What race? Please expand
> this comment to make the function of odirect_sync more understandable.

I should have answered this one with the above 3 links.

> What do you think about making all this new code dependent upon some
> CONFIG_ switch which CONFIG_KVM can select?

I like that too.

2008-11-11 20:40:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/4] add ksm kernel shared memory driver

On Tue, 11 Nov 2008 15:21:40 +0200
Izik Eidus <[email protected]> wrote:

> From: Izik Eidus <[email protected]>
>
> ksm is driver that allow merging identical pages between one or more
> applications in way unvisible to the application that use it.
> pages that are merged are marked as readonly and are COWed when any application
> try to change them.
>
> ksm is working by walking over the memory pages of the applications it scan
> in order to find identical pages.
> it uses an hash table to find in effective way the identical pages.
>
> when ksm find two identical pages, it marked them as readonly and merge them
> into single one page,
> after the pages are marked as readonly and merged into one page, linux
> will treat this pages as normal copy_on_write pages and will fork them
> when write access will happen to them.
>
> ksm scan just memory areas that were registred to be scanned by it.
>

This driver apepars to implement a new userspace interface, in /dev/ksm

Please fully document that interface in the changelog so that we can
review your decisions here. This is by far the most important
consideration - we can change all the code, but interfaces are for
ever.

> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index d38f43f..c1c701f 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -105,4 +105,9 @@ source "drivers/uio/Kconfig"
> source "drivers/xen/Kconfig"
>
> source "drivers/staging/Kconfig"
> +
> +config KSM
> + bool "KSM driver support"
> + help
> + ksm is driver for merging identical pages between applciations

s/is/is a/

"applications" is misspelt.

> endmenu
> diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> new file mode 100644
> index 0000000..f873502
> --- /dev/null
> +++ b/include/linux/ksm.h
> @@ -0,0 +1,53 @@
> +#ifndef __LINUX_KSM_H
> +#define __LINUX_KSM_H
> +
> +/*
> + * Userspace interface for /dev/ksm - kvm shared memory
> + */
> +
> +#include <asm/types.h>
> +#include <linux/ioctl.h>
> +
> +#define KSM_API_VERSION 1
> +
> +/* for KSM_REGISTER_MEMORY_REGION */
> +struct ksm_memory_region {
> + __u32 npages; /* number of pages to share */
> + __u32 pad;
> + __u64 addr; /* the begining of the virtual address */
> +};
> +
> +struct ksm_user_scan {
> + __u32 pages_to_scan;
> + __u32 max_pages_to_merge;
> +};
> +
> +struct ksm_kthread_info {
> + __u32 sleep; /* number of microsecoends to sleep */
> + __u32 pages_to_scan; /* number of pages to scan */
> + __u32 max_pages_to_merge;
> + __u32 running;
> +};
> +
> +#define KSMIO 0xAB
> +
> +/* ioctls for /dev/ksm */
> +#define KSM_GET_API_VERSION _IO(KSMIO, 0x00)
> +#define KSM_CREATE_SHARED_MEMORY_AREA _IO(KSMIO, 0x01) /* return SMA fd */
> +#define KSM_CREATE_SCAN _IO(KSMIO, 0x02) /* return SCAN fd */
> +#define KSM_START_STOP_KTHREAD _IOW(KSMIO, 0x03,\
> + struct ksm_kthread_info)
> +#define KSM_GET_INFO_KTHREAD _IOW(KSMIO, 0x04,\
> + struct ksm_kthread_info)
> +
> +
> +/* ioctls for SMA fds */
> +#define KSM_REGISTER_MEMORY_REGION _IOW(KSMIO, 0x20,\
> + struct ksm_memory_region)
> +#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO, 0x21)
> +
> +/* ioctls for SCAN fds */
> +#define KSM_SCAN _IOW(KSMIO, 0x40,\
> + struct ksm_user_scan)
> +
> +#endif

uh-oh, ioctls.

Please do document that interface for us.

> diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
> index 26433ec..adc2435 100644
> --- a/include/linux/miscdevice.h
> +++ b/include/linux/miscdevice.h
> @@ -30,6 +30,7 @@
> #define TUN_MINOR 200
> #define HPET_MINOR 228
> #define KVM_MINOR 232
> +#define KSM_MINOR 233
>
> struct device;
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 5b5790f..e7f0061 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -222,3 +222,6 @@ config UNEVICTABLE_LRU
>
> config MMU_NOTIFIER
> bool
> +
> +config KSM
> + bool
> diff --git a/mm/Makefile b/mm/Makefile
> index c06b45a..9722afe 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -26,6 +26,7 @@ obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o
> obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
> obj-$(CONFIG_SLOB) += slob.o
> obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
> +obj-$(CONFIG_KSM) += ksm.o
> obj-$(CONFIG_SLAB) += slab.o
> obj-$(CONFIG_SLUB) += slub.o
> obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
> diff --git a/mm/ksm.c b/mm/ksm.c
> new file mode 100644
> index 0000000..977eb37
> --- /dev/null
> +++ b/mm/ksm.c
> @@ -0,0 +1,1202 @@
> +/*
> + * Memory merging driver for Linux
> + *
> + * This module enables dynamic sharing of identical pages found in different
> + * memory areas, even if they are not shared by fork()
> + *
> + * Copyright (C) 2008 Red Hat, Inc.
> + * Authors:
> + * Izik Eidus
> + * Andrea Arcangeli
> + * Chris Wright
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/errno.h>
> +#include <linux/mm.h>
> +#include <linux/fs.h>
> +#include <linux/miscdevice.h>
> +#include <linux/vmalloc.h>
> +#include <linux/file.h>
> +#include <linux/mman.h>
> +#include <linux/sched.h>
> +#include <linux/rwsem.h>
> +#include <linux/pagemap.h>
> +#include <linux/vmalloc.h>
> +#include <linux/sched.h>
> +#include <linux/rmap.h>
> +#include <linux/spinlock.h>
> +#include <linux/jhash.h>
> +#include <linux/delay.h>
> +#include <linux/kthread.h>
> +#include <linux/wait.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/ksm.h>
> +#include <linux/crypto.h>
> +#include <linux/scatterlist.h>
> +#include <linux/random.h>
> +#include <crypto/sha.h>
> +
> +#include <asm/tlbflush.h>
> +
> +MODULE_AUTHOR("Red Hat, Inc.");
> +MODULE_LICENSE("GPL");
> +
> +static int page_hash_size;
> +module_param(page_hash_size, int, 0);
> +MODULE_PARM_DESC(page_hash_size, "Hash table size for the pages checksum");
> +
> +static int rmap_hash_size;
> +module_param(rmap_hash_size, int, 0);
> +MODULE_PARM_DESC(rmap_hash_size, "Hash table size for the reverse mapping");
> +
> +static int sha1_hash_size;
> +module_param(sha1_hash_size, int, 0);
> +MODULE_PARM_DESC(sha1_hash_size, "Hash table size for the sha1 caching");
> +
> +struct ksm_mem_slot {
> + struct list_head link;
> + struct list_head sma_link;
> + struct mm_struct *mm;
> + unsigned long addr; /* the begining of the virtual address */
> + int npages; /* number of pages to share */
> +};
> +
> +/*
> + * sma - shared memory area, each process have its own sma that contain the
> + * information about the slots that it own
> + */
> +struct ksm_sma {
> + struct list_head sma_slots;
> +};
> +
> +struct ksm_scan {
> + struct ksm_mem_slot *slot_index; /* the slot we are scanning now */
> + int page_index; /* the page inside sma that is now being scanned */
> +};
> +
> +struct page_hash_item {
> + struct hlist_node link;
> + struct mm_struct *mm;
> + unsigned long addr;
> +};
> +
> +struct rmap_item {
> + struct hlist_node link;
> + struct page_hash_item *page_hash_item;
> + unsigned long oldindex;
> +};
> +
> +struct sha1_item {
> + unsigned char sha1val[SHA1_DIGEST_SIZE];
> + unsigned long pfn;
> +};
> +
> +static struct list_head slots;

Use LIST_HEAD(), remove the INIT_LIST_HEAD() from ksm_init().

> +static struct rw_semaphore slots_lock;

Use DECLARE_RWSEM(), remove the init_rwsem() from ksm_init().

> +static DEFINE_MUTEX(sha1_lock);
> +
> +static int npages_hash;
> +static struct hlist_head *page_hash_items;
> +static int nrmaps_hash;
> +static struct hlist_head *rmap_hash;
> +static int nsha1s_hash;
> +static struct sha1_item *sha1_hash;
> +
> +static struct kmem_cache *page_hash_item_cache;
> +static struct kmem_cache *rmap_item_cache;
> +
> +static int kthread_sleep;
> +static int kthread_pages_to_scan;
> +static int kthread_max_npages;
> +static struct ksm_scan kthread_ksm_scan;
> +static int kthread_run;
> +static struct task_struct *kthread;
> +static wait_queue_head_t kthread_wait;

initialise at compile-time.

> +static struct rw_semaphore kthread_lock;

initialise at compile-time.

> +static struct crypto_hash *tfm;
> +static unsigned char hmac_key[SHA1_DIGEST_SIZE];
> +static DEFINE_MUTEX(tfm_mutex);
> +
> +static spinlock_t hash_lock;

initialise at compile-time.

I'd suggest that the above globals have a decent amount of
documentation attached to them - describe what they're for, what is
their role in life. Documenting the data structures and the
relationships between them is an excellent way of helping readers to
understand the overall code.

> +static int ksm_tfm_init(void)
> +{
> + struct crypto_hash *hash;
> + int rc = 0;
> +
> + mutex_lock(&tfm_mutex);
> + if (tfm)
> + goto out;
> +
> + /* Must be called from user context before starting any scanning */
> + hash = crypto_alloc_hash("hmac(sha1)", 0, CRYPTO_ALG_ASYNC);
> + if (IS_ERR(hash)) {
> + rc = PTR_ERR(hash);
> + goto out;
> + }
> +
> + get_random_bytes(hmac_key, sizeof(hmac_key));
> +
> + rc = crypto_hash_setkey(hash, hmac_key, SHA1_DIGEST_SIZE);
> + if (rc) {
> + crypto_free_hash(hash);
> + goto out;
> + }
> + tfm = hash;
> +out:
> + mutex_unlock(&tfm_mutex);
> + return rc;
> +}

Did the Kconfig have the appropriate `depends on' lines for this?

> +static int ksm_slab_init(void)
> +{
> + int ret = 1;
> +
> + page_hash_item_cache = kmem_cache_create("ksm_page_hash_item",
> + sizeof(struct page_hash_item), 0, 0,
> + NULL);
> + if (!page_hash_item_cache)
> + goto out;
> +
> + rmap_item_cache = kmem_cache_create("ksm_rmap_item",
> + sizeof(struct rmap_item), 0, 0,
> + NULL);
> + if (!rmap_item_cache)
> + goto out_free;

You might be able to use KMEM_CACHE() here. I don't like KMEM_CACHE()
much, but let's be consistent here. We change the arguments to
kmem_cache_create() regularly, and KMEM_CACHE helps to reduce the
churn.

> + return 0;
> +
> +out_free:
> + kmem_cache_destroy(page_hash_item_cache);
> +out:
> + return ret;
> +}
> +
> +static void ksm_slab_free(void)
> +{
> + kmem_cache_destroy(rmap_item_cache);
> + kmem_cache_destroy(page_hash_item_cache);
> +}
> +
> +static struct page_hash_item *alloc_page_hash_item(void)
> +{
> + void *obj;
> +
> + obj = kmem_cache_zalloc(page_hash_item_cache, GFP_KERNEL);
> + return (struct page_hash_item *)obj;

Unneeded and undesirable case of void*.

> +}

But this whole function can be reduced to

static inline struct page_hash_item *alloc_page_hash_item(void)
{
return kmem_cache_zalloc(page_hash_item_cache, GFP_KERNEL);
}

> +static void free_page_hash_item(struct page_hash_item *page_hash_item)
> +{
> + kmem_cache_free(page_hash_item_cache, page_hash_item);
> +}

Suggest that this be inlined (the compiler might have decided to do
that anwyay, but it won't hurt to give gcc some help)

> +static struct rmap_item *alloc_rmap_item(void)
> +{
> + void *obj;
> +
> + obj = kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
> + return (struct rmap_item *)obj;
> +}

static inline struct rmap_item *alloc_rmap_item(void)
{
return kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
}

> +static void free_rmap_item(struct rmap_item *rmap_item)
> +{
> + kmem_cache_free(rmap_item_cache, rmap_item);
> +}

Inline this.

> +static inline int PageKsm(struct page *page)
> +{
> + return !PageAnon(page);
> +}

What's going on here? Please fully document the _meaning_ of
PageKsm(page) and then write something which helps readers understand
how and why that identically maps onto !PageAnon.

Please don't just toss stuff like this in there and expect people to
somehow understand what it's doing :(

> +static int page_hash_init(void)
> +{
> + if (!page_hash_size) {
> + struct sysinfo sinfo;
> +
> + si_meminfo(&sinfo);
> + page_hash_size = sinfo.totalram;
> + page_hash_size /= 40;
> + }

This will go wrong on large i386 highmem machines. The hash should be
sized by the amount of lowmem, not by the total amount of memory.

> + npages_hash = page_hash_size;
> + page_hash_items = vmalloc(npages_hash * sizeof(struct page_hash_item));
> + if (!page_hash_items)
> + return 1;
> +
> + memset(page_hash_items, 0, npages_hash * sizeof(struct page_hash_item));
> + return 0;
> +}

Review of this code would be considerably more productive if
page_hash_items had had good documentation. As it stands, I'm left to
vaguely guess at its function.

> +static void page_hash_free(void)
> +{
> + int i;
> + struct hlist_head *bucket;
> + struct hlist_node *node, *n;
> + struct page_hash_item *page_hash_item;
> +
> + for (i = 0; i < npages_hash; ++i) {
> + bucket = &page_hash_items[i];
> + hlist_for_each_entry_safe(page_hash_item, node, n, bucket, link) {
> + hlist_del(&page_hash_item->link);
> + free_page_hash_item(page_hash_item);
> + }
> + }
> + vfree(page_hash_items);
> +}
> +
> +static int rmap_hash_init(void)
> +{
> + if (!rmap_hash_size) {
> + struct sysinfo sinfo;
> +
> + si_meminfo(&sinfo);
> + rmap_hash_size = sinfo.totalram;
> + rmap_hash_size /= 40;
> + }
> + nrmaps_hash = rmap_hash_size;
> + rmap_hash = vmalloc(nrmaps_hash *
> + sizeof(struct hlist_head));

unneeded linebreak.

> + if (!rmap_hash)
> + return 1;
> + memset(rmap_hash, 0, nrmaps_hash * sizeof(struct hlist_head));
> + return 0;
> +}

There are more modern ways than si_meminfo() of getting the info you
want. nr_free_buffer_pages() is one, and there are similar functions
around there.

> +static void rmap_hash_free(void)
> +{
> + int i;
> + struct hlist_head *bucket;
> + struct hlist_node *node, *n;
> + struct rmap_item *rmap_item;
> +
> + for (i = 0; i < nrmaps_hash; ++i) {
> + bucket = &rmap_hash[i];
> + hlist_for_each_entry_safe(rmap_item, node, n, bucket, link) {
> + hlist_del(&rmap_item->link);
> + free_rmap_item(rmap_item);
> + }
> + }
> + vfree(rmap_hash);
> +}
> +
> +static int sha1_hash_init(void)
> +{
> + if (!sha1_hash_size) {
> + struct sysinfo sinfo;
> +
> + si_meminfo(&sinfo);
> + sha1_hash_size = sinfo.totalram;
> + sha1_hash_size /= 128;
> + }
> + nsha1s_hash = sha1_hash_size;
> + sha1_hash = vmalloc(nsha1s_hash *
> + sizeof(struct sha1_item));

unneeded linebreak.

> + if (!sha1_hash)
> + return 1;
> + memset(sha1_hash, 0, nsha1s_hash * sizeof(struct sha1_item));
> + return 0;
> +}
> +
> +static void sha1_hash_free(void)
> +{
> + vfree(sha1_hash);
> +}
> +
> +static inline u32 calc_hash_index(struct page *page)
> +{
> + u32 hash;
> + void *addr = kmap_atomic(page, KM_USER0);
> + hash = jhash(addr, PAGE_SIZE, 17);
> + kunmap_atomic(addr, KM_USER0);
> + return hash % npages_hash;
> +}
> +
> +static void remove_page_from_hash(struct mm_struct *mm, unsigned long addr)
> +{
> + struct rmap_item *rmap_item;
> + struct hlist_head *bucket;
> + struct hlist_node *node, *n;
> +
> + bucket = &rmap_hash[addr % nrmaps_hash];
> + hlist_for_each_entry_safe(rmap_item, node, n, bucket, link) {
> + if (mm == rmap_item->page_hash_item->mm &&
> + rmap_item->page_hash_item->addr == addr) {
> + hlist_del(&rmap_item->page_hash_item->link);
> + free_page_hash_item(rmap_item->page_hash_item);
> + hlist_del(&rmap_item->link);
> + free_rmap_item(rmap_item);
> + return;
> + }
> + }
> +}

ah, so we hashed the pages by their contents.

Avoiding races in here would be a challenge. Alas, we are given no
description at all of how these problems have been solved.

> +static int ksm_sma_ioctl_register_memory_region(struct ksm_sma *ksm_sma,
> + struct ksm_memory_region *mem)
> +{
> + struct ksm_mem_slot *slot;
> + int ret = -1;
> +
> + if (!current->mm)
> + goto out;

Why? Needs a comment.

> + slot = kzalloc(sizeof(struct ksm_mem_slot), GFP_KERNEL);
> + if (!slot)
> + goto out;

This duplicates the above `if (!current->mm)' test.

> + slot->mm = get_task_mm(current);
> + slot->addr = mem->addr;
> + slot->npages = mem->npages;
> +
> + down_write(&slots_lock);
> +
> + list_add_tail(&slot->link, &slots);
> + list_add_tail(&slot->sma_link, &ksm_sma->sma_slots);
> +
> + up_write(&slots_lock);
> + ret = 0;
> +out:
> + return ret;
> +}
> +
> +static void remove_mm_from_hash(struct mm_struct *mm)
> +{
> + struct ksm_mem_slot *slot;
> + int pages_count = 0;
> +
> + list_for_each_entry(slot, &slots, link)
> + if (slot->mm == mm)
> + break;
> + if (!slot)
> + BUG();

BUG_ON(!slot)

> + spin_lock(&hash_lock);
> + while (pages_count < slot->npages) {
> + remove_page_from_hash(mm, slot->addr + pages_count * PAGE_SIZE);
> + pages_count++;
> + }
> + spin_unlock(&hash_lock);
> + list_del(&slot->link);
> +}
> +
> +static int ksm_sma_ioctl_remove_memory_region(struct ksm_sma *ksm_sma)
> +{
> + struct ksm_mem_slot *slot, *node;
> +
> + down_write(&slots_lock);
> + list_for_each_entry_safe(slot, node, &ksm_sma->sma_slots, sma_link) {
> + remove_mm_from_hash(slot->mm);
> + mmput(slot->mm);
> + list_del(&slot->sma_link);
> + kfree(slot);
> + }
> + up_write(&slots_lock);
> + return 0;
> +}
> +
> +static int ksm_sma_release(struct inode *inode, struct file *filp)
> +{
> + struct ksm_sma *ksm_sma = filp->private_data;
> + int r;
> +
> + r = ksm_sma_ioctl_remove_memory_region(ksm_sma);
> + kfree(ksm_sma);
> + return r;
> +}
> +
> +static long ksm_sma_ioctl(struct file *filp,
> + unsigned int ioctl, unsigned long arg)
> +{
> + struct ksm_sma *sma = filp->private_data;
> + void __user *argp = (void __user *)arg;
> + int r = EINVAL;
> +
> + switch (ioctl) {
> + case KSM_REGISTER_MEMORY_REGION: {
> + struct ksm_memory_region ksm_memory_region;
> +
> + r = -EFAULT;
> + if (copy_from_user(&ksm_memory_region, argp,
> + sizeof ksm_memory_region))
> + goto out;
> + r = ksm_sma_ioctl_register_memory_region(sma,
> + &ksm_memory_region);
> + break;
> + }
> + case KSM_REMOVE_MEMORY_REGION:
> + r = ksm_sma_ioctl_remove_memory_region(sma);
> + break;
> + }
> +
> +out:
> + return r;
> +}
> +
> +static int insert_page_to_hash(struct ksm_scan *ksm_scan,
> + unsigned long hash_index,
> + struct page_hash_item *page_hash_item,
> + struct rmap_item *rmap_item)
> +{
> + struct ksm_mem_slot *slot;
> + struct hlist_head *bucket;
> +
> + slot = ksm_scan->slot_index;
> + page_hash_item->addr = slot->addr + ksm_scan->page_index * PAGE_SIZE;
> + page_hash_item->mm = slot->mm;
> + bucket = &page_hash_items[hash_index];
> + hlist_add_head(&page_hash_item->link, bucket);
> +
> + rmap_item->page_hash_item = page_hash_item;
> + rmap_item->oldindex = hash_index;
> + bucket = &rmap_hash[page_hash_item->addr % nrmaps_hash];
> + hlist_add_head(&rmap_item->link, bucket);
> + return 0;
> +}
> +
> +static void update_hash(struct ksm_scan *ksm_scan,
> + unsigned long hash_index)
> +{
> + struct rmap_item *rmap_item;
> + struct ksm_mem_slot *slot;
> + struct hlist_head *bucket;
> + struct hlist_node *node, *n;
> + unsigned long addr;
> +
> + slot = ksm_scan->slot_index;;

double semicolon

> + addr = slot->addr + ksm_scan->page_index * PAGE_SIZE;
> + bucket = &rmap_hash[addr % nrmaps_hash];
> + hlist_for_each_entry_safe(rmap_item, node, n, bucket, link) {
> + if (slot->mm == rmap_item->page_hash_item->mm &&
> + rmap_item->page_hash_item->addr == addr) {
> + if (hash_index != rmap_item->oldindex) {
> + hlist_del(&rmap_item->page_hash_item->link);
> + free_page_hash_item(rmap_item->page_hash_item);
> + hlist_del(&rmap_item->link);
> + free_rmap_item(rmap_item);
> + }
> + return;
> + }
> + }
> +}
> +
> +static unsigned long addr_in_vma(struct vm_area_struct *vma, struct page *page)
> +{
> + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> + unsigned long addr;
> +
> + addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> + if (unlikely(addr < vma->vm_start || addr >= vma->vm_end))
> + return -EFAULT;
> + return addr;
> +}
> +
> +static pte_t *get_pte(struct mm_struct *mm, unsigned long addr)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *ptep = NULL;
> +
> + pgd = pgd_offset(mm, addr);
> + if (!pgd_present(*pgd))
> + goto out;
> +
> + pud = pud_offset(pgd, addr);
> + if (!pud_present(*pud))
> + goto out;
> +
> + pmd = pmd_offset(pud, addr);
> + if (!pmd_present(*pmd))
> + goto out;
> +
> + ptep = pte_offset_map(pmd, addr);
> +out:
> + return ptep;
> +}
> +
> +static int is_present_pte(struct mm_struct *mm, unsigned long addr)
> +{
> + pte_t *ptep;
> +
> + ptep = get_pte(mm, addr);
> + if (!ptep)
> + return 0;
> +
> + if (pte_present(*ptep))
> + return 1;
> + return 0;
> +}

I'm all lost now. Are we dealing with only anonymous vma's here? Or
can they be file-backed as well or what?

A meta-question is: why did I get lost reading this code?

> +#define PAGECMP_OFFSET 128
> +#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
> +/* hash the page */
> +static void page_hash(struct page *page, unsigned char *digest)
> +{
> + struct scatterlist sg;
> + struct hash_desc desc;
> +
> + sg_init_table(&sg, 1);
> + sg_set_page(&sg, page, PAGEHASH_SIZE, 0);
> + desc.tfm = tfm;
> + desc.flags = 0;
> + crypto_hash_digest(&desc, &sg, PAGEHASH_SIZE, digest);
> +}
> +
> +/* pages_identical
> + * calculate sha1 hash of each page, compare results,
> + * and return 1 if identical, 0 otherwise.
> + */

ooh, a comment!

> +static int pages_identical(struct page *oldpage, struct page *newpage, int new)
> +{
> + int r;
> + unsigned char old_digest[SHA1_DIGEST_SIZE];
> + struct sha1_item *sha1_item;
> +
> + page_hash(oldpage, old_digest);
> + /*
> + * If new = 1, it is never safe to use the sha1 value that is
> + * inside the cache, the reason is that the page can be released
> + * and then recreated and have diffrent sha1 value.
> + * (swapping as for now is not an issue, beacuse KsmPages cannot be

"because"

> + * swapped)
> + */
> + if (new) {
> + mutex_lock(&sha1_lock);
> + sha1_item = &sha1_hash[page_to_pfn(newpage) % nsha1s_hash];
> + page_hash(newpage, sha1_item->sha1val);
> + sha1_item->pfn = page_to_pfn(newpage);
> + r = !memcmp(old_digest, sha1_item->sha1val, SHA1_DIGEST_SIZE);
> + mutex_unlock(&sha1_lock);
> + } else {
> + mutex_lock(&sha1_lock);
> + sha1_item = &sha1_hash[page_to_pfn(newpage) % nsha1s_hash];
> + if (sha1_item->pfn != page_to_pfn(newpage)) {
> + page_hash(newpage, sha1_item->sha1val);
> + sha1_item->pfn = page_to_pfn(newpage);
> + }
> + r = !memcmp(old_digest, sha1_item->sha1val, SHA1_DIGEST_SIZE);
> + mutex_unlock(&sha1_lock);
> + }
> + if (PAGECMP_OFFSET && r) {
> + char *old_addr, *new_addr;
> + old_addr = kmap_atomic(oldpage, KM_USER0);
> + new_addr = kmap_atomic(newpage, KM_USER1);
> + r = !memcmp(old_addr+PAGECMP_OFFSET, new_addr+PAGECMP_OFFSET, PAGE_SIZE-PAGECMP_OFFSET);
> + kunmap_atomic(old_addr, KM_USER0);
> + kunmap_atomic(new_addr, KM_USER1);
> + }
> + return r;
> +}
> +
> +/*
> + * try_to_merge_one_page - take two pages and merge them into one
> + * note:
> + * oldpage should be anon page while newpage should be file mapped page
> + */

The function's name implies that the function can fail. But the
function's description implies that it always succeeds.

> +static int try_to_merge_one_page(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + struct page *oldpage,
> + struct page *newpage,
> + pgprot_t newprot,
> + int new)
> +{
> + int ret = 0;
> + int odirect_sync;
> + unsigned long page_addr_in_vma;
> + pte_t orig_pte, *orig_ptep;
> +
> + get_page(newpage);
> + get_page(oldpage);
> +
> + down_read(&mm->mmap_sem);
> +
> + page_addr_in_vma = addr_in_vma(vma, oldpage);
> + if (page_addr_in_vma == -EFAULT)
> + goto out_unlock;
> +
> + orig_ptep = get_pte(mm, page_addr_in_vma);
> + if (!orig_ptep)
> + goto out_unlock;
> + orig_pte = *orig_ptep;
> + if (!pte_present(orig_pte))
> + goto out_unlock;
> + if (page_to_pfn(oldpage) != pte_pfn(orig_pte))
> + goto out_unlock;
> + /*
> + * page_wrprotect check if the page is swapped or in swap cache,
> + * in the future we might want to run here if_present_pte and then
> + * swap_free
> + */
> + if (!page_wrprotect(oldpage, &odirect_sync))
> + goto out_unlock;
> + if (!odirect_sync)
> + goto out_unlock;
> +
> + orig_pte = pte_wrprotect(orig_pte);
> +
> + ret = 1;
> + if (pages_identical(oldpage, newpage, new))
> + ret = replace_page(vma, oldpage, newpage, orig_pte, newprot);
> +
> + if (!ret)
> + ret = 1;
> + else
> + ret = 0;

ret = !ret?

> +out_unlock:
> + up_read(&mm->mmap_sem);
> + put_page(oldpage);
> + put_page(newpage);
> + return ret;
> +}

By now I've lost track of what the return value of this function is on
success versus failure and what the semantic meaning of "failure" is.
And none of this was documented.

It is usual for kernel fucntions to return a -ve errno on failure.

> +static int try_to_merge_two_pages(struct mm_struct *mm1, struct page *page1,
> + struct mm_struct *mm2, struct page *page2,
> + unsigned long addr1, unsigned long addr2)
> +{
> + struct vm_area_struct *vma;
> + pgprot_t prot;
> + int ret = 0;
> +
> + /*
> + * If page2 isn't shared (it isn't PageKsm) we have to allocate a new
> + * file mapped page and make the two ptes of mm1(page1) and mm2(page2)
> + * point to it. If page2 is shared, we can just make the pte of
> + * mm1(page1) point to page2
> + */
> + if (PageKsm(page2)) {
> + vma = find_vma(mm1, addr1);
> + if (!vma)
> + return ret;
> + prot = vma->vm_page_prot;
> + pgprot_val(prot) &= ~VM_WRITE;
> + ret = try_to_merge_one_page(mm1, vma, page1, page2, prot, 0);
> + } else {
> + struct page *kpage;
> +
> + kpage = alloc_page(GFP_KERNEL | __GFP_HIGHMEM);

Stray whitepace.

Replace with GFP_HIGHUSER.

> + if (!kpage)
> + return ret;

hm, silent failure. What are the consequences of this? (WOuld prefer
that this question be answered via code comment!)

> + vma = find_vma(mm1, addr1);
> + if (!vma) {
> + page_cache_release(kpage);

put_page() is the preferred way of releasing non-pagecache pages.

> + return ret;
> + }
> + prot = vma->vm_page_prot;
> + pgprot_val(prot) &= ~VM_WRITE;
> +
> + copy_highpage(kpage, page1);

Are you sure we didn't need copy_user_highpage()?

> + ret = try_to_merge_one_page(mm1, vma, page1, kpage, prot, 1);
> +
> + if (ret) {
> + vma = find_vma(mm2, addr2);
> + if (!vma) {
> + page_cache_release(kpage);
> + return ret;
> + }
> +
> + prot = vma->vm_page_prot;
> + pgprot_val(prot) &= ~VM_WRITE;
> +
> + ret = try_to_merge_one_page(mm2, vma, page2, kpage,
> + prot, 0);
> + }
> + page_cache_release(kpage);

put_page(), please.

> + }
> + return ret;

please document the return value.

> +}
> +
> +static int cmp_and_merge_page(struct ksm_scan *ksm_scan, struct page *page)
> +{
> + struct hlist_head *bucket;
> + struct hlist_node *node, *n;
> + struct page_hash_item *page_hash_item;
> + struct ksm_mem_slot *slot;
> + unsigned long hash_index;
> + unsigned long addr;
> + int ret = 0;
> + int used = 0;
> +
> + hash_index = calc_hash_index(page);
> + bucket = &page_hash_items[hash_index];
> +
> + slot = ksm_scan->slot_index;
> + addr = slot->addr + ksm_scan->page_index * PAGE_SIZE;

Hang on, ->page_index is an `int'. This code will blow up at 16TB (and
maybe even 8TB, when things go negative).

> + spin_lock(&hash_lock);
> + /*
> + * update_hash must be called every time because there is a chance
> + * that the data in the page has changed since the page was inserted
> + * into the hash table to avoid inserting the page more than once.
> + */
> + update_hash(ksm_scan, hash_index);
> + spin_unlock(&hash_lock);
> +
> + hlist_for_each_entry_safe(page_hash_item, node, n, bucket, link) {
> + int npages;
> + struct page *hash_page[1];
> +
> + if (slot->mm == page_hash_item->mm && addr == page_hash_item->addr) {
> + used = 1;
> + continue;
> + }
> +
> + down_read(&page_hash_item->mm->mmap_sem);
> + /*
> + * If the page is swapped out or in swap cache we don't want to
> + * scan it (it is just for performance).
> + */
> + if (!is_present_pte(page_hash_item->mm, page_hash_item->addr)) {
> + up_read(&page_hash_item->mm->mmap_sem);
> + continue;
> + }
> + npages = get_user_pages(current, page_hash_item->mm,
> + page_hash_item->addr,
> + 1, 0, 0, hash_page, NULL);
> + up_read(&page_hash_item->mm->mmap_sem);
> + if (npages != 1)
> + break;
> +
> + /* Recalculate the page's hash index in case it has changed. */
> + if (calc_hash_index(hash_page[0]) == hash_index) {
> +
> + ret = try_to_merge_two_pages(slot->mm, page,
> + page_hash_item->mm,
> + hash_page[0], addr,
> + page_hash_item->addr);
> + if (ret) {
> + page_cache_release(hash_page[0]);
> + return ret;
> + }
> + }
> + page_cache_release(hash_page[0]);
> + }
> + /* If node is NULL and used=0, the page is not in the hash table. */
> + if (!node && !used) {
> + struct page_hash_item *page_hash_item;
> + struct rmap_item *rmap_item;
> +
> + page_hash_item = alloc_page_hash_item();
> + if (!page_hash_item)
> + return ret;
> +
> + rmap_item = alloc_rmap_item();
> + if (!rmap_item) {
> + free_page_hash_item(page_hash_item);
> + return ret;
> + }
> +
> + spin_lock(&hash_lock);
> + update_hash(ksm_scan, hash_index);
> + insert_page_to_hash(ksm_scan, hash_index, page_hash_item, rmap_item);
> + spin_unlock(&hash_lock);
> + }
> +
> + return ret;
> +}
> +
> +/* return 1 - no slots registered, nothing to be done */
> +static int scan_get_next_index(struct ksm_scan *ksm_scan, int nscan)
> +{
> + struct ksm_mem_slot *slot;
> +
> + if (list_empty(&slots))
> + return 1;
> +
> + slot = ksm_scan->slot_index;
> +
> + /* Are there pages left in this slot to scan? */
> + if ((slot->npages - ksm_scan->page_index - nscan) > 0) {
> + ksm_scan->page_index += nscan;
> + return 0;
> + }
> +
> + list_for_each_entry_from(slot, &slots, link) {
> + if (slot == ksm_scan->slot_index)
> + continue;
> + ksm_scan->page_index = 0;
> + ksm_scan->slot_index = slot;
> + return 0;
> + }
> +
> + /* look like we finished scanning the whole memory, starting again */
> + ksm_scan->page_index = 0;
> + ksm_scan->slot_index = list_first_entry(&slots,
> + struct ksm_mem_slot, link);
> + return 0;
> +}
> +
> +/*
> + * update slot_index so it point to vaild data, it is possible that by
> + * the time we are here the data that ksm_scan was pointed to was released
> + * so we have to call this function every time after taking the slots_lock
> + */
> +static void scan_update_old_index(struct ksm_scan *ksm_scan)
> +{
> + struct ksm_mem_slot *slot;
> +
> + if (list_empty(&slots))
> + return;
> +
> + list_for_each_entry(slot, &slots, link) {
> + if (ksm_scan->slot_index == slot)
> + return;
> + }
> +
> + ksm_scan->slot_index = list_first_entry(&slots,
> + struct ksm_mem_slot, link);
> + ksm_scan->page_index = 0;
> +}
> +
> +static int ksm_scan_start(struct ksm_scan *ksm_scan, int scan_npages,
> + int max_npages)
> +{
> + struct ksm_mem_slot *slot;
> + struct page *page[1];
> + int val;
> + int ret = 0;
> +
> + down_read(&slots_lock);
> +
> + scan_update_old_index(ksm_scan);
> +
> + while (scan_npages > 0 && max_npages > 0) {
> + if (scan_get_next_index(ksm_scan, 1)) {
> + /* we have no slots, another ret value should be used */
> + goto out;
> + }
> +
> + slot = ksm_scan->slot_index;
> + down_read(&slot->mm->mmap_sem);
> + /*
> + * If the page is swapped out or in swap cache, we don't want to
> + * scan it (it is just for performance).
> + */
> + if (is_present_pte(slot->mm, slot->addr +
> + ksm_scan->page_index * PAGE_SIZE)) {
> + val = get_user_pages(current, slot->mm, slot->addr +
> + ksm_scan->page_index * PAGE_SIZE ,
> + 1, 0, 0, page, NULL);
> + up_read(&slot->mm->mmap_sem);
> + if (val == 1) {
> + if (!PageKsm(page[0]))
> + max_npages -=
> + cmp_and_merge_page(ksm_scan, page[0]);
> + page_cache_release(page[0]);
> + }
> + } else
> + up_read(&slot->mm->mmap_sem);
> + scan_npages--;
> + }
> +
> + scan_get_next_index(ksm_scan, 1);
> +out:
> + up_read(&slots_lock);
> + return ret;
> +}

I'm rather lost now. The code sorely needs documentation :(

> +static int ksm_scan_ioctl_start(struct ksm_scan *ksm_scan,
> + struct ksm_user_scan *scan)
> +{
> + return ksm_scan_start(ksm_scan, scan->pages_to_scan,
> + scan->max_pages_to_merge);
> +}
> +
> +static int ksm_scan_release(struct inode *inode, struct file *filp)
> +{
> + struct ksm_scan *ksm_scan = filp->private_data;
> +
> + kfree(ksm_scan);
> + return 0;
> +}
> +
> +static long ksm_scan_ioctl(struct file *filp,
> + unsigned int ioctl, unsigned long arg)
> +{
> + struct ksm_scan *ksm_scan = filp->private_data;
> + void __user *argp = (void __user *)arg;
> + int r = EINVAL;
> +
> + switch (ioctl) {
> + case KSM_SCAN: {
> + struct ksm_user_scan scan;
> +
> + r = -EFAULT;
> + if (copy_from_user(&scan, argp,
> + sizeof(struct ksm_user_scan)))
> + break;
> +
> + r = ksm_scan_ioctl_start(ksm_scan, &scan);
> + }
> + }
> + return r;
> +}
> +
> +static struct file_operations ksm_sma_fops = {
> + .release = ksm_sma_release,
> + .unlocked_ioctl = ksm_sma_ioctl,
> + .compat_ioctl = ksm_sma_ioctl,
> +};
> +
> +static int ksm_dev_ioctl_create_shared_memory_area(void)
> +{
> + int fd = -1;
> + struct ksm_sma *ksm_sma;
> +
> + ksm_sma = kmalloc(sizeof(struct ksm_sma), GFP_KERNEL);
> + if (!ksm_sma)
> + goto out;
> +
> + INIT_LIST_HEAD(&ksm_sma->sma_slots);
> +
> + fd = anon_inode_getfd("ksm-sma", &ksm_sma_fops, ksm_sma, 0);
> + if (fd < 0)
> + goto out_free;
> +
> + return fd;
> +out_free:
> + kfree(ksm_sma);
> +out:
> + return fd;
> +}

The term "shared memory" has specific meanings in Linux/Unix. I'm
suspecting that its use here was inappropriate but I have insufficient
information to be able to suggest alternatives.

> +static struct file_operations ksm_scan_fops = {
> + .release = ksm_scan_release,
> + .unlocked_ioctl = ksm_scan_ioctl,
> + .compat_ioctl = ksm_scan_ioctl,
> +};
> +
> +static struct ksm_scan *ksm_scan_create(void)
> +{
> + struct ksm_scan *ksm_scan;
> +
> + ksm_scan = kzalloc(sizeof(struct ksm_scan), GFP_KERNEL);
> + return ksm_scan;
> +}

Convert to one-liner.

> +static int ksm_dev_ioctl_create_scan(void)
> +{
> + int fd;
> + struct ksm_scan *ksm_scan;
> +
> + if (!tfm) {
> + fd = ksm_tfm_init();
> + if (fd)
> + goto out;
> + }
> +
> + fd = -ENOMEM;
> + ksm_scan = ksm_scan_create();
> + if (!ksm_scan)
> + goto out;
> +
> + fd = anon_inode_getfd("ksm-scan", &ksm_scan_fops, ksm_scan, 0);
> + if (fd < 0)
> + goto out_free;
> + return fd;
> +
> +out_free:
> + kfree(ksm_scan);
> +out:
> + return fd;
> +}

What the heck is all this doing?

> +static int ksm_dev_ioctl_start_stop_kthread(struct ksm_kthread_info *info)
> +{
> + int rc = 0;
> +
> + /* Make sure crypto tfm is initialized before starting scanning */
> + if (info->running && !tfm) {
> + rc = ksm_tfm_init();
> + if (rc)
> + goto out;
> + }
> +
> + down_write(&kthread_lock);
> +
> + kthread_sleep = info->sleep;
> + kthread_pages_to_scan = info->pages_to_scan;
> + kthread_max_npages = info->max_pages_to_merge;
> + kthread_run = info->running;
> +
> + up_write(&kthread_lock);
> +
> + if (kthread_run)
> + wake_up_interruptible(&kthread_wait);
> +
> +out:
> + return rc;
> +}

We have a well-known kernel-wide macro called "kthread_run()". The
creation of a file-wide global of the same name is unfortunate.

> +static int ksm_dev_ioctl_get_info_kthread(struct ksm_kthread_info *info)
> +{
> + down_read(&kthread_lock);
> +
> + info->sleep = kthread_sleep;
> + info->pages_to_scan = kthread_pages_to_scan;
> + info->max_pages_to_merge = kthread_max_npages;
> + info->running = kthread_run;
> +
> + up_read(&kthread_lock);
> + return 0;
> +}
> +
> +static long ksm_dev_ioctl(struct file *filp,
> + unsigned int ioctl, unsigned long arg)
> +{
> + void __user *argp = (void __user *)arg;
> + long r = -EINVAL;
> +
> + switch (ioctl) {
> + case KSM_GET_API_VERSION:
> + r = KSM_API_VERSION;
> + break;
> + case KSM_CREATE_SHARED_MEMORY_AREA:
> + r = ksm_dev_ioctl_create_shared_memory_area();
> + break;
> + case KSM_CREATE_SCAN:
> + r = ksm_dev_ioctl_create_scan();
> + break;
> + case KSM_START_STOP_KTHREAD: {
> + struct ksm_kthread_info info;
> +
> + r = -EFAULT;
> + if (copy_from_user(&info, argp,
> + sizeof(struct ksm_kthread_info)))
> + break;
> +
> + r = ksm_dev_ioctl_start_stop_kthread(&info);
> + break;
> + }
> + case KSM_GET_INFO_KTHREAD: {
> + struct ksm_kthread_info info;
> +
> + r = ksm_dev_ioctl_get_info_kthread(&info);
> + if (r)
> + break;
> + r = -EFAULT;
> + if (copy_to_user(argp, &info,
> + sizeof(struct ksm_kthread_info)))
> + break;
> + r = 0;
> + }
> + default:
> + return r;
> + }
> + return r;
> +}

Interface should be documented.

> +static int ksm_dev_open(struct inode *inode, struct file *filp)
> +{
> + try_module_get(THIS_MODULE);
> + return 0;
> +}
> +
> +static int ksm_dev_release(struct inode *inode, struct file *filp)
> +{
> + module_put(THIS_MODULE);
> + return 0;
> +}

What's going on here?

> +static struct file_operations ksm_chardev_ops = {
> + .open = ksm_dev_open,
> + .release = ksm_dev_release,
> + .unlocked_ioctl = ksm_dev_ioctl,
> + .compat_ioctl = ksm_dev_ioctl,
> +};
> +
> +static struct miscdevice ksm_dev = {
> + KSM_MINOR,
> + "ksm",
> + &ksm_chardev_ops,
> +};
> +
> +int kthread_ksm_scan_thread(void *nothing)
> +{
> + while (!kthread_should_stop()) {
> + if(kthread_run) {

checkpatch?

> + down_read(&kthread_lock);
> + ksm_scan_start(&kthread_ksm_scan,
> + kthread_pages_to_scan,
> + kthread_max_npages);
> + up_read(&kthread_lock);
> + schedule_timeout_interruptible(usecs_to_jiffies(kthread_sleep));
> + } else
> + wait_event_interruptible(kthread_wait, kthread_run);
> + }
> + return 0;
> +}
> +
> +static int __init ksm_init(void)
> +{
> + int r = 1;
> +
> + r = ksm_slab_init();
> + if (r)
> + goto out;
> +
> + r = page_hash_init();
> + if (r)
> + goto out_free1;
> +
> + r = rmap_hash_init();
> + if (r)
> + goto out_free2;
> +
> + r = sha1_hash_init();
> + if (r)
> + goto out_free3;
> +
> + INIT_LIST_HEAD(&slots);
> + init_rwsem(&slots_lock);
> + spin_lock_init(&hash_lock);
> + init_rwsem(&kthread_lock);
> + init_waitqueue_head(&kthread_wait);
> +
> + kthread = kthread_run(kthread_ksm_scan_thread, NULL, "kksmd");
> + if (IS_ERR(kthread)) {
> + printk(KERN_ERR "ksm: creating kthread failed\n");
> + goto out_free4;
> + }
> +
> + r = misc_register(&ksm_dev);
> + if (r) {
> + printk(KERN_ERR "ksm: misc device register failed\n");
> + goto out_free5;
> + }
> +
> + printk(KERN_WARNING "ksm loaded\n");
> + return 0;
> +
> +out_free5:
> + kthread_stop(kthread);
> +out_free4:
> + sha1_hash_free();
> +out_free3:
> + rmap_hash_free();
> +out_free2:
> + page_hash_free();
> +out_free1:
> + ksm_slab_free();
> +out:
> + return r;
> +}
> +
> +static void __exit ksm_exit(void)
> +{
> + misc_deregister(&ksm_dev);
> + kthread_run = 1;
> + kthread_stop(kthread);
> + if (tfm)
> + crypto_free_hash(tfm);
> + sha1_hash_free();
> + rmap_hash_free();
> + page_hash_free();
> + ksm_slab_free();
> +}
> +
> +module_init(ksm_init)
> +module_exit(ksm_exit)

Generally: this review was rather a waste of your time and of mine
because the code is inadequately documented.

2008-11-11 20:59:17

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

Andrew Morton wrote:
> On Tue, 11 Nov 2008 15:21:39 +0200
> Izik Eidus <[email protected]> wrote:
>
>
>> From: Izik Eidus <[email protected]>
>>
>> this function is needed in cases you want to change the userspace
>> virtual mapping into diffrent physical page,
>>
>
> Not sure that I understand that description. We want to replace a live
> page in an anonymous VMA with a different one?
>
> It looks that way.
>
yes but it replace it with kernel allocated page.
> page migration already kinda does that. Is there common ground?
>
>
page migration as far as i saw cant migrate anonymous page into kernel page.
if you want we can change page_migration to do that, but i thought you
will rather have ksm changes separate.

>> KSM need this for merging the identical pages.
>>
>> this function is working by removing the oldpage from the rmap and
>> calling put_page on it, and by setting the virtual address pte
>> to point into the new page.
>> (note that the new page (the page that we change the pte to map to)
>> cannot be anonymous page)
>>
>>
>
> I don't understand the restrictions on anonymous pages. Please expand
> the changelog so that reviewers can understand the reasons for this
> restriction.
>
the page that we are going to map into the pte going to be nonlinear
from the point of view of anon-vma
therefore it cannot be anonymous.

>
>
>> ---
>> include/linux/mm.h | 3 ++
>> mm/memory.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 71 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index ffee2f7..4da7fa8 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1207,6 +1207,9 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>> int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
>> unsigned long pfn);
>>
>> +int replace_page(struct vm_area_struct *vma, struct page *oldpage,
>> + struct page *newpage, pte_t orig_pte, pgprot_t prot);
>> +
>> struct page *follow_page(struct vm_area_struct *, unsigned long address,
>> unsigned int foll_flags);
>> #define FOLL_WRITE 0x01 /* check pte is writable */
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 164951c..b2c542c 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1472,6 +1472,74 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
>> }
>> EXPORT_SYMBOL(vm_insert_mixed);
>>
>> +/**
>> + * replace _page - replace the pte mapping related to vm area between two pages
>>
>
> s/replace _page/replace_page/
>
>
>> + * (from oldpage to newpage)
>> + * NOTE: you should take into consideration the impact on the VM when replacing
>> + * anonymous pages with kernel non swappable pages.
>> + */
>>
>
> This _is_ a kerneldoc comment, but kernedoc comments conventionally
> document the arguments and the return value also.
>
>
>> +int replace_page(struct vm_area_struct *vma, struct page *oldpage,
>> + struct page *newpage, pte_t orig_pte, pgprot_t prot)
>> +{
>> + struct mm_struct *mm = vma->vm_mm;
>> + pgd_t *pgd;
>> + pud_t *pud;
>> + pmd_t *pmd;
>> + pte_t *ptep;
>> + spinlock_t *ptl;
>> + unsigned long addr;
>> + int ret;
>> +
>> + BUG_ON(PageAnon(newpage));
>> +
>> + ret = -EFAULT;
>> + addr = page_address_in_vma(oldpage, vma);
>> + if (addr == -EFAULT)
>> + goto out;
>> +
>> + pgd = pgd_offset(mm, addr);
>> + if (!pgd_present(*pgd))
>> + goto out;
>> +
>> + pud = pud_offset(pgd, addr);
>> + if (!pud_present(*pud))
>> + goto out;
>> +
>> + pmd = pmd_offset(pud, addr);
>> + if (!pmd_present(*pmd))
>> + goto out;
>> +
>> + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
>> + if (!ptep)
>> + goto out;
>> +
>> + if (!pte_same(*ptep, orig_pte)) {
>> + pte_unmap_unlock(ptep, ptl);
>> + goto out;
>> + }
>> +
>> + ret = 0;
>> + get_page(newpage);
>> + page_add_file_rmap(newpage);
>> +
>> + flush_cache_page(vma, addr, pte_pfn(*ptep));
>> + ptep_clear_flush(vma, addr, ptep);
>> + set_pte_at(mm, addr, ptep, mk_pte(newpage, prot));
>> +
>> + page_remove_rmap(oldpage, vma);
>> + if (PageAnon(oldpage)) {
>> + dec_mm_counter(mm, anon_rss);
>> + inc_mm_counter(mm, file_rss);
>> + }
>> + put_page(oldpage);
>> +
>> + pte_unmap_unlock(ptep, ptl);
>> +
>> +out:
>> + return ret;
>> +}
>> +EXPORT_SYMBOL(replace_page);
>>
>
> Again, we could make the presence of this code selectable by subsystems
> which want it.
>
> ]

sure.

2008-11-11 21:02:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/4] rmap: add page_wrprotect() function,

On Tue, 11 Nov 2008 21:38:06 +0100
Andrea Arcangeli <[email protected]> wrote:

> > > + * set all the ptes pointed to a page as read only,
> > > + * odirect_sync is set to 0 in case we cannot protect against race with odirect
> > > + * return the number of ptes that were set as read only
> > > + * (ptes that were read only before this function was called are couned as well)
> > > + */
> >
> > But it isn't.
>
> What isn't?

This code comment had the kerneldoc marker ("/**") but it isn't a
kerneldoc comment.

> > I don't understand this odirect_sync thing. What race? Please expand
> > this comment to make the function of odirect_sync more understandable.
>
> I should have answered this one with the above 3 links.

OK, well can we please update the code so these things are clearer.

(It's a permanent problem I have. I ask "what is this", but I really
mean "the code should be changed so that readers will know what this is")

> > What do you think about making all this new code dependent upon some
> > CONFIG_ switch which CONFIG_KVM can select?
>
> I like that too.

2008-11-11 21:07:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Tue, Nov 11, 2008 at 11:45:55AM -0800, Andrew Morton wrote:
> page migration already kinda does that. Is there common ground?

btw, page_migration likely is buggy w.r.t. o_direct too (and now
unfixable with gup_fast until the 2.4 brlock is added around it or
similar) if it does the same thing but without any page_mapcount vs
page_count check.

page_migration does too much for us, so us calling into migrate.c may
not be ideal. It has to convert a fresh page to a VM page. In KSM we
don't convert the newpage to be a VM page, we just replace the anon
page with another page. The new page in the KSM case is not a page
known by the VM, not in the lru etc...

The way to go could be to change the page_migration to use
replace_page (or __replace_page if called in some shared inner-lock
context) after preparing the newpage to be a regular VM page. If we
can do that, migrate.c will get the o_direct race fixed too for free.

> I don't understand the restrictions on anonymous pages. Please expand
> the changelog so that reviewers can understand the reasons for this
> restriction.

That's a good question but I don't have a definitive answer as I
didn't account for exactly how complex it would be yet.

Suppose a file has 0-4k equal to 4k-8k. A MAP_SHARED that maps both
pages with the same physical page sounds tricky. The shared pages are
KSM owned and invisible to the VM (later could be made visible with an
external-rmap), but pagecache can't be just KSM owned, they at least
must go in pagecache radix tree so that requires patching the radix
tree etc... All things we don't need for anon ram. I think first thing
to extend is to add external-rmap and make ksm swappable, then we can
think if something can be done about MAP_SHARED/MAP_PRIVATE too
(MAP_PRIVATE post-COW already works, the question is pre-COW). One
excuse of why I didn't think too much about it yet is because in
effect KSM it's mostly useful to the anon ram, the pagecache can be
solved in userland with hardlinks with openvz, and shared libs already
do all they can to share .text (the not-shared post dynamic link
should be covered by ksm already).

> Again, we could make the presence of this code selectable by subsystems
> which want it.

Indeed!!

2008-11-11 21:17:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 1/4] rmap: add page_wrprotect() function,

On Tue, Nov 11, 2008 at 01:01:49PM -0800, Andrew Morton wrote:
> On Tue, 11 Nov 2008 21:38:06 +0100
> Andrea Arcangeli <[email protected]> wrote:
>
> > > > + * set all the ptes pointed to a page as read only,
> > > > + * odirect_sync is set to 0 in case we cannot protect against race with odirect
> > > > + * return the number of ptes that were set as read only
> > > > + * (ptes that were read only before this function was called are couned as well)
> > > > + */
> > >
> > > But it isn't.
> >
> > What isn't?
>
> This code comment had the kerneldoc marker ("/**") but it isn't a
> kerneldoc comment.

8) never mind, I thought it was referred to the quoted comment, like
if the comment pretended something the function wasn't doing.

> OK, well can we please update the code so these things are clearer.

Sure. Let's say this o_direct fix was done after confirmation that
this was the agreed fix to solve those kind of problems (same fix that
fork will need as in the third link).

> (It's a permanent problem I have. I ask "what is this", but I really
> mean "the code should be changed so that readers will know what this is")

Agreed, this deserves much more commentary. But not much effort was
done in this area, because fork (and likely migrate) has the same race
and it isn't fixed yet so this is still a work-in-progress area. The
fix has to be the same for all places that have this bug, and we have
not agreed on a way to fix gup_fast yet (all I provided as suggestion
that will work fine is brlock but surely Nick will try to find a way
that remains non-blocking, which is the core of the problem, if it
can't block, we've to use RCU but we can't wait there as far as I can
tell as all sort of synchronous memory regions are involved).

I think once the problem will get fixed and patches will go in
mainline, more docs will emerge (I hope ;). We'll most certainly need
changes to cover gup_fast (including if we use brlock, the read side
of the lock will have to be taken around it). This was a fix we did at
the last minute just to reflect the latest status.

I CC Nick to this thread btw.

2008-11-11 21:22:30

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Tue, 11 Nov 2008, Izik Eidus wrote:

> yes but it replace it with kernel allocated page.
> > page migration already kinda does that. Is there common ground?
> >
> >
> page migration as far as i saw cant migrate anonymous page into kernel page.
> if you want we can change page_migration to do that, but i thought you will
> rather have ksm changes separate.

What do you mean by kernel page? The kernel can allocate a page and then
point a user space pte to it. That is how page migration works.

2008-11-11 21:24:40

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

Christoph Lameter wrote:
>> page migration as far as i saw cant migrate anonymous page into kernel page.
>> if you want we can change page_migration to do that, but i thought you will
>> rather have ksm changes separate.
>>
>
> What do you mean by kernel page? The kernel can allocate a page and then
> point a user space pte to it. That is how page migration works.
>
i mean filebacked page (!AnonPage())
ksm need the pte inside the vma to point from anonymous page into
filebacked page
can migrate.c do it without changes?

2008-11-11 21:27:38

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Tue, 11 Nov 2008, Andrea Arcangeli wrote:

> btw, page_migration likely is buggy w.r.t. o_direct too (and now
> unfixable with gup_fast until the 2.4 brlock is added around it or
> similar) if it does the same thing but without any page_mapcount vs
> page_count check.

Details please?

> page_migration does too much for us, so us calling into migrate.c may
> not be ideal. It has to convert a fresh page to a VM page. In KSM we
> don't convert the newpage to be a VM page, we just replace the anon
> page with another page. The new page in the KSM case is not a page
> known by the VM, not in the lru etc...

A VM page as opposed to pages not in the VM? ???

page migration requires the page to be on the LRU. That could be changed
if you have a different means of isolating a page from its page tables.

> The way to go could be to change the page_migration to use
> replace_page (or __replace_page if called in some shared inner-lock
> context) after preparing the newpage to be a regular VM page. If we
> can do that, migrate.c will get the o_direct race fixed too for free.

Define a regular VM page? A page on the LRU?

2008-11-11 21:31:58

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Tue, 11 Nov 2008, Izik Eidus wrote:

> > What do you mean by kernel page? The kernel can allocate a page and then
> > point a user space pte to it. That is how page migration works.
> >
> i mean filebacked page (!AnonPage())

ok.

> ksm need the pte inside the vma to point from anonymous page into filebacked
> page
> can migrate.c do it without changes?

So change anonymous to filebacked page?

Currently page migration assumes that the page will continue to be part
of the existing file or anon vma.

What you want sounds like assigning a swap pte to an anonymous page? That
way a anon page gains membership in a file backed mapping.

2008-11-11 21:36:40

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Tue, Nov 11, 2008 at 03:21:45PM -0600, Christoph Lameter wrote:
> What do you mean by kernel page? The kernel can allocate a page and then
> point a user space pte to it. That is how page migration works.

Just to make an example, remove_migration_pte adds the page back to
rmap layer. We can't do that right now as rmap for the ksm pages will
be built inside ksm, or alternatively rmap.c will have to learn to
handle nonlinear anon-vma.

Migration simply migrates the page. The new page is identical to the
original one, just backed by different physical memory.

For us the new page is an entirely different beast that we build
ourself (we can't let migrate.c to pretend dealing with the newpage
like if it resembled the old page like it's doing now).

We replace a linear anon page with something that isn't an anonymous
page at all right now (in the future it may become a nonlinear anon
page if VM learns about it, or still an unknown page
external-rmappable if we go the external-rmap way).

There's clearly something to share, but the replace_page seem to be
the one that could be called from migrate.c. What is different is that
we don't need the migration pte placeholder, we never block releasing
locks, all atomic with pte wrprotected, and a final pte_same check
under PT lock before we replace the page. There isn't a whole lot to
share after all, but surely it'd be nice to share if we can. Us
calling into migrate.c isn't feasible right now without some
significant change to migrate.c where it would be misplaced IMHO as to
share we'd need migrate.c to call into VM core instead.

2008-11-11 21:39:08

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

Christoph Lameter wrote:
>
>
> Currently page migration assumes that the page will continue to be part
> of the existing file or anon vma.
>

exactly, and ksm really need it to get out of the existing anon vma!

> What you want sounds like assigning a swap pte to an anonymous page? That
> way a anon page gains membership in a file backed mapping.
>
>
>

No, i want pte that is found inside vma and point to anonymous page,
will stop point into the anonymous page
and will point to a whole diffrent page that i chose (for ksm it is
needed because this way we are mapping alot
of ptes into the same write_protected page and save memory)

2008-11-11 21:40:20

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

Christoph Lameter wrote:
> page migration requires the page to be on the LRU. That could be changed
> if you have a different means of isolating a page from its page tables.
>

Isn't rmap the means of isolating a page from its page tables? I guess
I'm misunderstanding something.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-11-11 21:48:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Tue, 11 Nov 2008, Avi Kivity wrote:

> Christoph Lameter wrote:
> > page migration requires the page to be on the LRU. That could be changed
> > if you have a different means of isolating a page from its page tables.
> >
>
> Isn't rmap the means of isolating a page from its page tables? I guess I'm
> misunderstanding something.

In order to migrate a page one first has to make sure that userspace and
the kernel cannot access the page in any way. User space must be made to
generate page faults and all kernel references must be accounted for and
not be in use.

The user space portion involves changing the page tables so that faults
are generated.

The kernel portion isolates the page from the LRU (to exempt it from
kernel reclaim handling etc).

Only then can the page and its metadata be copied to a new location.

Guess you already have the LRU portion done. So you just need the user
space isolation portion?

2008-11-11 21:55:59

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

Christoph Lameter wrote:
> On Tue, 11 Nov 2008, Avi Kivity wrote:
>
>
>> Christoph Lameter wrote:
>>
>>> page migration requires the page to be on the LRU. That could be changed
>>> if you have a different means of isolating a page from its page tables.
>>>
>>>
>> Isn't rmap the means of isolating a page from its page tables? I guess I'm
>> misunderstanding something.
>>
>
> In order to migrate a page one first has to make sure that userspace and
> the kernel cannot access the page in any way. User space must be made to
> generate page faults and all kernel references must be accounted for and
> not be in use.
>
This is really not the case for ksm,
in ksm we allow the page to be accessed all the time, we dont have to
swap the page
like migrate.c is doing...

2008-11-11 22:03:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 3/4] add ksm kernel shared memory driver

On Tue, Nov 11, 2008 at 12:38:51PM -0800, Andrew Morton wrote:
> Please fully document that interface in the changelog so that we can
> review your decisions here. This is by far the most important
> consideration - we can change all the code, but interfaces are for
> ever.

Yes, this is the most important point in my view. Even after we make
the ksm pages swappable it'll remain an invisible change to anybody
but us (it'll work better under VM pressure, but that's about it).

> uh-oh, ioctls.

Yes, it's all ioctl based. In very short, it assigns the task and
memory region to scan, and allows to start/stop the kernel thread that
does the scan while selecting how many pages to execute per scan and
how many scans to execute per second. The more pages per scan and the
more scans per second, the higher cpu utilization of the kernel thread.

It would also be possible to submit ksm in a way that has no API at
all (besides kernel module params tunable later by sysfs to set
pages-per-scan and scan-frequency). Doing that would allow us to defer
the API decisions. But then all anonymous memory in the system will be
scanned unconditionally even if there may be little to share for
certain tasks. It would perform quite well, usually the sharable part
is the largest part so the rest wouldn't generate an huge amount of
cpu waste. There's some ram waste too, as some memory has to be
allocated for every page we want to possibly share.

In some ways removing the API would make it simpler to use for non
virtualization environments where they may want to enable it
system-wide.

> ooh, a comment!

8)

> > + kpage = alloc_page(GFP_KERNEL | __GFP_HIGHMEM);
>
> Stray whitepace.
>
> Replace with GFP_HIGHUSER.

So not a cleanup, but an improvement (I agree highuser is better, this
isn't by far any higher-priority kernel alloc and it deserves to have
lower prio in the watermarks).

> The term "shared memory" has specific meanings in Linux/Unix. I'm
> suspecting that its use here was inappropriate but I have insufficient
> information to be able to suggest alternatives.

We could call it create_shared_anonymous_memory but then what if it
ever becomes capable of sharing pagecache too? (I doubt it will, not
in the short term at least ;)

Usually when we do these kind of tricks we use the word cloning, so
perhaps also create_cloned_memory_area, so you can later say cloned
anonymous memory or cloned shared memory page instead of just KSM
page. But then this module would become KCM and not KSM 8). Perhaps we
should just go recursive and use create_ksm_memory_area.

> Generally: this review was rather a waste of your time and of mine
> because the code is inadequately documented.

Well, this was a great review considering how little the code was
documented, definitely not a waste of time on our end, it was very
helpful and the good thing is I don't see any controversial stuff.

The two inner loops are the core of the ksm scan, and they're aren't
very simple I've to say. Documenting won't be trivial but it's surely
possible, Izik already working on it I think! Apologies if any time
was wasted on your side!!

2008-11-11 22:04:19

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [PATCH 3/4] add ksm kernel shared memory driver

I don't claim to begin to really understand the deep VM side of this
patch, but I can certainly pick nits as I work through it...sorry for
the lack of anything more substantive.

> +static struct list_head slots;

Some of these file-static variable names seem a little..terse...

> +#define PAGECMP_OFFSET 128
> +#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
> +/* hash the page */
> +static void page_hash(struct page *page, unsigned char *digest)

So is this really saying that you only hash the first 128 bytes, relying on
full compares for the rest? I assume there's a perfectly good reason for
doing it that way, but it's not clear to me from reading the code. Do you
gain performance which is not subsequently lost in the (presumably) higher
number of hash collisions?

> +static int ksm_scan_start(struct ksm_scan *ksm_scan, int scan_npages,
> + int max_npages)
> +{
> + struct ksm_mem_slot *slot;
> + struct page *page[1];
> + int val;
> + int ret = 0;
> +
> + down_read(&slots_lock);
> +
> + scan_update_old_index(ksm_scan);
> +
> + while (scan_npages > 0 && max_npages > 0) {

Should this loop maybe check kthread_run too? It seems like you could loop
for a long time after kthread_run has been set to zero.

> +static int ksm_dev_open(struct inode *inode, struct file *filp)
> +{
> + try_module_get(THIS_MODULE);
> + return 0;
> +}
> +
> +static int ksm_dev_release(struct inode *inode, struct file *filp)
> +{
> + module_put(THIS_MODULE);
> + return 0;
> +}
> +
> +static struct file_operations ksm_chardev_ops = {
> + .open = ksm_dev_open,
> + .release = ksm_dev_release,
> + .unlocked_ioctl = ksm_dev_ioctl,
> + .compat_ioctl = ksm_dev_ioctl,
> +};

Why do you roll your own module reference counting? Is there a reason you
don't just set .owner and let the VFS handle it?

Given that the KSM_REGISTER_MEMORY_REGION ioctl() creates unswappable
memory, should there be some sort of capability check done there? A check
for starting/stopping the thread might also make sense. Or is that
expected to be handled via permissions on /dev/ksm?

Actually, it occurs to me that there's no sanity checks on any of the
values passed in by ioctl(). What happens if the user tells KSM to scan a
bogus range of memory?

Any benchmarks on the runtime cost of having KSM running?

jon

2008-11-11 22:18:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Tue, Nov 11, 2008 at 03:26:57PM -0600, Christoph Lameter wrote:
> On Tue, 11 Nov 2008, Andrea Arcangeli wrote:
>
> > btw, page_migration likely is buggy w.r.t. o_direct too (and now
> > unfixable with gup_fast until the 2.4 brlock is added around it or
> > similar) if it does the same thing but without any page_mapcount vs
> > page_count check.
>
> Details please?

spin_lock_irq(&mapping->tree_lock);

pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));

expected_count = 2 + !!PagePrivate(page);
if (page_count(page) != expected_count ||

this page_count check done with only the tree_lock won't prevent a
task to start O_DIRECT after page_count has been read in the above line.

If a thread starts O_DIRECT on the page, and the o_direct is still in
flight by the time you copy the page to the new page, the read will
not be represented fully in the newpage leading to userland data
corruption.

> > page_migration does too much for us, so us calling into migrate.c may
> > not be ideal. It has to convert a fresh page to a VM page. In KSM we
> > don't convert the newpage to be a VM page, we just replace the anon
> > page with another page. The new page in the KSM case is not a page
> > known by the VM, not in the lru etc...
>
> A VM page as opposed to pages not in the VM? ???

Yes, you migrate old VM pages to new VM pages. We replace VM pages
with unknown stuff called KSM pages. So in the inner function where you
replace the pte-migration-placeholder with a pte pointing to the
newpage, you also rightfully do unconditional stuff we can't be doing
like page_add_*_rmap. It's VM pages you're dealing with. Not for us.

> page migration requires the page to be on the LRU. That could be changed
> if you have a different means of isolating a page from its page tables.

Yes we'd need to change those bits to be able to use migrate.c.

> Define a regular VM page? A page on the LRU?

Yes, pages owned, allocated and worked on by the VM. So they can be
swapped, collected, migrated etc... You can't possibly migrate a
device driver page for example and infact those device driver pages
can't be migrated either.

The KSM page initially is a driver page, later we'd like to teach the
VM how to swap it by introducing rmap methods and adding it to the
LRU. As long as it's only anonymous memory that we're sharing/cloning,
we won't have to patch pagecache radix tree and other stuff. BTW, if
we ever decice to clone pagecache we could generate immense metadata
ram overhead in the radix tree with just a single page of data. All
issues that don't exist for anon ram.

2008-11-11 22:18:32

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 3/4] add ksm kernel shared memory driver

Jonathan Corbet wrote:
> I don't claim to begin to really understand the deep VM side of this
> patch, but I can certainly pick nits as I work through it...sorry for
> the lack of anything more substantive.
>
>
>> +static struct list_head slots;
>>
>
> Some of these file-static variable names seem a little..terse...
>
>
>> +#define PAGECMP_OFFSET 128
>> +#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
>> +/* hash the page */
>> +static void page_hash(struct page *page, unsigned char *digest)
>>
>
> So is this really saying that you only hash the first 128 bytes, relying on
> full compares for the rest? I assume there's a perfectly good reason for
> doing it that way, but it's not clear to me from reading the code. Do you
> gain performance which is not subsequently lost in the (presumably) higher
> number of hash collisions?
>
>
>> +static int ksm_scan_start(struct ksm_scan *ksm_scan, int scan_npages,
>> + int max_npages)
>> +{
>> + struct ksm_mem_slot *slot;
>> + struct page *page[1];
>> + int val;
>> + int ret = 0;
>> +
>> + down_read(&slots_lock);
>> +
>> + scan_update_old_index(ksm_scan);
>> +
>> + while (scan_npages > 0 && max_npages > 0) {
>>
>
> Should this loop maybe check kthread_run too? It seems like you could loop
> for a long time after kthread_run has been set to zero.
>
>
>> +static int ksm_dev_open(struct inode *inode, struct file *filp)
>> +{
>> + try_module_get(THIS_MODULE);
>> + return 0;
>> +}
>> +
>> +static int ksm_dev_release(struct inode *inode, struct file *filp)
>> +{
>> + module_put(THIS_MODULE);
>> + return 0;
>> +}
>> +
>> +static struct file_operations ksm_chardev_ops = {
>> + .open = ksm_dev_open,
>> + .release = ksm_dev_release,
>> + .unlocked_ioctl = ksm_dev_ioctl,
>> + .compat_ioctl = ksm_dev_ioctl,
>> +};
>>
>
> Why do you roll your own module reference counting? Is there a reason you
> don't just set .owner and let the VFS handle it?
>

Yes, I am taking get_task_mm() if the module will be removed before i
free the mms, things will go wrong

> Given that the KSM_REGISTER_MEMORY_REGION ioctl() creates unswappable
> memory, should there be some sort of capability check done there? A check
> for starting/stopping the thread might also make sense. Or is that
> expected to be handled via permissions on /dev/ksm?
>

Well, I think giving premmision to /dev/ksm default as root is enough.
No?

> Actually, it occurs to me that there's no sanity checks on any of the
> values passed in by ioctl(). What happens if the user tells KSM to scan a
> bogus range of memory?
>

Well get_user_pages() run in context of the process, therefore it should
fail in "bogus range of memory"

> Any benchmarks on the runtime cost of having KSM running?
>

This one is problematic, ksm can take anything from 0% to 100% cpu
its all depend on how fast you run it.
it have 3 parameters:
number of pages to scan before it go to sleep
maximum number of pages to merge while we scanning the above pages
(merging is expensive)
time to sleep (when runing from userspace using /dev/ksm, we actually do
it there (userspace)

> jon
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-11-11 22:25:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Tue, Nov 11, 2008 at 03:31:18PM -0600, Christoph Lameter wrote:
> > ksm need the pte inside the vma to point from anonymous page into filebacked
> > page
> > can migrate.c do it without changes?
>
> So change anonymous to filebacked page?
>
> Currently page migration assumes that the page will continue to be part
> of the existing file or anon vma.
>
> What you want sounds like assigning a swap pte to an anonymous page? That
> way a anon page gains membership in a file backed mapping.

KSM needs to convert anonymous pages to PageKSM, which means a page
owned by ksm.c and only known by ksm.c. The Linux VM will free this
page in munmap but that's about it, all we do is to match the number
of anon-ptes pointing to the page with the page_count. So besides
freeing the page when the last user exit()s or cows it, the VM will do
nothing about it. Initially. Later it can swap it in a nonlinear way.

2008-11-11 22:26:22

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [PATCH 3/4] add ksm kernel shared memory driver

On Wed, 12 Nov 2008 00:17:39 +0200
Izik Eidus <[email protected]> wrote:

> >> +static int ksm_dev_open(struct inode *inode, struct file *filp)
> >> +{
> >> + try_module_get(THIS_MODULE);
> >> + return 0;
> >> +}
> >> +
> >> +static int ksm_dev_release(struct inode *inode, struct file *filp)
> >> +{
> >> + module_put(THIS_MODULE);
> >> + return 0;
> >> +}
> >> +
> >> +static struct file_operations ksm_chardev_ops = {
> >> + .open = ksm_dev_open,
> >> + .release = ksm_dev_release,
> >> + .unlocked_ioctl = ksm_dev_ioctl,
> >> + .compat_ioctl = ksm_dev_ioctl,
> >> +};
> >>
> >
> > Why do you roll your own module reference counting? Is there a
> > reason you don't just set .owner and let the VFS handle it?
> >
>
> Yes, I am taking get_task_mm() if the module will be removed before i
> free the mms, things will go wrong

But...if you set .owner, the VFS will do the try_module_get() *before*
calling into your module (as an added bonus, it will actually check the
return value too). All you've succeeded in doing here is adding a
microscopic race to the module reference counting; otherwise the end
result is the same.

jon

2008-11-11 22:30:45

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [PATCH 3/4] add ksm kernel shared memory driver

[Let's see if I can get through the rest without premature sends...]

On Wed, 12 Nov 2008 00:17:39 +0200
Izik Eidus <[email protected]> wrote:

> > Actually, it occurs to me that there's no sanity checks on any of
> > the values passed in by ioctl(). What happens if the user tells
> > KSM to scan a bogus range of memory?
> >
>
> Well get_user_pages() run in context of the process, therefore it
> should fail in "bogus range of memory"

But it will fail in a totally silent and mysterious way. Doesn't it
seem better to verify the values when you can return a meaningful error
code to the caller?

The other ioctl() calls have the same issue; you can start the thread
with nonsensical values for the number of pages to scan and the sleep
time.

>
> > Any benchmarks on the runtime cost of having KSM running?
> >
>
> This one is problematic, ksm can take anything from 0% to 100% cpu
> its all depend on how fast you run it.
> it have 3 parameters:
> number of pages to scan before it go to sleep
> maximum number of pages to merge while we scanning the above pages
> (merging is expensive)
> time to sleep (when runing from userspace using /dev/ksm, we actually
> do it there (userspace)

What about things like cache effects from scanning all those pages? My
guess is that, if you're trying to run dozens of Windows guests, cache
usage is not at the top of your list of concerns, but I could be
wrong. Usually am...

Thanks,

jon

2008-11-11 22:31:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Tue, 11 Nov 2008, Andrea Arcangeli wrote:

> this page_count check done with only the tree_lock won't prevent a
> task to start O_DIRECT after page_count has been read in the above line.
>
> If a thread starts O_DIRECT on the page, and the o_direct is still in
> flight by the time you copy the page to the new page, the read will
> not be represented fully in the newpage leading to userland data
> corruption.

O_DIRECT does not take a refcount on the page in order to prevent this?

> > Define a regular VM page? A page on the LRU?
>
> Yes, pages owned, allocated and worked on by the VM. So they can be
> swapped, collected, migrated etc... You can't possibly migrate a
> device driver page for example and infact those device driver pages
> can't be migrated either.

Oh they could be migrated if you had a callback to the devices method for
giving up references. Same as slab defrag.

> The KSM page initially is a driver page, later we'd like to teach the
> VM how to swap it by introducing rmap methods and adding it to the
> LRU. As long as it's only anonymous memory that we're sharing/cloning,
> we won't have to patch pagecache radix tree and other stuff. BTW, if
> we ever decice to clone pagecache we could generate immense metadata
> ram overhead in the radix tree with just a single page of data. All
> issues that don't exist for anon ram.

Seems that we are tinkering around with the concept of what an anonymous
page is? Doesnt shmem have some means of converting pages to file backed?
Swizzling?

2008-11-11 22:32:32

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 3/4] add ksm kernel shared memory driver

Jonathan Corbet wrote:
> On Wed, 12 Nov 2008 00:17:39 +0200
> Izik Eidus <[email protected]> wrote:
>
>
>>>> +static int ksm_dev_open(struct inode *inode, struct file *filp)
>>>> +{
>>>> + try_module_get(THIS_MODULE);
>>>> + return 0;
>>>> +}
>>>> +
>>>> +static int ksm_dev_release(struct inode *inode, struct file *filp)
>>>> +{
>>>> + module_put(THIS_MODULE);
>>>> + return 0;
>>>> +}
>>>> +
>>>> +static struct file_operations ksm_chardev_ops = {
>>>> + .open = ksm_dev_open,
>>>> + .release = ksm_dev_release,
>>>> + .unlocked_ioctl = ksm_dev_ioctl,
>>>> + .compat_ioctl = ksm_dev_ioctl,
>>>> +};
>>>>
>>>>
>>> Why do you roll your own module reference counting? Is there a
>>> reason you don't just set .owner and let the VFS handle it?
>>>
>>>
>> Yes, I am taking get_task_mm() if the module will be removed before i
>> free the mms, things will go wrong
>>
>
> But...if you set .owner, the VFS will do the try_module_get() *before*
> calling into your module (as an added bonus, it will actually check the
> return value too).
Ohhh i see what you mean
you are right i had at least needed to check for the return value of
try_module_get(),
anyway will check this issue for V2.

2008-11-11 22:36:21

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

Christoph Lameter wrote:
> On Tue, 11 Nov 2008, Avi Kivity wrote:
>
>
>> Christoph Lameter wrote:
>>
>>> page migration requires the page to be on the LRU. That could be changed
>>> if you have a different means of isolating a page from its page tables.
>>>
>>>
>> Isn't rmap the means of isolating a page from its page tables? I guess I'm
>> misunderstanding something.
>>
>
> In order to migrate a page one first has to make sure that userspace and
> the kernel cannot access the page in any way. User space must be made to
> generate page faults and all kernel references must be accounted for and
> not be in use.
>
> The user space portion involves changing the page tables so that faults
> are generated.
>
> The kernel portion isolates the page from the LRU (to exempt it from
> kernel reclaim handling etc).
>
>

Thanks.

> Only then can the page and its metadata be copied to a new location.
>
> Guess you already have the LRU portion done. So you just need the user
> space isolation portion?
>

We don't want to limit all faults, just writes. So we write protect the
page before merging.

What do you mean by page metadata? obviously the dirty bit (Izik?),
accessed bit and position in the LRU are desirable (the last is quite
difficult for ksm -- the page occupied *two* positions in the LRU).

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-11-11 22:38:57

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 3/4] add ksm kernel shared memory driver

Jonathan Corbet wrote:
> [Let's see if I can get through the rest without premature sends...]
>
> On Wed, 12 Nov 2008 00:17:39 +0200
> Izik Eidus <[email protected]> wrote:
>
>
>>> Actually, it occurs to me that there's no sanity checks on any of
>>> the values passed in by ioctl(). What happens if the user tells
>>> KSM to scan a bogus range of memory?
>>>
>>>
>> Well get_user_pages() run in context of the process, therefore it
>> should fail in "bogus range of memory"
>>
>
> But it will fail in a totally silent and mysterious way. Doesn't it
> seem better to verify the values when you can return a meaningful error
> code to the caller?
>
>

Well I dont mind insert it (the above for sure is not a bug)
but even with that, the user can still free the memory that he gave to us
so this check if "nice to have check", we have nothing to do but to relay on
get_user_pages return value :)

> The other ioctl() calls have the same issue; you can start the thread
> with nonsensical values for the number of pages to scan and the sleep
> time.
>

well about this i agree, here it make alot of logic to check the values!


2008-11-11 22:40:43

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH 3/4] add ksm kernel shared memory driver

On Tue, 11 Nov 2008 15:03:45 MST, Jonathan Corbet said:

> > +#define PAGECMP_OFFSET 128
> > +#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
> > +/* hash the page */
> > +static void page_hash(struct page *page, unsigned char *digest)
>
> So is this really saying that you only hash the first 128 bytes, relying on
> full compares for the rest? I assume there's a perfectly good reason for
> doing it that way, but it's not clear to me from reading the code. Do you
> gain performance which is not subsequently lost in the (presumably) higher
> number of hash collisions?

Seems reasonably sane to me - only doing the first 128 bytes rather than
a full 4K page is some 32 times faster. Yes, you'll have the *occasional*
case where two pages were identical for 128 bytes but then differed, which is
why there's buckets. But the vast majority of the time, at least one bit
will be different in the first part.

In fact, I'd not be surprised if only going for 64 bytes works well...


Attachments:
(No filename) (226.00 B)

2008-11-11 22:43:33

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 3/4] add ksm kernel shared memory driver

Jonathan Corbet wrote:
>> +static struct list_head slots;
>>
>
> Some of these file-static variable names seem a little..terse...
>

While ksm was written to be independent of a certain TLA-named kernel
subsystem developed two rooms away, they share some naming... this
refers to kvm 'memory slots' which correspond to DIMM banks.

I guess it should be renamed to merge_ranges or something.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-11-11 22:49:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 3/4] add ksm kernel shared memory driver

Izik Eidus wrote:
>> Any benchmarks on the runtime cost of having KSM running?
>>
>
> This one is problematic, ksm can take anything from 0% to 100% cpu
> its all depend on how fast you run it.
> it have 3 parameters:
> number of pages to scan before it go to sleep
> maximum number of pages to merge while we scanning the above pages
> (merging is expensive)
> time to sleep (when runing from userspace using /dev/ksm, we actually
> do it there (userspace)

The scan process priority also has its effect. One strategy would be to
run it at idle priority as long as you have enough free memory, and
increase the priority as memory starts depleting.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-11-11 23:03:24

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 3/4] add ksm kernel shared memory driver

Hi Jonathan,

On Tue, Nov 11, 2008 at 03:30:28PM -0700, Jonathan Corbet wrote:
> But it will fail in a totally silent and mysterious way. Doesn't it
> seem better to verify the values when you can return a meaningful error
> code to the caller?

I think you're right, but just because find_extend_vma will have the
effect of growing the kernel stack down. We clearly don't set it on a
stack with KVM as there's nothing to share on the stack usually - we
only set it in the guest physical memory range. And things are safe
regardless as get_user_pages is verifying the values for us. Problem
is it's using find_extend_vma because it behaves like a page fault. We
must not behave like a pagefault, we're much closer to follow_page
only than a page fault. Not a big deal, but it can be improved by
avoiding to extend the stack somehow (likely simplest is to call
find_vma twice, first time externally, we hold mmap_sem externally so
all right).

> What about things like cache effects from scanning all those pages? My
> guess is that, if you're trying to run dozens of Windows guests, cache
> usage is not at the top of your list of concerns, but I could be
> wrong. Usually am...

Oh that's not an issue. This is all about trading some CPU for lots of
free memory. It pays off big as so many more VM can run. With desktop
virtualization and 1G systems, you reach a RAM bottleneck much quicker
than a CPU bottleneck. Perhaps not so quick on server virtualization
but the point is this is intentional. It may be possible to compute
the jhash (that's where the cpu is spent) with instructions that don't
pollute the cpu cache but I doubt it's going to make much an huge
difference.

2008-11-11 23:03:39

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 3/4] add ksm kernel shared memory driver

Jonathan Corbet wrote:
>
> What about things like cache effects from scanning all those pages? My
> guess is that, if you're trying to run dozens of Windows guests, cache
> usage is not at the top of your list of concerns, but I could be
> wrong. Usually am...
>

Ok, ksm does make the cache of the cpu dirty when scanning the pages
(but the scanning happen slowly slowly and cache usually get dirty much
faster)
But infact it improve the cache by the fact that it make many ptes point
to the same page
so if before we had 12 process touching 12 diffrent physical page they
would dirty the page
but now they will touch just one...

so i guess it depend on how you see it...

> Thanks,
>
> jon
>

2008-11-11 23:18:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Tue, Nov 11, 2008 at 04:30:22PM -0600, Christoph Lameter wrote:
> On Tue, 11 Nov 2008, Andrea Arcangeli wrote:
>
> > this page_count check done with only the tree_lock won't prevent a
> > task to start O_DIRECT after page_count has been read in the above line.
> >
> > If a thread starts O_DIRECT on the page, and the o_direct is still in
> > flight by the time you copy the page to the new page, the read will
> > not be represented fully in the newpage leading to userland data
> > corruption.
>
> O_DIRECT does not take a refcount on the page in order to prevent this?

It definitely does, it's also the only thing it does.

The whole point is that O_DIRECT can start the instruction after
page_count returns as far as I can tell.

If you check the three emails I linked in answer to Andrew on the
topic, we agree the o_direct can't start under PT lock (or under
mmap_sem in write mode but migrate.c rightefully takes the read
mode). So the fix used in ksm page_wrprotect and in fork() is to check
page_count vs page_mapcount inside PT lock before doing anything on
the pte. If you just mark the page wprotect while O_DIRECT is in
flight, that's enough for fork() to generate data corruption in the
parent (not the child where the result would be undefined). But in the
parent the result of the o-direct is defined and it'd never corrupt if
this was a cached-I/O. The moment the parent pte is marked readonly, a thread
in the parent could write to the last 512bytes of the page, leading to
the first 512bytes coming with O_DIRECT from disk being lost (as the
write will trigger a cow before I/O is complete and the dma will
complete on the oldpage).

We do the check in page_wprotect because that's the _first_ place
where we mark a pte readonly. Same as fork. And we can't convert a pte
from writeable to wrprotected without doing this check inside PT lock
or we generate data corruption with o_direct (again same as the bug in
fork).

We don't have to check the page_count vs mapcount later in
replace_page because we know if anybody started an O_DIRECT read from
disk, it would have triggered a cow, and the pte_same check that we
have to do for other reasons would take care of bailing out the
replace_page.

> Oh they could be migrated if you had a callback to the devices method for
> giving up references. Same as slab defrag.

The oldpage is a regular anonymous page for us, not the 'strange'
object called PageKSM. And we need to migrate many anonymous pages
having destination a single PageKSM page already preallocated and not
being an anonymous page. We never migrate PageKSM to anything, that's
always the destination.

> Seems that we are tinkering around with the concept of what an anonymous
> page is? Doesnt shmem have some means of converting pages to file backed?
> Swizzling?

Anonymous page is anything with page->mapping pointing to an anon_vma
or swapcache_space instead of an address_space of a real inode backed
by the vfs.

2008-11-11 23:25:29

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Wed, Nov 12, 2008 at 12:17:22AM +0100, Andrea Arcangeli wrote:
> We don't have to check the page_count vs mapcount later in
> replace_page because we know if anybody started an O_DIRECT read from
> disk, it would have triggered a cow, and the pte_same check that we
> have to do for other reasons would take care of bailing out the
> replace_page.

Ah, for completeness: above I didn't mention the case of O_DIRECT
writes to disk, because we never need to care about those. They're
never a problem. If the page is replaced and the cpu writes to the
page and by doing so triggers a cow that lead to the CPU write to go
in a different page (not the one under dma) it'll be like if the write
to disk completed before the cpu overwritten the page, so result is
undefined. I don't think we've to define the case of somebody doing a
direct read from a location where there's still an o_direct write in
flight either.

2008-11-12 00:44:59

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Wed, 12 Nov 2008, Andrea Arcangeli wrote:

> > O_DIRECT does not take a refcount on the page in order to prevent this?
>
> It definitely does, it's also the only thing it does.

Then page migration will not occur because there is an unresolved
reference.

> The whole point is that O_DIRECT can start the instruction after
> page_count returns as far as I can tell.

But there must still be reference for the bio and whatever may be going on
at the time in order to perform the I/O operation.

> If you check the three emails I linked in answer to Andrew on the
> topic, we agree the o_direct can't start under PT lock (or under
> mmap_sem in write mode but migrate.c rightefully takes the read
> mode). So the fix used in ksm page_wrprotect and in fork() is to check
> page_count vs page_mapcount inside PT lock before doing anything on
> the pte. If you just mark the page wprotect while O_DIRECT is in
> flight, that's enough for fork() to generate data corruption in the
> parent (not the child where the result would be undefined). But in the
> parent the result of the o-direct is defined and it'd never corrupt if
> this was a cached-I/O. The moment the parent pte is marked readonly, a thread
> in the parent could write to the last 512bytes of the page, leading to
> the first 512bytes coming with O_DIRECT from disk being lost (as the
> write will trigger a cow before I/O is complete and the dma will
> complete on the oldpage).

Have you actually seen corruption or this conjecture? AFACT the page
count is elevated while I/O is in progress and thus this is safe.

2008-11-12 02:20:27

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Tue, 11 Nov 2008 23:24:21 +0100
Andrea Arcangeli <[email protected]> wrote:

> On Tue, Nov 11, 2008 at 03:31:18PM -0600, Christoph Lameter wrote:
> > > ksm need the pte inside the vma to point from anonymous page into filebacked
> > > page
> > > can migrate.c do it without changes?
> >
> > So change anonymous to filebacked page?
> >
> > Currently page migration assumes that the page will continue to be part
> > of the existing file or anon vma.
> >
> > What you want sounds like assigning a swap pte to an anonymous page? That
> > way a anon page gains membership in a file backed mapping.
>
> KSM needs to convert anonymous pages to PageKSM, which means a page
> owned by ksm.c and only known by ksm.c. The Linux VM will free this
> page in munmap but that's about it, all we do is to match the number
> of anon-ptes pointing to the page with the page_count. So besides
> freeing the page when the last user exit()s or cows it, the VM will do
> nothing about it. Initially. Later it can swap it in a nonlinear way.
>
Can I make a question ? (I'm working for memory cgroup.)

Now, we do charge to anonymous page when
- charge(+1) when it's mapped firstly (mapcount 0->1)
- uncharge(-1) it's fully unmapped (mapcount 1->0) vir page_remove_rmap().

My quesion is
- PageKSM pages are not necessary to be tracked by memory cgroup ?
- Can we know that "the page is just replaced and we don't necessary to do
charge/uncharge".
- annonymous page from KSM is worth to be tracked by memory cgroup ?
(IOW, it's on LRU and can be swapped-out ?)

Thanks,
-Kame

2008-11-12 02:27:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Tue, Nov 11, 2008 at 06:27:09PM -0600, Christoph Lameter wrote:
> Then page migration will not occur because there is an unresolved
> reference.

So are you checking if there's an unresolved reference only in the
very place I just quoted in the previous email? If answer is yes: what
should prevent get_user_pages from running in parallel from another
thread? get_user_pages will trigger a minor fault and get the elevated
reference just after you read page_count. To you it looks like there
is no o_direct in progress when you proceed to the core of migration
code, but in effect o_direct just started a moment after you read the
page count.

What can protect you is PG lock or mmap_sem in _write_ mode (and
they've to be hold for the whole duration of the migration). I don't
see any of the two being hold while you read the page count... You
don't seem to be using stop_machine either (stop_machine pretty
expensive on the 4096 way I guess).

This wasn't reproduced in practice but it should be possible to
reproduce it by just writing a testcase with three threads, one forks
in a loop (child just quit) the other memset 0 the first 512bytes of a
page, and then o_direct read from a 0xff 512byte region and checks
that the first 512bytes are all non-zero in a loop, and the third
writes 1 byte to the last 512bytes of the page in a loop. Eventually
the comparison should show zero data in the page.

To reproduce with migration just start the thread that memset 0, reads
a 0xff region with o_direct, and checks it's all 0xff in a loop, and
then migrate the memory of this thread back and forth between two
nodes with the sys_move_pages (mpol is safe by luck because it
surrounds migrate_pages with the mmap_sem in write mode). Eventually
you should see zero bytes despite I/O is complete.

Reproducing this is normal life would take time and for the fork bug
it may not be reproducible depending of what the app is doing. Mixing
sys_move_pages with o_direct in the same process with on two different
threads, instead should eventually eventually reproduce it. And with
gup_fast is now unfixable until more infrastructure is added to
slowdown gup_fast a bit (unless Nick finds an RCU way of doing it).

2008-11-12 03:11:28

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Wed, 12 Nov 2008, Andrea Arcangeli wrote:

> So are you checking if there's an unresolved reference only in the
> very place I just quoted in the previous email? If answer is yes: what
> should prevent get_user_pages from running in parallel from another
> thread? get_user_pages will trigger a minor fault and get the elevated
> reference just after you read page_count. To you it looks like there
> is no o_direct in progress when you proceed to the core of migration
> code, but in effect o_direct just started a moment after you read the
> page count.

get_user_pages() cannot get to it since the pagetables have already been
modified. If get_user_pages runs then the fault handling will occur
which will block the thread until migration is complete.

2008-11-12 10:06:22

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

KAMEZAWA Hiroyuki wrote:
> Can I make a question ? (I'm working for memory cgroup.)
>
> Now, we do charge to anonymous page when
> - charge(+1) when it's mapped firstly (mapcount 0->1)
> - uncharge(-1) it's fully unmapped (mapcount 1->0) vir page_remove_rmap().
>
> My quesion is
> - PageKSM pages are not necessary to be tracked by memory cgroup ?
> - Can we know that "the page is just replaced and we don't necessary to do
> charge/uncharge".
> - annonymous page from KSM is worth to be tracked by memory cgroup ?
> (IOW, it's on LRU and can be swapped-out ?)
>

My feeling is that shared pages should be accounted as if they were not
shared; that is, a share page should be accounted for each process that
shares it. Perhaps sharing within a cgroup should be counted as 1 page
for all the ptes pointing to it.


--
error compiling committee.c: too many arguments to function

2008-11-12 11:13:30

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

Avi Kivity wrote:
> KAMEZAWA Hiroyuki wrote:
>> Can I make a question ? (I'm working for memory cgroup.)
>>
>> Now, we do charge to anonymous page when
>> - charge(+1) when it's mapped firstly (mapcount 0->1)
>> - uncharge(-1) it's fully unmapped (mapcount 1->0) vir
>> page_remove_rmap().
>>
>> My quesion is
>> - PageKSM pages are not necessary to be tracked by memory cgroup ?
When we reaplacing page using page_replace() we have:
oldpage - > anonymous page that is going to be replaced by newpage
newpage -> kernel allocated page (KsmPage)
so about oldpage we are calling page_remove_rmap() that will notify cgroup
and about newpage it wont be count inside cgroup beacuse it is file rmap
page
(we are calling to page_add_file_rmap), so right now PageKSM wont ever
be tracked by cgroup.

>> - Can we know that "the page is just replaced and we don't necessary
>> to do
>> charge/uncharge".

The caller of page_replace does know it, the only problem is that
page_remove_rmap()
automaticly change the cgroup for anonymous pages,
if we want it not to change the cgroup, we can:
increase the cgroup count before page_remove (but in that case what
happen if we reach to the limit???)
give parameter to page_remove_rmap() that we dont want the cgroup to be
changed.

>> - annonymous page from KSM is worth to be tracked by memory cgroup ?
>> (IOW, it's on LRU and can be swapped-out ?)

KSM have no anonymous pages (it share anonymous pages into KsmPAGE ->
kernel allocated page without mapping)
so it isnt in LRU and it cannt be swapped, only when KsmPAGEs will be
break by do_wp_page() the duplication will be able to swap.

>>
>
> My feeling is that shared pages should be accounted as if they were
> not shared; that is, a share page should be accounted for each process
> that shares it. Perhaps sharing within a cgroup should be counted as
> 1 page for all the ptes pointing to it.
>
>

2008-11-12 17:33:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote:
> get_user_pages() cannot get to it since the pagetables have already been
> modified. If get_user_pages runs then the fault handling will occur
> which will block the thread until migration is complete.

migrate.c does nothing for ptes pointing to swap entries and
do_swap_page won't wait for them either. Assume follow_page in
migrate.c returns a swapcache not mapped but with a pte pointing to
it. That means page_count 1 (+1 after you isolate it from the lru),
page_mapcount 0, page_mapped 0, page_mapping = swap address space,
swap_count = 2 (1 swapcache, 1 the pte with the swapentry). Now assume
one thread does o_direct read from disk that triggers a minor fault in
do_swap_cache called by get_user_pages. The other cpu is running
sys_move_pages and the expected count will match the page count in
migrate_page_move_mapping. Page is still in swapcache. So after the
expected count matches in the migrate.c thread, the other thread
continues in do_swap_page and runs lookup_swap_cache that succeeds
(the page wasn't removed from swapcache yet as migrate.c needs to bail
out if the expected count doesn't match, so it can't mess with the
oldpage until it's sure it can migrate it). After that do_swap_page
gets a reference on the swapcache (at that point migrate.c continues
despite the expected count isn't 2 anymore! just a second after having
verified that it was 2). lock_page blocks do_swap_page until migration
is complete but pte_same in do_swap_page won't fail because the pte is
still pointing to the same swapentry (it's just the swapcache inode
radix tree that points to a different page, the swapentry is still the
same as before the migration - is_swap_pte will succeed but
is_migration_entry failed when restoring the pte). Finally the pte is
overwritten with the old page and any data written to the new page in
between is lost.

However it's not exactly the same bug as the one in fork, I was
talking about before, it's also not o_direct specific. Still
page_wrprotect + replace_page is orders of magnitude simpler logic
than migrate.c and it has no bugs or at least it's certainly much
simpler to proof as correct. Furthermore we never 'stall' any userland
task while we do our work. We only mark the pte wrprotected, the task
can cow or takeover it if refcount allows anytime, and later we'll
bailout during replace_page if something has happened in between
page_wrprotect and replace_page. So our logic is simpler and tuned for
max performance and fewer interference with userland runtime. Not
really sure if it worth for us to call into migrate.c.

2008-11-12 20:08:23

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Wed, 2008-11-12 at 18:32 +0100, Andrea Arcangeli wrote:
> On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote:
> > get_user_pages() cannot get to it since the pagetables have already been
> > modified. If get_user_pages runs then the fault handling will occur
> > which will block the thread until migration is complete.
>
> migrate.c does nothing for ptes pointing to swap entries and
> do_swap_page won't wait for them either. Assume follow_page in
> migrate.c returns a swapcache not mapped but with a pte pointing to
> it. That means page_count 1 (+1 after you isolate it from the lru),
> page_mapcount 0, page_mapped 0, page_mapping = swap address space,
> swap_count = 2 (1 swapcache, 1 the pte with the swapentry). Now assume
> one thread does o_direct read from disk that triggers a minor fault in
> do_swap_cache called by get_user_pages. The other cpu is running
> sys_move_pages and the expected count will match the page count in
> migrate_page_move_mapping. Page is still in swapcache. So after the
> expected count matches in the migrate.c thread, the other thread
> continues in do_swap_page and runs lookup_swap_cache that succeeds
> (the page wasn't removed from swapcache yet as migrate.c needs to bail
> out if the expected count doesn't match, so it can't mess with the
> oldpage until it's sure it can migrate it). After that do_swap_page
> gets a reference on the swapcache (at that point migrate.c continues
> despite the expected count isn't 2 anymore! just a second after having
> verified that it was 2). lock_page blocks do_swap_page until migration
> is complete but pte_same in do_swap_page won't fail because the pte is
> still pointing to the same swapentry (it's just the swapcache inode
> radix tree that points to a different page, the swapentry is still the
> same as before the migration - is_swap_pte will succeed but
> is_migration_entry failed when restoring the pte).

Ah. try_to_unmap_one() won't replace the pte entry with a
migration_pte() if the [anon] page is already in the swap cache. When
migration completes, we won't modify the page tables with the newpage
pte--we'll just let any subsequent swap page [minor] fault handle that.

That suggests a possible fix: instead of replacing the pte with a
duplicate swap entry in try_to_unmap_one(), go ahead and replace the pte
with a migration pte. Then back in try_to_unmap_anon(), after unmapping
all references, free the swap cache entry, so as not to leak it
[assuming we're in a lock state that allows that--I haven't checked that
far]. Then, the page table WILL have been modified by the time
migration unlocks the page.

Might want/need to check for migration entry in do_swap_page() and loop
back to migration_entry_wait() call when the changed pte is detected
rather than returning an error to the caller.

Does that sound reasonable?

> Finally the pte is
> overwritten with the old page and any data written to the new page in
> between is lost.

And wouldn't the new page potentially be leaked? That is, could it end
up on the lru with page_count == page_mapcount() >= 1, but no page table
reference to ever be unmapped to release the count?

>
> However it's not exactly the same bug as the one in fork, I was
> talking about before, it's also not o_direct specific. Still
> page_wrprotect + replace_page is orders of magnitude simpler logic
> than migrate.c and it has no bugs or at least it's certainly much
> simpler to proof as correct. Furthermore we never 'stall' any userland
> task while we do our work. We only mark the pte wrprotected, the task
> can cow or takeover it if refcount allows anytime, and later we'll
> bailout during replace_page if something has happened in between
> page_wrprotect and replace_page. So our logic is simpler and tuned for
> max performance and fewer interference with userland runtime. Not
> really sure if it worth for us to call into migrate.c.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2008-11-12 20:28:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Wed, 12 Nov 2008, Andrea Arcangeli wrote:

> On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote:
> > get_user_pages() cannot get to it since the pagetables have already been
> > modified. If get_user_pages runs then the fault handling will occur
> > which will block the thread until migration is complete.
>
> migrate.c does nothing for ptes pointing to swap entries and
> do_swap_page won't wait for them either. Assume follow_page in

If a anonymous page is a swap page then it has a mapping.
migrate_page_move_mapping() will lock the radix tree and ensure that no
additional reference (like done by do_swap_page) is established during
migration.

> However it's not exactly the same bug as the one in fork, I was
> talking about before, it's also not o_direct specific. Still

So far I have seen wild ideas not bugs.


2008-11-12 20:32:00

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Wed, 12 Nov 2008, Lee Schermerhorn wrote:

> Might want/need to check for migration entry in do_swap_page() and loop
> back to migration_entry_wait() call when the changed pte is detected
> rather than returning an error to the caller.
>
> Does that sound reasonable?

The reference count freezing and the rechecking of the pte in
do_swap_page() does not work? Nick broke it during lock removal for the
lockless page cache?

2008-11-12 22:09:23

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Wed, 2008-11-12 at 14:27 -0600, Christoph Lameter wrote:
> On Wed, 12 Nov 2008, Andrea Arcangeli wrote:
>
> > On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote:
> > > get_user_pages() cannot get to it since the pagetables have already been
> > > modified. If get_user_pages runs then the fault handling will occur
> > > which will block the thread until migration is complete.
> >
> > migrate.c does nothing for ptes pointing to swap entries and
> > do_swap_page won't wait for them either. Assume follow_page in
>
> If a anonymous page is a swap page then it has a mapping.
> migrate_page_move_mapping() will lock the radix tree and ensure that no
> additional reference (like done by do_swap_page) is established during
> migration.

So, it's Nick's reference freezing you asked about in response to my
mail that prevents do_swap_page() from getting another reference on the
page in the swap cache just after migrate_page_move_mapping() checks the
ref count and replaces the slot with new swap pte. Radix tree lock just
prevents other threads from modifying the slot, right? [Hmmm, looks
like we need to update the reference to "write lock" in the comments on
the 'deref_slot() and _replace_slot() definitions in radix-tree.h.]

Therefore, do_swap_page() will either get the old page and raise the ref
before migration check, or it will [possibly loop in find_get_page() and
then] get the new page.

Migration will bail out, for this pass anyway, in the former case. In
the second case, do_swap_page() will wait on the new page lock until
migration completes, deferring any direct IO.

Or am I still missing something?

>
> > However it's not exactly the same bug as the one in fork, I was
> > talking about before, it's also not o_direct specific. Still
>
> So far I have seen wild ideas not bugs.

Maybe not so wild, given the complexity of these interactions...

Later,
Lee

2008-11-13 02:01:24

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Wed, Nov 12, 2008 at 05:09:03PM -0500, Lee Schermerhorn wrote:
> Maybe not so wild, given the complexity of these interactions...

Perhaps Christoph's right it's just wild ideas, but see below.

You both seem to agree the first theory of the tree_lock is bogus
as it's lockless for find_get_page.

The second theory of why it shouldn't happen thanks to the refcount
freezing is bogus too...

CPU0 migrate.c CPU1 filemap.c
------- ----------
find_get_page
radix_tree_lookup_slot returns the oldpage
page_count still = expected_count
freeze_ref (oldpage->count = 0)
radix_tree_replace (too late, other side already got the oldpage)
unfreeze_ref (oldpage->count = 2)
page_cache_get_speculative(old_page)
set count to 3 and succeeds

Admittedly I couldn't understand what the freeze_ref was about, I
thought it was something related to the radix tree internals (which I
didn't check as they weren't relevant at that point besides being
lockless) as there was nothing inside that freeze/unfreeze critical
section that could affect find_get_page, so I ignored it. If if was
meant to stop find_get_page to get the oldpage it clearly isn't
working.

Perhaps I'm still missing something...

If I'm right my suggested fix is to simply change the
remove_migration_ptes to set the pte to point to the swapcache,
instead of leaving the swapentry in it. That will make do_swap_page
bailout like every other path in memory.c in the pte_same check, and
additionally it'll avoid an unnecessary minor fault.

2008-11-13 02:31:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Thu, Nov 13, 2008 at 03:00:59AM +0100, Andrea Arcangeli wrote:
> CPU0 migrate.c CPU1 filemap.c
> ------- ----------
> find_get_page
> radix_tree_lookup_slot returns the oldpage
> page_count still = expected_count
> freeze_ref (oldpage->count = 0)
> radix_tree_replace (too late, other side already got the oldpage)
> unfreeze_ref (oldpage->count = 2)
> page_cache_get_speculative(old_page)
> set count to 3 and succeeds

After reading more of this lockless radix tree code, I realized this
below check is the one that was intended to restart find_get_page and
prevent it to return the oldpage:

if (unlikely(page != *pagep)) {

But there's no barrier there, atomic_add_unless would need to provide
an atomic smp_mb() _after_ atomic_add_unless executed. In the old days
the atomic_* routines had no implied memory barriers, you had to use
smp_mb__after_atomic_add_unless if you wanted to avoid the race. I
don't see much in the ppc implementation of atomic_add_unless that
would provide an implicit smb_mb after the page_cache_get_speculative
returns, so I can't see why the pagep can't be by find_get_page read
before the other cpu executes radix_tree_replace in the above
timeline.

I guess you intended to put an smp_mb() in between the
page_cache_get_speculative and the *pagep to make the code safe on ppc
too, but there isn't, and I think it must be fixed, either that or I
don't understand ppc assembly right. The other side has a smp_wmb
implicit inside radix_tree_replace_slot so it should be ok already to
ensure we see the refcount going to 0 before we see the pagep changed
(the fact the other side has a memory barrier, further confirms this
side needs it too).

BTW, the radix_tree_deref_slot might miss a rcu_barrier_depends()
after radix_tree_deref_slot returns but I'm not entirely sure and only
alpha would be affected in the worst case.

2008-11-13 04:02:39

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Thursday 13 November 2008 13:31, Andrea Arcangeli wrote:
> On Thu, Nov 13, 2008 at 03:00:59AM +0100, Andrea Arcangeli wrote:
> > CPU0 migrate.c CPU1 filemap.c
> > ------- ----------
> > find_get_page
> > radix_tree_lookup_slot returns the oldpage
> > page_count still = expected_count
> > freeze_ref (oldpage->count = 0)
> > radix_tree_replace (too late, other side already got the oldpage)
> > unfreeze_ref (oldpage->count = 2)
> > page_cache_get_speculative(old_page)
> > set count to 3 and succeeds
>
> After reading more of this lockless radix tree code, I realized this
> below check is the one that was intended to restart find_get_page and
> prevent it to return the oldpage:
>
> if (unlikely(page != *pagep)) {
>
> But there's no barrier there, atomic_add_unless would need to provide
> an atomic smp_mb() _after_ atomic_add_unless executed. In the old days
> the atomic_* routines had no implied memory barriers, you had to use
> smp_mb__after_atomic_add_unless if you wanted to avoid the race. I
> don't see much in the ppc implementation of atomic_add_unless that
> would provide an implicit smb_mb after the page_cache_get_speculative
> returns, so I can't see why the pagep can't be by find_get_page read
> before the other cpu executes radix_tree_replace in the above
> timeline.

All atomic functions that both return a value and modify their memory
are defined to provide full barriers before and after the operation.

powerpc should be OK

__asm__ __volatile__ (
LWSYNC_ON_SMP
"1: lwarx %0,0,%1 # atomic_add_unless\n\
cmpw 0,%0,%3 \n\
beq- 2f \n\
add %0,%2,%0 \n"
PPC405_ERR77(0,%2)
" stwcx. %0,0,%1 \n\
bne- 1b \n"
ISYNC_ON_SMP
" subf %0,%2,%0 \n\
2:"
: "=&r" (t)
: "r" (&v->counter), "r" (a), "r" (u)
: "cc", "memory");

lwsync instruction prevents all reorderings except allows loads to
pass stores. But because it is followed by a LL/SC sequence, the
store part of that sequence is prevented from passing stores, so
therefore it will fail if the load had executed earlier and the
value has changed.

isync prevents speculative execution, so the branch has to be
resolved before any subsequent instructions start. The branch
depends on the result of the LL/SC, so that has to be complete.

So that prevents loads from passing the LL/SC.

For the SC to complete, I think by definition the store has to be
visible, because it has to check the value has not changed (so it
would have to own the cacheline).

I think that's how it works. I'm not an expert at powerpc's weird
barriers, but I'm pretty sure they work.


> I guess you intended to put an smp_mb() in between the
> page_cache_get_speculative and the *pagep to make the code safe on ppc
> too, but there isn't, and I think it must be fixed, either that or I
> don't understand ppc assembly right. The other side has a smp_wmb
> implicit inside radix_tree_replace_slot so it should be ok already to
> ensure we see the refcount going to 0 before we see the pagep changed
> (the fact the other side has a memory barrier, further confirms this
> side needs it too).

I think it is OK. It should have a comment there however to say it
relies on a barrier in get_page_unless_zero (I thought I had one..)

2008-11-13 06:12:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

Thank you for answers.

On Wed, 12 Nov 2008 13:11:12 +0200
Izik Eidus <[email protected]> wrote:

> Avi Kivity wrote:
> > KAMEZAWA Hiroyuki wrote:
> >> Can I make a question ? (I'm working for memory cgroup.)
> >>
> >> Now, we do charge to anonymous page when
> >> - charge(+1) when it's mapped firstly (mapcount 0->1)
> >> - uncharge(-1) it's fully unmapped (mapcount 1->0) vir
> >> page_remove_rmap().
> >>
> >> My quesion is
> >> - PageKSM pages are not necessary to be tracked by memory cgroup ?
> When we reaplacing page using page_replace() we have:
> oldpage - > anonymous page that is going to be replaced by newpage
> newpage -> kernel allocated page (KsmPage)
> so about oldpage we are calling page_remove_rmap() that will notify cgroup
> and about newpage it wont be count inside cgroup beacuse it is file rmap
> page
> (we are calling to page_add_file_rmap), so right now PageKSM wont ever
> be tracked by cgroup.
>
If not in radix-tree, it's not tracked.
(But we don't want to track non-LRU pages which are not freeable.)


> >> - Can we know that "the page is just replaced and we don't necessary
> >> to do
> >> charge/uncharge".
>
> The caller of page_replace does know it, the only problem is that
> page_remove_rmap()
> automaticly change the cgroup for anonymous pages,
> if we want it not to change the cgroup, we can:
> increase the cgroup count before page_remove (but in that case what
> happen if we reach to the limit???)
> give parameter to page_remove_rmap() that we dont want the cgroup to be
> changed.

Hmm, current mem cgroup works via page_cgroup struct to track pages.

page <-> page_cgroup has one-to-one relation ship.

So, "exchanging page" itself causes trouble. But I may be able to provide
necessary hooks to you as I did in page migraiton.

>
> >> - annonymous page from KSM is worth to be tracked by memory cgroup ?
> >> (IOW, it's on LRU and can be swapped-out ?)
>
> KSM have no anonymous pages (it share anonymous pages into KsmPAGE ->
> kernel allocated page without mapping)
> so it isnt in LRU and it cannt be swapped, only when KsmPAGEs will be
> break by do_wp_page() the duplication will be able to swap.
>
Ok, thank you for confirmation.

> >>
> >
> > My feeling is that shared pages should be accounted as if they were
> > not shared; that is, a share page should be accounted for each process
> > that shares it. Perhaps sharing within a cgroup should be counted as
> > 1 page for all the ptes pointing to it.
> >
> >

If KSM pages are on radix-tree, it will be accounted automatically.
Now, we have "Unevictable" LRU and mlocked() pages are smartly isolated into its
own LRU. So, just doing

- inode's radix-tree
- make all pages mlocked.
- provide special page fault handler for your purpose

is simple one. But ok, whatever implementation you'll do, I have to check it
and consider whether it should be tracked or not. Then, add codes to memcg to
track it or ignore it or comments on your patches ;)

It's helpful to add me to CC: when you post this set again.

Thanks,
-Kame





2008-11-13 06:14:01

by Eric Rannaud

[permalink] [raw]
Subject: Re: [PATCH 3/4] add ksm kernel shared memory driver

On Tue, 2008-11-11 at 17:40 -0500, [email protected] wrote:
> On Tue, 11 Nov 2008 15:03:45 MST, Jonathan Corbet said:
> Seems reasonably sane to me - only doing the first 128 bytes rather than
> a full 4K page is some 32 times faster. Yes, you'll have the *occasional*
> case where two pages were identical for 128 bytes but then differed, which is
> why there's buckets. But the vast majority of the time, at least one bit
> will be different in the first part.

In the same spirit, computing a CRC32 instead of a SHA1 would probably
be faster as well (faster to compute, and faster to compare the
digests). The increased rate of collision should be negligible.

Also, the upcoming SSE4.2 (Core i7) has a CRC32 instruction. (Support is
already in the kernel: arch/x86/crypto/crc32c-intel.c)

2008-11-13 10:38:49

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

ציטוט KAMEZAWA Hiroyuki:
> Thank you for answers.
>
> On Wed, 12 Nov 2008 13:11:12 +0200
> Izik Eidus <[email protected]> wrote:
>
>
>> Avi Kivity wrote:
>>
>>> KAMEZAWA Hiroyuki wrote:
>>>
>>>> Can I make a question ? (I'm working for memory cgroup.)
>>>>
>>>> Now, we do charge to anonymous page when
>>>> - charge(+1) when it's mapped firstly (mapcount 0->1)
>>>> - uncharge(-1) it's fully unmapped (mapcount 1->0) vir
>>>> page_remove_rmap().
>>>>
>>>> My quesion is
>>>> - PageKSM pages are not necessary to be tracked by memory cgroup ?
>>>>
>> When we reaplacing page using page_replace() we have:
>> oldpage - > anonymous page that is going to be replaced by newpage
>> newpage -> kernel allocated page (KsmPage)
>> so about oldpage we are calling page_remove_rmap() that will notify cgroup
>> and about newpage it wont be count inside cgroup beacuse it is file rmap
>> page
>> (we are calling to page_add_file_rmap), so right now PageKSM wont ever
>> be tracked by cgroup.
>>
>>
> If not in radix-tree, it's not tracked.
> (But we don't want to track non-LRU pages which are not freeable.)
>

Yes.

>
>
>>>> - Can we know that "the page is just replaced and we don't necessary
>>>> to do
>>>> charge/uncharge".
>>>>
>> The caller of page_replace does know it, the only problem is that
>> page_remove_rmap()
>> automaticly change the cgroup for anonymous pages,
>> if we want it not to change the cgroup, we can:
>> increase the cgroup count before page_remove (but in that case what
>> happen if we reach to the limit???)
>> give parameter to page_remove_rmap() that we dont want the cgroup to be
>> changed.
>>
>
> Hmm, current mem cgroup works via page_cgroup struct to track pages.
>
> page <-> page_cgroup has one-to-one relation ship.
>
> So, "exchanging page" itself causes trouble. But I may be able to provide
> necessary hooks to you as I did in page migraiton.
>

But if we dont track KsmPages, and we call to page_remove_rmap() on the
anonymous page that we
replace, why would it be a problem?
(should everything be correct in that case???)

>
>>>> - annonymous page from KSM is worth to be tracked by memory cgroup ?
>>>> (IOW, it's on LRU and can be swapped-out ?)
>>>>
>> KSM have no anonymous pages (it share anonymous pages into KsmPAGE ->
>> kernel allocated page without mapping)
>> so it isnt in LRU and it cannt be swapped, only when KsmPAGEs will be
>> break by do_wp_page() the duplication will be able to swap.
>>
>>
> Ok, thank you for confirmation.
>
>
>>>>
>>>>
>>> My feeling is that shared pages should be accounted as if they were
>>> not shared; that is, a share page should be accounted for each process
>>> that shares it. Perhaps sharing within a cgroup should be counted as
>>> 1 page for all the ptes pointing to it.
>>>
>>>
>>>
>
> If KSM pages are on radix-tree, it will be accounted automatically.
> Now, we have "Unevictable" LRU and mlocked() pages are smartly isolated into its
> own LRU. So, just doing
>
> - inode's radix-tree
> - make all pages mlocked.
> - provide special page fault handler for your purpose
>

Well in this version that i am going to merge the pages arent going to
be swappable,
Latter after Ksm will get merged we will make the KsmPages swappable...
so i think working with cgroups would be effective / useful only when
KsmPages will start be swappable...
Do you agree?
(What i am saying is that right now lets dont count the KsmPages inside
the cgroup, lets do it when KsmPages
will be swappable)

If you feel this pages should be counted in the cgroup i have no problem
to do it via hooks like page migration is doing.

thanks.

> is simple one. But ok, whatever implementation you'll do, I have to check it
> and consider whether it should be tracked or not. Then, add codes to memcg to
> track it or ignore it or comments on your patches ;)
>
> It's helpful to add me to CC: when you post this set again.
>

Sure will.

> Thanks,
> -Kame
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-11-13 11:33:40

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

On Thu, 13 Nov 2008 12:38:07 +0200
Izik Eidus <[email protected]> wrote:
> > If KSM pages are on radix-tree, it will be accounted automatically.
> > Now, we have "Unevictable" LRU and mlocked() pages are smartly isolated into its
> > own LRU. So, just doing
> >
> > - inode's radix-tree
> > - make all pages mlocked.
> > - provide special page fault handler for your purpose
> >
>
> Well in this version that i am going to merge the pages arent going to
> be swappable,
> Latter after Ksm will get merged we will make the KsmPages swappable...
good to hear

> so i think working with cgroups would be effective / useful only when
> KsmPages will start be swappable...
> Do you agree?
> (What i am saying is that right now lets dont count the KsmPages inside
> the cgroup, lets do it when KsmPages
> will be swappable)
>
ok.

> If you feel this pages should be counted in the cgroup i have no problem
> to do it via hooks like page migration is doing.
>
> thanks.
>
> > is simple one. But ok, whatever implementation you'll do, I have to check it
> > and consider whether it should be tracked or not. Then, add codes to memcg to
> > track it or ignore it or comments on your patches ;)
> >
> > It's helpful to add me to CC: when you post this set again.
> >
>
> Sure will.
>

If necessary, I'll have to add "ignore in this case" hook in memcg.
(ex. checking PageKSM flag in memcg.)

If you are sufferred from memcg in your test, cgroup_disable=memory boot option
will allow you to disable memcg.


Thanks,
-Kame