Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757603AbZCaM1Y (ORCPT ); Tue, 31 Mar 2009 08:27:24 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756525AbZCaM1G (ORCPT ); Tue, 31 Mar 2009 08:27:06 -0400 Received: from mx2.redhat.com ([66.187.237.31]:48880 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757325AbZCaM1A (ORCPT ); Tue, 31 Mar 2009 08:27:00 -0400 Message-ID: <49D20AE1.4060802@redhat.com> Date: Tue, 31 Mar 2009 15:21:53 +0300 From: Izik Eidus User-Agent: Thunderbird 2.0.0.21 (Windows/20090302) MIME-Version: 1.0 To: KAMEZAWA Hiroyuki CC: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org, avi@redhat.com, aarcange@redhat.com, chrisw@redhat.com, riel@redhat.com, jeremy@goop.org, mtosatti@redhat.com, hugh@veritas.com, corbet@lwn.net, yaniv@redhat.com, dmonakhov@openvz.org Subject: Re: [PATCH 4/4] add ksm kernel shared memory driver. References: <1238457560-7613-1-git-send-email-ieidus@redhat.com> <1238457560-7613-2-git-send-email-ieidus@redhat.com> <1238457560-7613-3-git-send-email-ieidus@redhat.com> <1238457560-7613-4-git-send-email-ieidus@redhat.com> <1238457560-7613-5-git-send-email-ieidus@redhat.com> <20090331111510.dbb712d2.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20090331111510.dbb712d2.kamezawa.hiroyu@jp.fujitsu.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 28953 Lines: 940 KAMEZAWA Hiroyuki wrote: > On Tue, 31 Mar 2009 02:59:20 +0300 > Izik Eidus wrote: > > >> Ksm is driver that allow merging identical pages between one or more >> applications in way unvisible to the application that use it. >> Pages that are merged are marked as readonly and are COWed when any >> application try to change them. >> >> Ksm is used for cases where using fork() is not suitable, >> one of this cases is where the pages of the application keep changing >> dynamicly and the application cannot know in advance what pages are >> going to be identical. >> >> Ksm works by walking over the memory pages of the applications it >> scan in order to find identical pages. >> It uses a two sorted data strctures called stable and unstable trees >> to find in effective way the identical pages. >> >> When ksm finds two identical pages, it marks them as readonly and merges >> them into single one page, >> after the pages are marked as readonly and merged into one page, linux >> will treat this pages as normal copy_on_write pages and will fork them >> when write access will happen to them. >> >> Ksm scan just memory areas that were registred to be scanned by it. >> >> Ksm api: >> >> KSM_GET_API_VERSION: >> Give the userspace the api version of the module. >> >> KSM_CREATE_SHARED_MEMORY_AREA: >> Create shared memory reagion fd, that latter allow the user to register >> the memory region to scan by using: >> KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION >> >> KSM_START_STOP_KTHREAD: >> Return information about the kernel thread, the inforamtion is returned >> using the ksm_kthread_info structure: >> ksm_kthread_info: >> __u32 sleep: >> number of microsecoends to sleep between each iteration of >> scanning. >> >> __u32 pages_to_scan: >> number of pages to scan for each iteration of scanning. >> >> __u32 max_pages_to_merge: >> maximum number of pages to merge in each iteration of scanning >> (so even if there are still more pages to scan, we stop this >> iteration) >> >> __u32 flags: >> flags to control ksmd (right now just ksm_control_flags_run >> available) >> >> KSM_REGISTER_MEMORY_REGION: >> Register userspace virtual address range to be scanned by ksm. >> This ioctl is using the ksm_memory_region structure: >> ksm_memory_region: >> __u32 npages; >> number of pages to share inside this memory region. >> __u32 pad; >> __u64 addr: >> the begining of the virtual address of this region. >> >> KSM_REMOVE_MEMORY_REGION: >> Remove memory region from ksm. >> >> Signed-off-by: Izik Eidus >> --- >> include/linux/ksm.h | 69 +++ >> include/linux/miscdevice.h | 1 + >> mm/Kconfig | 6 + >> mm/Makefile | 1 + >> mm/ksm.c | 1431 ++++++++++++++++++++++++++++++++++++++++++++ >> 5 files changed, 1508 insertions(+), 0 deletions(-) >> create mode 100644 include/linux/ksm.h >> create mode 100644 mm/ksm.c >> >> diff --git a/include/linux/ksm.h b/include/linux/ksm.h >> new file mode 100644 >> index 0000000..5776dce >> --- /dev/null >> +++ b/include/linux/ksm.h >> @@ -0,0 +1,69 @@ >> +#ifndef __LINUX_KSM_H >> +#define __LINUX_KSM_H >> + >> +/* >> + * Userspace interface for /dev/ksm - kvm shared memory >> + */ >> + >> +#include >> +#include >> + >> +#include >> + >> +#define KSM_API_VERSION 1 >> + >> +#define ksm_control_flags_run 1 >> + >> +/* for KSM_REGISTER_MEMORY_REGION */ >> +struct ksm_memory_region { >> + __u32 npages; /* number of pages to share */ >> + __u32 pad; >> + __u64 addr; /* the begining of the virtual address */ >> + __u64 reserved_bits; >> +}; >> + >> +struct ksm_kthread_info { >> + __u32 sleep; /* number of microsecoends to sleep */ >> + __u32 pages_to_scan; /* number of pages to scan */ >> + __u32 flags; /* control flags */ >> + __u32 pad; >> + __u64 reserved_bits; >> +}; >> + >> +#define KSMIO 0xAB >> + >> +/* ioctls for /dev/ksm */ >> + >> +#define KSM_GET_API_VERSION _IO(KSMIO, 0x00) >> +/* >> + * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd >> + */ >> +#define KSM_CREATE_SHARED_MEMORY_AREA _IO(KSMIO, 0x01) /* return SMA fd */ >> +/* >> + * KSM_START_STOP_KTHREAD - control the kernel thread scanning speed >> + * (can stop the kernel thread from working by setting running = 0) >> + */ >> +#define KSM_START_STOP_KTHREAD _IOW(KSMIO, 0x02,\ >> + struct ksm_kthread_info) >> +/* >> + * KSM_GET_INFO_KTHREAD - return information about the kernel thread >> + * scanning speed. >> + */ >> +#define KSM_GET_INFO_KTHREAD _IOW(KSMIO, 0x03,\ >> + struct ksm_kthread_info) >> + >> + >> +/* ioctls for SMA fds */ >> + >> +/* >> + * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be >> + * scanned by kvm. >> + */ >> +#define KSM_REGISTER_MEMORY_REGION _IOW(KSMIO, 0x20,\ >> + struct ksm_memory_region) >> +/* >> + * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm. >> + */ >> +#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO, 0x21) >> + >> +#endif >> diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h >> index a820f81..6d4f8df 100644 >> --- a/include/linux/miscdevice.h >> +++ b/include/linux/miscdevice.h >> @@ -29,6 +29,7 @@ >> #define HPET_MINOR 228 >> #define FUSE_MINOR 229 >> #define KVM_MINOR 232 >> +#define KSM_MINOR 233 >> #define MISC_DYNAMIC_MINOR 255 >> >> struct device; >> diff --git a/mm/Kconfig b/mm/Kconfig >> index a5b7781..2818223 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -216,3 +216,9 @@ config UNEVICTABLE_LRU >> >> config MMU_NOTIFIER >> bool >> + >> +config KSM >> + tristate "Enable KSM for page sharing" >> + help >> + Enable the KSM kernel module to allow page sharing of equal pages >> + among different tasks. >> diff --git a/mm/Makefile b/mm/Makefile >> index 72255be..e3bf7bf 100644 >> --- a/mm/Makefile >> +++ b/mm/Makefile >> @@ -24,6 +24,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o >> obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o >> obj-$(CONFIG_SLOB) += slob.o >> obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o >> +obj-$(CONFIG_KSM) += ksm.o >> obj-$(CONFIG_SLAB) += slab.o >> obj-$(CONFIG_SLUB) += slub.o >> obj-$(CONFIG_FAILSLAB) += failslab.o >> diff --git a/mm/ksm.c b/mm/ksm.c >> new file mode 100644 >> index 0000000..eba4c09 >> --- /dev/null >> +++ b/mm/ksm.c >> @@ -0,0 +1,1431 @@ >> +/* >> + * Memory merging driver for Linux >> + * >> + * This module enables dynamic sharing of identical pages found in different >> + * memory areas, even if they are not shared by fork() >> + * >> + * Copyright (C) 2008 Red Hat, Inc. >> + * Authors: >> + * Izik Eidus >> + * Andrea Arcangeli >> + * Chris Wright >> + * >> + * This work is licensed under the terms of the GNU GPL, version 2. >> + */ >> + >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> + >> +#include >> + >> +MODULE_AUTHOR("Red Hat, Inc."); >> +MODULE_LICENSE("GPL"); >> + >> +static int rmap_hash_size; >> +module_param(rmap_hash_size, int, 0); >> +MODULE_PARM_DESC(rmap_hash_size, "Hash table size for the reverse mapping"); >> + >> +/* >> + * ksm_mem_slot - hold information for an userspace scanning range >> + * (the scanning for this region will be from addr untill addr + >> + * npages * PAGE_SIZE inside mm) >> + */ >> +struct ksm_mem_slot { >> + struct list_head link; >> + struct list_head sma_link; >> + struct mm_struct *mm; >> + unsigned long addr; /* the begining of the virtual address */ >> + unsigned npages; /* number of pages to share */ >> +}; >> + >> +/* >> + * ksm_sma - shared memory area, each process have its own sma that contain the >> + * information about the slots that it own >> + */ >> +struct ksm_sma { >> + struct list_head sma_slots; >> +}; >> + >> +/** >> + * struct ksm_scan - cursor for scanning >> + * @slot_index: the current slot we are scanning >> + * @page_index: the page inside the sma that is currently being scanned >> + * >> + * ksm uses it to know what are the next pages it need to scan >> + */ >> +struct ksm_scan { >> + struct ksm_mem_slot *slot_index; >> + unsigned long page_index; >> +}; >> + >> +/* >> + * Few notes about ksm scanning progress (make it easier to understand the >> + * data structures below): >> + * >> + * In order to reduce excessive scanning, ksm sort the memory pages by their >> + * contents into a data strcture that hold pointer into the pages. >> + * >> + * Since the contents of the pages may change at any moment, ksm cant just >> + * insert the pages into normal sorted tree and expect it to find anything. >> + * >> + * For this purpuse ksm use two data strctures - stable and unstable trees, >> + * the stable tree hold pointers into all the merged pages (KsmPage) sorted by >> + * their contents, beacuse that each such page have to be write-protected, >> + * searching on this tree is fully assuranced to be working and therefore this >> + * tree is called the stable tree. >> + * >> + * In addition to the stable tree, ksm use another data strcture called the >> + * unstable tree, this specific tree hold pointers into pages that have >> + * been found to be "unchanged for period of time", the unstable tree sort this >> + * pages by their contents, but given the fact that this pages are not >> + * write-protected, ksm cant trust the unstable tree to be fully assuranced to >> + * work. >> + * For the reason that the unstable tree would become corrupted when some of >> + * the page inside itself would change, the tree is called unstable. >> + * Ksm solve this problem by two ways: >> + * 1) the unstable tree get flushed every time ksm finish to scan the whole >> + * memory, and then the tree is rebuild from the begining. >> + * 2) Ksm will only insert into the unstable tree, pages that their hash value >> + * was not changed during the whole progress of one circuler scanning of the >> + * memory. >> + * 3) The unstable tree is RedBlack Tree - meaning its balancing is based on >> + * the colors of the nodes and not their content, this assure that even when >> + * the tree get "corrupted" we wont get out of balance and the timing of >> + * scanning is the same, another issue is that searching and inserting nodes >> + * into rbtree is the same algorithem, therefore we have no overhead when we >> + * flush the tree and rebuild it. >> + * 4) Ksm never flush the stable tree, this mean that even if it would take 10 >> + * times to find page inside the unstable tree, as soon as we would find it, >> + * it will be secured inside the stable tree, >> + * (When we scan new page, we first compare it against the stable tree, and >> + * then against the unstable tree) >> + */ >> + >> +struct rmap_item; >> + >> +/* >> + * tree_item - object of the stable and unstable trees >> + */ >> +struct tree_item { >> + struct rb_node node; >> + struct rmap_item *rmap_item; >> +}; >> + >> +/* >> + * rmap_item - object of the rmap_hash hash table >> + * (it is holding the previous hash value (oldindex), >> + * pointer into the page_hash_item, and pointer into the tree_item) >> + */ >> + >> +/** >> + * struct rmap_item - reverse mapping item for virtual addresses >> + * @link: link into the rmap_hash hash table. >> + * @mm: the memory strcture the rmap_item is pointing to. >> + * @address: the virtual address the rmap_item is pointing to. >> + * @oldchecksum: old checksum result for the page belong the virtual address >> + * @stable_tree: when 1 rmap_item is used for stable_tree, 0 unstable tree >> + * @tree_item: pointer into the stable/unstable tree that hold the virtual >> + * address that the rmap_item is pointing to. >> + * @next: the next rmap item inside the stable/unstable tree that have that is >> + * found inside the same tree node. >> + */ >> + >> +struct rmap_item { >> + struct hlist_node link; >> + struct mm_struct *mm; >> + unsigned long address; >> + unsigned int oldchecksum; /* old checksum value */ >> + unsigned char stable_tree; /* 1 stable_tree 0 unstable tree */ >> + struct tree_item *tree_item; >> + struct rmap_item *next; >> + struct rmap_item *prev; >> +}; >> + >> +/* >> + * slots is linked list that hold all the memory regions that were registred >> + * to be scanned. >> + */ >> +static LIST_HEAD(slots); >> +/* >> + * slots_lock protect against removing and adding memory regions while a scanner >> + * is in the middle of scanning. >> + */ >> +static DECLARE_RWSEM(slots_lock); >> + >> +/* The stable and unstable trees heads. */ >> +struct rb_root root_stable_tree = RB_ROOT; >> +struct rb_root root_unstable_tree = RB_ROOT; >> + >> + >> +/* The number of linked list members inside the hash table */ >> +static int nrmaps_hash; >> +/* rmap_hash hash table */ >> +static struct hlist_head *rmap_hash; >> + >> +static struct kmem_cache *tree_item_cache; >> +static struct kmem_cache *rmap_item_cache; >> + >> +static int kthread_sleep; /* sleep time of the kernel thread */ >> +static int kthread_pages_to_scan; /* npages to scan for the kernel thread */ >> +static struct ksm_scan kthread_ksm_scan; >> +static int ksmd_flags; >> +static struct task_struct *kthread; >> +static DECLARE_WAIT_QUEUE_HEAD(kthread_wait); >> +static DECLARE_RWSEM(kthread_lock); >> + >> +static int ksm_slab_init(void) >> +{ >> + int ret = -ENOMEM; >> + >> + tree_item_cache = KMEM_CACHE(tree_item, 0); >> + if (!tree_item_cache) >> + goto out; >> + >> + rmap_item_cache = KMEM_CACHE(rmap_item, 0); >> + if (!rmap_item_cache) >> + goto out_free; >> + >> + return 0; >> + >> +out_free: >> + kmem_cache_destroy(tree_item_cache); >> +out: >> + return ret; >> +} >> + >> +static void ksm_slab_free(void) >> +{ >> + kmem_cache_destroy(rmap_item_cache); >> + kmem_cache_destroy(tree_item_cache); >> +} >> + >> +static inline struct tree_item *alloc_tree_item(void) >> +{ >> + return kmem_cache_zalloc(tree_item_cache, GFP_KERNEL); >> +} >> + >> +static void free_tree_item(struct tree_item *tree_item) >> +{ >> + kmem_cache_free(tree_item_cache, tree_item); >> +} >> + >> +static inline struct rmap_item *alloc_rmap_item(void) >> +{ >> + return kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL); >> +} >> + >> +static inline void free_rmap_item(struct rmap_item *rmap_item) >> +{ >> + kmem_cache_free(rmap_item_cache, rmap_item); >> +} >> + >> +/* >> + * PageKsm - this type of pages are the write protected pages that ksm map >> + * into multiple vmas (this is the "shared page") >> + * this page was allocated using alloc_page(), and every pte that point to it >> + * is always write protected (therefore its data content cant ever be changed) >> + * and this page cant be swapped. >> + */ >> +static inline int PageKsm(struct page *page) >> +{ >> + /* >> + * When ksm create new shared page, it create kernel allocated page >> + * using alloc_page(), therefore this page is not anonymous, taking into >> + * account that ksm scan just anonymous pages, we can relay on the fact >> + * that each time we see !PageAnon(page) we are hitting shared page. >> + */ >> + return !PageAnon(page); >> +} >> + >> +static int rmap_hash_init(void) >> +{ >> + if (!rmap_hash_size) { >> + struct sysinfo sinfo; >> + >> + si_meminfo(&sinfo); >> + rmap_hash_size = sinfo.totalram / 10; >> + } >> + nrmaps_hash = rmap_hash_size; >> + rmap_hash = vmalloc(nrmaps_hash * sizeof(struct hlist_head)); >> + if (!rmap_hash) >> + return -ENOMEM; >> + memset(rmap_hash, 0, nrmaps_hash * sizeof(struct hlist_head)); >> + return 0; >> +} >> + >> +static void rmap_hash_free(void) >> +{ >> + int i; >> + struct hlist_head *bucket; >> + struct hlist_node *node, *n; >> + struct rmap_item *rmap_item; >> + >> + for (i = 0; i < nrmaps_hash; ++i) { >> + bucket = &rmap_hash[i]; >> + hlist_for_each_entry_safe(rmap_item, node, n, bucket, link) { >> + hlist_del(&rmap_item->link); >> + free_rmap_item(rmap_item); >> + } >> + } >> + vfree(rmap_hash); >> +} >> + >> +static inline u32 calc_checksum(struct page *page) >> +{ >> + u32 checksum; >> + void *addr = kmap_atomic(page, KM_USER0); >> + checksum = jhash(addr, PAGE_SIZE, 17); >> + kunmap_atomic(addr, KM_USER0); >> + return checksum; >> +} >> + >> +/* >> + * Return rmap_item for a given virtual address. >> + */ >> +static struct rmap_item *get_rmap_item(struct mm_struct *mm, unsigned long addr) >> +{ >> + struct rmap_item *rmap_item; >> + struct hlist_head *bucket; >> + struct hlist_node *node; >> + >> + bucket = &rmap_hash[addr % nrmaps_hash]; >> + hlist_for_each_entry(rmap_item, node, bucket, link) { >> + if (mm == rmap_item->mm && rmap_item->address == addr) { >> + return rmap_item; >> + } >> + } >> + return NULL; >> +} >> + >> +/* >> + * Removing rmap_item from stable or unstable tree. >> + * This function will free the rmap_item object, and if that rmap_item was >> + * insde the stable or unstable trees, it would remove the link from there >> + * as well. >> + */ >> +static void remove_rmap_item_from_tree(struct rmap_item *rmap_item) >> +{ >> + struct tree_item *tree_item; >> + >> + tree_item = rmap_item->tree_item; >> + rmap_item->tree_item = NULL; >> + >> + if (rmap_item->stable_tree) { >> + if (rmap_item->prev) { >> + BUG_ON(rmap_item->prev->next != rmap_item); >> + rmap_item->prev->next = rmap_item->next; >> + } >> + if (rmap_item->next) { >> + BUG_ON(rmap_item->next->prev != rmap_item); >> + rmap_item->next->prev = rmap_item->prev; >> + } >> + } >> + >> + if (tree_item) { >> + if (rmap_item->stable_tree) { >> + if (!rmap_item->next && !rmap_item->prev) { >> + rb_erase(&tree_item->node, &root_stable_tree); >> + free_tree_item(tree_item); >> + } else if (!rmap_item->prev) { >> + tree_item->rmap_item = rmap_item->next; >> + } else { >> + tree_item->rmap_item = rmap_item->prev; >> + } >> + } else if (!rmap_item->stable_tree) { >> + free_tree_item(tree_item); >> + } >> + } >> + >> + hlist_del(&rmap_item->link); >> + free_rmap_item(rmap_item); >> +} >> + >> +static void remove_page_from_tree(struct mm_struct *mm, >> + unsigned long addr) >> +{ >> + struct rmap_item *rmap_item; >> + >> + rmap_item = get_rmap_item(mm, addr); >> + if (!rmap_item) >> + return; >> + remove_rmap_item_from_tree(rmap_item); >> + return; >> +} >> + >> +static int ksm_sma_ioctl_register_memory_region(struct ksm_sma *ksm_sma, >> + struct ksm_memory_region *mem) >> +{ >> + struct ksm_mem_slot *slot; >> + int ret = -EPERM; >> + >> + slot = kzalloc(sizeof(struct ksm_mem_slot), GFP_KERNEL); >> + if (!slot) { >> + ret = -ENOMEM; >> + goto out; >> + } >> + >> + slot->mm = get_task_mm(current); >> + if (!slot->mm) >> + goto out_free; >> + slot->addr = mem->addr; >> + slot->npages = mem->npages; >> + >> + down_write(&slots_lock); >> + >> + list_add_tail(&slot->link, &slots); >> + list_add_tail(&slot->sma_link, &ksm_sma->sma_slots); >> + >> + up_write(&slots_lock); >> + return 0; >> + >> +out_free: >> + kfree(slot); >> +out: >> + return ret; >> +} >> + >> +static void remove_mm_from_hash_and_tree(struct mm_struct *mm) >> +{ >> + struct ksm_mem_slot *slot; >> + int pages_count; >> + >> + list_for_each_entry(slot, &slots, link) >> + if (slot->mm == mm) >> + break; >> + BUG_ON(!slot); >> + >> + root_unstable_tree = RB_ROOT; >> + for (pages_count = 0; pages_count < slot->npages; ++pages_count) >> + remove_page_from_tree(mm, slot->addr + >> + pages_count * PAGE_SIZE); >> + list_del(&slot->link); >> +} >> + >> +static int ksm_sma_ioctl_remove_memory_region(struct ksm_sma *ksm_sma) >> +{ >> + struct ksm_mem_slot *slot, *node; >> + >> + down_write(&slots_lock); >> + list_for_each_entry_safe(slot, node, &ksm_sma->sma_slots, sma_link) { >> + remove_mm_from_hash_and_tree(slot->mm); >> + mmput(slot->mm); >> + list_del(&slot->sma_link); >> + kfree(slot); >> + } >> + up_write(&slots_lock); >> + return 0; >> +} >> + >> +static int ksm_sma_release(struct inode *inode, struct file *filp) >> +{ >> + struct ksm_sma *ksm_sma = filp->private_data; >> + int r; >> + >> + r = ksm_sma_ioctl_remove_memory_region(ksm_sma); >> + kfree(ksm_sma); >> + return r; >> +} >> + >> +static long ksm_sma_ioctl(struct file *filp, >> + unsigned int ioctl, unsigned long arg) >> +{ >> + struct ksm_sma *sma = filp->private_data; >> + void __user *argp = (void __user *)arg; >> + int r = EINVAL; >> + >> + switch (ioctl) { >> + case KSM_REGISTER_MEMORY_REGION: { >> + struct ksm_memory_region ksm_memory_region; >> + >> + r = -EFAULT; >> + if (copy_from_user(&ksm_memory_region, argp, >> + sizeof(ksm_memory_region))) >> + goto out; >> + r = ksm_sma_ioctl_register_memory_region(sma, >> + &ksm_memory_region); >> + break; >> + } >> + case KSM_REMOVE_MEMORY_REGION: >> + r = ksm_sma_ioctl_remove_memory_region(sma); >> + break; >> + } >> + >> +out: >> + return r; >> +} >> + >> +static unsigned long addr_in_vma(struct vm_area_struct *vma, struct page *page) >> +{ >> + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); >> + unsigned long addr; >> + >> + addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); >> + if (unlikely(addr < vma->vm_start || addr >= vma->vm_end)) >> + return -EFAULT; >> + return addr; >> +} >> + >> +static pte_t *get_pte(struct mm_struct *mm, unsigned long addr) >> +{ >> + pgd_t *pgd; >> + pud_t *pud; >> + pmd_t *pmd; >> + pte_t *ptep = NULL; >> + >> + pgd = pgd_offset(mm, addr); >> + if (!pgd_present(*pgd)) >> + goto out; >> + >> + pud = pud_offset(pgd, addr); >> + if (!pud_present(*pud)) >> + goto out; >> + >> + pmd = pmd_offset(pud, addr); >> + if (!pmd_present(*pmd)) >> + goto out; >> + >> + ptep = pte_offset_map(pmd, addr); >> +out: >> + return ptep; >> +} >> + >> +static int is_present_pte(struct mm_struct *mm, unsigned long addr) >> +{ >> + pte_t *ptep; >> + int r; >> + >> + ptep = get_pte(mm, addr); >> + if (!ptep) >> + return 0; >> + >> + r = pte_present(*ptep); >> + pte_unmap(ptep); >> + >> + return r; >> +} >> + >> +static int memcmp_pages(struct page *page1, struct page *page2) >> +{ >> + char *addr1, *addr2; >> + int r; >> + >> + addr1 = kmap_atomic(page1, KM_USER0); >> + addr2 = kmap_atomic(page2, KM_USER1); >> + r = memcmp(addr1, addr2, PAGE_SIZE); >> + kunmap_atomic(addr1, KM_USER0); >> + kunmap_atomic(addr2, KM_USER1); >> + return r; >> +} >> + >> +/* pages_identical >> + * return 1 if identical, 0 otherwise. >> + */ >> +static inline int pages_identical(struct page *page1, struct page *page2) >> +{ >> + return !memcmp_pages(page1, page2); >> +} >> + >> +/* >> + * try_to_merge_one_page - take two pages and merge them into one >> + * @mm: mm_struct that hold vma pointing into oldpage >> + * @vma: the vma that hold the pte pointing into oldpage >> + * @oldpage: the page that we want to replace with newpage >> + * @newpage: the page that we want to map instead of oldpage >> + * @newprot: the new permission of the pte inside vma >> + * note: >> + * oldpage should be anon page while newpage should be file mapped page >> + * >> + * this function return 0 if the pages were merged, 1 otherwise. >> + */ >> +static int try_to_merge_one_page(struct mm_struct *mm, >> + struct vm_area_struct *vma, >> + struct page *oldpage, >> + struct page *newpage, >> + pgprot_t newprot) >> +{ >> + int ret = 1; >> + int odirect_sync; >> + unsigned long page_addr_in_vma; >> + pte_t orig_pte, *orig_ptep; >> + >> + get_page(newpage); >> + get_page(oldpage); >> + >> + down_read(&mm->mmap_sem); >> + >> + page_addr_in_vma = addr_in_vma(vma, oldpage); >> + if (page_addr_in_vma == -EFAULT) >> + goto out_unlock; >> + >> + orig_ptep = get_pte(mm, page_addr_in_vma); >> + if (!orig_ptep) >> + goto out_unlock; >> + orig_pte = *orig_ptep; >> + pte_unmap(orig_ptep); >> + if (!pte_present(orig_pte)) >> + goto out_unlock; >> + if (page_to_pfn(oldpage) != pte_pfn(orig_pte)) >> + goto out_unlock; >> + /* >> + * we need the page lock to read a stable PageSwapCache in >> + * page_wrprotect() >> + */ >> + if (!trylock_page(oldpage)) >> + goto out_unlock; >> + /* >> + * page_wrprotect check if the page is swapped or in swap cache, >> + * in the future we might want to run here if_present_pte and then >> + * swap_free >> + */ >> + if (!page_wrprotect(oldpage, &odirect_sync, 2)) { >> + unlock_page(oldpage); >> + goto out_unlock; >> + } >> + unlock_page(oldpage); >> + if (!odirect_sync) >> + goto out_unlock; >> + >> + orig_pte = pte_wrprotect(orig_pte); >> + >> + if (pages_identical(oldpage, newpage)) >> + ret = replace_page(vma, oldpage, newpage, orig_pte, newprot); >> + >> +out_unlock: >> + up_read(&mm->mmap_sem); >> + put_page(oldpage); >> + put_page(newpage); >> + return ret; >> +} >> + >> +/* >> + * try_to_merge_two_pages - take two identical pages and prepare them to be >> + * merged into one page. >> + * >> + * this function return 0 if we successfully mapped two identical pages into one >> + * page, 1 otherwise. >> + * (note in case we created KsmPage and mapped one page into it but the second >> + * page was not mapped we consider it as a failure and return 1) >> + */ >> +static int try_to_merge_two_pages(struct mm_struct *mm1, struct page *page1, >> + struct mm_struct *mm2, struct page *page2, >> + unsigned long addr1, unsigned long addr2) >> +{ >> + struct vm_area_struct *vma; >> + pgprot_t prot; >> + int ret = 1; >> + >> + /* >> + * If page2 isn't shared (it isn't PageKsm) we have to allocate a new >> + * file mapped page and make the two ptes of mm1(page1) and mm2(page2) >> + * point to it. If page2 is shared, we can just make the pte of >> + * mm1(page1) point to page2 >> + */ >> + if (PageKsm(page2)) { >> + down_read(&mm1->mmap_sem); >> + vma = find_vma(mm1, addr1); >> + up_read(&mm1->mmap_sem); >> + if (!vma) >> + return ret; >> + prot = vma->vm_page_prot; >> + pgprot_val(prot) &= ~_PAGE_RW; >> + ret = try_to_merge_one_page(mm1, vma, page1, page2, prot); >> + } else { >> + struct page *kpage; >> + >> + kpage = alloc_page(GFP_HIGHUSER); >> + if (!kpage) >> + return ret; >> + down_read(&mm1->mmap_sem); >> + vma = find_vma(mm1, addr1); >> + up_read(&mm1->mmap_sem); >> + if (!vma) { >> + put_page(kpage); >> + return ret; >> + } >> + prot = vma->vm_page_prot; >> + pgprot_val(prot) &= ~_PAGE_RW; >> + >> + copy_user_highpage(kpage, page1, addr1, vma); >> + ret = try_to_merge_one_page(mm1, vma, page1, kpage, prot); >> + >> + if (!ret) { >> + down_read(&mm2->mmap_sem); >> + vma = find_vma(mm2, addr2); >> + up_read(&mm2->mmap_sem); >> + if (!vma) { >> + put_page(kpage); >> + ret = 1; >> + return ret; >> + } >> + >> + prot = vma->vm_page_prot; >> + pgprot_val(prot) &= ~_PAGE_RW; >> + >> + ret = try_to_merge_one_page(mm2, vma, page2, kpage, >> + prot); >> + /* >> + * If the secoend try_to_merge_one_page call was failed, >> + * we are in situation where we have Ksm page that have >> + * just one pte pointing to it, in this case we break >> + * it. >> + */ >> + if (ret) { >> + struct page *tmppage[1]; >> + >> + down_read(&mm1->mmap_sem); >> + if (get_user_pages(current, mm1, addr1, 1, 1, >> + 0, tmppage, NULL)) { >> + put_page(tmppage[0]); >> + } >> + up_read(&mm1->mmap_sem); >> + } >> + } >> + put_page(kpage); >> + } >> + return ret; >> +} >> > > I'm sorry if I'm wrong. Is the above "kpage" is free from global LRU and never be > reclaimed(swapped-out) by global LRU ? > kpage is actually what going to be KsmPage -> the shared page... Right now this pages are not swappable..., after ksm will be merged we will make this pages swappable as well... > If so, please > - show the amount of kpage > > - allow users to set limit for usage of kpages. or preserve kpages at boot or > by user's command. > kpage actually save memory..., and limiting the number of them, would make you limit the number of shared pages... > Thanks, > -Kame > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/