2009-04-09 04:02:00

by Izik Eidus

[permalink] [raw]
Subject: [PATCH 0/4] ksm - dynamic page sharing driver for linux v3

>From v2 to v3:

1)Remove unnessery check of is_dirty_pte() inside PageKsm()
We have added the is_dirty_pte() chceck to protect against the
reuse: case inside do_wp_page().
Andrea pointed to me that such condtion couldnt ever happen,
du to the fact that if VM_SHARED is set no Anonymous page can be
on the vma, therefore it is unpossible that such Page would become
KsmPage and therefore KsmPages would never trigger the reuse case
(Checkout From v1 to v2 for more info)

2)Add !vm_file check in addition to PageKsm() to check if sharedpage
Until now Ksm was checking whatever Pages are sharedpages (KsmPage)
by just running get_user_page() and then check if Page != AnonPage.
The problem raise as Ksm keep virtual addresses inside its data
strctures and if the user will free page and allocate new !AnonPage
Page, Ksm might think this page is shared page.
To solve this problem we have added an additional check for Ksm,
We are checking whatever the vma->vm_file is set to NULL, in case
we see a virtual address that its vma->vm_file is NULL and the
page that it pointing into it isnt AnonPage we can safetly know that
this is shared page (KsmPage).

3)Replace jhash() with jhash2()
Andrey Panin pointed that we should use jhash2 as it faster than
jhash().

Thanks.

(Below is info from previous posts)

>From v1 to v2:

1)Fixed security issue found by Chris Wright:
Ksm was checking if page is a shared page by running !PageAnon.
Beacuse that Ksm scan only anonymous memory, all !PageAnons
inside ksm data strctures are shared page, however there might
be a case for do_wp_page() when the VM_SHARED is used where
do_wp_page() would instead of copying the page into new anonymos
page, would reuse the page, it was fixed by adding check for the
dirty_bit of the virtual addresses pointing into the shared page.
I was not finding any VM code tha would clear the dirty bit from
this virtual address (due to the fact that we allocate the page
using page_alloc() - kernel allocated pages), ~but i still want
confirmation about this from the vm guys - thanks.~

2)Moved to sysfs to control ksm:
It was requested as a better way to control the ksm scanning
thread than ioctls.
the sysfs api:
dir: /sys/kernel/mm/ksm/

kernel_pages_allocated - information about how many kernel pages
ksm have allocated, this pages are not swappable, and each page
like that is used by ksm to share pages with identical content

pages_shared - how many pages were shared by ksm

run - set to 1 when you want ksm to run, 0 when no

max_kernel_pages - set the maximum amount of kernel pages
to be allocated by ksm, set 0 for unlimited.

pages_to_scan - how many pages to scan before ksm will sleep

sleep - how much usecs ksm will sleep.

3)Add sysfs paramater to control the maximum kernel pages to be by
ksm.

4)Add statistics about how much pages are really shared.


One issue still to be discussed:
There was a suggestion to use madvice(SHAREABLE) instead of using
ioctls to register memory that need to be scanned by ksm.
Such change is outside the area of ksm.c and would required adding
new madvice api, and change some parts of the vm and the kernel
code, so first thing to do, is realized if we really want this.

I dont know any other open issues.

Thanks.

This is from the first post:
(The kvm part, togather with the kvm-userspace part, was post with V1
before about a week, whoever want to test ksm may download the
patch from lkml archive)

KSM is a linux driver that allows dynamicly sharing identical memory
pages between one or more processes.

Unlike tradtional page sharing that is made at the allocation of the
memory, ksm do it dynamicly after the memory was created.
Memory is periodically scanned; identical pages are identified and
merged.
The sharing is unnoticeable by the process that use this memory.
(the shared pages are marked as readonly, and in case of write
do_wp_page() take care to create new copy of the page)

To find identical pages ksm use algorithm that is split into three
primery levels:

1) Ksm will start scan the memory and will calculate checksum for each
page that is registred to be scanned.
(In the first round of the scanning, ksm would only calculate
this checksum for all the pages)

2) Ksm will go again on the whole memory and will recalculate the
checmsum of the pages, pages that are found to have the same
checksum value, would be considered "pages that are most likely
wont changed"
Ksm will insert this pages into sorted by page content RB-tree that
is called "unstable tree", the reason that this tree is called
unstable is due to the fact that the page contents might changed
while they are still inside the tree, and therefore the tree would
become corrupted.
Due to this problem ksm take two more steps in addition to the
checksum calculation:
a) Ksm will throw and recreate the entire unstable tree each round
of memory scanning - so if we have corruption, it will be fixed
when we will rebuild the tree.
b) Ksm is using RB-tree, that its balancing is made by the node color
and not by the content, so even if the page get corrupted, it still
would take the same amount of time to search on it.

3) In addition to the unstable tree, ksm hold another tree that is called
"stable tree" - this tree is RB-tree that is sorted by the pages
content and all its pages are write protected, and therefore it cant get
corrupted.
Each time ksm will find two identcial pages using the unstable tree,
it will create new write-protected shared page, and this page will be
inserted into the stable tree, and would be saved there, the
stable tree, unlike the unstable tree, is never throwen away, so each
page that we find would be saved inside it.

Taking into account the three levels that described above, the algorithm
work like that:

search primary tree (sorted by entire page contents, pages write protected)
- if match found, merge
- if no match found...
- search secondary tree (sorted by entire page contents, pages not write
protected)
- if match found, merge
- remove from secondary tree and insert merged page into primary tree
- if no match found...
- checksum
- if checksum hasn't changed
- insert into secondary tree
- if it has, store updated checksum (note: first time this page
is handled it won't have a checksum, so checksum will appear
as "changed", so it takes two passes w/ no other matches to
get into secondary tree)
- do not insert into any tree, will see it again on next pass

The basic idea of this algorithm, is that even if the unstable tree doesnt
promise to us to find two identical pages in the first round, we would
probably find them in the second or the third or the tenth round,
then after we have found this two identical pages only once, we will insert
them into the stable tree, and then they would be protected there forever.
So the all idea of the unstable tree, is just to build the stable tree and
then we will find the identical pages using it.

The current implemantion can be improved alot:
we dont have to calculate exspensive checksum, we can just use the host
dirty bit.

currently we dont support shared pages swapping (other pages that are not
shared can be swapped (all the pages that we didnt find to be identical
to other pages...).

Walking on the tree, we keep call to get_user_pages(), we can optimized it
by saving the pfn, and using mmu notifiers to know when the virtual address
mapping was changed.

We currently scan just programs that were registred to be used by ksm, we
would later want to add the abilaty to tell ksm to scan PIDS (so you can
scan closed binary applications as well).

Right now ksm scanning is made by just one thread, multiple scanners
support might would be needed.

This driver is very useful for KVM as in cases of runing multiple guests
operation system of the same type.
(For desktop work loads we have achived more than x2 memory overcommit
(more like x3))

This driver have found users other than KVM, for example CERN,
Fons Rademakers:
"on many-core machines we run one large detector simulation program per core.
These simulation programs are identical but run each in their own process and
need about 2 - 2.5 GB RAM.
We typically buy machines with 2GB RAM per core and so have a problem to run
one of these programs per core.
Of the 2 - 2.5 GB about 700MB is identical data in the form of magnetic field
maps, detector geometry, etc.
Currently people have been trying to start one program, initialize the geometry
and field maps and then fork it N times, to have the data shared.
With KSM this would be done automatically by the system so it sounded extremely
attractive when Andrea presented it."

I am sending another seires of patchs for kvm kernel and kvm-userspace
that would allow users of kvm to test ksm with it.
The kvm patchs would apply to Avi git tree.



Izik Eidus (4):
MMU_NOTIFIERS: add set_pte_at_notify()
add page_wrprotect(): write protecting page.
add replace_page(): change the page pte is pointing to.
add ksm kernel shared memory driver.

include/linux/ksm.h | 48 ++
include/linux/miscdevice.h | 1 +
include/linux/mm.h | 5 +
include/linux/mmu_notifier.h | 34 +
include/linux/rmap.h | 11 +
mm/Kconfig | 6 +
mm/Makefile | 1 +
mm/ksm.c | 1674 ++++++++++++++++++++++++++++++++++++++++++
mm/memory.c | 90 +++-
mm/mmu_notifier.c | 20 +
mm/rmap.c | 139 ++++
11 files changed, 2027 insertions(+), 2 deletions(-)
create mode 100644 include/linux/ksm.h
create mode 100644 mm/ksm.c


2009-04-09 04:01:15

by Izik Eidus

[permalink] [raw]
Subject: [PATCH 3/4] add replace_page(): change the page pte is pointing to.

replace_page() allow changing the mapping of pte from one physical page
into diffrent physical page.

this function is working by removing oldpage from the rmap and calling
put_page on it, and by setting the pte to point into newpage and by
inserting it to the rmap using page_add_file_rmap().

note: newpage must be non anonymous page, the reason for this is:
replace_page() is built to allow mapping one page into more than one
virtual addresses, the mapping of this page can happen in diffrent
offsets inside each vma, and therefore we cannot trust the page->index
anymore.

the side effect of this issue is that newpage cannot be anything but
kernel allocated page that is not swappable.

Signed-off-by: Izik Eidus <[email protected]>
---
include/linux/mm.h | 5 +++
mm/memory.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 85 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bff1f0d..7a831ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1240,6 +1240,11 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);

+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+ struct page *newpage, pte_t orig_pte, pgprot_t prot);
+#endif
+
struct page *follow_page(struct vm_area_struct *, unsigned long address,
unsigned int foll_flags);
#define FOLL_WRITE 0x01 /* check pte is writable */
diff --git a/mm/memory.c b/mm/memory.c
index 1e1a14b..d6e53c2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1567,6 +1567,86 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
}
EXPORT_SYMBOL(vm_insert_mixed);

+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+
+/**
+ * replace_page - replace page in vma with new page
+ * @vma: vma that hold the pte oldpage is pointed by.
+ * @oldpage: the page we are replacing with newpage
+ * @newpage: the page we replace oldpage with
+ * @orig_pte: the original value of the pte
+ * @prot: page protection bits
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ *
+ * Note: @newpage must not be an anonymous page because replace_page() does
+ * not change the mapping of @newpage to have the same values as @oldpage.
+ * @newpage can be mapped in several vmas at different offsets (page->index).
+ */
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+ struct page *newpage, pte_t orig_pte, pgprot_t prot)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *ptep;
+ spinlock_t *ptl;
+ unsigned long addr;
+ int ret;
+
+ BUG_ON(PageAnon(newpage));
+
+ ret = -EFAULT;
+ addr = page_address_in_vma(oldpage, vma);
+ if (addr == -EFAULT)
+ goto out;
+
+ pgd = pgd_offset(mm, addr);
+ if (!pgd_present(*pgd))
+ goto out;
+
+ pud = pud_offset(pgd, addr);
+ if (!pud_present(*pud))
+ goto out;
+
+ pmd = pmd_offset(pud, addr);
+ if (!pmd_present(*pmd))
+ goto out;
+
+ ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!ptep)
+ goto out;
+
+ if (!pte_same(*ptep, orig_pte)) {
+ pte_unmap_unlock(ptep, ptl);
+ goto out;
+ }
+
+ ret = 0;
+ get_page(newpage);
+ page_add_file_rmap(newpage);
+
+ flush_cache_page(vma, addr, pte_pfn(*ptep));
+ ptep_clear_flush(vma, addr, ptep);
+ set_pte_at_notify(mm, addr, ptep, mk_pte(newpage, prot));
+
+ page_remove_rmap(oldpage);
+ if (PageAnon(oldpage)) {
+ dec_mm_counter(mm, anon_rss);
+ inc_mm_counter(mm, file_rss);
+ }
+ put_page(oldpage);
+
+ pte_unmap_unlock(ptep, ptl);
+
+out:
+ return ret;
+}
+EXPORT_SYMBOL_GPL(replace_page);
+
+#endif
+
/*
* maps a range of physical memory into the requested pages. the old
* mappings are removed. any references to nonexistent pages results
--
1.5.6.5

2009-04-09 04:01:37

by Izik Eidus

[permalink] [raw]
Subject: [PATCH 1/4] MMU_NOTIFIERS: add set_pte_at_notify()

this macro allow setting the pte in the shadow page tables directly
instead of flushing the shadow page table entry and then get vmexit in
order to set it.

This function is optimzation for kvm/users of mmu_notifiers for COW
pages, it is useful for kvm when ksm is used beacuse it allow kvm
not to have to recive VMEXIT and only then map the shared page into
the mmu shadow pages, but instead map it directly at the same time
linux map the page into the host page table.

this mmu notifer macro is working by calling to callback that will map
directly the physical page into the shadow page tables.

(users of mmu_notifiers that didnt implement the set_pte_at_notify()
call back will just recive the mmu_notifier_invalidate_page callback)

Signed-off-by: Izik Eidus <[email protected]>
---
include/linux/mmu_notifier.h | 34 ++++++++++++++++++++++++++++++++++
mm/memory.c | 10 ++++++++--
mm/mmu_notifier.c | 20 ++++++++++++++++++++
3 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b77486d..8bb245f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,15 @@ struct mmu_notifier_ops {
struct mm_struct *mm,
unsigned long address);

+ /*
+ * change_pte is called in cases that pte mapping into page is changed
+ * for example when ksm mapped pte to point into a new shared page.
+ */
+ void (*change_pte)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address,
+ pte_t pte);
+
/*
* Before this is invoked any secondary MMU is still ok to
* read/write to the page previously pointed to by the Linux
@@ -154,6 +163,8 @@ extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
extern void __mmu_notifier_release(struct mm_struct *mm);
extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
unsigned long address);
+extern void __mmu_notifier_change_pte(struct mm_struct *mm,
+ unsigned long address, pte_t pte);
extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address);
extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
@@ -175,6 +186,13 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
return 0;
}

+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+ unsigned long address, pte_t pte)
+{
+ if (mm_has_notifiers(mm))
+ __mmu_notifier_change_pte(mm, address, pte);
+}
+
static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address)
{
@@ -236,6 +254,16 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
__young; \
})

+#define set_pte_at_notify(__mm, __address, __ptep, __pte) \
+({ \
+ struct mm_struct *___mm = __mm; \
+ unsigned long ___address = __address; \
+ pte_t ___pte = __pte; \
+ \
+ set_pte_at(__mm, __address, __ptep, ___pte); \
+ mmu_notifier_change_pte(___mm, ___address, ___pte); \
+})
+
#else /* CONFIG_MMU_NOTIFIER */

static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -248,6 +276,11 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
return 0;
}

+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+ unsigned long address, pte_t pte)
+{
+}
+
static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address)
{
@@ -273,6 +306,7 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)

#define ptep_clear_flush_young_notify ptep_clear_flush_young
#define ptep_clear_flush_notify ptep_clear_flush
+#define set_pte_at_notify set_pte_at

#endif /* CONFIG_MMU_NOTIFIER */

diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..1e1a14b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2051,9 +2051,15 @@ gotten:
* seen in the presence of one thread doing SMC and another
* thread doing COW.
*/
- ptep_clear_flush_notify(vma, address, page_table);
+ ptep_clear_flush(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address);
- set_pte_at(mm, address, page_table, entry);
+ /*
+ * We call here the notify macro beacuse in cases of using
+ * secondary mmu page table like kvm shadow page, tables we want
+ * the new page to be mapped directly into the secondary page
+ * table
+ */
+ set_pte_at_notify(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
if (old_page) {
/*
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 5f4ef02..c3e8779 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -99,6 +99,26 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
return young;
}

+void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
+ pte_t pte)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n;
+
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+ if (mn->ops->change_pte)
+ mn->ops->change_pte(mn, mm, address, pte);
+ /*
+ * some drivers dont have change_pte and therefor we must call
+ * for invalidate_page in that case
+ */
+ else if (mn->ops->invalidate_page)
+ mn->ops->invalidate_page(mn, mm, address);
+ }
+ rcu_read_unlock();
+}
+
void __mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address)
{
--
1.5.6.5

2009-04-09 04:02:32

by Izik Eidus

[permalink] [raw]
Subject: [PATCH 2/4] add page_wrprotect(): write protecting page.

this patch add new function called page_wrprotect(),
page_wrprotect() is used to take a page and mark all the pte that
point into it as readonly.

The function is working by walking the rmap of the page, and setting
each pte realted to the page as readonly.

The odirect_sync parameter is used to protect against possible races
with odirect while we are marking the pte as readonly,
as noted by Andrea Arcanglei:

"While thinking at get_user_pages_fast I figured another worse way
things can go wrong with ksm and o_direct: think a thread writing
constantly to the last 512bytes of a page, while another thread read
and writes to/from the first 512bytes of the page. We can lose
O_DIRECT reads, the very moment we mark any pte wrprotected..."

Signed-off-by: Izik Eidus <[email protected]>
---
include/linux/rmap.h | 11 ++++
mm/rmap.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 150 insertions(+), 0 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b35bc0e..469376d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,6 +118,10 @@ static inline int try_to_munlock(struct page *page)
}
#endif

+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
+#endif
+
#else /* !CONFIG_MMU */

#define anon_vma_init() do {} while (0)
@@ -132,6 +136,13 @@ static inline int page_mkclean(struct page *page)
return 0;
}

+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+static inline int page_wrprotect(struct page *page, int *odirect_sync,
+ int count_offset)
+{
+ return 0;
+}
+#endif

#endif /* CONFIG_MMU */

diff --git a/mm/rmap.c b/mm/rmap.c
index 1652166..95c55ea 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -585,6 +585,145 @@ int page_mkclean(struct page *page)
}
EXPORT_SYMBOL_GPL(page_mkclean);

+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+
+static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
+ int *odirect_sync, int count_offset)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address;
+ pte_t *pte;
+ spinlock_t *ptl;
+ int ret = 0;
+
+ address = vma_address(page, vma);
+ if (address == -EFAULT)
+ goto out;
+
+ pte = page_check_address(page, mm, address, &ptl, 0);
+ if (!pte)
+ goto out;
+
+ if (pte_write(*pte)) {
+ pte_t entry;
+
+ flush_cache_page(vma, address, pte_pfn(*pte));
+ /*
+ * Ok this is tricky, when get_user_pages_fast() run it doesnt
+ * take any lock, therefore the check that we are going to make
+ * with the pagecount against the mapcount is racey and
+ * O_DIRECT can happen right after the check.
+ * So we clear the pte and flush the tlb before the check
+ * this assure us that no O_DIRECT can happen after the check
+ * or in the middle of the check.
+ */
+ entry = ptep_clear_flush(vma, address, pte);
+ /*
+ * Check that no O_DIRECT or similar I/O is in progress on the
+ * page
+ */
+ if ((page_mapcount(page) + count_offset) != page_count(page)) {
+ *odirect_sync = 0;
+ set_pte_at_notify(mm, address, pte, entry);
+ goto out_unlock;
+ }
+ entry = pte_wrprotect(entry);
+ set_pte_at_notify(mm, address, pte, entry);
+ }
+ ret = 1;
+
+out_unlock:
+ pte_unmap_unlock(pte, ptl);
+out:
+ return ret;
+}
+
+static int page_wrprotect_file(struct page *page, int *odirect_sync,
+ int count_offset)
+{
+ struct address_space *mapping;
+ struct prio_tree_iter iter;
+ struct vm_area_struct *vma;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ int ret = 0;
+
+ mapping = page_mapping(page);
+ if (!mapping)
+ return ret;
+
+ spin_lock(&mapping->i_mmap_lock);
+
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
+ ret += page_wrprotect_one(page, vma, odirect_sync,
+ count_offset);
+
+ spin_unlock(&mapping->i_mmap_lock);
+
+ return ret;
+}
+
+static int page_wrprotect_anon(struct page *page, int *odirect_sync,
+ int count_offset)
+{
+ struct vm_area_struct *vma;
+ struct anon_vma *anon_vma;
+ int ret = 0;
+
+ anon_vma = page_lock_anon_vma(page);
+ if (!anon_vma)
+ return ret;
+
+ /*
+ * If the page is inside the swap cache, its _count number was
+ * increased by one, therefore we have to increase count_offset by one.
+ */
+ if (PageSwapCache(page))
+ count_offset++;
+
+ list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
+ ret += page_wrprotect_one(page, vma, odirect_sync,
+ count_offset);
+
+ page_unlock_anon_vma(anon_vma);
+
+ return ret;
+}
+
+/**
+ * page_wrprotect - set all ptes pointing to a page as readonly
+ * @page: the page to set as readonly
+ * @odirect_sync: boolean value that is set to 0 when some of the ptes were not
+ * marked as readonly beacuse page_wrprotect_one() was not able
+ * to mark this ptes as readonly without opening window to a race
+ * with odirect
+ * @count_offset: number of times page_wrprotect() caller had called get_page()
+ * on the page
+ *
+ * returns the number of ptes which were marked as readonly.
+ * (ptes that were readonly before this function was called are counted as well)
+ */
+int page_wrprotect(struct page *page, int *odirect_sync, int count_offset)
+{
+ int ret = 0;
+
+ /*
+ * Page lock is needed for anon pages for the PageSwapCache check,
+ * and for page_mapping for filebacked pages
+ */
+ BUG_ON(!PageLocked(page));
+
+ *odirect_sync = 1;
+ if (PageAnon(page))
+ ret = page_wrprotect_anon(page, odirect_sync, count_offset);
+ else
+ ret = page_wrprotect_file(page, odirect_sync, count_offset);
+
+ return ret;
+}
+EXPORT_SYMBOL(page_wrprotect);
+
+#endif
+
/**
* __page_set_anon_rmap - setup new anonymous rmap
* @page: the page to add the mapping to
--
1.5.6.5

2009-04-09 04:02:53

by Izik Eidus

[permalink] [raw]
Subject: [PATCH 4/4] add ksm kernel shared memory driver.

Ksm is driver that allow merging identical pages between one or more
applications in way unvisible to the application that use it.
Pages that are merged are marked as readonly and are COWed when any
application try to change them.

Ksm is used for cases where using fork() is not suitable,
one of this cases is where the pages of the application keep changing
dynamicly and the application cannot know in advance what pages are
going to be identical.

Ksm works by walking over the memory pages of the applications it
scan in order to find identical pages.
It uses a two sorted data strctures called stable and unstable trees
to find in effective way the identical pages.

When ksm finds two identical pages, it marks them as readonly and merges
them into single one page,
after the pages are marked as readonly and merged into one page, linux
will treat this pages as normal copy_on_write pages and will fork them
when write access will happen to them.

Ksm scan just memory areas that were registred to be scanned by it.

Ksm api:

KSM_GET_API_VERSION:
Give the userspace the api version of the module.

KSM_CREATE_SHARED_MEMORY_AREA:
Create shared memory reagion fd, that latter allow the user to register
the memory region to scan by using:
KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION

KSM_REGISTER_MEMORY_REGION:
Register userspace virtual address range to be scanned by ksm.
This ioctl is using the ksm_memory_region structure:
ksm_memory_region:
__u32 npages;
number of pages to share inside this memory region.
__u32 pad;
__u64 addr:
the begining of the virtual address of this region.
__u64 reserved_bits;
reserved bits for future usage.

KSM_REMOVE_MEMORY_REGION:
Remove memory region from ksm.

Signed-off-by: Izik Eidus <[email protected]>
Signed-off-by: Chris Wright <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/ksm.h | 48 ++
include/linux/miscdevice.h | 1 +
mm/Kconfig | 6 +
mm/Makefile | 1 +
mm/ksm.c | 1674 ++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 1730 insertions(+), 0 deletions(-)
create mode 100644 include/linux/ksm.h
create mode 100644 mm/ksm.c

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
new file mode 100644
index 0000000..2c11e9a
--- /dev/null
+++ b/include/linux/ksm.h
@@ -0,0 +1,48 @@
+#ifndef __LINUX_KSM_H
+#define __LINUX_KSM_H
+
+/*
+ * Userspace interface for /dev/ksm - kvm shared memory
+ */
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#include <asm/types.h>
+
+#define KSM_API_VERSION 1
+
+#define ksm_control_flags_run 1
+
+/* for KSM_REGISTER_MEMORY_REGION */
+struct ksm_memory_region {
+ __u32 npages; /* number of pages to share */
+ __u32 pad;
+ __u64 addr; /* the begining of the virtual address */
+ __u64 reserved_bits;
+};
+
+#define KSMIO 0xAB
+
+/* ioctls for /dev/ksm */
+
+#define KSM_GET_API_VERSION _IO(KSMIO, 0x00)
+/*
+ * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd
+ */
+#define KSM_CREATE_SHARED_MEMORY_AREA _IO(KSMIO, 0x01) /* return SMA fd */
+
+/* ioctls for SMA fds */
+
+/*
+ * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be
+ * scanned by kvm.
+ */
+#define KSM_REGISTER_MEMORY_REGION _IOW(KSMIO, 0x20,\
+ struct ksm_memory_region)
+/*
+ * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm.
+ */
+#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO, 0x21)
+
+#endif
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index beb6ec9..297c0bb 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -30,6 +30,7 @@
#define HPET_MINOR 228
#define FUSE_MINOR 229
#define KVM_MINOR 232
+#define KSM_MINOR 233
#define MISC_DYNAMIC_MINOR 255

struct device;
diff --git a/mm/Kconfig b/mm/Kconfig
index b53427a..3f3fd04 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -223,3 +223,9 @@ config HAVE_MLOCKED_PAGE_BIT

config MMU_NOTIFIER
bool
+
+config KSM
+ tristate "Enable KSM for page sharing"
+ help
+ Enable the KSM kernel module to allow page sharing of equal pages
+ among different tasks.
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..b885513 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o
obj-$(CONFIG_SLOB) += slob.o
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
+obj-$(CONFIG_KSM) += ksm.o
obj-$(CONFIG_PAGE_POISONING) += debug-pagealloc.o
obj-$(CONFIG_SLAB) += slab.o
obj-$(CONFIG_SLUB) += slub.o
diff --git a/mm/ksm.c b/mm/ksm.c
new file mode 100644
index 0000000..a15a92d
--- /dev/null
+++ b/mm/ksm.c
@@ -0,0 +1,1674 @@
+/*
+ * Memory merging driver for Linux
+ *
+ * This module enables dynamic sharing of identical pages found in different
+ * memory areas, even if they are not shared by fork()
+ *
+ * Copyright (C) 2008 Red Hat, Inc.
+ * Authors:
+ * Izik Eidus
+ * Andrea Arcangeli
+ * Chris Wright
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/vmalloc.h>
+#include <linux/file.h>
+#include <linux/mman.h>
+#include <linux/sched.h>
+#include <linux/rwsem.h>
+#include <linux/pagemap.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/spinlock.h>
+#include <linux/jhash.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
+#include <linux/wait.h>
+#include <linux/scatterlist.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/rbtree.h>
+#include <linux/anon_inodes.h>
+#include <linux/ksm.h>
+
+#include <asm/tlbflush.h>
+
+MODULE_AUTHOR("Red Hat, Inc.");
+MODULE_LICENSE("GPL");
+
+static int rmap_hash_size;
+module_param(rmap_hash_size, int, 0);
+MODULE_PARM_DESC(rmap_hash_size, "Hash table size for the reverse mapping");
+
+/*
+ * ksm_mem_slot - hold information for an userspace scanning range
+ * (the scanning for this region will be from addr untill addr +
+ * npages * PAGE_SIZE inside mm)
+ */
+struct ksm_mem_slot {
+ struct list_head link;
+ struct list_head sma_link;
+ struct mm_struct *mm;
+ unsigned long addr; /* the begining of the virtual address */
+ unsigned npages; /* number of pages to share */
+};
+
+/*
+ * ksm_sma - shared memory area, each process have its own sma that contain the
+ * information about the slots that it own
+ */
+struct ksm_sma {
+ struct list_head sma_slots;
+};
+
+/**
+ * struct ksm_scan - cursor for scanning
+ * @slot_index: the current slot we are scanning
+ * @page_index: the page inside the sma that is currently being scanned
+ *
+ * ksm uses it to know what are the next pages it need to scan
+ */
+struct ksm_scan {
+ struct ksm_mem_slot *slot_index;
+ unsigned long page_index;
+};
+
+/*
+ * Few notes about ksm scanning progress (make it easier to understand the
+ * data structures below):
+ *
+ * In order to reduce excessive scanning, ksm sort the memory pages by their
+ * contents into a data strcture that hold pointer into the pages.
+ *
+ * Since the contents of the pages may change at any moment, ksm cant just
+ * insert the pages into normal sorted tree and expect it to find anything.
+ *
+ * For this purpuse ksm use two data strctures - stable and unstable trees,
+ * the stable tree hold pointers into all the merged pages (KsmPage) sorted by
+ * their contents, beacuse that each such page have to be write-protected,
+ * searching on this tree is fully assuranced to be working and therefore this
+ * tree is called the stable tree.
+ *
+ * In addition to the stable tree, ksm use another data strcture called the
+ * unstable tree, this specific tree hold pointers into pages that have
+ * been found to be "unchanged for period of time", the unstable tree sort this
+ * pages by their contents, but given the fact that this pages are not
+ * write-protected, ksm cant trust the unstable tree to be fully assuranced to
+ * work.
+ * For the reason that the unstable tree would become corrupted when some of
+ * the page inside itself would change, the tree is called unstable.
+ * Ksm solve this problem by two ways:
+ * 1) the unstable tree get flushed every time ksm finish to scan the whole
+ * memory, and then the tree is rebuild from the begining.
+ * 2) Ksm will only insert into the unstable tree, pages that their hash value
+ * was not changed during the whole progress of one circuler scanning of the
+ * memory.
+ * 3) The unstable tree is RedBlack Tree - meaning its balancing is based on
+ * the colors of the nodes and not their content, this assure that even when
+ * the tree get "corrupted" we wont get out of balance and the timing of
+ * scanning is the same, another issue is that searching and inserting nodes
+ * into rbtree is the same algorithem, therefore we have no overhead when we
+ * flush the tree and rebuild it.
+ * 4) Ksm never flush the stable tree, this mean that even if it would take 10
+ * times to find page inside the unstable tree, as soon as we would find it,
+ * it will be secured inside the stable tree,
+ * (When we scan new page, we first compare it against the stable tree, and
+ * then against the unstable tree)
+ */
+
+struct rmap_item;
+
+/*
+ * tree_item - object of the stable and unstable trees
+ */
+struct tree_item {
+ struct rb_node node;
+ struct rmap_item *rmap_item;
+};
+
+/*
+ * rmap_item - object of the rmap_hash hash table
+ * (it is holding the previous hash value (oldindex),
+ * pointer into the page_hash_item, and pointer into the tree_item)
+ */
+
+/**
+ * struct rmap_item - reverse mapping item for virtual addresses
+ * @link: link into the rmap_hash hash table.
+ * @mm: the memory strcture the rmap_item is pointing to.
+ * @address: the virtual address the rmap_item is pointing to.
+ * @oldchecksum: old checksum result for the page belong the virtual address
+ * @stable_tree: when 1 rmap_item is used for stable_tree, 0 unstable tree
+ * @kpage_outside_tree: when 1 this rmap_item point into kpage outside tree
+ * @tree_item: pointer into the stable/unstable tree that hold the virtual
+ * address that the rmap_item is pointing to.
+ * @next: the next rmap item inside the stable/unstable tree that have that is
+ * found inside the same tree node.
+ */
+
+struct rmap_item {
+ struct hlist_node link;
+ struct mm_struct *mm;
+ unsigned long address;
+ unsigned int oldchecksum; /* old checksum value */
+ unsigned char stable_tree; /* 1 stable_tree 0 unstable tree */
+ unsigned char kpage_outside_tree;
+ struct tree_item *tree_item;
+ struct rmap_item *next;
+ struct rmap_item *prev;
+};
+
+/*
+ * slots is linked list that hold all the memory regions that were registred
+ * to be scanned.
+ */
+static LIST_HEAD(slots);
+/*
+ * slots_lock protect against removing and adding memory regions while a scanner
+ * is in the middle of scanning.
+ */
+static DECLARE_RWSEM(slots_lock);
+
+/* The stable and unstable trees heads. */
+struct rb_root root_stable_tree = RB_ROOT;
+struct rb_root root_unstable_tree = RB_ROOT;
+
+
+/* The number of linked list members inside the hash table */
+static int nrmaps_hash;
+/* rmap_hash hash table */
+static struct hlist_head *rmap_hash;
+
+static struct kmem_cache *tree_item_cache;
+static struct kmem_cache *rmap_item_cache;
+
+/* the number of nodes inside the stable tree */
+static unsigned long nnodes_stable_tree;
+
+/* the number of kernel allocated pages outside the stable tree */
+static unsigned long nkpage_out_tree;
+
+static int kthread_sleep; /* sleep time of the kernel thread */
+static int kthread_pages_to_scan; /* npages to scan for the kernel thread */
+static int kthread_max_kernel_pages; /* number of unswappable pages allowed */
+static unsigned long ksm_pages_shared;
+static struct ksm_scan kthread_ksm_scan;
+static int ksmd_flags;
+static struct task_struct *kthread;
+static DECLARE_WAIT_QUEUE_HEAD(kthread_wait);
+static DECLARE_RWSEM(kthread_lock);
+
+
+static int ksm_slab_init(void)
+{
+ int ret = -ENOMEM;
+
+ tree_item_cache = KMEM_CACHE(tree_item, 0);
+ if (!tree_item_cache)
+ goto out;
+
+ rmap_item_cache = KMEM_CACHE(rmap_item, 0);
+ if (!rmap_item_cache)
+ goto out_free;
+
+ return 0;
+
+out_free:
+ kmem_cache_destroy(tree_item_cache);
+out:
+ return ret;
+}
+
+static void ksm_slab_free(void)
+{
+ kmem_cache_destroy(rmap_item_cache);
+ kmem_cache_destroy(tree_item_cache);
+}
+
+static inline struct tree_item *alloc_tree_item(void)
+{
+ return kmem_cache_zalloc(tree_item_cache, GFP_KERNEL);
+}
+
+static void free_tree_item(struct tree_item *tree_item)
+{
+ kmem_cache_free(tree_item_cache, tree_item);
+}
+
+static inline struct rmap_item *alloc_rmap_item(void)
+{
+ return kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
+}
+
+static inline void free_rmap_item(struct rmap_item *rmap_item)
+{
+ kmem_cache_free(rmap_item_cache, rmap_item);
+}
+
+static unsigned long addr_in_vma(struct vm_area_struct *vma, struct page *page)
+{
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ unsigned long addr;
+
+ addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+ if (unlikely(addr < vma->vm_start || addr >= vma->vm_end))
+ return -EFAULT;
+ return addr;
+}
+
+static pte_t *get_pte(struct mm_struct *mm, unsigned long addr)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *ptep = NULL;
+
+ pgd = pgd_offset(mm, addr);
+ if (!pgd_present(*pgd))
+ goto out;
+
+ pud = pud_offset(pgd, addr);
+ if (!pud_present(*pud))
+ goto out;
+
+ pmd = pmd_offset(pud, addr);
+ if (!pmd_present(*pmd))
+ goto out;
+
+ ptep = pte_offset_map(pmd, addr);
+out:
+ return ptep;
+}
+
+
+static int is_present_pte(struct mm_struct *mm, unsigned long addr)
+{
+ pte_t *ptep;
+ int r;
+
+ ptep = get_pte(mm, addr);
+ if (!ptep)
+ return 0;
+
+ r = pte_present(*ptep);
+ pte_unmap(ptep);
+
+ return r;
+}
+
+/*
+ * PageKsm - this type of pages are the write protected pages that ksm map
+ * into multiple vmas (this is the "shared page")
+ * this page was allocated using alloc_page(), and every pte that point to it
+ * is always write protected (therefore its data content cant ever be changed)
+ * and this page cant be swapped.
+ */
+static inline int PageKsm(struct page *page)
+{
+ /*
+ * When ksm create new shared page, it create kernel allocated page
+ * using alloc_page(), therefore this page is not anonymous, taking into
+ * account that ksm scan just anonymous pages, we can relay on the fact
+ * that each time we see !PageAnon(page) we are hitting shared page,
+ * in addition to this check, to be 100% sure we are dealing with
+ * KsmPage we have to check for !vm_file.
+ */
+ return !PageAnon(page);
+}
+
+static int rmap_hash_init(void)
+{
+ if (!rmap_hash_size) {
+ struct sysinfo sinfo;
+
+ si_meminfo(&sinfo);
+ rmap_hash_size = sinfo.totalram / 10;
+ }
+ nrmaps_hash = rmap_hash_size;
+ rmap_hash = vmalloc(nrmaps_hash * sizeof(struct hlist_head));
+ if (!rmap_hash)
+ return -ENOMEM;
+ memset(rmap_hash, 0, nrmaps_hash * sizeof(struct hlist_head));
+ return 0;
+}
+
+static void rmap_hash_free(void)
+{
+ int i;
+ struct hlist_head *bucket;
+ struct hlist_node *node, *n;
+ struct rmap_item *rmap_item;
+
+ for (i = 0; i < nrmaps_hash; ++i) {
+ bucket = &rmap_hash[i];
+ hlist_for_each_entry_safe(rmap_item, node, n, bucket, link) {
+ hlist_del(&rmap_item->link);
+ free_rmap_item(rmap_item);
+ }
+ }
+ vfree(rmap_hash);
+}
+
+static inline u32 calc_checksum(struct page *page)
+{
+ u32 checksum;
+ void *addr = kmap_atomic(page, KM_USER0);
+ checksum = jhash2(addr, PAGE_SIZE / 4, 17);
+ kunmap_atomic(addr, KM_USER0);
+ return checksum;
+}
+
+/*
+ * Return rmap_item for a given virtual address.
+ */
+static struct rmap_item *get_rmap_item(struct mm_struct *mm, unsigned long addr)
+{
+ struct rmap_item *rmap_item;
+ struct hlist_head *bucket;
+ struct hlist_node *node;
+
+ bucket = &rmap_hash[addr % nrmaps_hash];
+ hlist_for_each_entry(rmap_item, node, bucket, link) {
+ if (mm == rmap_item->mm && rmap_item->address == addr) {
+ return rmap_item;
+ }
+ }
+ return NULL;
+}
+
+/*
+ * Removing rmap_item from stable or unstable tree.
+ * This function will free the rmap_item object, and if that rmap_item was
+ * insde the stable or unstable trees, it would remove the link from there
+ * as well.
+ */
+static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
+{
+ struct tree_item *tree_item;
+
+ tree_item = rmap_item->tree_item;
+ rmap_item->tree_item = NULL;
+
+ if (rmap_item->stable_tree) {
+ ksm_pages_shared--;
+ if (rmap_item->prev) {
+ BUG_ON(rmap_item->prev->next != rmap_item);
+ rmap_item->prev->next = rmap_item->next;
+ }
+ if (rmap_item->next) {
+ BUG_ON(rmap_item->next->prev != rmap_item);
+ rmap_item->next->prev = rmap_item->prev;
+ }
+ } else if (rmap_item->kpage_outside_tree) {
+ ksm_pages_shared--;
+ nkpage_out_tree--;
+ }
+
+ if (tree_item) {
+ if (rmap_item->stable_tree) {
+ if (!rmap_item->next && !rmap_item->prev) {
+ rb_erase(&tree_item->node, &root_stable_tree);
+ free_tree_item(tree_item);
+ nnodes_stable_tree--;
+ } else if (!rmap_item->prev) {
+ tree_item->rmap_item = rmap_item->next;
+ } else {
+ tree_item->rmap_item = rmap_item->prev;
+ }
+ } else {
+ free_tree_item(tree_item);
+ }
+ }
+
+ hlist_del(&rmap_item->link);
+ free_rmap_item(rmap_item);
+}
+
+static void break_cow(struct mm_struct *mm, unsigned long addr)
+{
+ struct page *page[1];
+
+ down_read(&mm->mmap_sem);
+ if (get_user_pages(current, mm, addr, 1, 1, 0, page, NULL)) {
+ put_page(page[0]);
+ }
+ up_read(&mm->mmap_sem);
+}
+
+static void remove_page_from_tree(struct mm_struct *mm,
+ unsigned long addr)
+{
+ struct rmap_item *rmap_item;
+
+ rmap_item = get_rmap_item(mm, addr);
+ if (!rmap_item)
+ return;
+
+ if (rmap_item->stable_tree) {
+ /* We are breaking all the KsmPages of area that is removed */
+ break_cow(mm, addr);
+ } else {
+ /*
+ * If kpage_outside_tree is set, this item is KsmPage outside
+ * the stable tree, therefor we have to break the COW and
+ * in addition we have to dec nkpage_out_tree.
+ */
+ if (rmap_item->kpage_outside_tree)
+ break_cow(mm, addr);
+ }
+
+ remove_rmap_item_from_tree(rmap_item);
+}
+
+static int ksm_sma_ioctl_register_memory_region(struct ksm_sma *ksm_sma,
+ struct ksm_memory_region *mem)
+{
+ struct ksm_mem_slot *slot;
+ int ret = -EPERM;
+
+ slot = kzalloc(sizeof(struct ksm_mem_slot), GFP_KERNEL);
+ if (!slot) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ slot->mm = get_task_mm(current);
+ if (!slot->mm)
+ goto out_free;
+ slot->addr = mem->addr;
+ slot->npages = mem->npages;
+
+ down_write(&slots_lock);
+
+ list_add_tail(&slot->link, &slots);
+ list_add_tail(&slot->sma_link, &ksm_sma->sma_slots);
+
+ up_write(&slots_lock);
+ return 0;
+
+out_free:
+ kfree(slot);
+out:
+ return ret;
+}
+
+static void remove_mm_from_hash_and_tree(struct mm_struct *mm)
+{
+ struct ksm_mem_slot *slot;
+ int pages_count;
+
+ list_for_each_entry(slot, &slots, link)
+ if (slot->mm == mm)
+ break;
+ BUG_ON(!slot);
+
+ root_unstable_tree = RB_ROOT;
+ for (pages_count = 0; pages_count < slot->npages; ++pages_count)
+ remove_page_from_tree(mm, slot->addr +
+ pages_count * PAGE_SIZE);
+ list_del(&slot->link);
+}
+
+static int ksm_sma_ioctl_remove_memory_region(struct ksm_sma *ksm_sma)
+{
+ struct ksm_mem_slot *slot, *node;
+
+ down_write(&slots_lock);
+ list_for_each_entry_safe(slot, node, &ksm_sma->sma_slots, sma_link) {
+ remove_mm_from_hash_and_tree(slot->mm);
+ mmput(slot->mm);
+ list_del(&slot->sma_link);
+ kfree(slot);
+ }
+ up_write(&slots_lock);
+ return 0;
+}
+
+static int ksm_sma_release(struct inode *inode, struct file *filp)
+{
+ struct ksm_sma *ksm_sma = filp->private_data;
+ int r;
+
+ r = ksm_sma_ioctl_remove_memory_region(ksm_sma);
+ kfree(ksm_sma);
+ return r;
+}
+
+static long ksm_sma_ioctl(struct file *filp,
+ unsigned int ioctl, unsigned long arg)
+{
+ struct ksm_sma *sma = filp->private_data;
+ void __user *argp = (void __user *)arg;
+ int r = EINVAL;
+
+ switch (ioctl) {
+ case KSM_REGISTER_MEMORY_REGION: {
+ struct ksm_memory_region ksm_memory_region;
+
+ r = -EFAULT;
+ if (copy_from_user(&ksm_memory_region, argp,
+ sizeof(ksm_memory_region)))
+ goto out;
+ r = ksm_sma_ioctl_register_memory_region(sma,
+ &ksm_memory_region);
+ break;
+ }
+ case KSM_REMOVE_MEMORY_REGION:
+ r = ksm_sma_ioctl_remove_memory_region(sma);
+ break;
+ }
+
+out:
+ return r;
+}
+
+static int memcmp_pages(struct page *page1, struct page *page2)
+{
+ char *addr1, *addr2;
+ int r;
+
+ addr1 = kmap_atomic(page1, KM_USER0);
+ addr2 = kmap_atomic(page2, KM_USER1);
+ r = memcmp(addr1, addr2, PAGE_SIZE);
+ kunmap_atomic(addr1, KM_USER0);
+ kunmap_atomic(addr2, KM_USER1);
+ return r;
+}
+
+/* pages_identical
+ * return 1 if identical, 0 otherwise.
+ */
+static inline int pages_identical(struct page *page1, struct page *page2)
+{
+ return !memcmp_pages(page1, page2);
+}
+
+/*
+ * try_to_merge_one_page - take two pages and merge them into one
+ * @mm: mm_struct that hold vma pointing into oldpage
+ * @vma: the vma that hold the pte pointing into oldpage
+ * @oldpage: the page that we want to replace with newpage
+ * @newpage: the page that we want to map instead of oldpage
+ * @newprot: the new permission of the pte inside vma
+ * note:
+ * oldpage should be anon page while newpage should be file mapped page
+ *
+ * this function return 0 if the pages were merged, 1 otherwise.
+ */
+static int try_to_merge_one_page(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ struct page *oldpage,
+ struct page *newpage,
+ pgprot_t newprot)
+{
+ int ret = 1;
+ int odirect_sync;
+ unsigned long page_addr_in_vma;
+ pte_t orig_pte, *orig_ptep;
+
+ if (!PageAnon(oldpage))
+ goto out;
+
+ get_page(newpage);
+ get_page(oldpage);
+
+ down_read(&mm->mmap_sem);
+
+ page_addr_in_vma = addr_in_vma(vma, oldpage);
+ if (page_addr_in_vma == -EFAULT)
+ goto out_unlock;
+
+ orig_ptep = get_pte(mm, page_addr_in_vma);
+ if (!orig_ptep)
+ goto out_unlock;
+ orig_pte = *orig_ptep;
+ pte_unmap(orig_ptep);
+ if (!pte_present(orig_pte))
+ goto out_unlock;
+ if (page_to_pfn(oldpage) != pte_pfn(orig_pte))
+ goto out_unlock;
+ /*
+ * we need the page lock to read a stable PageSwapCache in
+ * page_wrprotect()
+ */
+ if (!trylock_page(oldpage))
+ goto out_unlock;
+ /*
+ * page_wrprotect check if the page is swapped or in swap cache,
+ * in the future we might want to run here if_present_pte and then
+ * swap_free
+ */
+ if (!page_wrprotect(oldpage, &odirect_sync, 2)) {
+ unlock_page(oldpage);
+ goto out_unlock;
+ }
+ unlock_page(oldpage);
+ if (!odirect_sync)
+ goto out_unlock;
+
+ orig_pte = pte_wrprotect(orig_pte);
+
+ if (pages_identical(oldpage, newpage))
+ ret = replace_page(vma, oldpage, newpage, orig_pte, newprot);
+
+out_unlock:
+ up_read(&mm->mmap_sem);
+ put_page(oldpage);
+ put_page(newpage);
+out:
+ return ret;
+}
+
+/*
+ * try_to_merge_two_pages_alloc - take two identical pages and prepare them
+ * to be merged into one page.
+ *
+ * this function return 0 if we successfully mapped two identical pages into one
+ * page, 1 otherwise.
+ * (note this function will allocate a new kernel page, if one of the pages
+ * is already shared page (KsmPage), then try_to_merge_two_pages_noalloc()
+ * should be called.)
+ */
+
+static int try_to_merge_two_pages_alloc(struct mm_struct *mm1,
+ struct page *page1,
+ struct mm_struct *mm2,
+ struct page *page2,
+ unsigned long addr1,
+ unsigned long addr2)
+{
+ struct vm_area_struct *vma;
+ pgprot_t prot;
+ int ret = 1;
+ struct page *kpage;
+
+ /*
+ * The number of the nodes inside the stable tree +
+ * nkpage_out_tree is the same as the number kernel pages that
+ * we hold.
+ */
+ if (kthread_max_kernel_pages &&
+ (nnodes_stable_tree + nkpage_out_tree) >=
+ kthread_max_kernel_pages)
+ return ret;
+
+ kpage = alloc_page(GFP_HIGHUSER);
+ if (!kpage)
+ return ret;
+ down_read(&mm1->mmap_sem);
+ vma = find_vma(mm1, addr1);
+ up_read(&mm1->mmap_sem);
+ if (!vma) {
+ put_page(kpage);
+ return ret;
+ }
+ prot = vma->vm_page_prot;
+ pgprot_val(prot) &= ~_PAGE_RW;
+
+ copy_user_highpage(kpage, page1, addr1, vma);
+ ret = try_to_merge_one_page(mm1, vma, page1, kpage, prot);
+
+ if (!ret) {
+ down_read(&mm2->mmap_sem);
+ vma = find_vma(mm2, addr2);
+ up_read(&mm2->mmap_sem);
+ if (!vma) {
+ put_page(kpage);
+ break_cow(mm1, addr1);
+ ret = 1;
+ return ret;
+ }
+
+ prot = vma->vm_page_prot;
+ pgprot_val(prot) &= ~_PAGE_RW;
+
+ ret = try_to_merge_one_page(mm2, vma, page2, kpage,
+ prot);
+ /*
+ * If the secoend try_to_merge_one_page call was failed,
+ * we are in situation where we have Ksm page that have
+ * just one pte pointing to it, in this case we break
+ * it.
+ */
+ if (ret) {
+ break_cow(mm1, addr1);
+ } else {
+ ksm_pages_shared += 2;
+ }
+ }
+
+ put_page(kpage);
+ return ret;
+}
+
+/*
+ * try_to_merge_two_pages_noalloc - the same astry_to_merge_two_pages_alloc,
+ * but no new kernel page is allocated (page2 should be KsmPage)
+ */
+static int try_to_merge_two_pages_noalloc(struct mm_struct *mm1,
+ struct page *page1,
+ struct page *page2,
+ unsigned long addr1)
+{
+ struct vm_area_struct *vma;
+ pgprot_t prot;
+ int ret = 1;
+
+ /*
+ * If page2 is shared, we can just make the pte of mm1(page1) point to
+ * page2.
+ */
+ BUG_ON(!PageKsm(page2));
+ down_read(&mm1->mmap_sem);
+ vma = find_vma(mm1, addr1);
+ up_read(&mm1->mmap_sem);
+ if (!vma)
+ return ret;
+ prot = vma->vm_page_prot;
+ pgprot_val(prot) &= ~_PAGE_RW;
+ ret = try_to_merge_one_page(mm1, vma, page1, page2, prot);
+ if (!ret)
+ ksm_pages_shared++;
+
+ return ret;
+}
+
+/*
+ * is_zapped_item - check if the page belong to the rmap_item was zapped.
+ *
+ * This function would check if the page that the virtual address inside
+ * rmap_item is poiting to is still KsmPage, and therefore we can trust the
+ * content of this page.
+ * Since that this function call already to get_user_pages it return the
+ * pointer to the page as an optimization.
+ */
+static int is_zapped_item(struct rmap_item *rmap_item,
+ struct page **page)
+{
+ int ret = 0;
+ struct vm_area_struct *vma;
+
+ cond_resched();
+ if (is_present_pte(rmap_item->mm, rmap_item->address)) {
+ down_read(&rmap_item->mm->mmap_sem);
+ vma = find_vma(rmap_item->mm, rmap_item->address);
+ if (vma && !vma->vm_file) {
+ BUG_ON(vma->vm_flags & VM_SHARED);
+ ret = get_user_pages(current, rmap_item->mm,
+ rmap_item->address,
+ 1, 0, 0, page, NULL);
+ }
+ up_read(&rmap_item->mm->mmap_sem);
+ }
+
+ if (!ret)
+ return 1;
+
+ if (unlikely(!PageKsm(page[0]))) {
+ put_page(page[0]);
+ return 1;
+ }
+ return 0;
+}
+
+/*
+ * stable_tree_search - search page inside the stable tree
+ * @page: the page that we are searching idneitcal pages to.
+ * @page2: pointer into identical page that we are holding inside the stable
+ * tree that we have found.
+ * @rmap_item: the reverse mapping item
+ *
+ * this function check if there is a page inside the stable tree
+ * with identical content to the page that we are scanning right now.
+ *
+ * this function return rmap_item pointer to the identical item if found, NULL
+ * otherwise.
+ */
+static struct rmap_item *stable_tree_search(struct page *page,
+ struct page **page2,
+ struct rmap_item *rmap_item)
+{
+ struct rb_node *node = root_stable_tree.rb_node;
+ struct tree_item *tree_item;
+ struct rmap_item *found_rmap_item;
+
+ while (node) {
+ int ret;
+
+ tree_item = rb_entry(node, struct tree_item, node);
+ found_rmap_item = tree_item->rmap_item;
+ while (found_rmap_item) {
+ BUG_ON(!found_rmap_item->stable_tree);
+ BUG_ON(!found_rmap_item->tree_item);
+ if (!rmap_item ||
+ !(found_rmap_item->mm == rmap_item->mm &&
+ found_rmap_item->address == rmap_item->address)) {
+ if (!is_zapped_item(found_rmap_item, page2))
+ break;
+ remove_rmap_item_from_tree(found_rmap_item);
+ }
+ found_rmap_item = found_rmap_item->next;
+ }
+ if (!found_rmap_item)
+ goto out_didnt_find;
+
+ /*
+ * We can trust the value of the memcmp as we know the pages
+ * are write protected.
+ */
+ ret = memcmp_pages(page, page2[0]);
+
+ if (ret < 0) {
+ put_page(page2[0]);
+ node = node->rb_left;
+ } else if (ret > 0) {
+ put_page(page2[0]);
+ node = node->rb_right;
+ } else {
+ goto out_found;
+ }
+ }
+out_didnt_find:
+ found_rmap_item = NULL;
+out_found:
+ return found_rmap_item;
+}
+
+/*
+ * stable_tree_insert - insert into the stable tree, new rmap_item that is
+ * pointing into a new KsmPage.
+ *
+ * @page: the page that we are searching identical page to inside the stable
+ * tree.
+ * @new_tree_item: the new tree item we are going to link into the stable tree.
+ * @rmap_item: pointer into the reverse mapping item.
+ *
+ * this function return 0 if success, 1 otherwise.
+ * otherwise.
+ */
+static int stable_tree_insert(struct page *page,
+ struct tree_item *new_tree_item,
+ struct rmap_item *rmap_item)
+{
+ struct rb_node **new = &(root_stable_tree.rb_node);
+ struct rb_node *parent = NULL;
+ struct tree_item *tree_item;
+ struct page *page2[1];
+
+ while (*new) {
+ int ret;
+ struct rmap_item *insert_rmap_item;
+
+ tree_item = rb_entry(*new, struct tree_item, node);
+ BUG_ON(!tree_item);
+ BUG_ON(!tree_item->rmap_item);
+
+ insert_rmap_item = tree_item->rmap_item;
+ while (insert_rmap_item) {
+ BUG_ON(!insert_rmap_item->stable_tree);
+ BUG_ON(!insert_rmap_item->tree_item);
+ if (!(insert_rmap_item->mm == rmap_item->mm &&
+ insert_rmap_item->address == rmap_item->address)) {
+ if (!is_zapped_item(insert_rmap_item, page2))
+ break;
+ remove_rmap_item_from_tree(insert_rmap_item);
+ }
+ insert_rmap_item = insert_rmap_item->next;
+ }
+ if (!insert_rmap_item)
+ return 1;
+
+ ret = memcmp_pages(page, page2[0]);
+
+ parent = *new;
+ if (ret < 0) {
+ put_page(page2[0]);
+ new = &((*new)->rb_left);
+ } else if (ret > 0) {
+ put_page(page2[0]);
+ new = &((*new)->rb_right);
+ } else {
+ /*
+ * It isnt a bug when we are here (the fact that we
+ * didnt find the page inside the stable tree), beacuse:
+ * when we searched the page inside the stable tree
+ * it was still not write protected, and therefore it
+ * could have changed later.
+ */
+ return 1;
+ }
+ }
+
+ rb_link_node(&new_tree_item->node, parent, new);
+ rb_insert_color(&new_tree_item->node, &root_stable_tree);
+ nnodes_stable_tree++;
+ rmap_item->stable_tree = 1;
+ rmap_item->tree_item = new_tree_item;
+
+ return 0;
+}
+
+/*
+ * unstable_tree_search_insert - search and insert items into the unstable tree.
+ *
+ * @page: the page that we are going to search for identical page or to insert
+ * into the unstable tree
+ * @page2: pointer into identical page that was found inside the unstable tree
+ * @page_rmap_item: the reverse mapping item of page
+ *
+ * this function search if identical page to the page that we
+ * are scanning right now is found inside the unstable tree, and in case no page
+ * with identical content is exist inside the unstable tree, we insert
+ * page_rmap_item as a new object into the unstable tree.
+ *
+ * this function return pointer to rmap_item pointer of item that is found to
+ * be identical to the page that we are scanning right now, NULL otherwise.
+ *
+ * (this function do both searching and inserting, beacuse the fact that
+ * searching and inserting share the same walking algorithem in rbtrees)
+ */
+static struct tree_item *unstable_tree_search_insert(struct page *page,
+ struct page **page2,
+ struct rmap_item *page_rmap_item)
+{
+ struct rb_node **new = &(root_unstable_tree.rb_node);
+ struct rb_node *parent = NULL;
+ struct tree_item *tree_item;
+ struct tree_item *new_tree_item;
+ struct rmap_item *rmap_item;
+
+ while (*new) {
+ int ret;
+
+ tree_item = rb_entry(*new, struct tree_item, node);
+ BUG_ON(!tree_item);
+ rmap_item = tree_item->rmap_item;
+ BUG_ON(!rmap_item);
+
+ /*
+ * We dont want to swap in pages
+ */
+ if (!is_present_pte(rmap_item->mm, rmap_item->address))
+ return NULL;
+
+ down_read(&rmap_item->mm->mmap_sem);
+ ret = get_user_pages(current, rmap_item->mm, rmap_item->address,
+ 1, 0, 0, page2, NULL);
+ up_read(&rmap_item->mm->mmap_sem);
+ if (!ret)
+ return NULL;
+
+ ret = memcmp_pages(page, page2[0]);
+
+ parent = *new;
+ if (ret < 0) {
+ put_page(page2[0]);
+ new = &((*new)->rb_left);
+ } else if (ret > 0) {
+ put_page(page2[0]);
+ new = &((*new)->rb_right);
+ } else {
+ return tree_item;
+ }
+ }
+
+ if (!page_rmap_item)
+ return NULL;
+
+ new_tree_item = alloc_tree_item();
+ if (!new_tree_item)
+ return NULL;
+
+ page_rmap_item->tree_item = new_tree_item;
+ page_rmap_item->stable_tree = 0;
+ new_tree_item->rmap_item = page_rmap_item;
+ rb_link_node(&new_tree_item->node, parent, new);
+ rb_insert_color(&new_tree_item->node, &root_unstable_tree);
+
+ return NULL;
+}
+
+/*
+ * update_stable_tree - check if the page inside tree got zapped,
+ * and if it got zapped, kick it from the tree.
+ *
+ * we return 1 in case we removed the rmap_item.
+ */
+int update_tree(struct rmap_item *rmap_item)
+{
+ if (!rmap_item->stable_tree) {
+ if (unlikely(rmap_item->kpage_outside_tree)) {
+ remove_rmap_item_from_tree(rmap_item);
+ return 1;
+ }
+ /*
+ * If the rmap_item is !stable_tree and in addition
+ * it have tree_item != NULL, it mean this rmap_item
+ * was inside the unstable tree, therefore we have to free
+ * the tree_item from it (beacuse the unstable tree was already
+ * flushed by the time we are here).
+ */
+ if (rmap_item->tree_item) {
+ free_tree_item(rmap_item->tree_item);
+ rmap_item->tree_item = NULL;
+ return 0;
+ }
+ return 0;
+ }
+ /*
+ * If we are here it mean the rmap_item was zapped, beacuse the
+ * rmap_item was pointing into the stable_tree and there all the pages
+ * should be KsmPages, so it shouldnt have came to here in the first
+ * place. (cmp_and_merge_page() shouldnt have been called)
+ */
+ remove_rmap_item_from_tree(rmap_item);
+ return 1;
+}
+
+static void create_new_rmap_item(struct rmap_item *rmap_item,
+ struct mm_struct *mm,
+ unsigned long addr,
+ unsigned int checksum)
+{
+ struct hlist_head *bucket;
+
+ rmap_item->mm = mm;
+ rmap_item->address = addr;
+ rmap_item->oldchecksum = checksum;
+ rmap_item->stable_tree = 0;
+ rmap_item->kpage_outside_tree = 0;
+ rmap_item->tree_item = NULL;
+
+ bucket = &rmap_hash[addr % nrmaps_hash];
+ hlist_add_head(&rmap_item->link, bucket);
+}
+
+/*
+ * cmp_and_merge_page - take a page computes its hash value and check if there
+ * is similar hash value to different page,
+ * in case we find that there is similar hash to different page we call to
+ * try_to_merge_two_pages().
+ *
+ * @ksm_scan: the ksm scanner strcture.
+ * @page: the page that we are searching identical page to.
+ */
+static int cmp_and_merge_page(struct ksm_scan *ksm_scan, struct page *page)
+{
+ struct page *page2[1];
+ struct ksm_mem_slot *slot;
+ struct tree_item *tree_item;
+ struct rmap_item *rmap_item;
+ struct rmap_item *tree_rmap_item;
+ unsigned int checksum;
+ unsigned long addr;
+ int wait = 0;
+ int ret = 0;
+
+ slot = ksm_scan->slot_index;
+ addr = slot->addr + ksm_scan->page_index * PAGE_SIZE;
+ rmap_item = get_rmap_item(slot->mm, addr);
+ if (rmap_item) {
+ if (update_tree(rmap_item)) {
+ rmap_item = NULL;
+ wait = 1;
+ }
+ }
+
+ /* We first start with searching the page inside the stable tree */
+ tree_rmap_item = stable_tree_search(page, page2, rmap_item);
+ if (tree_rmap_item) {
+ struct rmap_item *tmp_rmap_item = NULL;
+
+ if (!rmap_item) {
+ tmp_rmap_item = alloc_rmap_item();
+ if (!tmp_rmap_item)
+ return ret;
+ }
+
+ BUG_ON(!tree_rmap_item->tree_item);
+ ret = try_to_merge_two_pages_noalloc(slot->mm, page, page2[0],
+ addr);
+ put_page(page2[0]);
+ if (!ret) {
+ /*
+ * The page was successuly merged, lets insert its
+ * rmap_item into the stable tree.
+ */
+
+ if (!rmap_item) {
+ create_new_rmap_item(tmp_rmap_item, slot->mm,
+ addr, 0);
+ rmap_item = tmp_rmap_item;
+ }
+
+ rmap_item->next = tree_rmap_item->next;
+ rmap_item->prev = tree_rmap_item;
+
+ if (tree_rmap_item->next)
+ tree_rmap_item->next->prev = rmap_item;
+
+ tree_rmap_item->next = rmap_item;
+
+ rmap_item->stable_tree = 1;
+ rmap_item->tree_item = tree_rmap_item->tree_item;
+ } else {
+ if (tmp_rmap_item)
+ free_rmap_item(tmp_rmap_item);
+ }
+ ret = !ret;
+ goto out;
+ }
+
+ /*
+ * In case the hash value of the page was changed from the last time we
+ * have calculated it, this page to be changed frequely, therefore we
+ * dont want to insert it to the unstable tree, and we dont want to
+ * waste our time to search if there is something identical to it there.
+ */
+ if (rmap_item) {
+ checksum = calc_checksum(page);
+ if (rmap_item->oldchecksum != checksum) {
+ rmap_item->oldchecksum = checksum;
+ goto out;
+ }
+ }
+
+ tree_item = unstable_tree_search_insert(page, page2, rmap_item);
+ if (tree_item) {
+ struct rmap_item *tmp_rmap_item = NULL;
+ struct rmap_item *merge_rmap_item;
+
+ merge_rmap_item = tree_item->rmap_item;
+ BUG_ON(!merge_rmap_item);
+
+ if (!rmap_item) {
+ tmp_rmap_item = alloc_rmap_item();
+ if (!tmp_rmap_item)
+ return ret;
+ }
+
+ ret = try_to_merge_two_pages_alloc(slot->mm, page,
+ merge_rmap_item->mm,
+ page2[0], addr,
+ merge_rmap_item->address);
+ /*
+ * As soon as we successuly merged this page, we want to remove
+ * the rmap_item object of the page that we have merged with
+ * from the unstable_tree and instead insert it as a new stable
+ * tree node.
+ */
+ if (!ret) {
+ rb_erase(&tree_item->node, &root_unstable_tree);
+ /*
+ * In case we will fail to insert the page into
+ * the stable tree, we will have 2 virtual addresses
+ * that are pointing into KsmPage that wont be inside
+ * the stable tree, therefore we have to mark both of
+ * their rmap as tree_item->kpage_outside_tree = 1
+ * and to inc nkpage_out_tree by 2.
+ */
+ if (stable_tree_insert(page2[0],
+ tree_item, merge_rmap_item)) {
+ merge_rmap_item->kpage_outside_tree = 1;
+ if (!rmap_item) {
+ create_new_rmap_item(tmp_rmap_item,
+ slot->mm,
+ addr, 0);
+ rmap_item = tmp_rmap_item;
+ }
+ rmap_item->kpage_outside_tree = 1;
+ nkpage_out_tree += 2;
+ } else {
+ if (tmp_rmap_item) {
+ create_new_rmap_item(tmp_rmap_item,
+ slot->mm, addr, 0);
+ rmap_item = tmp_rmap_item;
+ }
+ rmap_item->stable_tree = 1;
+ }
+ } else {
+ if (tmp_rmap_item)
+ free_rmap_item(tmp_rmap_item);
+ }
+ put_page(page2[0]);
+ ret = !ret;
+ goto out;
+ }
+ /*
+ * When wait is 1, we dont want to calculate the hash value of the page
+ * right now, instead we prefer to wait.
+ */
+ if (!wait && !rmap_item) {
+ rmap_item = alloc_rmap_item();
+ if (!rmap_item)
+ return ret;
+ checksum = calc_checksum(page);
+ create_new_rmap_item(rmap_item, slot->mm, addr, checksum);
+ }
+out:
+ return ret;
+}
+
+/* return -EAGAIN - no slots registered, nothing to be done */
+static int scan_get_next_index(struct ksm_scan *ksm_scan)
+{
+ struct ksm_mem_slot *slot;
+
+ if (list_empty(&slots))
+ return -EAGAIN;
+
+ slot = ksm_scan->slot_index;
+
+ /* Are there pages left in this slot to scan? */
+ if ((slot->npages - ksm_scan->page_index - 1) > 0) {
+ ksm_scan->page_index++;
+ return 0;
+ }
+
+ list_for_each_entry_from(slot, &slots, link) {
+ if (slot == ksm_scan->slot_index)
+ continue;
+ ksm_scan->page_index = 0;
+ ksm_scan->slot_index = slot;
+ return 0;
+ }
+
+ /* look like we finished scanning the whole memory, starting again */
+ root_unstable_tree = RB_ROOT;
+ ksm_scan->page_index = 0;
+ ksm_scan->slot_index = list_first_entry(&slots,
+ struct ksm_mem_slot, link);
+ return 0;
+}
+
+/*
+ * update slot_index - make sure ksm_scan will point to vaild data,
+ * it is possible that by the time we are here the data that ksm_scan was
+ * pointed to was released so we have to call this function every time after
+ * taking the slots_lock
+ */
+static void scan_update_old_index(struct ksm_scan *ksm_scan)
+{
+ struct ksm_mem_slot *slot;
+
+ if (list_empty(&slots))
+ return;
+
+ list_for_each_entry(slot, &slots, link) {
+ if (ksm_scan->slot_index == slot)
+ return;
+ }
+
+ ksm_scan->slot_index = list_first_entry(&slots,
+ struct ksm_mem_slot, link);
+ ksm_scan->page_index = 0;
+}
+
+/**
+ * ksm_scan_start - the ksm scanner main worker function.
+ * @ksm_scan - the scanner.
+ * @scan_npages - number of pages we are want to scan before we return from this
+ * @function.
+ *
+ * (this function can be called from the kernel thread scanner, or from
+ * userspace ioctl context scanner)
+ *
+ * The function return -EAGAIN in case there are not slots to scan.
+ */
+static int ksm_scan_start(struct ksm_scan *ksm_scan, unsigned int scan_npages)
+{
+ struct ksm_mem_slot *slot;
+ struct page *page[1];
+ int val;
+ int ret = 0;
+
+ down_read(&slots_lock);
+
+ scan_update_old_index(ksm_scan);
+
+ while (scan_npages > 0) {
+ ret = scan_get_next_index(ksm_scan);
+ if (ret)
+ goto out;
+
+ slot = ksm_scan->slot_index;
+
+ cond_resched();
+
+ /*
+ * If the page is swapped out or in swap cache, we don't want to
+ * scan it (it is just for performance).
+ */
+ if (is_present_pte(slot->mm, slot->addr +
+ ksm_scan->page_index * PAGE_SIZE)) {
+ down_read(&slot->mm->mmap_sem);
+ val = get_user_pages(current, slot->mm, slot->addr +
+ ksm_scan->page_index * PAGE_SIZE ,
+ 1, 0, 0, page, NULL);
+ up_read(&slot->mm->mmap_sem);
+ if (val == 1) {
+ if (!PageKsm(page[0]))
+ cmp_and_merge_page(ksm_scan, page[0]);
+ put_page(page[0]);
+ }
+ }
+ scan_npages--;
+ }
+ scan_get_next_index(ksm_scan);
+out:
+ up_read(&slots_lock);
+ return ret;
+}
+
+static struct file_operations ksm_sma_fops = {
+ .release = ksm_sma_release,
+ .unlocked_ioctl = ksm_sma_ioctl,
+ .compat_ioctl = ksm_sma_ioctl,
+};
+
+static int ksm_dev_ioctl_create_shared_memory_area(void)
+{
+ int fd = -1;
+ struct ksm_sma *ksm_sma;
+
+ ksm_sma = kmalloc(sizeof(struct ksm_sma), GFP_KERNEL);
+ if (!ksm_sma)
+ goto out;
+
+ INIT_LIST_HEAD(&ksm_sma->sma_slots);
+
+ fd = anon_inode_getfd("ksm-sma", &ksm_sma_fops, ksm_sma, 0);
+ if (fd < 0)
+ goto out_free;
+
+ return fd;
+out_free:
+ kfree(ksm_sma);
+out:
+ return fd;
+}
+
+static long ksm_dev_ioctl(struct file *filp,
+ unsigned int ioctl, unsigned long arg)
+{
+ long r = -EINVAL;
+
+ switch (ioctl) {
+ case KSM_GET_API_VERSION:
+ r = KSM_API_VERSION;
+ break;
+ case KSM_CREATE_SHARED_MEMORY_AREA:
+ r = ksm_dev_ioctl_create_shared_memory_area();
+ break;
+ default:
+ break;
+ }
+ return r;
+}
+
+static struct file_operations ksm_chardev_ops = {
+ .unlocked_ioctl = ksm_dev_ioctl,
+ .compat_ioctl = ksm_dev_ioctl,
+ .owner = THIS_MODULE,
+};
+
+static struct miscdevice ksm_dev = {
+ KSM_MINOR,
+ "ksm",
+ &ksm_chardev_ops,
+};
+
+int kthread_ksm_scan_thread(void *nothing)
+{
+ while (!kthread_should_stop()) {
+ if (ksmd_flags & ksm_control_flags_run) {
+ down_read(&kthread_lock);
+ ksm_scan_start(&kthread_ksm_scan,
+ kthread_pages_to_scan);
+ up_read(&kthread_lock);
+ schedule_timeout_interruptible(
+ usecs_to_jiffies(kthread_sleep));
+ } else {
+ wait_event_interruptible(kthread_wait,
+ ksmd_flags & ksm_control_flags_run ||
+ kthread_should_stop());
+ }
+ }
+ return 0;
+}
+
+#define KSM_ATTR_RO(_name) \
+ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
+#define KSM_ATTR(_name) \
+ static struct kobj_attribute _name##_attr = \
+ __ATTR(_name, 0644, _name##_show, _name##_store)
+
+static ssize_t sleep_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *buf)
+{
+ unsigned int usecs;
+
+ down_read(&kthread_lock);
+ usecs = kthread_sleep;
+ up_read(&kthread_lock);
+
+ return sprintf(buf, "%u\n", usecs);
+}
+
+static ssize_t sleep_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long usecs;
+ int err;
+
+ err = strict_strtoul(buf, 10, &usecs);
+ if (err)
+ return 0;
+
+ /* TODO sanitize usecs */
+
+ down_write(&kthread_lock);
+ kthread_sleep = usecs;
+ up_write(&kthread_lock);
+
+ return count;
+}
+KSM_ATTR(sleep);
+
+static ssize_t pages_to_scan_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ unsigned long nr_pages;
+
+ down_read(&kthread_lock);
+ nr_pages = kthread_pages_to_scan;
+ up_read(&kthread_lock);
+
+ return sprintf(buf, "%lu\n", nr_pages);
+}
+
+static ssize_t pages_to_scan_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int err;
+ unsigned long nr_pages;
+
+ err = strict_strtoul(buf, 10, &nr_pages);
+ if (err)
+ return 0;
+
+ down_write(&kthread_lock);
+ kthread_pages_to_scan = nr_pages;
+ up_write(&kthread_lock);
+
+ return count;
+}
+KSM_ATTR(pages_to_scan);
+
+static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *buf)
+{
+ unsigned long run;
+
+ down_read(&kthread_lock);
+ run = ksmd_flags;
+ up_read(&kthread_lock);
+
+ return sprintf(buf, "%lu\n", run);
+}
+
+static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int err;
+ unsigned long k_flags;
+
+ err = strict_strtoul(buf, 10, &k_flags);
+ if (err)
+ return 0;
+
+ down_write(&kthread_lock);
+ ksmd_flags = k_flags;
+ up_write(&kthread_lock);
+
+ if (ksmd_flags)
+ wake_up_interruptible(&kthread_wait);
+
+ return count;
+}
+KSM_ATTR(run);
+
+static ssize_t pages_shared_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ /*
+ * Note: this number does not include the shared pages outside the
+ * stable tree.
+ */
+ return sprintf(buf, "%lu\n", ksm_pages_shared - nnodes_stable_tree);
+}
+KSM_ATTR_RO(pages_shared);
+
+static ssize_t kernel_pages_allocated_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%lu\n", nnodes_stable_tree);
+}
+KSM_ATTR_RO(kernel_pages_allocated);
+
+static ssize_t max_kernel_pages_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int err;
+ unsigned long nr_pages;
+
+ err = strict_strtoul(buf, 10, &nr_pages);
+ if (err)
+ return 0;
+
+ down_write(&kthread_lock);
+ kthread_max_kernel_pages = nr_pages;
+ up_write(&kthread_lock);
+
+ return count;
+}
+
+static ssize_t max_kernel_pages_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ unsigned long nr_pages;
+
+ down_read(&kthread_lock);
+ nr_pages = kthread_max_kernel_pages;
+ up_read(&kthread_lock);
+
+ return sprintf(buf, "%lu\n", nr_pages);
+}
+KSM_ATTR(max_kernel_pages);
+
+static struct attribute *ksm_attrs[] = {
+ &sleep_attr.attr,
+ &pages_to_scan_attr.attr,
+ &run_attr.attr,
+ &pages_shared_attr.attr,
+ &kernel_pages_allocated_attr.attr,
+ &max_kernel_pages_attr.attr,
+ NULL,
+};
+
+static struct attribute_group ksm_attr_group = {
+ .attrs = ksm_attrs,
+ .name = "ksm",
+};
+
+
+static int __init ksm_init(void)
+{
+ int r;
+
+ r = ksm_slab_init();
+ if (r)
+ goto out;
+
+ r = rmap_hash_init();
+ if (r)
+ goto out_free1;
+
+ kthread = kthread_run(kthread_ksm_scan_thread, NULL, "kksmd");
+ if (IS_ERR(kthread)) {
+ printk(KERN_ERR "ksm: creating kthread failed\n");
+ r = PTR_ERR(kthread);
+ goto out_free2;
+ }
+
+ r = misc_register(&ksm_dev);
+ if (r) {
+ printk(KERN_ERR "ksm: misc device register failed\n");
+ goto out_free3;
+ }
+
+ r = sysfs_create_group(mm_kobj, &ksm_attr_group);
+ if (r) {
+ printk(KERN_ERR "ksm: register sysfs failed\n");
+ goto out_free4;
+ }
+
+ printk(KERN_WARNING "ksm loaded\n");
+ return 0;
+
+out_free4:
+ misc_deregister(&ksm_dev);
+out_free3:
+ kthread_stop(kthread);
+out_free2:
+ rmap_hash_free();
+out_free1:
+ ksm_slab_free();
+out:
+ return r;
+}
+
+static void __exit ksm_exit(void)
+{
+ sysfs_remove_group(mm_kobj, &ksm_attr_group);
+ misc_deregister(&ksm_dev);
+ ksmd_flags = ksm_control_flags_run;
+ kthread_stop(kthread);
+ rmap_hash_free();
+ ksm_slab_free();
+}
+
+module_init(ksm_init)
+module_exit(ksm_exit)
--
1.5.6.5

2009-04-14 22:15:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux v3

On Thu, 9 Apr 2009 06:58:37 +0300
Izik Eidus <[email protected]> wrote:

> KSM is a linux driver that allows dynamicly sharing identical memory
> pages between one or more processes.

Generally looks OK to me. But that doesn't mean much. We should rub
bottles with words like "hugh" and "nick" on them to be sure.


>
> ...
>
> include/linux/ksm.h | 48 ++
> include/linux/miscdevice.h | 1 +
> include/linux/mm.h | 5 +
> include/linux/mmu_notifier.h | 34 +
> include/linux/rmap.h | 11 +
> mm/Kconfig | 6 +
> mm/Makefile | 1 +
> mm/ksm.c | 1674 ++++++++++++++++++++++++++++++++++++++++++
> mm/memory.c | 90 +++-
> mm/mmu_notifier.c | 20 +
> mm/rmap.c | 139 ++++

And it's pretty unobtrusive for what it is. I expect we can get this
into 2.6.31 unless there are some pratfalls which I missed.

2009-04-14 22:15:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/4] add replace_page(): change the page pte is pointing to.

On Thu, 9 Apr 2009 06:58:40 +0300
Izik Eidus <[email protected]> wrote:

> replace_page() allow changing the mapping of pte from one physical page
> into diffrent physical page.

At a high level, this is very similar to what page migration does. Yet
this implementation shares nothing with the page migration code.

Can this situation be improved?

> this function is working by removing oldpage from the rmap and calling
> put_page on it, and by setting the pte to point into newpage and by
> inserting it to the rmap using page_add_file_rmap().
>
> note: newpage must be non anonymous page, the reason for this is:
> replace_page() is built to allow mapping one page into more than one
> virtual addresses, the mapping of this page can happen in diffrent
> offsets inside each vma, and therefore we cannot trust the page->index
> anymore.
>
> the side effect of this issue is that newpage cannot be anything but
> kernel allocated page that is not swappable.
>

2009-04-14 22:16:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 4/4] add ksm kernel shared memory driver.

On Thu, 9 Apr 2009 06:58:41 +0300
Izik Eidus <[email protected]> wrote:

> Ksm is driver that allow merging identical pages between one or more
> applications in way unvisible to the application that use it.
> Pages that are merged are marked as readonly and are COWed when any
> application try to change them.
>
> Ksm is used for cases where using fork() is not suitable,
> one of this cases is where the pages of the application keep changing
> dynamicly and the application cannot know in advance what pages are
> going to be identical.
>
> Ksm works by walking over the memory pages of the applications it
> scan in order to find identical pages.
> It uses a two sorted data strctures called stable and unstable trees
> to find in effective way the identical pages.
>
> When ksm finds two identical pages, it marks them as readonly and merges
> them into single one page,
> after the pages are marked as readonly and merged into one page, linux
> will treat this pages as normal copy_on_write pages and will fork them
> when write access will happen to them.
>
> Ksm scan just memory areas that were registred to be scanned by it.
>
> Ksm api:
>
> KSM_GET_API_VERSION:
> Give the userspace the api version of the module.
>
> KSM_CREATE_SHARED_MEMORY_AREA:
> Create shared memory reagion fd, that latter allow the user to register
> the memory region to scan by using:
> KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION
>
> KSM_REGISTER_MEMORY_REGION:
> Register userspace virtual address range to be scanned by ksm.
> This ioctl is using the ksm_memory_region structure:
> ksm_memory_region:
> __u32 npages;
> number of pages to share inside this memory region.
> __u32 pad;
> __u64 addr:
> the begining of the virtual address of this region.
> __u64 reserved_bits;
> reserved bits for future usage.
>
> KSM_REMOVE_MEMORY_REGION:
> Remove memory region from ksm.
>
>
> ...
>
> +/* ioctls for /dev/ksm */

Confused. In the covering email you indicated that v2 of the patchset
had abandoned ioctls and had moved the interface to sysfs.

It would be good to completely (and briefly) describe KSM's proposed
userspace intefaces in the changelog or somewhere. I'm a bit confused.

>
> ...
>
> +/*
> + * slots_lock protect against removing and adding memory regions while a scanner
> + * is in the middle of scanning.
> + */

"protects"

> +static DECLARE_RWSEM(slots_lock);
> +
> +/* The stable and unstable trees heads. */
> +struct rb_root root_stable_tree = RB_ROOT;
> +struct rb_root root_unstable_tree = RB_ROOT;
> +
> +
> +/* The number of linked list members inside the hash table */
> +static int nrmaps_hash;

A signed type doesn't seem appropriate.

> +/* rmap_hash hash table */
> +static struct hlist_head *rmap_hash;
> +
> +static struct kmem_cache *tree_item_cache;
> +static struct kmem_cache *rmap_item_cache;
> +
> +/* the number of nodes inside the stable tree */
> +static unsigned long nnodes_stable_tree;
> +
> +/* the number of kernel allocated pages outside the stable tree */
> +static unsigned long nkpage_out_tree;
> +
> +static int kthread_sleep; /* sleep time of the kernel thread */
> +static int kthread_pages_to_scan; /* npages to scan for the kernel thread */
> +static int kthread_max_kernel_pages; /* number of unswappable pages allowed */

The kthread_max_kernel_pages isn't very illuminating.

The use of "kthread" in the identifier makes is look like part of the
kthread subsystem.

> +static unsigned long ksm_pages_shared;
> +static struct ksm_scan kthread_ksm_scan;
> +static int ksmd_flags;
> +static struct task_struct *kthread;
> +static DECLARE_WAIT_QUEUE_HEAD(kthread_wait);
> +static DECLARE_RWSEM(kthread_lock);
> +
> +
>
> ...
>
> +static pte_t *get_pte(struct mm_struct *mm, unsigned long addr)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *ptep = NULL;
> +
> + pgd = pgd_offset(mm, addr);
> + if (!pgd_present(*pgd))
> + goto out;
> +
> + pud = pud_offset(pgd, addr);
> + if (!pud_present(*pud))
> + goto out;
> +
> + pmd = pmd_offset(pud, addr);
> + if (!pmd_present(*pmd))
> + goto out;
> +
> + ptep = pte_offset_map(pmd, addr);
> +out:
> + return ptep;
> +}

hm, this looks very generic. Does it duplicate anything which core
kernel already provides? If not, perhaps core kernel should provide
this (perhaps after some reorganisation).

>
> ...
>
> +static int rmap_hash_init(void)
> +{
> + if (!rmap_hash_size) {
> + struct sysinfo sinfo;
> +
> + si_meminfo(&sinfo);
> + rmap_hash_size = sinfo.totalram / 10;

One slot per ten pages of physical memory? Is this too large, too
small or just right?

> + }
> + nrmaps_hash = rmap_hash_size;
> + rmap_hash = vmalloc(nrmaps_hash * sizeof(struct hlist_head));
> + if (!rmap_hash)
> + return -ENOMEM;
> + memset(rmap_hash, 0, nrmaps_hash * sizeof(struct hlist_head));
> + return 0;
> +}
> +
>
> ...
>
> +static void break_cow(struct mm_struct *mm, unsigned long addr)
> +{
> + struct page *page[1];
> +
> + down_read(&mm->mmap_sem);
> + if (get_user_pages(current, mm, addr, 1, 1, 0, page, NULL)) {
> + put_page(page[0]);
> + }
> + up_read(&mm->mmap_sem);
> +}

- unneeded brakes around single statement

- that single statement is over-indented.

- and it seems wrong. If get_user_pages() returned, say, -ENOMEM, we
end up doing put_page(random-uninitialised-address-from-stack-go-oops)?

>
> ...
>
> +static int ksm_sma_ioctl_register_memory_region(struct ksm_sma *ksm_sma,
> + struct ksm_memory_region *mem)
> +{
> + struct ksm_mem_slot *slot;
> + int ret = -EPERM;
> +
> + slot = kzalloc(sizeof(struct ksm_mem_slot), GFP_KERNEL);
> + if (!slot) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + slot->mm = get_task_mm(current);
> + if (!slot->mm)
> + goto out_free;
> + slot->addr = mem->addr;
> + slot->npages = mem->npages;
> +
> + down_write(&slots_lock);
> +
> + list_add_tail(&slot->link, &slots);
> + list_add_tail(&slot->sma_link, &ksm_sma->sma_slots);
> +
> + up_write(&slots_lock);
> + return 0;
> +
> +out_free:
> + kfree(slot);
> +out:
> + return ret;
> +}

So this function pins the mm_struct. I wonder what the implications of
this are. Not much, I guess. Some comments in the code which explain
the object lifecycles would be nice.

> +static void remove_mm_from_hash_and_tree(struct mm_struct *mm)
> +{
> + struct ksm_mem_slot *slot;
> + int pages_count;
> +
> + list_for_each_entry(slot, &slots, link)
> + if (slot->mm == mm)
> + break;
> + BUG_ON(!slot);
> +
> + root_unstable_tree = RB_ROOT;
> + for (pages_count = 0; pages_count < slot->npages; ++pages_count)
> + remove_page_from_tree(mm, slot->addr +
> + pages_count * PAGE_SIZE);
> + list_del(&slot->link);
> +}

/* Called under slots_lock */

>
> ...
>
> +static int memcmp_pages(struct page *page1, struct page *page2)
> +{
> + char *addr1, *addr2;
> + int r;
> +
> + addr1 = kmap_atomic(page1, KM_USER0);
> + addr2 = kmap_atomic(page2, KM_USER1);
> + r = memcmp(addr1, addr2, PAGE_SIZE);
> + kunmap_atomic(addr1, KM_USER0);
> + kunmap_atomic(addr2, KM_USER1);
> + return r;
> +}

I wonder if this code all does enough cpu cache flushing to be able to
guarantee that it's looking at valid data. Not my area, and presumably
not an issue on x86.

>
> ...
>
> +static int try_to_merge_one_page(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + struct page *oldpage,
> + struct page *newpage,
> + pgprot_t newprot)
> +{
> + int ret = 1;
> + int odirect_sync;
> + unsigned long page_addr_in_vma;
> + pte_t orig_pte, *orig_ptep;
> +
> + if (!PageAnon(oldpage))
> + goto out;
> +
> + get_page(newpage);
> + get_page(oldpage);
> +
> + down_read(&mm->mmap_sem);
> +
> + page_addr_in_vma = addr_in_vma(vma, oldpage);
> + if (page_addr_in_vma == -EFAULT)
> + goto out_unlock;
> +
> + orig_ptep = get_pte(mm, page_addr_in_vma);
> + if (!orig_ptep)
> + goto out_unlock;
> + orig_pte = *orig_ptep;
> + pte_unmap(orig_ptep);
> + if (!pte_present(orig_pte))
> + goto out_unlock;
> + if (page_to_pfn(oldpage) != pte_pfn(orig_pte))
> + goto out_unlock;
> + /*
> + * we need the page lock to read a stable PageSwapCache in
> + * page_wrprotect()
> + */
> + if (!trylock_page(oldpage))
> + goto out_unlock;

We need a comment here explaining why we can't use the much preferable
lock_page().

Why can't we use the much preferable lock_page()?

> + /*
> + * page_wrprotect check if the page is swapped or in swap cache,
> + * in the future we might want to run here if_present_pte and then
> + * swap_free
> + */
> + if (!page_wrprotect(oldpage, &odirect_sync, 2)) {
> + unlock_page(oldpage);
> + goto out_unlock;
> + }
> + unlock_page(oldpage);
> + if (!odirect_sync)
> + goto out_unlock;
> +
> + orig_pte = pte_wrprotect(orig_pte);
> +
> + if (pages_identical(oldpage, newpage))
> + ret = replace_page(vma, oldpage, newpage, orig_pte, newprot);
> +
> +out_unlock:
> + up_read(&mm->mmap_sem);
> + put_page(oldpage);
> + put_page(newpage);
> +out:
> + return ret;
> +}
>
> ...
>
> +static int try_to_merge_two_pages_alloc(struct mm_struct *mm1,
> + struct page *page1,
> + struct mm_struct *mm2,
> + struct page *page2,
> + unsigned long addr1,
> + unsigned long addr2)
> +{
> + struct vm_area_struct *vma;
> + pgprot_t prot;
> + int ret = 1;
> + struct page *kpage;
> +
> + /*
> + * The number of the nodes inside the stable tree +
> + * nkpage_out_tree is the same as the number kernel pages that
> + * we hold.
> + */
> + if (kthread_max_kernel_pages &&
> + (nnodes_stable_tree + nkpage_out_tree) >=
> + kthread_max_kernel_pages)
> + return ret;
> +
> + kpage = alloc_page(GFP_HIGHUSER);
> + if (!kpage)
> + return ret;
> + down_read(&mm1->mmap_sem);
> + vma = find_vma(mm1, addr1);
> + up_read(&mm1->mmap_sem);
> + if (!vma) {
> + put_page(kpage);
> + return ret;
> + }
> + prot = vma->vm_page_prot;

What locking protects *vma here?

> + pgprot_val(prot) &= ~_PAGE_RW;
> +
> + copy_user_highpage(kpage, page1, addr1, vma);
> + ret = try_to_merge_one_page(mm1, vma, page1, kpage, prot);
> +
> + if (!ret) {
> + down_read(&mm2->mmap_sem);
> + vma = find_vma(mm2, addr2);
> + up_read(&mm2->mmap_sem);
> + if (!vma) {
> + put_page(kpage);
> + break_cow(mm1, addr1);
> + ret = 1;
> + return ret;
> + }
> +
> + prot = vma->vm_page_prot;

And here.

> + pgprot_val(prot) &= ~_PAGE_RW;
> +
> + ret = try_to_merge_one_page(mm2, vma, page2, kpage,
> + prot);
> + /*
> + * If the secoend try_to_merge_one_page call was failed,
> + * we are in situation where we have Ksm page that have
> + * just one pte pointing to it, in this case we break
> + * it.
> + */
> + if (ret) {
> + break_cow(mm1, addr1);
> + } else {
> + ksm_pages_shared += 2;
> + }
> + }
> +
> + put_page(kpage);
> + return ret;
> +}
> +
> +/*
> + * try_to_merge_two_pages_noalloc - the same astry_to_merge_two_pages_alloc,
> + * but no new kernel page is allocated (page2 should be KsmPage)
> + */
> +static int try_to_merge_two_pages_noalloc(struct mm_struct *mm1,
> + struct page *page1,
> + struct page *page2,
> + unsigned long addr1)
> +{
> + struct vm_area_struct *vma;
> + pgprot_t prot;
> + int ret = 1;
> +
> + /*
> + * If page2 is shared, we can just make the pte of mm1(page1) point to
> + * page2.
> + */
> + BUG_ON(!PageKsm(page2));
> + down_read(&mm1->mmap_sem);
> + vma = find_vma(mm1, addr1);
> + up_read(&mm1->mmap_sem);
> + if (!vma)
> + return ret;
> + prot = vma->vm_page_prot;

etc.

> + pgprot_val(prot) &= ~_PAGE_RW;
> + ret = try_to_merge_one_page(mm1, vma, page1, page2, prot);
> + if (!ret)
> + ksm_pages_shared++;
> +
> + return ret;
> +}
> +
> +/*
> + * is_zapped_item - check if the page belong to the rmap_item was zapped.
> + *
> + * This function would check if the page that the virtual address inside
> + * rmap_item is poiting to is still KsmPage, and therefore we can trust the
> + * content of this page.
> + * Since that this function call already to get_user_pages it return the
> + * pointer to the page as an optimization.
> + */
> +static int is_zapped_item(struct rmap_item *rmap_item,
> + struct page **page)
> +{
> + int ret = 0;
> + struct vm_area_struct *vma;
> +
> + cond_resched();
> + if (is_present_pte(rmap_item->mm, rmap_item->address)) {
> + down_read(&rmap_item->mm->mmap_sem);
> + vma = find_vma(rmap_item->mm, rmap_item->address);
> + if (vma && !vma->vm_file) {
> + BUG_ON(vma->vm_flags & VM_SHARED);
> + ret = get_user_pages(current, rmap_item->mm,
> + rmap_item->address,
> + 1, 0, 0, page, NULL);
> + }
> + up_read(&rmap_item->mm->mmap_sem);
> + }
> +
> + if (!ret)
> + return 1;

Failed to check for get_user_pages() -ve return code?

> + if (unlikely(!PageKsm(page[0]))) {
> + put_page(page[0]);
> + return 1;
> + }
> + return 0;
> +}
> +
>
> ...
>
> +static struct tree_item *unstable_tree_search_insert(struct page *page,
> + struct page **page2,
> + struct rmap_item *page_rmap_item)
> +{
> + struct rb_node **new = &(root_unstable_tree.rb_node);
> + struct rb_node *parent = NULL;
> + struct tree_item *tree_item;
> + struct tree_item *new_tree_item;
> + struct rmap_item *rmap_item;
> +
> + while (*new) {
> + int ret;
> +
> + tree_item = rb_entry(*new, struct tree_item, node);
> + BUG_ON(!tree_item);
> + rmap_item = tree_item->rmap_item;
> + BUG_ON(!rmap_item);
> +
> + /*
> + * We dont want to swap in pages
> + */
> + if (!is_present_pte(rmap_item->mm, rmap_item->address))
> + return NULL;
> +
> + down_read(&rmap_item->mm->mmap_sem);
> + ret = get_user_pages(current, rmap_item->mm, rmap_item->address,
> + 1, 0, 0, page2, NULL);
> + up_read(&rmap_item->mm->mmap_sem);
> + if (!ret)
> + return NULL;

get_user_pages() return code..

> + ret = memcmp_pages(page, page2[0]);
> +
> + parent = *new;
> + if (ret < 0) {
> + put_page(page2[0]);
> + new = &((*new)->rb_left);
> + } else if (ret > 0) {
> + put_page(page2[0]);
> + new = &((*new)->rb_right);
> + } else {
> + return tree_item;
> + }
> + }
> +
> + if (!page_rmap_item)
> + return NULL;
> +
> + new_tree_item = alloc_tree_item();
> + if (!new_tree_item)
> + return NULL;
> +
> + page_rmap_item->tree_item = new_tree_item;
> + page_rmap_item->stable_tree = 0;
> + new_tree_item->rmap_item = page_rmap_item;
> + rb_link_node(&new_tree_item->node, parent, new);
> + rb_insert_color(&new_tree_item->node, &root_unstable_tree);
> +
> + return NULL;
> +}
> +
>
> ...
>
> +static struct file_operations ksm_sma_fops = {

const, please.

> + .release = ksm_sma_release,
> + .unlocked_ioctl = ksm_sma_ioctl,
> + .compat_ioctl = ksm_sma_ioctl,
> +};
> +
> +static int ksm_dev_ioctl_create_shared_memory_area(void)
> +{
> + int fd = -1;
> + struct ksm_sma *ksm_sma;
> +
> + ksm_sma = kmalloc(sizeof(struct ksm_sma), GFP_KERNEL);
> + if (!ksm_sma)
> + goto out;

This will cause a return of -1. Returniing -ENOMEM would be better.

> +
> + INIT_LIST_HEAD(&ksm_sma->sma_slots);
> +
> + fd = anon_inode_getfd("ksm-sma", &ksm_sma_fops, ksm_sma, 0);
> + if (fd < 0)
> + goto out_free;
> +
> + return fd;
> +out_free:
> + kfree(ksm_sma);
> +out:
> + return fd;
> +}
> +
> +static long ksm_dev_ioctl(struct file *filp,
> + unsigned int ioctl, unsigned long arg)
> +{
> + long r = -EINVAL;
> +
> + switch (ioctl) {
> + case KSM_GET_API_VERSION:
> + r = KSM_API_VERSION;
> + break;
> + case KSM_CREATE_SHARED_MEMORY_AREA:
> + r = ksm_dev_ioctl_create_shared_memory_area();
> + break;
> + default:
> + break;
> + }
> + return r;
> +}
> +
> +static struct file_operations ksm_chardev_ops = {

const

> + .unlocked_ioctl = ksm_dev_ioctl,
> + .compat_ioctl = ksm_dev_ioctl,
> + .owner = THIS_MODULE,
> +};
> +
>
> ...
>

2009-04-15 11:26:54

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 3/4] add replace_page(): change the page pte is pointing to.

On Tue, Apr 14, 2009 at 03:09:25PM -0700, Andrew Morton wrote:
> On Thu, 9 Apr 2009 06:58:40 +0300
> Izik Eidus <[email protected]> wrote:
>
> > replace_page() allow changing the mapping of pte from one physical page
> > into diffrent physical page.
>
> At a high level, this is very similar to what page migration does. Yet
> this implementation shares nothing with the page migration code.
>
> Can this situation be improved?

This was discussed last time too. Basically the thing is that using
migration entry with its special page fault paths, for this looks a
bit of an overkill complexity and unnecessary dependency on the
migration code. All we need is to mark the pte readonly. replace_page
is a no brainer then. The brainer part is page_wrprotect
(page_wrprotect is like fork).

The data visibility in the final memcmp you mentioned in the other
mail is supposedly taken care of by page_wrprotect too. It already
does flush_cache_page for the virtual indexed and not physically
tagged caches. page_wrprotect has to also IPI all CPUs to nuke any not
wrprotected tlb entry. I don't think we need further smp memory
barriers when we're guaranteed all tlb entries are wrprotected in the
other cpus and an IPI and invlpg run in them, to be sure we read the
data stable during memcmp even if we read through the kernel
pagetables and the last userland write happened through userland ptes
before they become effective wrprotected by the IPI.

2009-04-15 22:39:49

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 4/4] add ksm kernel shared memory driver.

Andrew Morton wrote:
> On Thu, 9 Apr 2009 06:58:41 +0300
> Izik Eidus <[email protected]> wrote:
>
>
>
> Confused. In the covering email you indicated that v2 of the patchset
> had abandoned ioctls and had moved the interface to sysfs.
>
We have abandoned the ioctls that control the ksm behavior (how much cpu
it take, how much kernel pages it may allocate and so on...)
But we still use ioctls to register the application memory to be used
with ksm.

> It would be good to completely (and briefly) describe KSM's proposed
> userspace intefaces in the changelog or somewhere. I'm a bit confused.
>

I will post new clean description for the ksm api with V4.

>
>
>
>>
>> +static pte_t *get_pte(struct mm_struct *mm, unsigned long addr)
>> +{
>> + pgd_t *pgd;
>> + pud_t *pud;
>> + pmd_t *pmd;
>> + pte_t *ptep = NULL;
>> +
>> + pgd = pgd_offset(mm, addr);
>> + if (!pgd_present(*pgd))
>> + goto out;
>> +
>> + pud = pud_offset(pgd, addr);
>> + if (!pud_present(*pud))
>> + goto out;
>> +
>> + pmd = pmd_offset(pud, addr);
>> + if (!pmd_present(*pmd))
>> + goto out;
>> +
>> + ptep = pte_offset_map(pmd, addr);
>> +out:
>> + return ptep;
>> +}
>>
>
> hm, this looks very generic. Does it duplicate anything which core
> kernel already provides?

I dont think so.

> If not, perhaps core kernel should provide
> this (perhaps after some reorganisation).
>

Quick grep on the code show me at least 2 places that can use this function
one is:
remove_migration_pte() inside migrate.c
and the other is:
page_check_address() inside rmap.c

I will post with V4 an inline get_ptep() function, worst case i will get
nacked.

>
>> ...
>>
>> +static int rmap_hash_init(void)
>> +{
>> + if (!rmap_hash_size) {
>> + struct sysinfo sinfo;
>> +
>> + si_meminfo(&sinfo);
>> + rmap_hash_size = sinfo.totalram / 10;
>>
>
> One slot per ten pages of physical memory? Is this too large, too
> small or just right?
>

Highly depend on the number of processes / memory regions that will be
registered inside ksm
It is a module parameter and so user can change it to how much it want.

>
>> + }
>> + nrmaps_hash = rmap_hash_size;
>> + rmap_hash = vmalloc(nrmaps_hash * sizeof(struct hlist_head));
>> + if (!rmap_hash)
>> + return -ENOMEM;
>> + memset(rmap_hash, 0, nrmaps_hash * sizeof(struct hlist_head));
>> + return 0;
>> +}
>> +
>>
>> ...
>>
>> +static void break_cow(struct mm_struct *mm, unsigned long addr)
>> +{
>> + struct page *page[1];
>> +
>> + down_read(&mm->mmap_sem);
>> + if (get_user_pages(current, mm, addr, 1, 1, 0, page, NULL)) {
>> + put_page(page[0]);
>> + }
>> + up_read(&mm->mmap_sem);
>> +}
>>
>
> - unneeded brakes around single statement
>
> - that single statement is over-indented.
>
> - and it seems wrong. If get_user_pages() returned, say, -ENOMEM, we
> end up doing put_page(random-uninitialised-address-from-stack-go-oops)?
>

Good catch.

>
>> ...
>>
>> +static int ksm_sma_ioctl_register_memory_region(struct ksm_sma *ksm_sma,
>> + struct ksm_memory_region *mem)
>> +{
>> + struct ksm_mem_slot *slot;
>> + int ret = -EPERM;
>> +
>> + slot = kzalloc(sizeof(struct ksm_mem_slot), GFP_KERNEL);
>> + if (!slot) {
>> + ret = -ENOMEM;
>> + goto out;
>> + }
>> +
>> + slot->mm = get_task_mm(current);
>> + if (!slot->mm)
>> + goto out_free;
>> + slot->addr = mem->addr;
>> + slot->npages = mem->npages;
>> +
>> + down_write(&slots_lock);
>> +
>> + list_add_tail(&slot->link, &slots);
>> + list_add_tail(&slot->sma_link, &ksm_sma->sma_slots);
>> +
>> + up_write(&slots_lock);
>> + return 0;
>> +
>> +out_free:
>> + kfree(slot);
>> +out:
>> + return ret;
>> +}
>>
>
> So this function pins the mm_struct. I wonder what the implications of
> this are.

The mm struct wont go away until the file will be closed... (Application
close the file descriptor, or the Application die)

> Not much, I guess. Some comments in the code which explain
> the object lifecycles would be nice.
>
>
>
>> ...
>>
>> +static int memcmp_pages(struct page *page1, struct page *page2)
>> +{
>> + char *addr1, *addr2;
>> + int r;
>> +
>> + addr1 = kmap_atomic(page1, KM_USER0);
>> + addr2 = kmap_atomic(page2, KM_USER1);
>> + r = memcmp(addr1, addr2, PAGE_SIZE);
>> + kunmap_atomic(addr1, KM_USER0);
>> + kunmap_atomic(addr2, KM_USER1);
>> + return r;
>> +}
>>
>
> I wonder if this code all does enough cpu cache flushing to be able to
> guarantee that it's looking at valid data. Not my area, and presumably
> not an issue on x86.
>

Andrea pointed in previous reply that due to the fact that we are
running page_wrprotect() on this pages memcmp_pages should be stable.

>
>> ...
>>
>> +static int try_to_merge_one_page(struct mm_struct *mm,
>> + struct vm_area_struct *vma,
>> + struct page *oldpage,
>> + struct page *newpage,
>> + pgprot_t newprot)
>> +{
>> + int ret = 1;
>> + int odirect_sync;
>> + unsigned long page_addr_in_vma;
>> + pte_t orig_pte, *orig_ptep;
>> +
>> + if (!PageAnon(oldpage))
>> + goto out;
>> +
>> + get_page(newpage);
>> + get_page(oldpage);
>> +
>> + down_read(&mm->mmap_sem);
>> +
>> + page_addr_in_vma = addr_in_vma(vma, oldpage);
>> + if (page_addr_in_vma == -EFAULT)
>> + goto out_unlock;
>> +
>> + orig_ptep = get_pte(mm, page_addr_in_vma);
>> + if (!orig_ptep)
>> + goto out_unlock;
>> + orig_pte = *orig_ptep;
>> + pte_unmap(orig_ptep);
>> + if (!pte_present(orig_pte))
>> + goto out_unlock;
>> + if (page_to_pfn(oldpage) != pte_pfn(orig_pte))
>> + goto out_unlock;
>> + /*
>> + * we need the page lock to read a stable PageSwapCache in
>> + * page_wrprotect()
>> + */
>> + if (!trylock_page(oldpage))
>> + goto out_unlock;
>>
>
> We need a comment here explaining why we can't use the much preferable
> lock_page().
>
> Why can't we use the much preferable lock_page()?
>

We can, but i am not sure if we want, why block on I/O when we can just
move into the next page?
(We will come to this page latter...)

>
>> + /*
>> + * page_wrprotect check if the page is swapped or in swap cache,
>> + * in the future we might want to run here if_present_pte and then
>> + * swap_free
>> + */
>> + if (!page_wrprotect(oldpage, &odirect_sync, 2)) {
>> + unlock_page(oldpage);
>> + goto out_unlock;
>> + }
>> + unlock_page(oldpage);
>> + if (!odirect_sync)
>> + goto out_unlock;
>> +
>> + orig_pte = pte_wrprotect(orig_pte);
>> +
>> + if (pages_identical(oldpage, newpage))
>> + ret = replace_page(vma, oldpage, newpage, orig_pte, newprot);
>> +
>> +out_unlock:
>> + up_read(&mm->mmap_sem);
>> + put_page(oldpage);
>> + put_page(newpage);
>> +out:
>> + return ret;
>> +}
>>
>> ...
>>
>> +static int try_to_merge_two_pages_alloc(struct mm_struct *mm1,
>> + struct page *page1,
>> + struct mm_struct *mm2,
>> + struct page *page2,
>> + unsigned long addr1,
>> + unsigned long addr2)
>> +{
>> + struct vm_area_struct *vma;
>> + pgprot_t prot;
>> + int ret = 1;
>> + struct page *kpage;
>> +
>> + /*
>> + * The number of the nodes inside the stable tree +
>> + * nkpage_out_tree is the same as the number kernel pages that
>> + * we hold.
>> + */
>> + if (kthread_max_kernel_pages &&
>> + (nnodes_stable_tree + nkpage_out_tree) >=
>> + kthread_max_kernel_pages)
>> + return ret;
>> +
>> + kpage = alloc_page(GFP_HIGHUSER);
>> + if (!kpage)
>> + return ret;
>> + down_read(&mm1->mmap_sem);
>> + vma = find_vma(mm1, addr1);
>> + up_read(&mm1->mmap_sem);
>> + if (!vma) {
>> + put_page(kpage);
>> + return ret;
>> + }
>> + prot = vma->vm_page_prot;
>>
>
> What locking protects *vma here?
>

Good catch.

>
>
>
>> + pgprot_val(prot) &= ~_PAGE_RW;
>> + ret = try_to_merge_one_page(mm1, vma, page1, page2, prot);
>> + if (!ret)
>> + ksm_pages_shared++;
>> +
>> + return ret;
>> +}
>> +
>> +/*
>> + * is_zapped_item - check if the page belong to the rmap_item was zapped.
>> + *
>> + * This function would check if the page that the virtual address inside
>> + * rmap_item is poiting to is still KsmPage, and therefore we can trust the
>> + * content of this page.
>> + * Since that this function call already to get_user_pages it return the
>> + * pointer to the page as an optimization.
>> + */
>> +static int is_zapped_item(struct rmap_item *rmap_item,
>> + struct page **page)
>> +{
>> + int ret = 0;
>> + struct vm_area_struct *vma;
>> +
>> + cond_resched();
>> + if (is_present_pte(rmap_item->mm, rmap_item->address)) {
>> + down_read(&rmap_item->mm->mmap_sem);
>> + vma = find_vma(rmap_item->mm, rmap_item->address);
>> + if (vma && !vma->vm_file) {
>> + BUG_ON(vma->vm_flags & VM_SHARED);
>> + ret = get_user_pages(current, rmap_item->mm,
>> + rmap_item->address,
>> + 1, 0, 0, page, NULL);
>> + }
>> + up_read(&rmap_item->mm->mmap_sem);
>> + }
>> +
>> + if (!ret)
>> + return 1;
>>
>
> Failed to check for get_user_pages() -ve return code?
>
Right.

>
>> + if (unlikely(!PageKsm(page[0]))) {
>> + put_page(page[0]);
>> + return 1;
>> + }
>> + return 0;
>> +}
>> +
>>
>> ...
>>
>> +static struct tree_item *unstable_tree_search_insert(struct page *page,
>> + struct page **page2,
>> + struct rmap_item *page_rmap_item)
>> +{
>> + struct rb_node **new = &(root_unstable_tree.rb_node);
>> + struct rb_node *parent = NULL;
>> + struct tree_item *tree_item;
>> + struct tree_item *new_tree_item;
>> + struct rmap_item *rmap_item;
>> +
>> + while (*new) {
>> + int ret;
>> +
>> + tree_item = rb_entry(*new, struct tree_item, node);
>> + BUG_ON(!tree_item);
>> + rmap_item = tree_item->rmap_item;
>> + BUG_ON(!rmap_item);
>> +
>> + /*
>> + * We dont want to swap in pages
>> + */
>> + if (!is_present_pte(rmap_item->mm, rmap_item->address))
>> + return NULL;
>> +
>> + down_read(&rmap_item->mm->mmap_sem);
>> + ret = get_user_pages(current, rmap_item->mm, rmap_item->address,
>> + 1, 0, 0, page2, NULL);
>> + up_read(&rmap_item->mm->mmap_sem);
>> + if (!ret)
>> + return NULL;
>>
>
> get_user_pages() return code..
>
>
>> + ret = memcmp_pages(page, page2[0]);
>> +
>> + parent = *new;
>> + if (ret < 0) {
>> + put_page(page2[0]);
>> + new = &((*new)->rb_left);
>> + } else if (ret > 0) {
>> + put_page(page2[0]);
>> + new = &((*new)->rb_right);
>> + } else {
>> + return tree_item;
>> + }
>> + }
>> +
>> + if (!page_rmap_item)
>> + return NULL;
>> +
>> + new_tree_item = alloc_tree_item();
>> + if (!new_tree_item)
>> + return NULL;
>> +
>> + page_rmap_item->tree_item = new_tree_item;
>> + page_rmap_item->stable_tree = 0;
>> + new_tree_item->rmap_item = page_rmap_item;
>> + rb_link_node(&new_tree_item->node, parent, new);
>> + rb_insert_color(&new_tree_item->node, &root_unstable_tree);
>> +
>> + return NULL;
>> +}
>> +
>>
>> ...
>>
>> +static struct file_operations ksm_sma_fops = {
>>
>
> const, please.
>
>
>> + .release = ksm_sma_release,
>> + .unlocked_ioctl = ksm_sma_ioctl,
>> + .compat_ioctl = ksm_sma_ioctl,
>> +};
>> +
>> +static int ksm_dev_ioctl_create_shared_memory_area(void)
>> +{
>> + int fd = -1;
>> + struct ksm_sma *ksm_sma;
>> +
>> + ksm_sma = kmalloc(sizeof(struct ksm_sma), GFP_KERNEL);
>> + if (!ksm_sma)
>> + goto out;
>>
>
> This will cause a return of -1. Returniing -ENOMEM would be better.
>
Right.



Thanks for the review i will cock V4 will test it and send it.


Thanks.

2009-04-15 22:51:17

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 3/4] add replace_page(): change the page pte is pointing to.

Andrea Arcangeli wrote:
> On Tue, Apr 14, 2009 at 03:09:25PM -0700, Andrew Morton wrote:
>
>> On Thu, 9 Apr 2009 06:58:40 +0300
>> Izik Eidus <[email protected]> wrote:
>>
>>
>>> replace_page() allow changing the mapping of pte from one physical page
>>> into diffrent physical page.
>>>
>> At a high level, this is very similar to what page migration does. Yet
>> this implementation shares nothing with the page migration code.
>>
>> Can this situation be improved?
>>
>
> This was discussed last time too. Basically the thing is that using
> migration entry with its special page fault paths, for this looks a
> bit of an overkill complexity and unnecessary dependency on the
> migration code.

I agree about that.

> All we need is to mark the pte readonly. replace_page
> is a no brainer then. The brainer part is page_wrprotect
> (page_wrprotect is like fork).
>
> The data visibility in the final memcmp you mentioned in the other
> mail is supposedly taken care of by page_wrprotect too. It already
> does flush_cache_page for the virtual indexed and not physically
> tagged caches. page_wrprotect has to also IPI all CPUs to nuke any not
> wrprotected tlb entry. I don't think we need further smp memory
> barriers when we're guaranteed all tlb entries are wrprotected in the
> other cpus and an IPI and invlpg run in them, to be sure we read the
> data stable during memcmp even if we read through the kernel
> pagetables and the last userland write happened through userland ptes
> before they become effective wrprotected by the IPI.
>

Yup agree.

2009-04-15 22:57:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 4/4] add ksm kernel shared memory driver.

On Thu, 16 Apr 2009 01:37:25 +0300
Izik Eidus <[email protected]> wrote:

> Andrew Morton wrote:
> > On Thu, 9 Apr 2009 06:58:41 +0300
> > Izik Eidus <[email protected]> wrote:
> >
> >
> >
> > Confused. In the covering email you indicated that v2 of the patchset
> > had abandoned ioctls and had moved the interface to sysfs.
> >
> We have abandoned the ioctls that control the ksm behavior (how much cpu
> it take, how much kernel pages it may allocate and so on...)
> But we still use ioctls to register the application memory to be used
> with ksm.

hm. ioctls make kernel people weep and gnash teeth.

An appropriate interface would be to add new syscalls. But as ksm is
an optional thing and can even be modprobed, that doesn't work. And
having a driver in mm/ which can be modprobed is kinda neat.

I can't immediately think of a nicer interface. You could always poke
numbers into some pseudo-file but to me that seems as ugly, or uglier
than an ioctl (others seem to disagee).

Ho hum. Please design the ioctl interface so that it doesn't need any
compat handling if poss.

2009-04-15 23:23:27

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 4/4] add ksm kernel shared memory driver.

On Wed, Apr 15, 2009 at 03:50:58PM -0700, Andrew Morton wrote:
> an optional thing and can even be modprobed, that doesn't work. And
> having a driver in mm/ which can be modprobed is kinda neat.

Agreed. I think madvise with all its vma split requirements and
ksm-unregistering invoked at vma destruction time (under CONFIG_KSM ||
CONFIG_KSM_MODULE) is clean approach only if ksm is considered a piece
of the core kernel VM. As long as only certain users out there use ksm
(i.e. only virtualization servers and LHC computations) the pseduochar
ioctl interface keeps it out of the kernel, so core kernel MM API
remains almost unaffected by ksm.

It's kinda neat it's external as self-contained module, but the whole
point is that to be self-contained it has to use ioctl.

Another thing is that madvise usually doesn't require mangling sysfs
to be effective. madvise without enabling ksm with sysfs would be
entirely useless. So doing it as madvise that returns success and has
no effect unless 'root' does something, is kind of weird.

Thinking about the absolute worst case: if this really turns out to be
wrong decision, simply /dev/ksm won't exist anymore and no app could
ever break as they will graceful handle the missing pseudochar. They
won't run the ioctl and just continue like if ksm.ko wasn't loaded. As
there are only a few (but critically important) apps using KSM,
converting them to fallback on madvise is a few liner trivial change
(kvm-userland will have 10 more lines to keep opening /dev/ksm before
calling madvise if we ever later decide KSM has to become a VM core
kernel functionality with madvise or its own per-arch syscall).

2009-04-16 00:43:22

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 4/4] add ksm kernel shared memory driver.

Andrew Morton wrote:
>> +static pte_t *get_pte(struct mm_struct *mm, unsigned long addr)
>> +{
>> + pgd_t *pgd;
>> + pud_t *pud;
>> + pmd_t *pmd;
>> + pte_t *ptep = NULL;
>> +
>> + pgd = pgd_offset(mm, addr);
>> + if (!pgd_present(*pgd))
>> + goto out;
>> +
>> + pud = pud_offset(pgd, addr);
>> + if (!pud_present(*pud))
>> + goto out;
>> +
>> + pmd = pmd_offset(pud, addr);
>> + if (!pmd_present(*pmd))
>> + goto out;
>> +
>> + ptep = pte_offset_map(pmd, addr);
>> +out:
>> + return ptep;
>> +}
>>
>
> hm, this looks very generic. Does it duplicate anything which core
> kernel already provides? If not, perhaps core kernel should provide
> this (perhaps after some reorganisation).
>

It is lookup_address() which works on user addresses, and as such is
very useful. But it would need to deal with returning a level so it can
deal with large pages in usermode, and have some well-defined semantics
on whether the caller is responsible for unmapping the returned thing
(ie, only if its a pte).

I implemented this myself a couple of months ago, but I can't find it
anywhere...

>> +static int memcmp_pages(struct page *page1, struct page *page2)
>> +{
>> + char *addr1, *addr2;
>> + int r;
>> +
>> + addr1 = kmap_atomic(page1, KM_USER0);
>> + addr2 = kmap_atomic(page2, KM_USER1);
>> + r = memcmp(addr1, addr2, PAGE_SIZE);
>> + kunmap_atomic(addr1, KM_USER0);
>> + kunmap_atomic(addr2, KM_USER1);
>> + return r;
>> +}
>>
>
> I wonder if this code all does enough cpu cache flushing to be able to
> guarantee that it's looking at valid data. Not my area, and presumably
> not an issue on x86.
>

Shouldn't that be kmap_atomic's job anyway? Otherwise it would be hard
to use on any virtual-tag/indexed cache machine.

J

2009-04-16 00:59:40

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 4/4] add ksm kernel shared memory driver.

Jeremy Fitzhardinge wrote:
> Andrew Morton wrote:
>>> +static pte_t *get_pte(struct mm_struct *mm, unsigned long addr)
>>> +{
>>> + pgd_t *pgd;
>>> + pud_t *pud;
>>> + pmd_t *pmd;
>>> + pte_t *ptep = NULL;
>>> +
>>> + pgd = pgd_offset(mm, addr);
>>> + if (!pgd_present(*pgd))
>>> + goto out;
>>> +
>>> + pud = pud_offset(pgd, addr);
>>> + if (!pud_present(*pud))
>>> + goto out;
>>> +
>>> + pmd = pmd_offset(pud, addr);
>>> + if (!pmd_present(*pmd))
>>> + goto out;
>>> +
>>> + ptep = pte_offset_map(pmd, addr);
>>> +out:
>>> + return ptep;
>>> +}
>>>
>>
>> hm, this looks very generic. Does it duplicate anything which core
>> kernel already provides? If not, perhaps core kernel should provide
>> this (perhaps after some reorganisation).
>>
>
> It is lookup_address() which works on user addresses, and as such is
> very useful.

But ksm need the pgd offset of an mm struct, not the kernel pgd, so
maybe changing it to get the pgd offset would be nice..

Another thing it is just for x86 right now, so probably it need to go
out to the common code

> But it would need to deal with returning a level so it can deal with
> large pages in usermode, and have some well-defined semantics on
> whether the caller is responsible for unmapping the returned thing
> (ie, only if its a pte).
>
> I implemented this myself a couple of months ago, but I can't find it
> anywhere...
>
>>> +static int memcmp_pages(struct page *page1, struct page *page2)
>>> +{
>>> + char *addr1, *addr2;
>>> + int r;
>>> +
>>> + addr1 = kmap_atomic(page1, KM_USER0);
>>> + addr2 = kmap_atomic(page2, KM_USER1);
>>> + r = memcmp(addr1, addr2, PAGE_SIZE);
>>> + kunmap_atomic(addr1, KM_USER0);
>>> + kunmap_atomic(addr2, KM_USER1);
>>> + return r;
>>> +}
>>>
>>
>> I wonder if this code all does enough cpu cache flushing to be able to
>> guarantee that it's looking at valid data. Not my area, and presumably
>> not an issue on x86.
>>
>
> Shouldn't that be kmap_atomic's job anyway? Otherwise it would be
> hard to use on any virtual-tag/indexed cache machine.
>
> J

2009-04-16 11:40:33

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 4/4] add ksm kernel shared memory driver.

On Wed, Apr 15, 2009 at 05:43:03PM -0700, Jeremy Fitzhardinge wrote:
> Shouldn't that be kmap_atomic's job anyway? Otherwise it would be hard to

No because those are full noops in no-highmem kernels. I commented in
other email why I think it's safe thanks to the wrprotect + smp tlb
flush of the userland PTE.

2009-04-16 16:09:01

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 4/4] add ksm kernel shared memory driver.

Andrea Arcangeli wrote:
> On Wed, Apr 15, 2009 at 05:43:03PM -0700, Jeremy Fitzhardinge wrote:
>
>> Shouldn't that be kmap_atomic's job anyway? Otherwise it would be hard to
>>
>
> No because those are full noops in no-highmem kernels. I commented in
> other email why I think it's safe thanks to the wrprotect + smp tlb
> flush of the userland PTE.
>

I think Andrew's query was about data cache synchronization in
architectures with virtually indexed d-cache. On x86 it's a non-issue,
but on architectures for which it is an issue, I assume kmap_atomic does
any necessary cache flushes, as it does tlb flushes on x86 (which may be
none at all, if no mapping actually happens).

J

2009-04-16 17:55:47

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux v3

On Wednesday 15 April 2009 08:09:03 Andrew Morton wrote:
> On Thu, 9 Apr 2009 06:58:37 +0300
> Izik Eidus <[email protected]> wrote:
>
> > KSM is a linux driver that allows dynamicly sharing identical memory
> > pages between one or more processes.
>
> Generally looks OK to me. But that doesn't mean much. We should rub
> bottles with words like "hugh" and "nick" on them to be sure.

I haven't looked too closely at it yet sorry. Hugh has a great eye for
these details, though, hint hint :)

As everyone knows, my favourite thing is to say nasty things about any
new feature that adds complexity to common code. I feel like crying to
hear about how many more instances of MS Office we can all run, if only
we apply this patch. And the poorly written HPC app just sounds like
scrapings from the bottom of justification barrel.

I'm sorry, maybe I'm way off with my understanding of how important
this is. There isn't too much help in the changelog. A discussion of
where the memory savings comes from, and how far does things like
sharing of fs image, or ballooning goes and how much extra savings we
get from this... with people from other hypervisors involved as well.
Have I missed this kind of discussion?

Careful what you wish for, ay? :)

2009-04-16 18:29:19

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux v3

Nick Piggin wrote:
> On Wednesday 15 April 2009 08:09:03 Andrew Morton wrote:
>
>> On Thu, 9 Apr 2009 06:58:37 +0300
>> Izik Eidus <[email protected]> wrote:
>>
>>
>>> KSM is a linux driver that allows dynamicly sharing identical memory
>>> pages between one or more processes.
>>>
>> Generally looks OK to me. But that doesn't mean much. We should rub
>> bottles with words like "hugh" and "nick" on them to be sure.
>>
>
> I haven't looked too closely at it yet sorry. Hugh has a great eye for
> these details, though, hint hint :)
>
> As everyone knows, my favourite thing is to say nasty things about any
> new feature that adds complexity to common code.

The whole idea and the way i wrote it so it wont touch common code, i
didnt change the linux mm logic no where.
The worst thing that we have add is helper functions.

> I feel like crying to
> hear about how many more instances of MS Office we can all run, if only
> we apply this patch.

And more instances of linux guests...

> And the poorly written HPC app just sounds like
> scrapings from the bottom of justification barrel.
>

So if you have a big rendering application that load gigas of
geometrical data that is handled by many threads
and you have a case that each thread sometimes change this geometrical
data and you dont want the other threads will notice it.
How would you share it in traditional way?, after one time shared data
will get cowed, how will you recollect it again when it become identical?
KSM do it for applications transparently

KSM writing motivation indeed was KVM where there it is highly needed
you may check what VMware say about the fact that they have much better
overcommit than Hyper-V / XEN:

http://blogs.vmware.com/virtualreality/2008/03/cheap-hyperviso.html

It is important to understand that in virtualization enviorments there
are cases where memory is much more critical than any other resource for
higher density.

Together with KSM, KVM will have the same memory overcommit abilitys
such as VMware have.
> I'm sorry, maybe I'm way off with my understanding of how important
> this is. There isn't too much help in the changelog. A discussion of
> where the memory savings comes from,

Memory saving come from identical librarys, identical kernels, zeroed
pages -> that is for virtualization.
The Librarys code will always be identical among similar guests, so why
have this code at multiple places on the host memory?

> and how far does things like
> sharing of fs image, or ballooning goes and how much extra savings we
> get from this...

Ballooning is much worse when it come to performance, beacuse what it
does is shrink the guest memory, with KSM we find identical pages and
merge them into one page, so we dont get guest performance lose

> with people from other hypervisors involved as well.
> Have I missed this kind of discussion?
>
> Careful what you wish for, ay? :)
>

2009-04-17 07:08:22

by Jared Hulbert

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux v3

> As everyone knows, my favourite thing is to say nasty things about any
> new feature that adds complexity to common code. I feel like crying to
> hear about how many more instances of MS Office we can all run, if only
> we apply this patch. And the poorly written HPC app just sounds like
> scrapings from the bottom of justification barrel.
>
> I'm sorry, maybe I'm way off with my understanding of how important
> this is. There isn't too much help in the changelog. A discussion of
> where the memory savings comes from, and how far does things like
> sharing of fs image, or ballooning goes and how much extra savings we
> get from this... with people from other hypervisors involved as well.
> Have I missed this kind of discussion?

Nick,

I don't know about other hypervisors, fs and balloonings, but I have
tried this out. It works. It works on apps I don't consider, "poorly
written". I'm very excited about this. I got >10% saving in a
roughly off the shelf embedded system. No user noticeable performance
impact.

2009-04-18 14:59:49

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 4/4] add ksm kernel shared memory driver.

On Tue, Apr 14, 2009 at 03:09:29PM -0700, Andrew Morton wrote:
> We need a comment here explaining why we can't use the much preferable
> lock_page().
>
> Why can't we use the much preferable lock_page()?

We might but then it'd risk to waste time waiting. It's not worth
waiting, we want kksmd to be allowed to keep one (in future more than
one as we scale it smp/numa) CPU busy at all times running memcmp and
not schedule (other than for need_resched()) to try to free memory at
the fastest peace possible.

2009-04-21 03:00:45

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux v3

On Friday 17 April 2009 17:08:07 Jared Hulbert wrote:
> > As everyone knows, my favourite thing is to say nasty things about any
> > new feature that adds complexity to common code. I feel like crying to
> > hear about how many more instances of MS Office we can all run, if only
> > we apply this patch. And the poorly written HPC app just sounds like
> > scrapings from the bottom of justification barrel.
> >
> > I'm sorry, maybe I'm way off with my understanding of how important
> > this is. There isn't too much help in the changelog. A discussion of
> > where the memory savings comes from, and how far does things like
> > sharing of fs image, or ballooning goes and how much extra savings we
> > get from this... with people from other hypervisors involved as well.
> > Have I missed this kind of discussion?
>
> Nick,
>
> I don't know about other hypervisors, fs and balloonings, but I have
> tried this out. It works. It works on apps I don't consider, "poorly
> written". I'm very excited about this. I got >10% saving in a
> roughly off the shelf embedded system. No user noticeable performance
> impact.

OK well that's what I want to hear. Thanks, that means a lot to me.