Hugh, so untill here we are sync,
Question is what you want me to do now?,
(Beacuse we are skipping 2.6.31, It is ok to you to tell me something
like: "Shut up and let me see what i can get with this madvise" -
that from one side.
From another side if you want me to do anything please say.
Thanks.
Izik Eidus (3):
ksm: remove ksm from being a module.
ksm: remove page_wrprotect() from rmap.c
withdraw ksm-add-page_wrprotect-write-protecting-page.patch
include/linux/mm.h | 2 +-
include/linux/rmap.h | 12 ----
mm/Kconfig | 5 +-
mm/ksm.c | 92 +++++++++++++++++++++++----------
mm/memory.c | 2 +-
mm/rmap.c | 139 --------------------------------------------------
6 files changed, 69 insertions(+), 183 deletions(-)
Signed-off-by: Izik Eidus <[email protected]>
---
include/linux/mm.h | 2 +-
include/linux/rmap.h | 4 ++--
mm/Kconfig | 5 ++---
mm/memory.c | 2 +-
mm/rmap.c | 2 +-
5 files changed, 7 insertions(+), 8 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e617bab..c1259bf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1258,7 +1258,7 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+#ifdef CONFIG_KSM
int replace_page(struct vm_area_struct *vma, struct page *oldpage,
struct page *newpage, pte_t orig_pte, pgprot_t prot);
#endif
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 469376d..8b98536 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,7 +118,7 @@ static inline int try_to_munlock(struct page *page)
}
#endif
-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+#ifdef CONFIG_KSM
int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
#endif
@@ -136,7 +136,7 @@ static inline int page_mkclean(struct page *page)
return 0;
}
-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+#ifdef CONFIG_KSM
static inline int page_wrprotect(struct page *page, int *odirect_sync,
int count_offset)
{
diff --git a/mm/Kconfig b/mm/Kconfig
index 5ebfd18..e7c118f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -227,10 +227,9 @@ config MMU_NOTIFIER
bool
config KSM
- tristate "Enable KSM for page sharing"
+ bool "Enable KSM for page sharing"
help
- Enable the KSM kernel module to allow page sharing of equal pages
- among different tasks.
+ Enable KSM to allow page sharing of equal pages among different tasks.
config NOMMU_INITIAL_TRIM_EXCESS
int "Turn on mmap() excess space trimming before booting"
diff --git a/mm/memory.c b/mm/memory.c
index 8b4e40e..203bbd0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1617,7 +1617,7 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
}
EXPORT_SYMBOL(vm_insert_mixed);
-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+#ifdef CONFIG_KSM
/**
* replace_page - replace page in vma with new page
diff --git a/mm/rmap.c b/mm/rmap.c
index f53074c..34a2029 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -585,7 +585,7 @@ int page_mkclean(struct page *page)
}
EXPORT_SYMBOL_GPL(page_mkclean);
-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+#ifdef CONFIG_KSM
static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
int *odirect_sync, int count_offset)
--
1.5.6.5
Remove page_wrprotect() from rmap.c and instead embedded the needed code
into ksm.c
Hugh pointed out that for the ksm usage case, we dont have to walk over the rmap
and to write protected page after page beacuse when Anonymous page is mapped
more than once, it have to be write protected already, and in a case that it
mapped just once, no need to walk over the rmap, we can instead write protect
it from inside ksm.c.
Thanks.
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Izik Eidus <[email protected]>
---
mm/ksm.c | 92 +++++++++++++++++++++++++++++++++++++++++++------------------
1 files changed, 65 insertions(+), 27 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index 74d921b..3aee221 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -37,6 +37,7 @@
#include <linux/swap.h>
#include <linux/rbtree.h>
#include <linux/anon_inodes.h>
+#include <linux/mmu_notifier.h>
#include <linux/ksm.h>
#include <asm/tlbflush.h>
@@ -643,6 +644,66 @@ static inline int pages_identical(struct page *page1, struct page *page2)
}
/*
+ * If this anonymous page is mapped only here, its pte may need
+ * to be write-protected, If it`s mapped elsewhere, all its
+ * ptes are necessarily already write-protected. In either
+ * case, we need to lock and check page_count is not raised.
+ */
+static inline int write_protect_page(struct page *page,
+ struct vm_area_struct *vma,
+ pte_t *orig_pte)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long addr;
+ pte_t *ptep;
+ spinlock_t *ptl;
+ int swapped;
+ int ret = 1;
+
+ addr = addr_in_vma(vma, page);
+ if (addr == -EFAULT)
+ goto out;
+
+ ptep = page_check_address(page, mm, addr, &ptl, 0);
+ if (!ptep)
+ goto out;
+
+ if (pte_write(*ptep)) {
+ pte_t entry;
+
+ swapped = PageSwapCache(page);
+ flush_cache_page(vma, addr, page_to_pfn(page));
+ /*
+ * Ok this is tricky, when get_user_pages_fast() run it doesnt
+ * take any lock, therefore the check that we are going to make
+ * with the pagecount against the mapcount is racey and
+ * O_DIRECT can happen right after the check.
+ * So we clear the pte and flush the tlb before the check
+ * this assure us that no O_DIRECT can happen after the check
+ * or in the middle of the check.
+ */
+ entry = ptep_clear_flush(vma, addr, ptep);
+ /*
+ * Check that no O_DIRECT or similar I/O is in progress on the
+ * page
+ */
+ if ((page_mapcount(page) + 2 + swapped) != page_count(page)) {
+ set_pte_at_notify(mm, addr, ptep, entry);
+ goto out_unlock;
+ }
+ entry = pte_wrprotect(entry);
+ set_pte_at_notify(mm, addr, ptep, entry);
+ }
+ *orig_pte = *ptep;
+ ret = 0;
+
+out_unlock:
+ pte_unmap_unlock(ptep, ptl);
+out:
+ return ret;
+}
+
+/*
* try_to_merge_one_page - take two pages and merge them into one
* @mm: mm_struct that hold vma pointing into oldpage
* @vma: the vma that hold the pte pointing into oldpage
@@ -661,9 +722,7 @@ static int try_to_merge_one_page(struct mm_struct *mm,
pgprot_t newprot)
{
int ret = 1;
- int odirect_sync;
- unsigned long page_addr_in_vma;
- pte_t orig_pte, *orig_ptep;
+ pte_t orig_pte = __pte(0);
if (!PageAnon(oldpage))
goto out;
@@ -671,42 +730,21 @@ static int try_to_merge_one_page(struct mm_struct *mm,
get_page(newpage);
get_page(oldpage);
- page_addr_in_vma = addr_in_vma(vma, oldpage);
- if (page_addr_in_vma == -EFAULT)
- goto out_putpage;
-
- orig_ptep = get_pte(mm, page_addr_in_vma);
- if (!orig_ptep)
- goto out_putpage;
- orig_pte = *orig_ptep;
- pte_unmap(orig_ptep);
- if (!pte_present(orig_pte))
- goto out_putpage;
- if (page_to_pfn(oldpage) != pte_pfn(orig_pte))
- goto out_putpage;
/*
* we need the page lock to read a stable PageSwapCache in
- * page_wrprotect().
+ * write_protect_page().
* we use trylock_page() instead of lock_page(), beacuse we dont want to
* wait here, we prefer to continue scanning and merging diffrent pages
* and to come back to this page when it is unlocked.
*/
if (!trylock_page(oldpage))
goto out_putpage;
- /*
- * page_wrprotect check if the page is swapped or in swap cache,
- * in the future we might want to run here if_present_pte and then
- * swap_free
- */
- if (!page_wrprotect(oldpage, &odirect_sync, 2)) {
+
+ if (write_protect_page(oldpage, vma, &orig_pte)) {
unlock_page(oldpage);
goto out_putpage;
}
unlock_page(oldpage);
- if (!odirect_sync)
- goto out_putpage;
-
- orig_pte = pte_wrprotect(orig_pte);
if (pages_identical(oldpage, newpage))
ret = replace_page(vma, oldpage, newpage, orig_pte, newprot);
--
1.5.6.5
Now that ksm have its private write_protect_page(), we dont need this helper
function to do the write-protecting work for us.
Signed-off-by: Izik Eidus <[email protected]>
---
include/linux/rmap.h | 12 ----
mm/rmap.c | 139 --------------------------------------------------
2 files changed, 0 insertions(+), 151 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8b98536..350e76d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,10 +118,6 @@ static inline int try_to_munlock(struct page *page)
}
#endif
-#ifdef CONFIG_KSM
-int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
-#endif
-
#else /* !CONFIG_MMU */
#define anon_vma_init() do {} while (0)
@@ -136,14 +132,6 @@ static inline int page_mkclean(struct page *page)
return 0;
}
-#ifdef CONFIG_KSM
-static inline int page_wrprotect(struct page *page, int *odirect_sync,
- int count_offset)
-{
- return 0;
-}
-#endif
-
#endif /* CONFIG_MMU */
/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 34a2029..c3ba0b9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -585,145 +585,6 @@ int page_mkclean(struct page *page)
}
EXPORT_SYMBOL_GPL(page_mkclean);
-#ifdef CONFIG_KSM
-
-static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
- int *odirect_sync, int count_offset)
-{
- struct mm_struct *mm = vma->vm_mm;
- unsigned long address;
- pte_t *pte;
- spinlock_t *ptl;
- int ret = 0;
-
- address = vma_address(page, vma);
- if (address == -EFAULT)
- goto out;
-
- pte = page_check_address(page, mm, address, &ptl, 0);
- if (!pte)
- goto out;
-
- if (pte_write(*pte)) {
- pte_t entry;
-
- flush_cache_page(vma, address, pte_pfn(*pte));
- /*
- * Ok this is tricky, when get_user_pages_fast() run it doesnt
- * take any lock, therefore the check that we are going to make
- * with the pagecount against the mapcount is racey and
- * O_DIRECT can happen right after the check.
- * So we clear the pte and flush the tlb before the check
- * this assure us that no O_DIRECT can happen after the check
- * or in the middle of the check.
- */
- entry = ptep_clear_flush(vma, address, pte);
- /*
- * Check that no O_DIRECT or similar I/O is in progress on the
- * page
- */
- if ((page_mapcount(page) + count_offset) != page_count(page)) {
- *odirect_sync = 0;
- set_pte_at_notify(mm, address, pte, entry);
- goto out_unlock;
- }
- entry = pte_wrprotect(entry);
- set_pte_at_notify(mm, address, pte, entry);
- }
- ret = 1;
-
-out_unlock:
- pte_unmap_unlock(pte, ptl);
-out:
- return ret;
-}
-
-static int page_wrprotect_file(struct page *page, int *odirect_sync,
- int count_offset)
-{
- struct address_space *mapping;
- struct prio_tree_iter iter;
- struct vm_area_struct *vma;
- pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
- int ret = 0;
-
- mapping = page_mapping(page);
- if (!mapping)
- return ret;
-
- spin_lock(&mapping->i_mmap_lock);
-
- vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
- ret += page_wrprotect_one(page, vma, odirect_sync,
- count_offset);
-
- spin_unlock(&mapping->i_mmap_lock);
-
- return ret;
-}
-
-static int page_wrprotect_anon(struct page *page, int *odirect_sync,
- int count_offset)
-{
- struct vm_area_struct *vma;
- struct anon_vma *anon_vma;
- int ret = 0;
-
- anon_vma = page_lock_anon_vma(page);
- if (!anon_vma)
- return ret;
-
- /*
- * If the page is inside the swap cache, its _count number was
- * increased by one, therefore we have to increase count_offset by one.
- */
- if (PageSwapCache(page))
- count_offset++;
-
- list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
- ret += page_wrprotect_one(page, vma, odirect_sync,
- count_offset);
-
- page_unlock_anon_vma(anon_vma);
-
- return ret;
-}
-
-/**
- * page_wrprotect - set all ptes pointing to a page as readonly
- * @page: the page to set as readonly
- * @odirect_sync: boolean value that is set to 0 when some of the ptes were not
- * marked as readonly beacuse page_wrprotect_one() was not able
- * to mark this ptes as readonly without opening window to a race
- * with odirect
- * @count_offset: number of times page_wrprotect() caller had called get_page()
- * on the page
- *
- * returns the number of ptes which were marked as readonly.
- * (ptes that were readonly before this function was called are counted as well)
- */
-int page_wrprotect(struct page *page, int *odirect_sync, int count_offset)
-{
- int ret = 0;
-
- /*
- * Page lock is needed for anon pages for the PageSwapCache check,
- * and for page_mapping for filebacked pages
- */
- BUG_ON(!PageLocked(page));
-
- *odirect_sync = 1;
- if (PageAnon(page))
- ret = page_wrprotect_anon(page, odirect_sync, count_offset);
- else
- ret = page_wrprotect_file(page, odirect_sync, count_offset);
-
- return ret;
-}
-EXPORT_SYMBOL(page_wrprotect);
-
-#endif
-
/**
* __page_set_anon_rmap - setup new anonymous rmap
* @page: the page to add the mapping to
--
1.5.6.5
On Sat, 13 Jun 2009, Izik Eidus wrote:
> Hugh, so untill here we are sync,
Yes, that fits with what I have here, thanks (or where it didn't
quite fit, e.g. ' versus `, I've adjusted to what you have!). And
thanks for fixing my *orig_pte = *ptep bug, you did point that out
before, but I misunderstood at first.
>
> Question is what you want me to do now?,
> (Beacuse we are skipping 2.6.31, It is ok to you to tell me something
> like: "Shut up and let me see what i can get with this madvise" -
> that from one side.
> From another side if you want me to do anything please say.
I had to get a bit further at my end before answering on that,
but now the answer is clear: please do some testing of your RFC
madvise() version (which is what I'm just tidying up a little),
and let me know any bugfixes you find. Try with SLAB or SLUB or
SLQB debug on e.g. CONFIG_SLUB=y, CONFIG_SLUB_DEBUG=y and boot
option "slub_debug".
I'm finding, whether with your RFC or my tidyup, that kksmd
soon oopses in get_next_mmlist (or perhaps find_vma): presumably
accessing a vma or mm which already got freed (if you don't have
slab debugging on, it's liable to hang instead).
(I've also not seen it actually merging yet: if you register
or madvise a large anon area and memset it, the /dev/ksm version
would merge all its pages, but I've not seen the madvise version
do so yet - though maybe there's something stupidly wrong in my
testing, really I'm more worried about the oopses at present.)
Note that mmotm includes a patch of Nick's which adds a function
madvise_behavior_valid() - you'll need to add your MADVs into its
list to get it to work at all there.
Here's a patch I added a month or so ago, when trying to experiment
with KSM on all mms: shouldn't be necessary if your mm refcounting
is right, but might help to avoid extra weirdness when things go
wrong: exit_mmap() leaves stale vma pointers around, reckoning
that nobody can be interested by now; but maybe KSM might peep
so better to tidy them up at least while debugging...
Thanks,
Hugh
--- old/mm/mmap.c 2009-05-01 13:47:45.000000000 +0100
+++ new/mm/mmap.c 2009-05-03 11:34:47.000000000 +0100
@@ -2112,6 +2112,14 @@ void exit_mmap(struct mm_struct *mm)
tlb_finish_mmu(tlb, 0, end);
/*
+ * Make sure get_user_pages() and find_vma() etc. will find nothing:
+ * this may be necessary for KSM.
+ */
+ mm->mmap = NULL;
+ mm->mmap_cache = NULL;
+ mm->mm_rb = RB_ROOT;
+
+ /*
* Walk the list again, actually closing and freeing it,
* with preemption enabled, without holding any MM locks.
*/
Hugh Dickins wrote:
> On Sat, 13 Jun 2009, Izik Eidus wrote:
>
>> Hugh, so untill here we are sync,
>>
>
> Yes, that fits with what I have here, thanks (or where it didn't
> quite fit, e.g. ' versus `, I've adjusted to what you have!). And
> thanks for fixing my *orig_pte = *ptep bug, you did point that out
> before, but I misunderstood at first.
>
>
>> Question is what you want me to do now?,
>> (Beacuse we are skipping 2.6.31, It is ok to you to tell me something
>> like: "Shut up and let me see what i can get with this madvise" -
>> that from one side.
>> From another side if you want me to do anything please say.
>>
>
> I had to get a bit further at my end before answering on that,
> but now the answer is clear: please do some testing of your RFC
> madvise() version (which is what I'm just tidying up a little),
> and let me know any bugfixes you find. Try with SLAB or SLUB or
> SLQB debug on e.g. CONFIG_SLUB=y, CONFIG_SLUB_DEBUG=y and boot
> option "slub_debug".
>
Sure, let me check it.
(You do have Andrea patch that fix the "used after free slab entries" ?)
> I'm finding, whether with your RFC or my tidyup, that kksmd
> soon oopses in get_next_mmlist (or perhaps find_vma): presumably
> accessing a vma or mm which already got freed (if you don't have
> slab debugging on, it's liable to hang instead).
>
> (I've also not seen it actually merging yet: if you register
> or madvise a large anon area and memset it, the /dev/ksm version
> would merge all its pages, but I've not seen the madvise version
> do so yet - though maybe there's something stupidly wrong in my
> testing, really I'm more worried about the oopses at present.)
>
> Note that mmotm includes a patch of Nick's which adds a function
> madvise_behavior_valid() - you'll need to add your MADVs into its
> list to get it to work at all there.
>
> Here's a patch I added a month or so ago, when trying to experiment
> with KSM on all mms: shouldn't be necessary if your mm refcounting
> is right, but might help to avoid extra weirdness when things go
> wrong: exit_mmap() leaves stale vma pointers around, reckoning
> that nobody can be interested by now; but maybe KSM might peep
> so better to tidy them up at least while debugging...
>
> Thanks,
> Hugh
>
> --- old/mm/mmap.c 2009-05-01 13:47:45.000000000 +0100
> +++ new/mm/mmap.c 2009-05-03 11:34:47.000000000 +0100
> @@ -2112,6 +2112,14 @@ void exit_mmap(struct mm_struct *mm)
> tlb_finish_mmu(tlb, 0, end);
>
> /*
> + * Make sure get_user_pages() and find_vma() etc. will find nothing:
> + * this may be necessary for KSM.
> + */
> + mm->mmap = NULL;
> + mm->mmap_cache = NULL;
> + mm->mm_rb = RB_ROOT;
> +
> + /*
> * Walk the list again, actually closing and freeing it,
> * with preemption enabled, without holding any MM locks.
> */
>
Izik Eidus wrote:
> Hugh Dickins wrote:
>> On Sat, 13 Jun 2009, Izik Eidus wrote:
>>
>>> Hugh, so untill here we are sync,
>>>
>>
>> Yes, that fits with what I have here, thanks (or where it didn't
>> quite fit, e.g. ' versus `, I've adjusted to what you have!). And
>> thanks for fixing my *orig_pte = *ptep bug, you did point that out
>> before, but I misunderstood at first.
>>
>>
>>> Question is what you want me to do now?,
>>> (Beacuse we are skipping 2.6.31, It is ok to you to tell me something
>>> like: "Shut up and let me see what i can get with this madvise" -
>>> that from one side.
>>> From another side if you want me to do anything please say.
>>>
>>
>> I had to get a bit further at my end before answering on that,
>> but now the answer is clear: please do some testing of your RFC
>> madvise() version (which is what I'm just tidying up a little),
>> and let me know any bugfixes you find. Try with SLAB or SLUB or
>> SLQB debug on e.g. CONFIG_SLUB=y, CONFIG_SLUB_DEBUG=y and boot
>> option "slub_debug".
>>
>
> Sure, let me check it.
> (You do have Andrea patch that fix the "used after free slab entries" ?)
How fast is it crush opps to you?, I compiled it and ran it here on
2.6.30-rc4-mm1 with:
"Enable SLQB debugging support" and "SLQB debugging on by default, and
it run and merge (i am using qemu processes to run virtual machines to
merge the pages between them)
("SLQB debugging on by defaul" mean i dont have to add boot pareameter
right?)
Maybe i should try update into newer version of the mm tree? (last
commit here is Jul 22)
>
>> I'm finding, whether with your RFC or my tidyup, that kksmd
>> soon oopses in get_next_mmlist (or perhaps find_vma): presumably
>> accessing a vma or mm which already got freed (if you don't have
>> slab debugging on, it's liable to hang instead).
>>
>> (I've also not seen it actually merging yet: if you register
>> or madvise a large anon area and memset it, the /dev/ksm version
>> would merge all its pages, but I've not seen the madvise version
>> do so yet - though maybe there's something stupidly wrong in my
>> testing, really I'm more worried about the oopses at present.)
>>
>> Note that mmotm includes a patch of Nick's which adds a function
>> madvise_behavior_valid() - you'll need to add your MADVs into its
>> list to get it to work at all there.
>>
>> Here's a patch I added a month or so ago, when trying to experiment
>> with KSM on all mms: shouldn't be necessary if your mm refcounting
>> is right, but might help to avoid extra weirdness when things go
>> wrong: exit_mmap() leaves stale vma pointers around, reckoning
>> that nobody can be interested by now; but maybe KSM might peep
>> so better to tidy them up at least while debugging...
>>
>> Thanks,
>> Hugh
>>
>> --- old/mm/mmap.c 2009-05-01 13:47:45.000000000 +0100
>> +++ new/mm/mmap.c 2009-05-03 11:34:47.000000000 +0100
>> @@ -2112,6 +2112,14 @@ void exit_mmap(struct mm_struct *mm)
>> tlb_finish_mmu(tlb, 0, end);
>>
>> /*
>> + * Make sure get_user_pages() and find_vma() etc. will find
>> nothing:
>> + * this may be necessary for KSM.
>> + */
>> + mm->mmap = NULL;
>> + mm->mmap_cache = NULL;
>> + mm->mm_rb = RB_ROOT;
>> +
>> + /*
>> * Walk the list again, actually closing and freeing it,
>> * with preemption enabled, without holding any MM locks.
>> */
>>
>
>
Izik Eidus wrote:
> Izik Eidus wrote:
>> Hugh Dickins wrote:
>>> On Sat, 13 Jun 2009, Izik Eidus wrote:
>>>
>>>> Hugh, so untill here we are sync,
>>>>
>>>
>>> Yes, that fits with what I have here, thanks (or where it didn't
>>> quite fit, e.g. ' versus `, I've adjusted to what you have!). And
>>> thanks for fixing my *orig_pte = *ptep bug, you did point that out
>>> before, but I misunderstood at first.
>>>
>>>
>>>> Question is what you want me to do now?,
>>>> (Beacuse we are skipping 2.6.31, It is ok to you to tell me something
>>>> like: "Shut up and let me see what i can get with this madvise" -
>>>> that from one side.
>>>> From another side if you want me to do anything please say.
>>>>
>>>
>>> I had to get a bit further at my end before answering on that,
>>> but now the answer is clear: please do some testing of your RFC
>>> madvise() version (which is what I'm just tidying up a little),
>>> and let me know any bugfixes you find. Try with SLAB or SLUB or
>>> SLQB debug on e.g. CONFIG_SLUB=y, CONFIG_SLUB_DEBUG=y and boot
>>> option "slub_debug".
>>>
>>
>> Sure, let me check it.
>> (You do have Andrea patch that fix the "used after free slab entries" ?)
>
> How fast is it crush opps to you?, I compiled it and ran it here on
> 2.6.30-rc4-mm1 with:
> "Enable SLQB debugging support" and "SLQB debugging on by default, and
> it run and merge (i am using qemu processes to run virtual machines to
> merge the pages between them)
>
> ("SLQB debugging on by defaul" mean i dont have to add boot pareameter
> right?)
>
> Maybe i should try update into newer version of the mm tree? (last
> commit here is Jul 22)
OK, bug on my side, just got that oppss, will try to fix and send patch.
(Sorry for the noise)
>
>>
>>> I'm finding, whether with your RFC or my tidyup, that kksmd
>>> soon oopses in get_next_mmlist (or perhaps find_vma): presumably
>>> accessing a vma or mm which already got freed (if you don't have
>>> slab debugging on, it's liable to hang instead).
>>>
>>> (I've also not seen it actually merging yet: if you register
>>> or madvise a large anon area and memset it, the /dev/ksm version
>>> would merge all its pages, but I've not seen the madvise version
>>> do so yet - though maybe there's something stupidly wrong in my
>>> testing, really I'm more worried about the oopses at present.)
>>>
>>> Note that mmotm includes a patch of Nick's which adds a function
>>> madvise_behavior_valid() - you'll need to add your MADVs into its
>>> list to get it to work at all there.
>>>
>>> Here's a patch I added a month or so ago, when trying to experiment
>>> with KSM on all mms: shouldn't be necessary if your mm refcounting
>>> is right, but might help to avoid extra weirdness when things go
>>> wrong: exit_mmap() leaves stale vma pointers around, reckoning
>>> that nobody can be interested by now; but maybe KSM might peep
>>> so better to tidy them up at least while debugging...
>>>
>>> Thanks,
>>> Hugh
>>>
>>> --- old/mm/mmap.c 2009-05-01 13:47:45.000000000 +0100
>>> +++ new/mm/mmap.c 2009-05-03 11:34:47.000000000 +0100
>>> @@ -2112,6 +2112,14 @@ void exit_mmap(struct mm_struct *mm)
>>> tlb_finish_mmu(tlb, 0, end);
>>>
>>> /*
>>> + * Make sure get_user_pages() and find_vma() etc. will find
>>> nothing:
>>> + * this may be necessary for KSM.
>>> + */
>>> + mm->mmap = NULL;
>>> + mm->mmap_cache = NULL;
>>> + mm->mm_rb = RB_ROOT;
>>> +
>>> + /*
>>> * Walk the list again, actually closing and freeing it,
>>> * with preemption enabled, without holding any MM locks.
>>> */
>>>
>>
>>
>
>
On Mon, 15 Jun 2009 03:05:14 +0300
Izik Eidus <[email protected]> wrote:
> Izik Eidus wrote:
> > Izik Eidus wrote:
> >> Hugh Dickins wrote:
> >>> On Sat, 13 Jun 2009, Izik Eidus wrote:
> >>>
> >>>> Hugh, so untill here we are sync,
> >>>>
> >>>
> >>> Yes, that fits with what I have here, thanks (or where it didn't
> >>> quite fit, e.g. ' versus `, I've adjusted to what you have!). And
> >>> thanks for fixing my *orig_pte = *ptep bug, you did point that out
> >>> before, but I misunderstood at first.
> >>>
> >>>
> >>>> Question is what you want me to do now?,
> >>>> (Beacuse we are skipping 2.6.31, It is ok to you to tell me
> >>>> something like: "Shut up and let me see what i can get with this
> >>>> madvise" - that from one side.
> >>>> From another side if you want me to do anything please say.
> >>>>
> >>>
> >>> I had to get a bit further at my end before answering on that,
> >>> but now the answer is clear: please do some testing of your RFC
> >>> madvise() version (which is what I'm just tidying up a little),
> >>> and let me know any bugfixes you find. Try with SLAB or SLUB or
> >>> SLQB debug on e.g. CONFIG_SLUB=y, CONFIG_SLUB_DEBUG=y and boot
> >>> option "slub_debug".
> >>>
> >>
> >> Sure, let me check it.
> >> (You do have Andrea patch that fix the "used after free slab
> >> entries" ?)
> >
> > How fast is it crush opps to you?, I compiled it and ran it here on
> > 2.6.30-rc4-mm1 with:
> > "Enable SLQB debugging support" and "SLQB debugging on by default,
> > and it run and merge (i am using qemu processes to run virtual
> > machines to merge the pages between them)
> >
> > ("SLQB debugging on by defaul" mean i dont have to add boot
> > pareameter right?)
> >
> > Maybe i should try update into newer version of the mm tree? (last
> > commit here is Jul 22)
>
> OK, bug on my side, just got that oppss, will try to fix and send
> patch.
>
> (Sorry for the noise)
>
> >
> >>
> >>> I'm finding, whether with your RFC or my tidyup, that kksmd
> >>> soon oopses in get_next_mmlist (or perhaps find_vma): presumably
> >>> accessing a vma or mm which already got freed (if you don't have
> >>> slab debugging on, it's liable to hang instead).
> >>>
> >>> (I've also not seen it actually merging yet: if you register
> >>> or madvise a large anon area and memset it, the /dev/ksm version
> >>> would merge all its pages, but I've not seen the madvise version
> >>> do so yet - though maybe there's something stupidly wrong in my
> >>> testing, really I'm more worried about the oopses at present.)
> >>>
> >>> Note that mmotm includes a patch of Nick's which adds a function
> >>> madvise_behavior_valid() - you'll need to add your MADVs into its
> >>> list to get it to work at all there.
> >>>
> >>> Here's a patch I added a month or so ago, when trying to
> >>> experiment with KSM on all mms: shouldn't be necessary if your mm
> >>> refcounting is right, but might help to avoid extra weirdness
> >>> when things go wrong: exit_mmap() leaves stale vma pointers
> >>> around, reckoning that nobody can be interested by now; but maybe
> >>> KSM might peep so better to tidy them up at least while
> >>> debugging...
> >>>
> >>> Thanks,
> >>> Hugh
> >>>
> >>> --- old/mm/mmap.c 2009-05-01 13:47:45.000000000 +0100
> >>> +++ new/mm/mmap.c 2009-05-03 11:34:47.000000000 +0100
> >>> @@ -2112,6 +2112,14 @@ void exit_mmap(struct mm_struct *mm)
> >>> tlb_finish_mmu(tlb, 0, end);
> >>>
> >>> /*
> >>> + * Make sure get_user_pages() and find_vma() etc. will find
> >>> nothing:
> >>> + * this may be necessary for KSM.
> >>> + */
> >>> + mm->mmap = NULL;
> >>> + mm->mmap_cache = NULL;
> >>> + mm->mm_rb = RB_ROOT;
> >>> +
> >>> + /*
> >>> * Walk the list again, actually closing and freeing it,
> >>> * with preemption enabled, without holding any MM locks.
> >>> */
> >>>
> >>
> >>
> >
> >
>
>
Ok, below is ugly fix for the opss..
>From 3be1ad5a9f990113e8849fa1e74c4e74066af131 Mon Sep 17 00:00:00 2001
From: Izik Eidus <[email protected]>
Date: Mon, 15 Jun 2009 03:52:05 +0300
Subject: [PATCH] ksm: madvise-rfc: really ugly fix for the oppss bug.
This patch is just so it can run without to crush with the madvise rfc patch.
True fix for this i think is adding another list for ksm inside the mm struct.
In the meanwhile i will try to think about other way how to fix this bug.
Hugh, i hope at least now you will be able to run it without it will crush to
you.
Signed-off-by: Izik Eidus <[email protected]>
---
kernel/fork.c | 11 ++++++-----
1 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index e5ef58c..771b89a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -484,17 +484,18 @@ void mmput(struct mm_struct *mm)
{
might_sleep();
+ spin_lock(&mmlist_lock);
if (atomic_dec_and_test(&mm->mm_users)) {
+ if (!list_empty(&mm->mmlist))
+ list_del(&mm->mmlist);
+ spin_unlock(&mmlist_lock);
exit_aio(mm);
exit_mmap(mm);
set_mm_exe_file(mm, NULL);
- if (!list_empty(&mm->mmlist)) {
- spin_lock(&mmlist_lock);
- list_del(&mm->mmlist);
- spin_unlock(&mmlist_lock);
- }
put_swap_token(mm);
mmdrop(mm);
+ } else {
+ spin_unlock(&mmlist_lock);
}
}
EXPORT_SYMBOL_GPL(mmput);
--
1.5.6.5
On Mon, 15 Jun 2009, Izik Eidus wrote:
> On Mon, 15 Jun 2009 03:05:14 +0300 Izik Eidus <[email protected]> wrote:
> > >>
> > >> Sure, let me check it.
> > >> (You do have Andrea patch that fix the "used after free slab
> > >> entries" ?)
I do, yes, and the other fix he posted at the same time.
> > >
> > > How fast is it crush opps to you?, I compiled it and ran it here on
> > > 2.6.30-rc4-mm1 with:
> > > "Enable SLQB debugging support" and "SLQB debugging on by default,
> > > and it run and merge (i am using qemu processes to run virtual
> > > machines to merge the pages between them)
I suspect too much of your testing has been on the sensible use of KSM,
not enough on foolish abuse of it!
> > >
> > > ("SLQB debugging on by defaul" mean i dont have to add boot
> > > pareameter right?)
That's right. (I never switch it on by default in the .config,
just love that we can switch it on by boot option when needed.)
> > OK, bug on my side, just got that oppss, will try to fix and send
> > patch.
Thanks a lot for getting there so quickly.
> Ok, below is ugly fix for the opss..
Right, for several reasons* not a patch we'd like to include in the
end, but enough to show that indeed fixes the problems I was seeing:
well caught.
[* We don't want to serialize everyone on a global lock there. You
could make it a little better using atomic_dec_and_lock(), but still
unpleasant. And that placing of the list_del() is bad for swapping:
I ended up using list_del_init() there, and restoring the check after
calling set_mm_exe_file(). But never mind, I've a better fix.]
>
> From 3be1ad5a9f990113e8849fa1e74c4e74066af131 Mon Sep 17 00:00:00 2001
> From: Izik Eidus <[email protected]>
> Date: Mon, 15 Jun 2009 03:52:05 +0300
> Subject: [PATCH] ksm: madvise-rfc: really ugly fix for the oppss bug.
>
> This patch is just so it can run without to crush with the madvise rfc patch.
>
> True fix for this i think is adding another list for ksm inside the mm struct.
> In the meanwhile i will try to think about other way how to fix this bug.
I'm already working with a separate list for KSM inside the mm struct
(prefer not to be mixing these two things up), but I don't see how that
helps.
>
> Hugh, i hope at least now you will be able to run it without it will crush to
> you.
Yes, thanks, and helped me to think of a better fix (I hope!) below...
>
> Signed-off-by: Izik Eidus <[email protected]>
> ---
> kernel/fork.c | 11 ++++++-----
> 1 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index e5ef58c..771b89a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -484,17 +484,18 @@ void mmput(struct mm_struct *mm)
> {
> might_sleep();
>
> + spin_lock(&mmlist_lock);
> if (atomic_dec_and_test(&mm->mm_users)) {
> + if (!list_empty(&mm->mmlist))
> + list_del(&mm->mmlist);
> + spin_unlock(&mmlist_lock);
> exit_aio(mm);
> exit_mmap(mm);
> set_mm_exe_file(mm, NULL);
> - if (!list_empty(&mm->mmlist)) {
> - spin_lock(&mmlist_lock);
> - list_del(&mm->mmlist);
> - spin_unlock(&mmlist_lock);
> - }
> put_swap_token(mm);
> mmdrop(mm);
> + } else {
> + spin_unlock(&mmlist_lock);
> }
> }
> EXPORT_SYMBOL_GPL(mmput);
I believe you can restore mmput() to how it was: patch below appears
to fix it from the get_next_mmlist() end, using atomic_inc_not_zero() -
swapoff's try_to_unuse() has to use that on mm_users too. But please
check and test it carefully, I'd usually want to test a lot more myself
before sending out.
And, hey, this very same patch appears to fix the other issue I mentioned,
that I wasn't yet seeing any merging (within a single mm): I think your
mm_users 1 test was badly placed before (whether or not we already have
an mm_slot makes a big difference to the meaning of mm_users 1).
Signed-off-by: Hugh Dickins <[email protected]>
--- ksm_madvise/mm/ksm.c 2009-06-14 11:15:33.000000000 +0100
+++ linux/mm/ksm.c 2009-06-15 12:49:12.000000000 +0100
@@ -1137,7 +1137,6 @@ static void insert_to_mm_slots_hash(stru
bucket = &mm_slots_hash[((unsigned long)mm / sizeof (struct mm_struct))
% nmm_slots_hash];
mm_slot->mm = mm;
- atomic_inc(&mm_slot->mm->mm_users);
INIT_LIST_HEAD(&mm_slot->rmap_list);
hlist_add_head(&mm_slot->link, bucket);
}
@@ -1253,21 +1252,21 @@ static struct mm_slot *get_next_mmlist(s
if (test_bit(MMF_VM_MERGEABLE, &mm->flags) ||
test_bit(MMF_VM_MERGEALL, &mm->flags)) {
mm_slot = get_mm_slot(mm);
- if (unlikely(atomic_read(&mm->mm_users) == 1)) {
- if (mm_slot)
+ if (mm_slot) {
+ if (atomic_read(&mm->mm_users) == 1) {
mm_slot->touched = 0;
- } else {
- if (!mm_slot) {
- insert_to_mm_slots_hash(mm,
- pre_alloc_mm_slot);
- *used_pre_alloc = 1;
- mm_slot = pre_alloc_mm_slot;
+ goto next;
}
- mm_slot->touched = 1;
- return mm_slot;
- }
+ } else if (atomic_inc_not_zero(&mm->mm_users)) {
+ insert_to_mm_slots_hash(mm, pre_alloc_mm_slot);
+ *used_pre_alloc = 1;
+ mm_slot = pre_alloc_mm_slot;
+ } else
+ goto next;
+ mm_slot->touched = 1;
+ return mm_slot;
}
-
+next:
cur = cur->next;
}
return NULL;
Hugh Dickins wrote:
> I suspect too much of your testing has been on the sensible use of KSM,
> not enough on foolish abuse of it!
>
Must of the testing was on the ioctl version of ksm, this RFC i thought
would end up in "list of requested changes" that will force me to
rewrite it.
(I did mention that i not sure if everything is safe, and it was mostly
just to give the idea of possible desgien and its problems)
>
> Right, for several reasons* not a patch we'd like to include in the
> end, but enough to show that indeed fixes the problems I was seeing:
> well caught.
>
Yea, that why i said ugly :).
>
>
> I'm already working with a separate list for KSM inside the mm struct
> (prefer not to be mixing these two things up), but I don't see how that
> helps.
>
So you dont have mmlist at all?, Good i think i found another problems
with the usage the RFC made with the mmlist,
Do you mind to send me the patch that move into diffrent mm list? or you
still want to wait?
>
>> Hugh, i hope at least now you will be able to run it without it will crush to
>> you.
>>
>
> Yes, thanks, and helped me to think of a better fix (I hope!) below...
>
>
>> Signed-off-by: Izik Eidus <[email protected]>
>> ---
>> kernel/fork.c | 11 ++++++-----
>> 1 files changed, 6 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index e5ef58c..771b89a 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -484,17 +484,18 @@ void mmput(struct mm_struct *mm)
>> {
>> might_sleep();
>>
>> + spin_lock(&mmlist_lock);
>> if (atomic_dec_and_test(&mm->mm_users)) {
>> + if (!list_empty(&mm->mmlist))
>> + list_del(&mm->mmlist);
>> + spin_unlock(&mmlist_lock);
>> exit_aio(mm);
>> exit_mmap(mm);
>> set_mm_exe_file(mm, NULL);
>> - if (!list_empty(&mm->mmlist)) {
>> - spin_lock(&mmlist_lock);
>> - list_del(&mm->mmlist);
>> - spin_unlock(&mmlist_lock);
>> - }
>> put_swap_token(mm);
>> mmdrop(mm);
>> + } else {
>> + spin_unlock(&mmlist_lock);
>> }
>> }
>> EXPORT_SYMBOL_GPL(mmput);
>>
>
> I believe you can restore mmput() to how it was: patch below appears
> to fix it from the get_next_mmlist() end, using atomic_inc_not_zero() -
> swapoff's try_to_unuse() has to use that on mm_users too. But please
> check and test it carefully, I'd usually want to test a lot more myself
> before sending out.
>
> And, hey, this very same patch appears to fix the other issue I mentioned,
> that I wasn't yet seeing any merging (within a single mm): I think your
> mm_users 1 test was badly placed before (whether or not we already have
> an mm_slot makes a big difference to the meaning of mm_users 1).
>
> Signed-off-by: Hugh Dickins <[email protected]>
>
> --- ksm_madvise/mm/ksm.c 2009-06-14 11:15:33.000000000 +0100
> +++ linux/mm/ksm.c 2009-06-15 12:49:12.000000000 +0100
> @@ -1137,7 +1137,6 @@ static void insert_to_mm_slots_hash(stru
> bucket = &mm_slots_hash[((unsigned long)mm / sizeof (struct mm_struct))
> % nmm_slots_hash];
> mm_slot->mm = mm;
> - atomic_inc(&mm_slot->mm->mm_users);
> INIT_LIST_HEAD(&mm_slot->rmap_list);
> hlist_add_head(&mm_slot->link, bucket);
> }
> @@ -1253,21 +1252,21 @@ static struct mm_slot *get_next_mmlist(s
> if (test_bit(MMF_VM_MERGEABLE, &mm->flags) ||
> test_bit(MMF_VM_MERGEALL, &mm->flags)) {
> mm_slot = get_mm_slot(mm);
> - if (unlikely(atomic_read(&mm->mm_users) == 1)) {
> - if (mm_slot)
> + if (mm_slot) {
> + if (atomic_read(&mm->mm_users) == 1) {
> mm_slot->touched = 0;
> - } else {
> - if (!mm_slot) {
> - insert_to_mm_slots_hash(mm,
> - pre_alloc_mm_slot);
> - *used_pre_alloc = 1;
> - mm_slot = pre_alloc_mm_slot;
> + goto next;
> }
> - mm_slot->touched = 1;
> - return mm_slot;
> - }
> + } else if (atomic_inc_not_zero(&mm->mm_users)) {
> + insert_to_mm_slots_hash(mm, pre_alloc_mm_slot);
> + *used_pre_alloc = 1;
> + mm_slot = pre_alloc_mm_slot;
> + } else
> + goto next;
> + mm_slot->touched = 1;
> + return mm_slot;
> }
> -
> +next:
> cur = cur->next;
> }
> return NULL;
>
Yea, that trick with the inc_not_zero is much better, This patch does
make sense as a bug fix..
I really would like to see it running with your new list for ksm inside
the mmstruct, beacuse I think the mmlist usage have another problems.
So if you can send me a patch I will start stress test, on the other
side I dont mind to wait.
Anyway, thanks.
On Mon, 15 Jun 2009, Izik Eidus wrote:
>
> So you dont have mmlist at all?, Good i think i found another problems with
> the usage the RFC made with the mmlist,
> Do you mind to send me the patch that move into diffrent mm list? or you still
> want to wait?
I'd really prefer to wait, there are too many questions I should
resolve one way or another now I can test again e.g. do we need
to put children on to the list when forking or not?
But don't let me hold you up: if you believe you've problems sharing
the mmlist, go ahead and separate out - I called the KSM one ksm_mmlist
(surprise!), hanging off init_mm.ksm_mmlist in the same way mmlist does.
And then you can get rid of the MMF flags etc (I've set aside your 4/4
as not something to get into right now - probably the release after).
At present I'm still using mmlist_lock, but that's probably a mistake:
we don't really want to have to consider unrelated lock ordering
constraints, a ksm_mmlist_lock would be better for now.
None of which precludes someone coming along later and combining
ksm_mmlist with mmlist in order to save space: but for now I
think we do better to keep them separate, and it sounds like
you've seen a strong reason for that.
Hugh