This patch set will improve system availability by allowing us to
avoid system panic in some special cases during page collapsing:
replacing a chunk of normal pages by a THP(Transparent Huge Page).
Hidetoshi Seto (4):
Lock the new THP when collapsing pages
Free the collapsed pages after the new THP is mapped.
Check whether pages are poisoned before copying
Check whether the new THP is poisoned before it is mapped to APL.
mm/huge_memory.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 65 insertions(+), 10 deletions(-)
When collapsing pages, if the new THP is poisoned before it is
mapped to the APL, memory_failure() can do nothing except setting
the poison flag to the new THP, because it is not locked and not
mapped yet.
So lock the new THP to make sure that memory_failure() could
run after the poisoned new THP is mapped to APL successfully.
This can make sure the poisoned THP will give the smallest impact
to the system by killing the APL instead of panicking the system.
Signed-off-by: Hidetoshi Seto <[email protected]>
Signed-off-by: Jin Dongming <[email protected]>
---
mm/huge_memory.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 113e35c..e02f687 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1869,6 +1869,12 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
anon_vma_unlock(vma->anon_vma);
+ /*
+ * Lock the new THP to block memory_failure until it is
+ * mapped to the process.
+ */
+ lock_page_nosync(new_page);
+
__collapse_huge_page_copy(pte, new_page, vma, address, ptl);
pte_unmap(pte);
__SetPageUptodate(new_page);
@@ -1900,6 +1906,7 @@ static void collapse_huge_page(struct mm_struct *mm,
*hpage = NULL;
#endif
khugepaged_pages_collapsed++;
+ unlock_page(new_page);
out_up_write:
up_write(&mm->mmap_sem);
return;
--
1.7.1
Free pages at once after they are replaced by new THP,
instead of freeing one by one during copy.
This change is for later patches, to allow cancel of page
collapsing.
Signed-off-by: Hidetoshi Seto <[email protected]>
Signed-off-by: Jin Dongming <[email protected]>
---
mm/huge_memory.c | 31 +++++++++++++++++++++++++------
1 files changed, 25 insertions(+), 6 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e02f687..c62176a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1704,11 +1704,10 @@ out:
static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
struct vm_area_struct *vma,
- unsigned long address,
- spinlock_t *ptl)
+ unsigned long address)
{
pte_t *_pte;
- for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
+ for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte++) {
pte_t pteval = *_pte;
struct page *src_page;
@@ -1720,7 +1719,28 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
copy_user_highpage(page, src_page, address, vma);
VM_BUG_ON(page_mapcount(src_page) != 1);
VM_BUG_ON(page_count(src_page) != 2);
+ }
+
+ address += PAGE_SIZE;
+ page++;
+ }
+}
+
+static void __collapse_huge_page_free_old_pte(pte_t *pte,
+ struct vm_area_struct *vma,
+ unsigned long address,
+ spinlock_t *ptl)
+{
+ pte_t *_pte;
+
+ for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte++) {
+ pte_t pteval = *_pte;
+ struct page *src_page;
+
+ if (!pte_none(pteval)) {
+ src_page = pte_page(pteval);
release_pte_page(src_page);
+
/*
* ptl mostly unnecessary, but preempt has to
* be disabled to update the per-cpu stats
@@ -1736,9 +1756,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
spin_unlock(ptl);
free_page_and_swap_cache(src_page);
}
-
address += PAGE_SIZE;
- page++;
}
}
@@ -1875,7 +1893,7 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
lock_page_nosync(new_page);
- __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+ __collapse_huge_page_copy(pte, new_page, vma, address);
pte_unmap(pte);
__SetPageUptodate(new_page);
pgtable = pmd_pgtable(_pmd);
@@ -1905,6 +1923,7 @@ static void collapse_huge_page(struct mm_struct *mm,
#ifndef CONFIG_NUMA
*hpage = NULL;
#endif
+ __collapse_huge_page_free_old_pte(pte, vma, address, ptl);
khugepaged_pages_collapsed++;
unlock_page(new_page);
out_up_write:
--
1.7.1
No matter whether it is one of collapsing pages or the new THP,
if the poisoned page is accessed during page copy, MCE will happen
and the system will panic.
So to avoid the above problem, add poison checks for both of 4K pages
and the THP before copying in __collapse_huge_page_copy().
If poisoned page is found, cancel page collapsing to keep the poisoned
4k page to be owned by the APL, or free poisoned THP before use it.
Signed-off-by: Hidetoshi Seto <[email protected]>
Signed-off-by: Jin Dongming <[email protected]>
---
mm/huge_memory.c | 27 +++++++++++++++++++++++----
1 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c62176a..6345279 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1702,20 +1702,26 @@ out:
return isolated;
}
-static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
- struct vm_area_struct *vma,
- unsigned long address)
+static int __collapse_huge_page_copy(pte_t *pte, struct page *page,
+ struct vm_area_struct *vma,
+ unsigned long address)
{
pte_t *_pte;
for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte++) {
pte_t pteval = *_pte;
struct page *src_page;
+ if (PageHWPoison(page))
+ return 0;
+
if (pte_none(pteval)) {
clear_user_highpage(page, address);
add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
} else {
src_page = pte_page(pteval);
+ if (PageHWPoison(src_page))
+ return 0;
+
copy_user_highpage(page, src_page, address, vma);
VM_BUG_ON(page_mapcount(src_page) != 1);
VM_BUG_ON(page_count(src_page) != 2);
@@ -1724,6 +1730,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
address += PAGE_SIZE;
page++;
}
+
+ return 1;
}
static void __collapse_huge_page_free_old_pte(pte_t *pte,
@@ -1893,7 +1901,9 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
lock_page_nosync(new_page);
- __collapse_huge_page_copy(pte, new_page, vma, address);
+ if (__collapse_huge_page_copy(pte, new_page, vma, address) == 0)
+ goto out_poison;
+
pte_unmap(pte);
__SetPageUptodate(new_page);
pgtable = pmd_pgtable(_pmd);
@@ -1930,6 +1940,15 @@ out_up_write:
up_write(&mm->mmap_sem);
return;
+out_poison:
+ release_all_pte_pages(pte);
+ pte_unmap(pte);
+ spin_lock(&mm->page_table_lock);
+ BUG_ON(!pmd_none(*pmd));
+ set_pmd_at(mm, address, pmd, _pmd);
+ spin_unlock(&mm->page_table_lock);
+ unlock_page(new_page);
+
out:
mem_cgroup_uncharge_page(new_page);
#ifdef CONFIG_NUMA
--
1.7.1
If the new THP is poisoned after the 4K pages are copied to it
and mapped to APL, APL will be killed by kernel with SIGBUS signal.
There is not much doubt that it is a right behavior. But we can
do our best to reduce the impact of the poisoned THP to the least.
So add final poison check for the new THP before the THP is mapped
to APL. If check find a poison, back to 4K pages and trash the THP.
Signed-off-by: Hidetoshi Seto <[email protected]>
Signed-off-by: Jin Dongming <[email protected]>
---
mm/huge_memory.c | 14 ++++++++++++--
1 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6345279..9aed3a8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1776,13 +1776,14 @@ static void collapse_huge_page(struct mm_struct *mm,
{
pgd_t *pgd;
pud_t *pud;
- pmd_t *pmd, _pmd;
+ pmd_t *pmd, _pmd, old_pmd;
pte_t *pte;
pgtable_t pgtable;
struct page *new_page;
spinlock_t *ptl;
int isolated;
unsigned long hstart, hend;
+ struct page *p;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
#ifndef CONFIG_NUMA
@@ -1873,6 +1874,7 @@ static void collapse_huge_page(struct mm_struct *mm,
* to avoid the risk of CPU bugs in that area.
*/
_pmd = pmdp_clear_flush_notify(vma, address, pmd);
+ old_pmd = _pmd;
spin_unlock(&mm->page_table_lock);
spin_lock(ptl);
@@ -1904,7 +1906,6 @@ static void collapse_huge_page(struct mm_struct *mm,
if (__collapse_huge_page_copy(pte, new_page, vma, address) == 0)
goto out_poison;
- pte_unmap(pte);
__SetPageUptodate(new_page);
pgtable = pmd_pgtable(_pmd);
VM_BUG_ON(page_count(pgtable) != 1);
@@ -1921,6 +1922,15 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
smp_wmb();
+ for (p = new_page; p < new_page + HPAGE_PMD_NR; p++) {
+ if (PageHWPoison(p)) {
+ _pmd = old_pmd;
+ goto out_poison;
+ }
+ }
+
+ pte_unmap(pte);
+
spin_lock(&mm->page_table_lock);
BUG_ON(!pmd_none(*pmd));
page_add_new_anon_rmap(new_page, vma, address);
--
1.7.1
On Thu, Mar 17, 2011 at 11:32:55AM +0900, Hidetoshi Seto wrote:
> So to avoid the above problem, add poison checks for both of 4K pages
> and the THP before copying in __collapse_huge_page_copy().
I don't think you need the check for the huge page -- it should not be allocated
in this case.
-Andi
(2011/03/17 13:14), Andi Kleen wrote:
> On Thu, Mar 17, 2011 at 11:32:55AM +0900, Hidetoshi Seto wrote:
>> So to avoid the above problem, add poison checks for both of 4K pages
>> and the THP before copying in __collapse_huge_page_copy().
>
> I don't think you need the check for the huge page -- it should not be allocated
> in this case.
Is there no case that the THP is once allocated but poison flag
is set before copy? Though I know it is one of corner cases,
having this check for THP is worse than nothing.
Thanks,
H.Seto
On Thu, Mar 17, 2011 at 02:20:42PM +0900, Hidetoshi Seto wrote:
> (2011/03/17 13:14), Andi Kleen wrote:
> > On Thu, Mar 17, 2011 at 11:32:55AM +0900, Hidetoshi Seto wrote:
> >> So to avoid the above problem, add poison checks for both of 4K pages
> >> and the THP before copying in __collapse_huge_page_copy().
> >
> > I don't think you need the check for the huge page -- it should not be allocated
> > in this case.
>
> Is there no case that the THP is once allocated but poison flag
> is set before copy? Though I know it is one of corner cases,
> having this check for THP is worse than nothing.
There's a theoretical window, but it's only a few instructions.
Probably not worth caring about, especially since you have a similar
window after your check again.
BTW normally I don't recommend to add new poison checks without
doing new studies with page-types if the case is likely and happens
with enough memory.
-Andi
--
[email protected] -- Speaking for myself only.
(2011/03/17 15:26), Andi Kleen wrote:
> On Thu, Mar 17, 2011 at 02:20:42PM +0900, Hidetoshi Seto wrote:
>> (2011/03/17 13:14), Andi Kleen wrote:
>>> On Thu, Mar 17, 2011 at 11:32:55AM +0900, Hidetoshi Seto wrote:
>>>> So to avoid the above problem, add poison checks for both of 4K pages
>>>> and the THP before copying in __collapse_huge_page_copy().
>>>
>>> I don't think you need the check for the huge page -- it should not be allocated
>>> in this case.
>>
>> Is there no case that the THP is once allocated but poison flag
>> is set before copy? Though I know it is one of corner cases,
>> having this check for THP is worse than nothing.
>
> There's a theoretical window, but it's only a few instructions.
> Probably not worth caring about, especially since you have a similar
> window after your check again.
At least copy to the last page of the huge page is performed after
all preceding copies are finished. So I'm not sure it is really
"a few" or not.
Still I think making the window smaller than now is worthwhile,
no matter it is change from 0.1% to 0.01%, or from 0.01% to 0.001%.
Or did you find the downside of the check here?
Thanks,
H.Seto
Hello Hidetoshi,
On Thu, Mar 17, 2011 at 04:43:03PM +0900, Hidetoshi Seto wrote:
> Still I think making the window smaller than now is worthwhile,
> no matter it is change from 0.1% to 0.01%, or from 0.01% to 0.001%.
>
> Or did you find the downside of the check here?
Well it makes the code more a little more complicated for something
that looks impossible to trigger in the first place.
The slowdown of these changes is probably not significant because the
2m copy should dominate the collapse_huge_page cost, but it still add
locked ops and loops that weren't there before so technically it is a
microslowdown.
NOTE: if this closed the race window 100% I would not disagree with
this. If there's still a 0.001% chance of hitting the race like Andi
hints, then I don't find it very attractive. I think memory failure
isn't 100% correct and probably it's impossible to make it 100%
correct across the whole kernel (for example the compound_head is safe
for THP but it's still unsafe for hugetlbfs while the page is being
tear down), so it's probably ok that it tends to work in practice 100%
reliably when the task is running in userland but we leave holes when
kernel is mangling the page structures and moving stuff around,
otherwise we litter the kernel code without much practical benefit, I
guess KSM has the same issues of khugepaged for example.
So again, I'm not against making these changes, but I don't find them
very attractive and I'm unsure if we should go down this route
whenever the objective of the patch is only to reduce the race window
(that is incredibly small and not reproducible to start with, and it's
a theoretical race that hardly anybody could trigger) instead of
actually closing the race completely. But thanks a lot for your
effort, I see your point, I'm just not sure if it's worth it. I think
I'll let other comments on this...
> At least copy to the last page of the huge page is performed after
> all preceding copies are finished. So I'm not sure it is really
> "a few" or not.
> Still I think making the window smaller than now is worthwhile,
> no matter it is change from 0.1% to 0.01%, or from 0.01% to 0.001%.
Note that hwpoison will never reach 100% coverage. That's impossible.
But to get nearer to 100% it's better to concentrate of the paths
that affect long time windows and significant amounts of memory.
What those are is often non-obvious and needs measurements.
>
> Or did you find the downside of the check here?
The usual problem is how to test it. That tends to be harder
than just writing the code. If it's not tested it's probably
not worth having.
-Andi
--
[email protected] -- Speaking for myself only.
> isn't 100% correct and probably it's impossible to make it 100%
> correct across the whole kernel (for example the compound_head is safe
> for THP but it's still unsafe for hugetlbfs while the page is being
> tear down), so it's probably ok that it tends to work in practice 100%
I would like to fix known oopses in the existing paths, so that should
be probably fixed.
> reliably when the task is running in userland but we leave holes when
> kernel is mangling the page structures and moving stuff around,
> otherwise we litter the kernel code without much practical benefit, I
> guess KSM has the same issues of khugepaged for example.
We measured KSM some time ago on some simple workloads (a couple
of window guests) and it turned out that KSM memory tends to be
only a very small fraction of total physical memory. So it was
deemed not very important for hwpoison.
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Mar 17, 2011 at 04:25:59PM +0100, Andi Kleen wrote:
> > isn't 100% correct and probably it's impossible to make it 100%
> > correct across the whole kernel (for example the compound_head is safe
> > for THP but it's still unsafe for hugetlbfs while the page is being
> > tear down), so it's probably ok that it tends to work in practice 100%
>
> I would like to fix known oopses in the existing paths, so that should
> be probably fixed.
I agree with that. And still an oops is better than silent memory
corruption.
> We measured KSM some time ago on some simple workloads (a couple
> of window guests) and it turned out that KSM memory tends to be
> only a very small fraction of total physical memory. So it was
> deemed not very important for hwpoison.
So it's your choice, I'm fine either ways...
What I can tell is with the default khugepaged scan rate, the
collapse_huge_page will have an impact much smaller than KSM. It could
have more impact than KSM if you increase khugepaged load to 100% with
sysfs (because of the more memory that is covered by khugepaged
compared to only the shared portion of KSM). Then the window gets much
bigger, but still minor, if you can't trigger it with the testsuite
it's even less likely to ever happen in practice.
Did you try the testsuite with khugepaged at 100% load? I think that's
good indication if this window has any practical significance.
But note that 100% khugepaged is unrealistic, because of how fast
khugepaged is, even a 10% CPU scan background load would be too
extreme even for huge amounts of memory.
So it's mostly up to you..
I think it needs more comments to explain why there are more loops
(only the lock_page has the comment) otherwise I guess over time it'll
get optimized away back again from people reading the code and not
checking ancient history in the git comments. Best would also be to
make it conditional to CONFIG_MEMORY_FAILURE=y but doing that for the
loops is a mess, at least the lock_page is doable (not that it matters
much but it's almost like a comment..).
> What I can tell is with the default khugepaged scan rate, the
> collapse_huge_page will have an impact much smaller than KSM. It could
> have more impact than KSM if you increase khugepaged load to 100% with
> sysfs (because of the more memory that is covered by khugepaged
> compared to only the shared portion of KSM). Then the window gets much
> bigger, but still minor, if you can't trigger it with the testsuite
> it's even less likely to ever happen in practice.
You mean randomly injecting errors?
That tends to be hard and unreliable -- usually we try to have a
specific tester that is not random.
> Did you try the testsuite with khugepaged at 100% load? I think that's
> good indication if this window has any practical significance.
The measurement is simple: run the workloads and do some dumps
with pagetypes and check if the memory with lots of pages
has a state that can be handled by memory_failure()
AFAIK this hasn't been done so far with THP.
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Mar 17, 2011 at 05:27:10PM +0100, Andi Kleen wrote:
> > What I can tell is with the default khugepaged scan rate, the
> > collapse_huge_page will have an impact much smaller than KSM. It could
> > have more impact than KSM if you increase khugepaged load to 100% with
> > sysfs (because of the more memory that is covered by khugepaged
> > compared to only the shared portion of KSM). Then the window gets much
> > bigger, but still minor, if you can't trigger it with the testsuite
> > it's even less likely to ever happen in practice.
>
> You mean randomly injecting errors?
> That tends to be hard and unreliable -- usually we try to have a
> specific tester that is not random.
I meant the testsuite using MCE injection, called mce-test. I've run
it a couple of times for some hugetlbfs collision with THP (solved
some time ago).
> The measurement is simple: run the workloads and do some dumps
> with pagetypes and check if the memory with lots of pages
> has a state that can be handled by memory_failure()
>
> AFAIK this hasn't been done so far with THP.
I'm unsure if there's already coverage for it in mce-test yet, the
biggest test I run was hugetlbfs related (MAP_ANONYMOUS|MAP_HUGETLB or
filebacked or still shm). Surely it'd be good idea to add THP
coverage.
> I meant the testsuite using MCE injection, called mce-test. I've run
> it a couple of times for some hugetlbfs collision with THP (solved
> some time ago).
This won't hit small code windows like that.
> I'm unsure if there's already coverage for it in mce-test yet, the
> biggest test I run was hugetlbfs related (MAP_ANONYMOUS|MAP_HUGETLB or
> filebacked or still shm). Surely it'd be good idea to add THP
> coverage.
Sounds like a good idea. Feel free to contribute a test case.
-Andi
Hi, Andi
(2011/03/18 0:21), Andi Kleen wrote:
>> At least copy to the last page of the huge page is performed after
>> all preceding copies are finished. So I'm not sure it is really
>> "a few" or not.
>> Still I think making the window smaller than now is worthwhile,
>> no matter it is change from 0.1% to 0.01%, or from 0.01% to 0.001%.
>
> Note that hwpoison will never reach 100% coverage. That's impossible.
> But to get nearer to 100% it's better to concentrate of the paths
> that affect long time windows and significant amounts of memory.
> What those are is often non-obvious and needs measurements.
>
>>
>> Or did you find the downside of the check here?
>
> The usual problem is how to test it. That tends to be harder
> than just writing the code. If it's not tested it's probably
> not worth having.
>
We did the test with our own test method. And the problem happened
as we expected really.
The method needs kernel part and user part. They are listed as following.
1. Kernel part
A. Debug interface
- check whether the THP aligned page belongs to THP.
- set the page position to be poisoned.
- set the flag whether 4K page or THP in khugepaged daemon will
be poisoned.
- split the requested THP to 4K pages.
B. A daemon poison_sched
Make poison_sched daemon call memory_failure().
C. Changes in khugepaged for debug.
- Check whether the requested page will be collapsed.
- Set poison information for poison_sched daemon
when the requested page will be collapsed.
- print the poison information to kernel log
when the page has been poisoned.
2. User part
A test APL
- Request memory which may be containing THP.
- Set test conditions with debug interface.
The steps for our own test are like following:
1. APL requests memory and check whether the THP aligned page is
THP with debug interface. If the THP aligned page is not THP,
APL will be restarted until THP is mapped.
2. APL set the page position being poisoned and the flag
whether 4K page or THP in khugepaged daemon is poisoned
with debug interface.
3. APL requests to split the requested THP with debug interface.
Here kernel must remember the split THP page address and pfn
for later page collapse.
(Waiting for page collapse ...)
4. When khugepaged daemon collapses the remembered split THP address
and pfn, khugepaged daemon will set poison information
for poison_sched daemon.
5. khugepaged daemon will do its work continually, and poison_sched
daemon will call memory_failure() deal with poisoned page
at the same time.
6. khugepaged daemon will print poison information to kernel log.
And whether the APL will be killed or not will be checked
by ourselves.
After we confirmed the above problem, the patch set is also implemented to
be tested. we confirmed the patch set could resolve the problem we got.
Thanks.
Best Regards,
Jin Dongming
> -Andi
On Thu, Mar 17, 2011 at 03:04:01PM +0100, Andrea Arcangeli wrote:
> Hello Hidetoshi,
>
> On Thu, Mar 17, 2011 at 04:43:03PM +0900, Hidetoshi Seto wrote:
> > Still I think making the window smaller than now is worthwhile,
> > no matter it is change from 0.1% to 0.01%, or from 0.01% to 0.001%.
> >
> > Or did you find the downside of the check here?
>
> Well it makes the code more a little more complicated for something
> that looks impossible to trigger in the first place.
>
> The slowdown of these changes is probably not significant because the
> 2m copy should dominate the collapse_huge_page cost, but it still add
> locked ops and loops that weren't there before so technically it is a
> microslowdown.
>
> NOTE: if this closed the race window 100% I would not disagree with
> this. If there's still a 0.001% chance of hitting the race like Andi
> hints, then I don't find it very attractive. I think memory failure
> isn't 100% correct and probably it's impossible to make it 100%
> correct across the whole kernel (for example the compound_head is safe
> for THP but it's still unsafe for hugetlbfs while the page is being
> tear down), so it's probably ok that it tends to work in practice 100%
> reliably when the task is running in userland but we leave holes when
> kernel is mangling the page structures and moving stuff around,
> otherwise we litter the kernel code without much practical benefit, I
> guess KSM has the same issues of khugepaged for example.
>
On an extended note, I don't understand why hwpoison code should not
handle Ksm pages the same way as other user-space pages. While it is
known that certain races between merging of pages by Ksm and poisoning
of code exist, the limitation posed for PageKsm() pages don't seem to
avoid it.
It appears that the restriction for PageKsm() can be removed from
memory-failure.c code without making the races any better or worse. Any
opinions on that?
Thanks,
K.Prasad
> So again, I'm not against making these changes, but I don't find them
> very attractive and I'm unsure if we should go down this route
> whenever the objective of the patch is only to reduce the race window
> (that is incredibly small and not reproducible to start with, and it's
> a theoretical race that hardly anybody could trigger) instead of
> actually closing the race completely. But thanks a lot for your
> effort, I see your point, I'm just not sure if it's worth it. I think
> I'll let other comments on this...
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
> On an extended note, I don't understand why hwpoison code should not
> handle Ksm pages the same way as other user-space pages. While it is
Noone has done the work for it so far.
However we did some evaluation some time ago with some simple
KVM setups and it turned out KSM was only very little physical memory
overall (< few percent) Normally it's better to focus on areas with more
payoff (= more memory), unless you can find a workload where KSM is actually
significant.
> It appears that the restriction for PageKsm() can be removed from
> memory-failure.c code without making the races any better or worse. Any
> opinions on that?
I don't know. The KSM code paths would need to be audited first.
Hugh added the checks originally and may have an opinion.
-Andi