2009-06-08 16:30:02

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 0/4] RFC - ksm api change into madvise

On Thu, 14 May 2009, Izik Eidus wrote:

> This is comment request for ksm api changes.
> The following patchs move the api to use madvise instead of ioctls.

Thanks a lot for doing this.

I'm afraid more than three weeks have gone past, and the 2.6.31
merge window is almost upon us, and you haven't even got a comment
out of me: I apologize for that.

Although my (lack of) response is indistinguishable from a conspiracy
to keep KSM out of the kernel, I beg to assure you that's not the case.
I do want KSM to go in - though I never shared Andrew's optimism that it
is 2.6.31 material: I've too long a list of notes/doubts on the existing
implementation, which I've not had time to expand upon to you; but I
don't think there are any killer issues, we should be able to work
things out as 2.6.31 goes through its -rcs, and aim for 2.6.32.

But let's get this change of interface sorted out first.

I remain convinced that it's right to go the madvise() route,
though I don't necessarily like the details in your patches.
And I've come to the conclusion that the only way I can force
myself to contribute constructively, is to start from these
patches, and shift things around until it's as I think it
should be, then see what you think of the result.

I notice that you chose to integrate fully (though not fully enough)
with vmas, adding a VM_MERGEABLE flag. Fine, that's probably going
to be safest in the end, and I'll follow you; but it is further than
I was necessarily asking you to go - it might have been okay to use
the madvise() interface, but just to declare areas of address space
(not necessarily backed by mappings) to ksm.c, as you did via /dev/ksm.
But it's fairly likely that if you had stayed with that, it would have
proved problematic later, so let's go forward with the full integration
with vmas.

>
> Before i will describe the patchs, i want to note that i rewrote this
> patch seires alot of times, all the other methods that i have tried had some
> fandumatel issues with them.
> The current implemantion does have some issues with it, but i belive they are
> all solveable and better than the other ways to do it.
> If you feel you have better way how to do it, please tell me :).
>
> Ok when we changed ksm to use madvise instead of ioctls we wanted to keep
> the following rules:
>
> Not to increase the host memory usage if ksm is not being used (even when it
> is compiled), this mean not to add fields into mm_struct / vm_area_struct...
>
> Not to effect the system performence with notifiers that will have to block
> while ksm code is running under some lock - ksm is helper, it should do it
> work quitely, - this why i dropped patch that i did that add mmu notifiers
> support inside ksm.c and recived notifications from the MM (for example
> when vma is destroyed (invalidate_range...)
>
> Not to change the MM logic.
>
> Trying to touch as less code as we can outisde ksm.c

These are well-intentioned goals, and thank you for making the effort
to follow them; but I'm probably going to depart from them. I'd
rather we put in what's necessary and appropriate, and then cut
that down if necessary.

>
> Taking into account all this rules, the end result that we have came with is:
> mmlist is now not used only by swapoff, but by ksm as well, this mean that
> each time you call to madvise for to set vma as MERGEABLE, madvise will check
> if the mm_struct is inside the mmlist and will insert it in case it isnt.
> It is belived that it is better to hurt little bit the performence of swapoff
> than adding another list into the mm_struct.

That was a perfectly sensible thing for you to do, given your rules
above; but I don't really like the result, and think it'll be clearer
to have your own list. Whether by mm or by vma, I've not yet decided:
by mm won't add enough #idef CONFIG_KSM bloat to worry about; by vma,
we might be able to reuse some prio_tree fields, I've not checked yet.

>
> One issue that should be note is: after mm_struct is going into the mmlist, it
> wont be kicked from it until the procsses is die (even if there are no more
> VM_MERGEABLE vmas), this doesnt mean memory is wasted, but it does mean ksm
> will spend little more time in doing cur = cur->next if(...).
>
> Another issue is: when procsess is die, ksm will have to find (when scanning)
> that its mm_users == 1 and then do mmput(), this mean that there might be dealy
> from the time that someone do kill until the mm is really free -
> i am open for suggestions on how to improve this...

You've resisted putting in the callbacks you need. I think they were
always (i.e. even when using /dev/ksm) necessary, but should become
more obvious now we have this tighter integration with mm's vmas.

You seem to have no callback in fork: doesn't that mean that KSM
pages get into mms of which mm/ksm.c has no knowledge? You had
no callback in mremap move: doesn't that mean that KSM pages could
be moved into areas which mm/ksm.c never tracked? Though that's
probably no issue now we move over to vmas: they should now travel
with their VM flag. You have no callback in unmap: doesn't that
mean that KSM never knows when its pages have gone away?

(Closing the /dev/ksm fd used to clean up some of this, in the
end; but the lifetime of the fd can be so different from that of
the mapped area, I've felt very unsafe with that technique - a good
technique when you're trying to sneak in special handling for your
special driver, but not a good technique once you go to mainline.)

I haven't worked out the full consequences of these lost pages:
perhaps it's no worse than that you could never properly enforce
your ksm_thread_max_kernel_pages quota.

>
> (when someone do echo 0 > /sys/kernel/mm/ksm/run ksm will throw away all the
> memory, so condtion when the memory wont ever be free wont happen)
>
>
> Another important thing is: this is request for comment, i still not sure few
> things that we have made here are totaly safe:
> (the mmlist sync with drain_mmlist, and the handle_vmas() function in madvise,
> the logic inside ksm for searching the next virtual address on the vmas,
> and so on...)
> The main purpuse of this is to ask if the new interface is what you guys
> want..., and if you like the impelmantion desgin.

It's in the right direction. But it would be silly for me to start
criticizing your details now: I need to try doing the same, that will
force me to think deeply enough about it, and I may then be led to
the same decisions as you made.

>
> (I have added option to scan closed support applications as well)

That's a nice detail that I'll find very useful for testing,
but we might want to hold it back longer than the rest. I just get
naturally more cautious when we consider interfaces for doing things
to other processes, and want to spend even longer over it.

>
>
> Thanks.
>
> Izik Eidus (4):
> madvice: add MADV_SHAREABLE and MADV_UNSHAREABLE calls.

I didn't understand why you went over to VM_MERGEABLE but stuck
with MADV_SHAREABLE: there's a confusing mix of shareables and
mergeables, I'll head for mergeables throughout, though keep to "KSM".

> mmlist: share mmlist with ksm.
> ksm: change ksm api to use madvise instead of ioctls.
> ksm: add support for scanning procsses that were not modifided to use
> ksm

While I'm being communicative, let me mention two things,
not related to this RFC patchset, but to what's currently in mmotm.

I've a bugfix patch to scan_get_next_index(), I'll send that to you
in a few moments.

And a question on your page_wrprotect() addition to mm/rmap.c: though
it may contain some important checks (I'm thinking of the get_user_pages
protection), isn't it essentially redundant, and should be removed from
the patchset? If we have a private page which is mapped into more than
the one address space by which we arrive at it, then, quite independent
of KSM, it needs to be write-protected already to prevent mods in one
address space leaking into another - doesn't it? So I see no need for
the rmap'ped write-protection there, just make the checks and write
protect the pte you have in ksm.c. Or am I missing something?

Hugh


2009-06-08 16:46:35

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH mmotm] ksm: stop scan skipping pages

KSM can be slow to identify all mergeable pages. There's an off-by-one
in how ksm_scan_start() proceeds (see how it does a scan_get_next_index
at the head of its loop, but also on leaving its loop), which causes it
to skip over a page occasionally. Fix that.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/ksm.c | 46 +++++++++++++++++++---------------------------
1 file changed, 19 insertions(+), 27 deletions(-)

--- mmotm/mm/ksm.c 2009-06-08 13:14:36.000000000 +0100
+++ fixed/mm/ksm.c 2009-06-08 13:18:30.000000000 +0100
@@ -1331,30 +1331,26 @@ out:
/* return -EAGAIN - no slots registered, nothing to be done */
static int scan_get_next_index(struct ksm_scan *ksm_scan)
{
- struct ksm_mem_slot *slot;
+ struct list_head *next;

if (list_empty(&slots))
return -EAGAIN;

- slot = ksm_scan->slot_index;
-
/* Are there pages left in this slot to scan? */
- if ((slot->npages - ksm_scan->page_index - 1) > 0) {
- ksm_scan->page_index++;
+ ksm_scan->page_index++;
+ if (ksm_scan->page_index < ksm_scan->slot_index->npages)
return 0;
- }

- list_for_each_entry_from(slot, &slots, link) {
- if (slot == ksm_scan->slot_index)
- continue;
- ksm_scan->page_index = 0;
- ksm_scan->slot_index = slot;
+ ksm_scan->page_index = 0;
+ next = ksm_scan->slot_index->link.next;
+ if (next != &slots) {
+ ksm_scan->slot_index =
+ list_entry(next, struct ksm_mem_slot, link);
return 0;
}

/* look like we finished scanning the whole memory, starting again */
root_unstable_tree = RB_ROOT;
- ksm_scan->page_index = 0;
ksm_scan->slot_index = list_first_entry(&slots,
struct ksm_mem_slot, link);
return 0;
@@ -1366,21 +1362,22 @@ static int scan_get_next_index(struct ks
* pointed to was released so we have to call this function every time after
* taking the slots_lock
*/
-static void scan_update_old_index(struct ksm_scan *ksm_scan)
+static int scan_update_old_index(struct ksm_scan *ksm_scan)
{
struct ksm_mem_slot *slot;

if (list_empty(&slots))
- return;
+ return -EAGAIN;

list_for_each_entry(slot, &slots, link) {
if (ksm_scan->slot_index == slot)
- return;
+ return 0;
}

ksm_scan->slot_index = list_first_entry(&slots,
struct ksm_mem_slot, link);
ksm_scan->page_index = 0;
+ return 0;
}

/**
@@ -1399,20 +1396,14 @@ static int ksm_scan_start(struct ksm_sca
struct ksm_mem_slot *slot;
struct page *page[1];
int val;
- int ret = 0;
+ int ret;

down_read(&slots_lock);
+ ret = scan_update_old_index(ksm_scan);

- scan_update_old_index(ksm_scan);
-
- while (scan_npages > 0) {
- ret = scan_get_next_index(ksm_scan);
- if (ret)
- goto out;
-
- slot = ksm_scan->slot_index;
-
+ while (scan_npages && !ret) {
cond_resched();
+ slot = ksm_scan->slot_index;

/*
* If the page is swapped out or in swap cache, we don't want to
@@ -1433,10 +1424,11 @@ static int ksm_scan_start(struct ksm_sca
} else {
up_read(&slot->mm->mmap_sem);
}
+
+ ret = scan_get_next_index(ksm_scan);
scan_npages--;
}
- scan_get_next_index(ksm_scan);
-out:
+
up_read(&slots_lock);
return ret;
}

2009-06-08 17:18:57

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 0/4] RFC - ksm api change into madvise

Hugh Dickins wrote:
> On Thu, 14 May 2009, Izik Eidus wrote:
>
>
>> This is comment request for ksm api changes.
>> The following patchs move the api to use madvise instead of ioctls.
>>
>
> Thanks a lot for doing this.
>
> I'm afraid more than three weeks have gone past, and the 2.6.31
> merge window is almost upon us, and you haven't even got a comment
> out of me: I apologize for that.
>
> Although my (lack of) response is indistinguishable from a conspiracy
> to keep KSM out of the kernel, I beg to assure you that's not the case.
> I do want KSM to go in - though I never shared Andrew's optimism that it
> is 2.6.31 material: I've too long a list of notes/doubts on the existing
> implementation, which I've not had time to expand upon to you; but I
> don't think there are any killer issues, we should be able to work
> things out as 2.6.31 goes through its -rcs, and aim for 2.6.32.
>
> But let's get this change of interface sorted out first.
>

Agree.

> I remain convinced that it's right to go the madvise() route,
> though I don't necessarily like the details in your patches.
> And I've come to the conclusion that the only way I can force
> myself to contribute constructively, is to start from these
> patches, and shift things around until it's as I think it
> should be, then see what you think of the result.
>

Sound perfect way to go.

> I notice that you chose to integrate fully (though not fully enough)
> with vmas, adding a VM_MERGEABLE flag. Fine, that's probably going
> to be safest in the end, and I'll follow you; but it is further than
> I was necessarily asking you to go - it might have been okay to use
> the madvise() interface, but just to declare areas of address space
> (not necessarily backed by mappings) to ksm.c, as you did via /dev/ksm.
> But it's fairly likely that if you had stayed with that, it would have
> proved problematic later, so let's go forward with the full integration
> with vmas.
>
>
>> Before i will describe the patchs, i want to note that i rewrote this
>> patch seires alot of times, all the other methods that i have tried had some
>> fandumatel issues with them.
>> The current implemantion does have some issues with it, but i belive they are
>> all solveable and better than the other ways to do it.
>> If you feel you have better way how to do it, please tell me :).
>>
>> Ok when we changed ksm to use madvise instead of ioctls we wanted to keep
>> the following rules:
>>
>> Not to increase the host memory usage if ksm is not being used (even when it
>> is compiled), this mean not to add fields into mm_struct / vm_area_struct...
>>
>> Not to effect the system performence with notifiers that will have to block
>> while ksm code is running under some lock - ksm is helper, it should do it
>> work quitely, - this why i dropped patch that i did that add mmu notifiers
>> support inside ksm.c and recived notifications from the MM (for example
>> when vma is destroyed (invalidate_range...)
>>
>> Not to change the MM logic.
>>
>> Trying to touch as less code as we can outisde ksm.c
>>
>
> These are well-intentioned goals, and thank you for making the effort
> to follow them; but I'm probably going to depart from them. I'd
> rather we put in what's necessary and appropriate, and then cut
> that down if necessary.
>

That the way to go, i just didnt want to scare anyone (it was obiouse to
me that it is needed, just wanted you to say it is needed)

>
>> Taking into account all this rules, the end result that we have came with is:
>> mmlist is now not used only by swapoff, but by ksm as well, this mean that
>> each time you call to madvise for to set vma as MERGEABLE, madvise will check
>> if the mm_struct is inside the mmlist and will insert it in case it isnt.
>> It is belived that it is better to hurt little bit the performence of swapoff
>> than adding another list into the mm_struct.
>>
>
> That was a perfectly sensible thing for you to do, given your rules
> above; but I don't really like the result, and think it'll be clearer
> to have your own list. Whether by mm or by vma, I've not yet decided:
> by mm won't add enough #idef CONFIG_KSM bloat to worry about; by vma,
> we might be able to reuse some prio_tree fields, I've not checked yet.
>
>
>> One issue that should be note is: after mm_struct is going into the mmlist, it
>> wont be kicked from it until the procsses is die (even if there are no more
>> VM_MERGEABLE vmas), this doesnt mean memory is wasted, but it does mean ksm
>> will spend little more time in doing cur = cur->next if(...).
>>
>> Another issue is: when procsess is die, ksm will have to find (when scanning)
>> that its mm_users == 1 and then do mmput(), this mean that there might be dealy
>> from the time that someone do kill until the mm is really free -
>> i am open for suggestions on how to improve this...
>>
>
> You've resisted putting in the callbacks you need. I think they were
> always (i.e. even when using /dev/ksm) necessary, but should become
> more obvious now we have this tighter integration with mm's vmas.
>
> You seem to have no callback in fork: doesn't that mean that KSM
> pages get into mms of which mm/ksm.c has no knowledge?
What you mean by this?, should the vma flags be copyed into the child
and therefore ksm will scan the vma?
(only thing i have to check is: maybe the process itself wont go into
the mmlist, and therefore ksm wont know about it)

> You had
> no callback in mremap move: doesn't that mean that KSM pages could
> be moved into areas which mm/ksm.c never tracked? Though that's
> probably no issue now we move over to vmas: they should now travel
> with their VM flag. You have no callback in unmap: doesn't that
> mean that KSM never knows when its pages have gone away?
>

Yes, Adding all this callbacks would make ksm much more happy, Again, i
didnt want to scare anyone...

> (Closing the /dev/ksm fd used to clean up some of this, in the
> end; but the lifetime of the fd can be so different from that of
> the mapped area, I've felt very unsafe with that technique - a good
> technique when you're trying to sneak in special handling for your
> special driver, but not a good technique once you go to mainline.)
>
> I haven't worked out the full consequences of these lost pages:
> perhaps it's no worse than that you could never properly enforce
> your ksm_thread_max_kernel_pages quota.
>

You mean the shared pages outside the stable tree comment?

>
>> (when someone do echo 0 > /sys/kernel/mm/ksm/run ksm will throw away all the
>> memory, so condtion when the memory wont ever be free wont happen)
>>
>>
>> Another important thing is: this is request for comment, i still not sure few
>> things that we have made here are totaly safe:
>> (the mmlist sync with drain_mmlist, and the handle_vmas() function in madvise,
>> the logic inside ksm for searching the next virtual address on the vmas,
>> and so on...)
>> The main purpuse of this is to ask if the new interface is what you guys
>> want..., and if you like the impelmantion desgin.
>>
>
> It's in the right direction. But it would be silly for me to start
> criticizing your details now: I need to try doing the same, that will
> force me to think deeply enough about it, and I may then be led to
> the same decisions as you made.
>
>
>> (I have added option to scan closed support applications as well)
>>
>
> That's a nice detail that I'll find very useful for testing,
> but we might want to hold it back longer than the rest. I just get
> naturally more cautious when we consider interfaces for doing things
> to other processes, and want to spend even longer over it.
>
>
>> Thanks.
>>
>> Izik Eidus (4):
>> madvice: add MADV_SHAREABLE and MADV_UNSHAREABLE calls.
>>
>
> I didn't understand why you went over to VM_MERGEABLE but stuck
> with MADV_SHAREABLE: there's a confusing mix of shareables and
> mergeables, I'll head for mergeables throughout, though keep to "KSM".
>
>
>> mmlist: share mmlist with ksm.
>> ksm: change ksm api to use madvise instead of ioctls.
>> ksm: add support for scanning procsses that were not modifided to use
>> ksm
>>
>
> While I'm being communicative, let me mention two things,
> not related to this RFC patchset, but to what's currently in mmotm.
>
> I've a bugfix patch to scan_get_next_index(), I'll send that to you
>

Thanks.


> in a few moments.
>
> And a question on your page_wrprotect() addition to mm/rmap.c: though
> it may contain some important checks (I'm thinking of the get_user_pages
> protection), isn't it essentially redundant, and should be removed from
> the patchset? If we have a private page which is mapped into more than
> the one address space by which we arrive at it, then, quite independent
> of KSM, it needs to be write-protected already to prevent mods in one
> address space leaking into another - doesn't it? So I see no need for
> the rmap'ped write-protection there, just make the checks and write
> protect the pte you have in ksm.c. Or am I missing something?
>

Ok, so we have here 2 cases for ksm:
1:
When the page is anonymous and is mapped readonly beteween serveal
processes:
for this you say we shouldnt walk over the rmap and try to
writeprotect what is already writeprtected...

2:
When the page is anonymous and is mapped write by just one process:
for this you say it is better to handle it directly from inside
ksm beacuse we already know
the virtual address mapping of this page?

so about this: you are right about the fact that we might dont
have to walk over the rmap of the page for pages with mapcount 1
but isnt it cleaner to deal it inside rmap.c?
another thing, get_user_pages() protection is needed even in
that case, beacuse get_user_pages_fast is lockless, so odirect
can run under our legs after we write protecting the page.


anyway, nothing critical, i dont mind to move
page_write_protect_one() into ksm.c, i still think get_user_pages
protection is needed.


Thanks alot for your time.
> Hugh
>

2009-06-08 17:43:31

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH mmotm] ksm: stop scan skipping pages

Hugh Dickins wrote:
> KSM can be slow to identify all mergeable pages. There's an off-by-one
> in how ksm_scan_start() proceeds (see how it does a scan_get_next_index
> at the head of its loop, but also on leaving its loop), which causes it
> to skip over a page occasionally. Fix that.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>
> mm/ksm.c | 46 +++++++++++++++++++---------------------------
> 1 file changed, 19 insertions(+), 27 deletions(-)
>
> --- mmotm/mm/ksm.c 2009-06-08 13:14:36.000000000 +0100
> +++ fixed/mm/ksm.c 2009-06-08 13:18:30.000000000 +0100
> @@ -1331,30 +1331,26 @@ out:
> /* return -EAGAIN - no slots registered, nothing to be done */
> static int scan_get_next_index(struct ksm_scan *ksm_scan)
> {
> - struct ksm_mem_slot *slot;
> + struct list_head *next;
>
> if (list_empty(&slots))
> return -EAGAIN;
>
> - slot = ksm_scan->slot_index;
> -
> /* Are there pages left in this slot to scan? */
> - if ((slot->npages - ksm_scan->page_index - 1) > 0) {
> - ksm_scan->page_index++;
> + ksm_scan->page_index++;
> + if (ksm_scan->page_index < ksm_scan->slot_index->npages)
> return 0;
> - }
>
> - list_for_each_entry_from(slot, &slots, link) {
> - if (slot == ksm_scan->slot_index)
> - continue;
> - ksm_scan->page_index = 0;
> - ksm_scan->slot_index = slot;
> + ksm_scan->page_index = 0;
> + next = ksm_scan->slot_index->link.next;
> + if (next != &slots) {
> + ksm_scan->slot_index =
> + list_entry(next, struct ksm_mem_slot, link);
> return 0;
> }
>
> /* look like we finished scanning the whole memory, starting again */
> root_unstable_tree = RB_ROOT;
> - ksm_scan->page_index = 0;
> ksm_scan->slot_index = list_first_entry(&slots,
> struct ksm_mem_slot, link);
> return 0;
> @@ -1366,21 +1362,22 @@ static int scan_get_next_index(struct ks
> * pointed to was released so we have to call this function every time after
> * taking the slots_lock
> */
> -static void scan_update_old_index(struct ksm_scan *ksm_scan)
> +static int scan_update_old_index(struct ksm_scan *ksm_scan)
> {
> struct ksm_mem_slot *slot;
>
> if (list_empty(&slots))
> - return;
> + return -EAGAIN;
>
> list_for_each_entry(slot, &slots, link) {
> if (ksm_scan->slot_index == slot)
> - return;
> + return 0;
> }
>
> ksm_scan->slot_index = list_first_entry(&slots,
> struct ksm_mem_slot, link);
> ksm_scan->page_index = 0;
> + return 0;
> }
>
> /**
> @@ -1399,20 +1396,14 @@ static int ksm_scan_start(struct ksm_sca
> struct ksm_mem_slot *slot;
> struct page *page[1];
> int val;
> - int ret = 0;
> + int ret;
>
> down_read(&slots_lock);
> + ret = scan_update_old_index(ksm_scan);
>
> - scan_update_old_index(ksm_scan);
> -
> - while (scan_npages > 0) {
> - ret = scan_get_next_index(ksm_scan);
> - if (ret)
> - goto out;
> -
> - slot = ksm_scan->slot_index;
> -
> + while (scan_npages && !ret) {
> cond_resched();
> + slot = ksm_scan->slot_index;
>
> /*
> * If the page is swapped out or in swap cache, we don't want to
> @@ -1433,10 +1424,11 @@ static int ksm_scan_start(struct ksm_sca
> } else {
> up_read(&slot->mm->mmap_sem);
> }
> +
> + ret = scan_get_next_index(ksm_scan);
> scan_npages--;
> }
> - scan_get_next_index(ksm_scan);
> -out:
> +
> up_read(&slots_lock);
> return ret;
> }
>
ACK.

Thanks for the fix,
(I saw it while i wrote the RFC patch for the madvise, but beacuse that
i thought that the RFC fix this (you can see the removel of the second
call to scan_get_next_index()), and we move to madvise, I thought that
no patch is needed for this code, guess I was wrong)

Thanks.

2009-06-08 18:12:24

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH mmotm] ksm: stop scan skipping pages

On Mon, 8 Jun 2009, Izik Eidus wrote:
>
> Thanks for the fix,
> (I saw it while i wrote the RFC patch for the madvise, but beacuse that i
> thought that the RFC fix this (you can see the removel of the second call to
> scan_get_next_index()), and we move to madvise, I thought that no patch is
> needed for this code, guess I was wrong)

Ah, no, I hadn't noticed that, this patch is from several weeks ago,
before you even posted the madvise() version.

I think myself that we ought to fix the algorithm as it stands now in
mmotm, rather than hiding the fix in amongst later interface changes.

But it's not a big deal, so long as it gets fixed in the end.

By the time Andrew sends KVM to Linus, it shouldn't be the
patches currently in mmotm with more on top: they should be
re-presented with all trace of /dev/ksm gone by then.

Hugh

2009-06-08 18:43:04

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 0/4] RFC - ksm api change into madvise

On Mon, 8 Jun 2009, Izik Eidus wrote:
> Hugh Dickins wrote:
> >
> > You seem to have no callback in fork: doesn't that mean that KSM
> > pages get into mms of which mm/ksm.c has no knowledge?
> What you mean by this?, should the vma flags be copyed into the child and
> therefore ksm will scan the vma?

That part should be happening automatically now
(you'd have to add code to avoid copying VM_ flags).

> (only thing i have to check is: maybe the process itself wont go into the
> mmlist, and therefore ksm wont know about it)

But that part needs something to make it happen. In the case of
swap, it's done over in copy_one_pte; the important thing is to
do it where the locking is such that it cannot get missed,
I can't say where is the right place yet.

>
> > You had
> > no callback in mremap move: doesn't that mean that KSM pages could
> > be moved into areas which mm/ksm.c never tracked? Though that's
> > probably no issue now we move over to vmas: they should now travel
> > with their VM flag. You have no callback in unmap: doesn't that
> > mean that KSM never knows when its pages have gone away?
> >
>
> Yes, Adding all this callbacks would make ksm much more happy, Again, i didnt
> want to scare anyone...

Izik, you're too kindly ;)

>
> > (Closing the /dev/ksm fd used to clean up some of this, in the
> > end; but the lifetime of the fd can be so different from that of
> > the mapped area, I've felt very unsafe with that technique - a good
> > technique when you're trying to sneak in special handling for your
> > special driver, but not a good technique once you go to mainline.)
> >
> > I haven't worked out the full consequences of these lost pages:
> > perhaps it's no worse than that you could never properly enforce
> > your ksm_thread_max_kernel_pages quota.
> >
>
> You mean the shared pages outside the stable tree comment?

I don't understand or even parse that question: which probably
means, no. I mean that if the merged pages end up in areas
outside of where your trees expect them, then you've got pages
counted which cannot be found to have gone. Whether you can get
too many resident pages that way, or just more counted than are
really there, I don't know: I've not worked through the accounting
yet, and would really really prefer it to be exact, instead of
pages getting lost outside the trees and raising these questions.

But you may tell me that I simply haven't understood it yet.

> > And a question on your page_wrprotect() addition to mm/rmap.c: though
> > it may contain some important checks (I'm thinking of the get_user_pages
> > protection), isn't it essentially redundant, and should be removed from
> > the patchset? If we have a private page which is mapped into more than
> > the one address space by which we arrive at it, then, quite independent
> > of KSM, it needs to be write-protected already to prevent mods in one
> > address space leaking into another - doesn't it? So I see no need for
> > the rmap'ped write-protection there, just make the checks and write
> > protect the pte you have in ksm.c. Or am I missing something?
> >
>
> Ok, so we have here 2 cases for ksm:
> 1:
> When the page is anonymous and is mapped readonly beteween serveal
> processes:
> for this you say we shouldnt walk over the rmap and try to
> writeprotect what is already writeprtected...

Yes.

>
> 2:
> When the page is anonymous and is mapped write by just one process:
> for this you say it is better to handle it directly from inside ksm
> beacuse we already know
> the virtual address mapping of this page?

Yes.

>
> so about this: you are right about the fact that we might dont have to
> walk over the rmap of the page for pages with mapcount 1
> but isnt it cleaner to deal it inside rmap.c?

If you needed to writeprotect pages with mapcount 2, 3, ... then
indeed you'd want to do it in rmap.c, and you wouldn't want to
specialcase mapcount 1. But I'm saying you don't need to do the
writeprotection with mapcount 2, 3, ...; therefore better just
to do it (again, no need to special case mapcount 1) in ksm.c.

Cutting out a body of code: that's as clean as clean can be.

> another thing, get_user_pages() protection is needed even in
> that case, beacuse get_user_pages_fast is lockless, so odirect
> can run under our legs after we write protecting the page.

That I don't dispute, I made a point of saying that "it may contain
some important checks (I'm thinking of the get_user_pages protection)",
meaning that you'd want to transfer any such checks to ksm.c.

(In my own preferred patchset structure, I'd actually like that
O_DIRECT get_user_pages consideration added as a separate patch
over the base: one needs to focus in a very different way to get
those issues, it's better to hold them up for separate inspection.
The patch before would be buggy, yes, but not in such a way that
anyone bisecting for an unrelated bug would get inconvenienced.)

>
> anyway, nothing critical, i dont mind to move page_write_protect_one()
> into ksm.c, i still think get_user_pages protection is needed.

Yes, something like that.

Hugh

2009-06-08 20:11:03

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 0/4] RFC - ksm api change into madvise

Hugh Dickins wrote:
> On Mon, 8 Jun 2009, Izik Eidus wrote:
>
>> Hugh Dickins wrote:
>>
>>> You seem to have no callback in fork: doesn't that mean that KSM
>>> pages get into mms of which mm/ksm.c has no knowledge?
>>>
>> What you mean by this?, should the vma flags be copyed into the child and
>> therefore ksm will scan the vma?
>>
>
> That part should be happening automatically now
> (you'd have to add code to avoid copying VM_ flags).
>

Yea, should -> shouldn't

>
>> (only thing i have to check is: maybe the process itself wont go into the
>> mmlist, and therefore ksm wont know about it)
>>
>
> But that part needs something to make it happen. In the case of
> swap, it's done over in copy_one_pte; the important thing is to
> do it where the locking is such that it cannot get missed,
> I can't say where is the right place yet.
>
>
>>> You had
>>> no callback in mremap move: doesn't that mean that KSM pages could
>>> be moved into areas which mm/ksm.c never tracked? Though that's
>>> probably no issue now we move over to vmas: they should now travel
>>> with their VM flag. You have no callback in unmap: doesn't that
>>> mean that KSM never knows when its pages have gone away?
>>>
>>>
>> Yes, Adding all this callbacks would make ksm much more happy, Again, i didnt
>> want to scare anyone...
>>
>
> Izik, you're too kindly ;)
>
>
>>> (Closing the /dev/ksm fd used to clean up some of this, in the
>>> end; but the lifetime of the fd can be so different from that of
>>> the mapped area, I've felt very unsafe with that technique - a good
>>> technique when you're trying to sneak in special handling for your
>>> special driver, but not a good technique once you go to mainline.)
>>>
>>> I haven't worked out the full consequences of these lost pages:
>>> perhaps it's no worse than that you could never properly enforce
>>> your ksm_thread_max_kernel_pages quota.
>>>
>>>
>> You mean the shared pages outside the stable tree comment?
>>
>
> I don't understand or even parse that question: which probably
> means, no. I mean that if the merged pages end up in areas
> outside of where your trees expect them, then you've got pages
> counted which cannot be found to have gone. Whether you can get
> too many resident pages that way, or just more counted than are
> really there, I don't know: I've not worked through the accounting
> yet, and would really really prefer it to be exact, instead of
> pages getting lost outside the trees and raising these questions.
>

Agree, but to have the exact number you should have notification from
do_wp_page when it break the write protected ksm pages, no?

(Right now ksm count this pages somewhat in "out of sync" mode from the
linux VM, it doesnt mean that more unswappable pages can be allocated
by ksm then allowed by ksm_thread_max_kernel_pages, but it mean that
for some preiod of time less kernel pages might be allocated than allowed
beacuse ksm will find too late, that some of the kernel pages that it
already allocated were break and became swappable)

> But you may tell me that I simply haven't understood it yet.
>
>
>>> And a question on your page_wrprotect() addition to mm/rmap.c: though
>>> it may contain some important checks (I'm thinking of the get_user_pages
>>> protection), isn't it essentially redundant, and should be removed from
>>> the patchset? If we have a private page which is mapped into more than
>>> the one address space by which we arrive at it, then, quite independent
>>> of KSM, it needs to be write-protected already to prevent mods in one
>>> address space leaking into another - doesn't it? So I see no need for
>>> the rmap'ped write-protection there, just make the checks and write
>>> protect the pte you have in ksm.c. Or am I missing something?
>>>
>>>
>> Ok, so we have here 2 cases for ksm:
>> 1:
>> When the page is anonymous and is mapped readonly beteween serveal
>> processes:
>> for this you say we shouldnt walk over the rmap and try to
>> writeprotect what is already writeprtected...
>>
>
> Yes.
>
>
>> 2:
>> When the page is anonymous and is mapped write by just one process:
>> for this you say it is better to handle it directly from inside ksm
>> beacuse we already know
>> the virtual address mapping of this page?
>>
>
> Yes.
>
>
>> so about this: you are right about the fact that we might dont have to
>> walk over the rmap of the page for pages with mapcount 1
>> but isnt it cleaner to deal it inside rmap.c?
>>
>
> If you needed to writeprotect pages with mapcount 2, 3, ... then
> indeed you'd want to do it in rmap.c, and you wouldn't want to
> specialcase mapcount 1. But I'm saying you don't need to do the
> writeprotection with mapcount 2, 3, ...; therefore better just
> to do it (again, no need to special case mapcount 1) in ksm.c.
>
> Cutting out a body of code: that's as clean as clean can be.
>
>
>
Make sense, i will send patch today that merge that code into ksm.c
(without all the rmap walking)

2009-06-08 20:13:18

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH mmotm] ksm: stop scan skipping pages

Hugh Dickins wrote:
> On Mon, 8 Jun 2009, Izik Eidus wrote:
>
>> Thanks for the fix,
>> (I saw it while i wrote the RFC patch for the madvise, but beacuse that i
>> thought that the RFC fix this (you can see the removel of the second call to
>> scan_get_next_index()), and we move to madvise, I thought that no patch is
>> needed for this code, guess I was wrong)
>>
>
> Ah, no, I hadn't noticed that, this patch is from several weeks ago,
> before you even posted the madvise() version.
>
> I think myself that we ought to fix the algorithm as it stands now in
> mmotm, rather than hiding the fix in amongst later interface changes.
>
> But it's not a big deal, so long as it gets fixed in the end.
>
> By the time Andrew sends KVM to Linus, it shouldn't be the
> patches currently in mmotm with more on top:

So you want a repost of all the patchs?, in that case any value to keep
this stuff in Andrew tree right now?
Wouldnt it speed our development if we will keep it outside, and then
repost the whole thing?

> they should be
> re-presented with all trace of /dev/ksm gone by then.
>
> Hugh
>

2009-06-08 21:16:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH mmotm] ksm: stop scan skipping pages

On Mon, 8 Jun 2009, Izik Eidus wrote:
> Hugh Dickins wrote:
> >
> > By the time Andrew sends KVM to Linus, it shouldn't be the
> > patches currently in mmotm with more on top:
>
> So you want a repost of all the patchs?

Will do eventually. In the same way that Andrew merges fixes to
patches into the patches before sending to Linus. But in this
case, rather large pieces would vanish in such a process, plus it
would be unfair to expect Andrew to do that work; so it would
be better for you to submit replacements when we're ready.

> in that case any value to keep this
> stuff in Andrew tree right now?

I guess that depends on whether it's actually receiving any
testing or review while it's in there. I'm happy for it to remain
there for now - and the core of it is going to remain unchanged
or only trivially changed. Really a matter for you to decide
with Andrew: I don't want to pull it out of circulation myself.

> Wouldnt it speed our development if we will keep it outside,
> and then repost the whole thing?

I don't see that having a version there in mmotm will slow us
down (what could slow me down more than myself ;-?). But you
and Andrew are free to decide it's counterproductive to keep the
current version there, that's okay by me if you prefer to pull it.

Hugh

2009-06-08 22:58:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 0/4] RFC - ksm api change into madvise

Hi Hugh!

On Mon, Jun 08, 2009 at 05:18:57PM +0100, Hugh Dickins wrote:
> I notice that you chose to integrate fully (though not fully enough)
> with vmas, adding a VM_MERGEABLE flag. Fine, that's probably going
> to be safest in the end, and I'll follow you; but it is further than

After we decided to switch to madvise, I suggested to take the
approach in the RFC in this thread so with a rmap_item list ordered by
address that is scanned nested in the core-vma loop that searches for
vma with VM_MERGEABLE set, so that madvise has only to set the
VM_MERGEABLE bit, and it still avoids to alter the VM at
all. Replacing the ksm slots (kind of out of sync ksm vmas) with the
nested loop, allows to provide madvise semantics while still being out
of sync with rmap_item and tree_item.

> You've resisted putting in the callbacks you need. I think they were
> always (i.e. even when using /dev/ksm) necessary, but should become
> more obvious now we have this tighter integration with mm's vmas.

There is no need of littering the VM with new callbacks, KSM only has
to register in the mmu notifier to teardown rmap_items (and tree_item
when the last rmap_item of the tree_item list is gone) synchronously
when the mappings are invalidated during munmap/mremap or swapping
(anon pages can already be swapped while it's tracked by the unstable
tree, and later we hope stable tree ksm pages can be swapped too, and
we get rid of those from the rbtree as we walk it with
is_present_pte and gup).

After that we can also have tree_item store the kernel-vaddr (or
struct page *) to avoid calling gup() during the walk of the rbtree.
However such a change involves huge complexities and little gain, and
the part of getting rid of orphaned rmap_item/tree_item synchronously
doesn't provide practical advantages. It takes a single pass to get
rid of all orphaned rmap_items/tree_items, and shutting down ksm gets
rid of everything releasing any memory allocated by KSM (or at least
it should, especially if we don't allow it to be a module anymore
because it might not be worth it, the =N option is still useful for
the embedded that are sure they won't get great benefit from KSM in
their apps).

Izik has been brave enough to try having rmap_item/tree_item in sync
with mmu notifier already, but one of the complexities were in the
locking, we can't schedule anywhere in the mmu notifier methods (this
could be relaxed but it makes no sense to depend on this just to get
rid of orphaned rmap_item/tree_item synchronously). Other complexities
were in the fact replace_page when it does set_pte_at_notify will
recurse in KSM itself if KSM registers in MMU notifier.

Admittedly I didn't spend much time thinking about how everything is
supposed to work with mmu notifer and in-sync rmap_item/tree_item with
tree_item pointing to a physical page address without any struct page
pinning whatsoever (the very cool thing that is all about mmu
notifier, no pinning and full swapping) yet, but because I doubt it
will provide any significant advantage and I think being simpler has
some value too at least until there are more users of the
feature. Hence the current implementation still has an out of sync
scan and garbage collect rmap_item/tree_item lazily. This is not KVM
that without mmu notifier can't swap at all. We can do all our work
here in KSM out of sync by just garbage collecting the orphaned
entries out of sync, KSM works with virtual addresses it only does
temporary gup pins so it's not forced to use mmu notifier to give good
behavior.

So my suggestion is to go with out of sync, and then once this is all
solid and finished, we can try to use mmu notifier to make it in sync
and benchmark to see if it is worth it. With the primary benefit of
being able to remove gup from the tree lookup (collecting orphaned
rmap_item/tree_item a bit sooner doesn't matter so much IMHO, at least
for KVM).

> And a question on your page_wrprotect() addition to mm/rmap.c: though
> it may contain some important checks (I'm thinking of the get_user_pages
> protection), isn't it essentially redundant, and should be removed from
> the patchset? If we have a private page which is mapped into more than
> the one address space by which we arrive at it, then, quite independent
> of KSM, it needs to be write-protected already to prevent mods in one
> address space leaking into another - doesn't it? So I see no need for
> the rmap'ped write-protection there, just make the checks and write
> protect the pte you have in ksm.c. Or am I missing something?

Makes sense, basically if pte is already wrprotected we've to do
nothing, and if it's writable we only need to care about the current
pte because mapcount is guaranteed 1. That probably can provide a
minor optimization to the ksm performance. OTOH if we'll ever decide
to merge more than anon pages this won't hold. Let's say anon pages
are by orders of magnitude simpler to merge than anything else because
they don't belong to all other data structures like pagecache, I guess
we'll never merge pagecache given the significant additional
complexities involved, so I'm not against keeping replace_page local
and without walking the anon_vma list at all.

So let us know what you think about the rmap_item/tree_item out of
sync, or in sync with mmu notifier. As said Izik already did a
preliminary patch with mmu notifier registration. I doubt we want to
invest in that direction unless there's 100% agreement that it is
definitely the way to go, and the expectation that it will make a
substantial difference to the KSM users. Minor optimizations that
increase complexity a lot, can be left for later.

Thanks for looking into KSM!
Andrea

2009-06-09 04:50:21

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 0/4] RFC - ksm api change into madvise

On Mon, 08 Jun 2009 23:10:30 +0300
Izik Eidus <[email protected]> wrote:

> Hugh Dickins wrote:
> > On Mon, 8 Jun 2009, Izik Eidus wrote:

> >
> > If you needed to writeprotect pages with mapcount 2, 3, ... then
> > indeed you'd want to do it in rmap.c, and you wouldn't want to
> > specialcase mapcount 1. But I'm saying you don't need to do the
> > writeprotection with mapcount 2, 3, ...; therefore better just
> > to do it (again, no need to special case mapcount 1) in ksm.c.
> >
> > Cutting out a body of code: that's as clean as clean can be.
> >
> >
> >
> Make sense, i will send patch today that merge that code into ksm.c
> (without all the rmap walking)
>
>

How does this look like?


>From f41b092bee1437f6f436faa74f5da56403f61009 Mon Sep 17 00:00:00 2001
From: Izik Eidus <[email protected]>
Date: Tue, 9 Jun 2009 07:24:25 +0300
Subject: [PATCH] ksm: remove page_wrprotect() from rmap.c

Remove page_wrprotect() from rmap.c and instead embedded the needed code
into ksm.c

Hugh pointed out that for the ksm usage case, we dont have to walk over the rmap
and to write protected page after page beacuse when Anonymous page is mapped
more than once, it have to be write protected already, and in a case that it
mapped just once, no need to walk over the rmap, we can instead write protect
it from inside ksm.c.

Thanks.

Signed-off-by: Izik Eidus <[email protected]>
---
include/linux/rmap.h | 12 ----
mm/ksm.c | 83 ++++++++++++++++++++++++++----
mm/rmap.c | 139 --------------------------------------------------
3 files changed, 73 insertions(+), 161 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 469376d..350e76d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,10 +118,6 @@ static inline int try_to_munlock(struct page *page)
}
#endif

-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
-int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
-#endif
-
#else /* !CONFIG_MMU */

#define anon_vma_init() do {} while (0)
@@ -136,14 +132,6 @@ static inline int page_mkclean(struct page *page)
return 0;
}

-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
-static inline int page_wrprotect(struct page *page, int *odirect_sync,
- int count_offset)
-{
- return 0;
-}
-#endif
-
#endif /* CONFIG_MMU */

/*
diff --git a/mm/ksm.c b/mm/ksm.c
index 74d921b..9fce82b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -37,6 +37,7 @@
#include <linux/swap.h>
#include <linux/rbtree.h>
#include <linux/anon_inodes.h>
+#include <linux/mmu_notifier.h>
#include <linux/ksm.h>

#include <asm/tlbflush.h>
@@ -642,6 +643,75 @@ static inline int pages_identical(struct page *page1, struct page *page2)
return !memcmp_pages(page1, page2);
}

+static inline int write_protect_page(struct page *page,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ pte_t orig_pte)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *ptep;
+ spinlock_t *ptl;
+ int swapped;
+ int ret = 1;
+
+ pgd = pgd_offset(mm, addr);
+ if (!pgd_present(*pgd))
+ goto out;
+
+ pud = pud_offset(pgd, addr);
+ if (!pud_present(*pud))
+ goto out;
+
+ pmd = pmd_offset(pud, addr);
+ if (!pmd_present(*pmd))
+ goto out;
+
+ ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!ptep)
+ goto out;
+
+ if (!pte_same(*ptep, orig_pte)) {
+ pte_unmap_unlock(ptep, ptl);
+ goto out;
+ }
+
+ if (pte_write(*ptep)) {
+ pte_t entry;
+
+ swapped = PageSwapCache(page);
+ flush_cache_page(vma, addr, page_to_pfn(page));
+ /*
+ * Ok this is tricky, when get_user_pages_fast() run it doesnt
+ * take any lock, therefore the check that we are going to make
+ * with the pagecount against the mapcount is racey and
+ * O_DIRECT can happen right after the check.
+ * So we clear the pte and flush the tlb before the check
+ * this assure us that no O_DIRECT can happen after the check
+ * or in the middle of the check.
+ */
+ entry = ptep_clear_flush(vma, addr, ptep);
+ /*
+ * Check that no O_DIRECT or similar I/O is in progress on the
+ * page
+ */
+ if ((page_mapcount(page) + 2 + swapped) != page_count(page)) {
+ set_pte_at_notify(mm, addr, ptep, entry);
+ goto out_unlock;
+ }
+ entry = pte_wrprotect(entry);
+ set_pte_at_notify(mm, addr, ptep, entry);
+ }
+ ret = 0;
+
+out_unlock:
+ pte_unmap_unlock(ptep, ptl);
+out:
+ return ret;
+}
+
/*
* try_to_merge_one_page - take two pages and merge them into one
* @mm: mm_struct that hold vma pointing into oldpage
@@ -661,7 +731,6 @@ static int try_to_merge_one_page(struct mm_struct *mm,
pgprot_t newprot)
{
int ret = 1;
- int odirect_sync;
unsigned long page_addr_in_vma;
pte_t orig_pte, *orig_ptep;

@@ -686,25 +755,19 @@ static int try_to_merge_one_page(struct mm_struct *mm,
goto out_putpage;
/*
* we need the page lock to read a stable PageSwapCache in
- * page_wrprotect().
+ * write_protect_page().
* we use trylock_page() instead of lock_page(), beacuse we dont want to
* wait here, we prefer to continue scanning and merging diffrent pages
* and to come back to this page when it is unlocked.
*/
if (!trylock_page(oldpage))
goto out_putpage;
- /*
- * page_wrprotect check if the page is swapped or in swap cache,
- * in the future we might want to run here if_present_pte and then
- * swap_free
- */
- if (!page_wrprotect(oldpage, &odirect_sync, 2)) {
+
+ if (write_protect_page(oldpage, vma, page_addr_in_vma, orig_pte)) {
unlock_page(oldpage);
goto out_putpage;
}
unlock_page(oldpage);
- if (!odirect_sync)
- goto out_putpage;

orig_pte = pte_wrprotect(orig_pte);

diff --git a/mm/rmap.c b/mm/rmap.c
index f53074c..c3ba0b9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -585,145 +585,6 @@ int page_mkclean(struct page *page)
}
EXPORT_SYMBOL_GPL(page_mkclean);

-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
-
-static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
- int *odirect_sync, int count_offset)
-{
- struct mm_struct *mm = vma->vm_mm;
- unsigned long address;
- pte_t *pte;
- spinlock_t *ptl;
- int ret = 0;
-
- address = vma_address(page, vma);
- if (address == -EFAULT)
- goto out;
-
- pte = page_check_address(page, mm, address, &ptl, 0);
- if (!pte)
- goto out;
-
- if (pte_write(*pte)) {
- pte_t entry;
-
- flush_cache_page(vma, address, pte_pfn(*pte));
- /*
- * Ok this is tricky, when get_user_pages_fast() run it doesnt
- * take any lock, therefore the check that we are going to make
- * with the pagecount against the mapcount is racey and
- * O_DIRECT can happen right after the check.
- * So we clear the pte and flush the tlb before the check
- * this assure us that no O_DIRECT can happen after the check
- * or in the middle of the check.
- */
- entry = ptep_clear_flush(vma, address, pte);
- /*
- * Check that no O_DIRECT or similar I/O is in progress on the
- * page
- */
- if ((page_mapcount(page) + count_offset) != page_count(page)) {
- *odirect_sync = 0;
- set_pte_at_notify(mm, address, pte, entry);
- goto out_unlock;
- }
- entry = pte_wrprotect(entry);
- set_pte_at_notify(mm, address, pte, entry);
- }
- ret = 1;
-
-out_unlock:
- pte_unmap_unlock(pte, ptl);
-out:
- return ret;
-}
-
-static int page_wrprotect_file(struct page *page, int *odirect_sync,
- int count_offset)
-{
- struct address_space *mapping;
- struct prio_tree_iter iter;
- struct vm_area_struct *vma;
- pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
- int ret = 0;
-
- mapping = page_mapping(page);
- if (!mapping)
- return ret;
-
- spin_lock(&mapping->i_mmap_lock);
-
- vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
- ret += page_wrprotect_one(page, vma, odirect_sync,
- count_offset);
-
- spin_unlock(&mapping->i_mmap_lock);
-
- return ret;
-}
-
-static int page_wrprotect_anon(struct page *page, int *odirect_sync,
- int count_offset)
-{
- struct vm_area_struct *vma;
- struct anon_vma *anon_vma;
- int ret = 0;
-
- anon_vma = page_lock_anon_vma(page);
- if (!anon_vma)
- return ret;
-
- /*
- * If the page is inside the swap cache, its _count number was
- * increased by one, therefore we have to increase count_offset by one.
- */
- if (PageSwapCache(page))
- count_offset++;
-
- list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
- ret += page_wrprotect_one(page, vma, odirect_sync,
- count_offset);
-
- page_unlock_anon_vma(anon_vma);
-
- return ret;
-}
-
-/**
- * page_wrprotect - set all ptes pointing to a page as readonly
- * @page: the page to set as readonly
- * @odirect_sync: boolean value that is set to 0 when some of the ptes were not
- * marked as readonly beacuse page_wrprotect_one() was not able
- * to mark this ptes as readonly without opening window to a race
- * with odirect
- * @count_offset: number of times page_wrprotect() caller had called get_page()
- * on the page
- *
- * returns the number of ptes which were marked as readonly.
- * (ptes that were readonly before this function was called are counted as well)
- */
-int page_wrprotect(struct page *page, int *odirect_sync, int count_offset)
-{
- int ret = 0;
-
- /*
- * Page lock is needed for anon pages for the PageSwapCache check,
- * and for page_mapping for filebacked pages
- */
- BUG_ON(!PageLocked(page));
-
- *odirect_sync = 1;
- if (PageAnon(page))
- ret = page_wrprotect_anon(page, odirect_sync, count_offset);
- else
- ret = page_wrprotect_file(page, odirect_sync, count_offset);
-
- return ret;
-}
-EXPORT_SYMBOL(page_wrprotect);
-
-#endif
-
/**
* __page_set_anon_rmap - setup new anonymous rmap
* @page: the page to add the mapping to
--
1.5.6.5

2009-06-09 17:35:33

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 0/4] RFC - ksm api change into madvise

On Tue, 9 Jun 2009, Izik Eidus wrote:
> How does this look like?

It looks about right to me, and how delightful to be rid of that
horrid odirect_sync argument!

For now I'll turn a blind eye to your idiosyncratic return code
convention (0 for success, 1 for failure: we're already schizoid
enough with 0 for failure, 1 for success boolean functions versus
0 for success, -errno for failure functions): you're being consistent
with yourself, let's run through ksm.c at a later date to sort that out.

One improvment to make now, though: you've elsewhere avoided
the pgd,pud,pmd,pte descent in ksm.c (using get_pte instead), and
page_check_address() is not static to rmap.c (filemap_xip wanted it),
so please continue to use that. It's not exported, right, but I think
Chris was already decisive that we should abandon modular KSM, yes?

So I think slip in a patch before this one to remove the modular
option, so as not to generate traffic from people who build all
modular kernels. Or if you think the modular option should remain,
append an EXPORT_SYMBOL to page_check_address().

Thanks,
Hugh

>
> From f41b092bee1437f6f436faa74f5da56403f61009 Mon Sep 17 00:00:00 2001
> From: Izik Eidus <[email protected]>
> Date: Tue, 9 Jun 2009 07:24:25 +0300
> Subject: [PATCH] ksm: remove page_wrprotect() from rmap.c
>
> Remove page_wrprotect() from rmap.c and instead embedded the needed code
> into ksm.c
>
> Hugh pointed out that for the ksm usage case, we dont have to walk over the rmap
> and to write protected page after page beacuse when Anonymous page is mapped
> more than once, it have to be write protected already, and in a case that it
> mapped just once, no need to walk over the rmap, we can instead write protect
> it from inside ksm.c.
>
> Thanks.
>
> Signed-off-by: Izik Eidus <[email protected]>
> ---
> include/linux/rmap.h | 12 ----
> mm/ksm.c | 83 ++++++++++++++++++++++++++----
> mm/rmap.c | 139 --------------------------------------------------
> 3 files changed, 73 insertions(+), 161 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 469376d..350e76d 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -118,10 +118,6 @@ static inline int try_to_munlock(struct page *page)
> }
> #endif
>
> -#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
> -int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
> -#endif
> -
> #else /* !CONFIG_MMU */
>
> #define anon_vma_init() do {} while (0)
> @@ -136,14 +132,6 @@ static inline int page_mkclean(struct page *page)
> return 0;
> }
>
> -#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
> -static inline int page_wrprotect(struct page *page, int *odirect_sync,
> - int count_offset)
> -{
> - return 0;
> -}
> -#endif
> -
> #endif /* CONFIG_MMU */
>
> /*
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 74d921b..9fce82b 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -37,6 +37,7 @@
> #include <linux/swap.h>
> #include <linux/rbtree.h>
> #include <linux/anon_inodes.h>
> +#include <linux/mmu_notifier.h>
> #include <linux/ksm.h>
>
> #include <asm/tlbflush.h>
> @@ -642,6 +643,75 @@ static inline int pages_identical(struct page *page1, struct page *page2)
> return !memcmp_pages(page1, page2);
> }
>
> +static inline int write_protect_page(struct page *page,
> + struct vm_area_struct *vma,
> + unsigned long addr,
> + pte_t orig_pte)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *ptep;
> + spinlock_t *ptl;
> + int swapped;
> + int ret = 1;
> +
> + pgd = pgd_offset(mm, addr);
> + if (!pgd_present(*pgd))
> + goto out;
> +
> + pud = pud_offset(pgd, addr);
> + if (!pud_present(*pud))
> + goto out;
> +
> + pmd = pmd_offset(pud, addr);
> + if (!pmd_present(*pmd))
> + goto out;
> +
> + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> + if (!ptep)
> + goto out;
> +
> + if (!pte_same(*ptep, orig_pte)) {
> + pte_unmap_unlock(ptep, ptl);
> + goto out;
> + }
> +
> + if (pte_write(*ptep)) {
> + pte_t entry;
> +
> + swapped = PageSwapCache(page);
> + flush_cache_page(vma, addr, page_to_pfn(page));
> + /*
> + * Ok this is tricky, when get_user_pages_fast() run it doesnt
> + * take any lock, therefore the check that we are going to make
> + * with the pagecount against the mapcount is racey and
> + * O_DIRECT can happen right after the check.
> + * So we clear the pte and flush the tlb before the check
> + * this assure us that no O_DIRECT can happen after the check
> + * or in the middle of the check.
> + */
> + entry = ptep_clear_flush(vma, addr, ptep);
> + /*
> + * Check that no O_DIRECT or similar I/O is in progress on the
> + * page
> + */
> + if ((page_mapcount(page) + 2 + swapped) != page_count(page)) {
> + set_pte_at_notify(mm, addr, ptep, entry);
> + goto out_unlock;
> + }
> + entry = pte_wrprotect(entry);
> + set_pte_at_notify(mm, addr, ptep, entry);
> + }
> + ret = 0;
> +
> +out_unlock:
> + pte_unmap_unlock(ptep, ptl);
> +out:
> + return ret;
> +}
> +
> /*
> * try_to_merge_one_page - take two pages and merge them into one
> * @mm: mm_struct that hold vma pointing into oldpage
> @@ -661,7 +731,6 @@ static int try_to_merge_one_page(struct mm_struct *mm,
> pgprot_t newprot)
> {
> int ret = 1;
> - int odirect_sync;
> unsigned long page_addr_in_vma;
> pte_t orig_pte, *orig_ptep;
>
> @@ -686,25 +755,19 @@ static int try_to_merge_one_page(struct mm_struct *mm,
> goto out_putpage;
> /*
> * we need the page lock to read a stable PageSwapCache in
> - * page_wrprotect().
> + * write_protect_page().
> * we use trylock_page() instead of lock_page(), beacuse we dont want to
> * wait here, we prefer to continue scanning and merging diffrent pages
> * and to come back to this page when it is unlocked.
> */
> if (!trylock_page(oldpage))
> goto out_putpage;
> - /*
> - * page_wrprotect check if the page is swapped or in swap cache,
> - * in the future we might want to run here if_present_pte and then
> - * swap_free
> - */
> - if (!page_wrprotect(oldpage, &odirect_sync, 2)) {
> +
> + if (write_protect_page(oldpage, vma, page_addr_in_vma, orig_pte)) {
> unlock_page(oldpage);
> goto out_putpage;
> }
> unlock_page(oldpage);
> - if (!odirect_sync)
> - goto out_putpage;
>
> orig_pte = pte_wrprotect(orig_pte);
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f53074c..c3ba0b9 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -585,145 +585,6 @@ int page_mkclean(struct page *page)
> }
> EXPORT_SYMBOL_GPL(page_mkclean);
>
> -#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
> -
> -static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
> - int *odirect_sync, int count_offset)
> -{
> - struct mm_struct *mm = vma->vm_mm;
> - unsigned long address;
> - pte_t *pte;
> - spinlock_t *ptl;
> - int ret = 0;
> -
> - address = vma_address(page, vma);
> - if (address == -EFAULT)
> - goto out;
> -
> - pte = page_check_address(page, mm, address, &ptl, 0);
> - if (!pte)
> - goto out;
> -
> - if (pte_write(*pte)) {
> - pte_t entry;
> -
> - flush_cache_page(vma, address, pte_pfn(*pte));
> - /*
> - * Ok this is tricky, when get_user_pages_fast() run it doesnt
> - * take any lock, therefore the check that we are going to make
> - * with the pagecount against the mapcount is racey and
> - * O_DIRECT can happen right after the check.
> - * So we clear the pte and flush the tlb before the check
> - * this assure us that no O_DIRECT can happen after the check
> - * or in the middle of the check.
> - */
> - entry = ptep_clear_flush(vma, address, pte);
> - /*
> - * Check that no O_DIRECT or similar I/O is in progress on the
> - * page
> - */
> - if ((page_mapcount(page) + count_offset) != page_count(page)) {
> - *odirect_sync = 0;
> - set_pte_at_notify(mm, address, pte, entry);
> - goto out_unlock;
> - }
> - entry = pte_wrprotect(entry);
> - set_pte_at_notify(mm, address, pte, entry);
> - }
> - ret = 1;
> -
> -out_unlock:
> - pte_unmap_unlock(pte, ptl);
> -out:
> - return ret;
> -}
> -
> -static int page_wrprotect_file(struct page *page, int *odirect_sync,
> - int count_offset)
> -{
> - struct address_space *mapping;
> - struct prio_tree_iter iter;
> - struct vm_area_struct *vma;
> - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> - int ret = 0;
> -
> - mapping = page_mapping(page);
> - if (!mapping)
> - return ret;
> -
> - spin_lock(&mapping->i_mmap_lock);
> -
> - vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
> - ret += page_wrprotect_one(page, vma, odirect_sync,
> - count_offset);
> -
> - spin_unlock(&mapping->i_mmap_lock);
> -
> - return ret;
> -}
> -
> -static int page_wrprotect_anon(struct page *page, int *odirect_sync,
> - int count_offset)
> -{
> - struct vm_area_struct *vma;
> - struct anon_vma *anon_vma;
> - int ret = 0;
> -
> - anon_vma = page_lock_anon_vma(page);
> - if (!anon_vma)
> - return ret;
> -
> - /*
> - * If the page is inside the swap cache, its _count number was
> - * increased by one, therefore we have to increase count_offset by one.
> - */
> - if (PageSwapCache(page))
> - count_offset++;
> -
> - list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
> - ret += page_wrprotect_one(page, vma, odirect_sync,
> - count_offset);
> -
> - page_unlock_anon_vma(anon_vma);
> -
> - return ret;
> -}
> -
> -/**
> - * page_wrprotect - set all ptes pointing to a page as readonly
> - * @page: the page to set as readonly
> - * @odirect_sync: boolean value that is set to 0 when some of the ptes were not
> - * marked as readonly beacuse page_wrprotect_one() was not able
> - * to mark this ptes as readonly without opening window to a race
> - * with odirect
> - * @count_offset: number of times page_wrprotect() caller had called get_page()
> - * on the page
> - *
> - * returns the number of ptes which were marked as readonly.
> - * (ptes that were readonly before this function was called are counted as well)
> - */
> -int page_wrprotect(struct page *page, int *odirect_sync, int count_offset)
> -{
> - int ret = 0;
> -
> - /*
> - * Page lock is needed for anon pages for the PageSwapCache check,
> - * and for page_mapping for filebacked pages
> - */
> - BUG_ON(!PageLocked(page));
> -
> - *odirect_sync = 1;
> - if (PageAnon(page))
> - ret = page_wrprotect_anon(page, odirect_sync, count_offset);
> - else
> - ret = page_wrprotect_file(page, odirect_sync, count_offset);
> -
> - return ret;
> -}
> -EXPORT_SYMBOL(page_wrprotect);
> -
> -#endif
> -
> /**
> * __page_set_anon_rmap - setup new anonymous rmap
> * @page: the page to add the mapping to
> --
> 1.5.6.5

2009-06-09 19:38:23

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 0/4] RFC - ksm api change into madvise

On Tue, 9 Jun 2009, Hugh Dickins wrote:
> On Tue, 9 Jun 2009, Izik Eidus wrote:
> > How does this look like?
>
> One improvment to make now, though: you've elsewhere avoided
> the pgd,pud,pmd,pte descent in ksm.c (using get_pte instead), and
> page_check_address() is not static to rmap.c (filemap_xip wanted it),
> so please continue to use that. It's not exported, right, but I think
> Chris was already decisive that we should abandon modular KSM, yes?

I think you can simplify it further, can't you? Isn't the get_pte()
preamble in try_to_merge_one_page() just unnecessary overhead now? See
untested code below. Or even move the trylock/unlock of the page into
write_protect_page if you prefer. Later on we'll uninline rmap.c's
vma_address() so you can use it instead of your addr_in_vma() copy.

Hugh

static inline int write_protect_page(struct page *page,
struct vm_area_struct *vma,
pte_t *orig_pte)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long addr;
pte_t *ptep;
spinlock_t *ptl;
int swapped;
int ret = 1;

addr = addr_in_vma(vma, page);
if (addr == -EFAULT)
goto out;

ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
goto out;

if (pte_write(*ptep)) {
pte_t entry;

swapped = PageSwapCache(page);
flush_cache_page(vma, addr, page_to_pfn(page));
/*
* Ok this is tricky, when get_user_pages_fast() run it doesnt
* take any lock, therefore the check that we are going to make
* with the pagecount against the mapcount is racey and
* O_DIRECT can happen right after the check.
* So we clear the pte and flush the tlb before the check
* this assure us that no O_DIRECT can happen after the check
* or in the middle of the check.
*/
entry = ptep_clear_flush(vma, addr, ptep);
/*
* Check that no O_DIRECT or similar I/O is in progress on the
* page
*/
if ((page_mapcount(page) + 2 + swapped) != page_count(page)) {
set_pte_at_notify(mm, addr, ptep, entry);
goto out_unlock;
}
entry = pte_wrprotect(entry);
set_pte_at_notify(mm, addr, ptep, entry);
*orig_pte = *ptep;
}
ret = 0;

out_unlock:
pte_unmap_unlock(ptep, ptl);
out:
return ret;
}

/*
* try_to_merge_one_page - take two pages and merge them into one
* @mm: mm_struct that hold vma pointing into oldpage
* @vma: the vma that hold the pte pointing into oldpage
* @oldpage: the page that we want to replace with newpage
* @newpage: the page that we want to map instead of oldpage
* @newprot: the new permission of the pte inside vma
* note:
* oldpage should be anon page while newpage should be file mapped page
*
* this function return 0 if the pages were merged, 1 otherwise.
*/
static int try_to_merge_one_page(struct mm_struct *mm,
struct vm_area_struct *vma,
struct page *oldpage,
struct page *newpage,
pgprot_t newprot)
{
int ret = 1;
pte_t orig_pte;

if (!PageAnon(oldpage))
goto out;

get_page(newpage);
get_page(oldpage);

/*
* we need the page lock to read a stable PageSwapCache in
* write_protect_page().
* we use trylock_page() instead of lock_page(), beacuse we dont want to
* wait here, we prefer to continue scanning and merging diffrent pages
* and to come back to this page when it is unlocked.
*/
if (!trylock_page(oldpage))
goto out_putpage;

if (write_protect_page(oldpage, vma, &orig_pte)) {
unlock_page(oldpage);
goto out_putpage;
}
unlock_page(oldpage);

if (pages_identical(oldpage, newpage))
ret = replace_page(vma, oldpage, newpage, orig_pte, newprot);

out_putpage:
put_page(oldpage);
put_page(newpage);
out:
return ret;
}

2009-06-10 06:30:27

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 0/4] RFC - ksm api change into madvise

On Tue, 9 Jun 2009 20:27:29 +0100 (BST)
Hugh Dickins <[email protected]> wrote:

> On Tue, 9 Jun 2009, Hugh Dickins wrote:
> > On Tue, 9 Jun 2009, Izik Eidus wrote:
> > > How does this look like?
> >
> > One improvment to make now, though: you've elsewhere avoided
> > the pgd,pud,pmd,pte descent in ksm.c (using get_pte instead), and
> > page_check_address() is not static to rmap.c (filemap_xip wanted
> > it), so please continue to use that. It's not exported, right, but
> > I think Chris was already decisive that we should abandon modular
> > KSM, yes?
>
> I think you can simplify it further, can't you? Isn't the get_pte()
> preamble in try_to_merge_one_page() just unnecessary overhead now?
> See untested code below. Or even move the trylock/unlock of the page
> into write_protect_page if you prefer. Later on we'll uninline
> rmap.c's vma_address() so you can use it instead of your
> addr_in_vma() copy.
>
> Hugh


Great!, what you think about below? another thing we want to add or to
start sending it to Andrew?

btw may i add your signed-off to this patch?
(the only thing that i changed was taking down the *orig_pte = *ptep,
so we will merge write_protected pages, and add orig_pte = __pte(0) to
avoid annoying warning message about being used uninitialized)

>From 7304f4404d91a40e234b0530de6d3bfc8c5925a2 Mon Sep 17 00:00:00 2001
From: Izik Eidus <[email protected]>
Date: Tue, 9 Jun 2009 21:00:55 +0300
Subject: [PATCH 1/2] ksm: remove ksm from being a module.

Signed-off-by: Izik Eidus <[email protected]>
---
include/linux/mm.h | 2 +-
include/linux/rmap.h | 4 ++--
mm/Kconfig | 5 ++---
mm/memory.c | 2 +-
4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e617bab..cdc08d2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1258,7 +1258,7 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);

-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+#if defined(CONFIG_KSM)
int replace_page(struct vm_area_struct *vma, struct page *oldpage,
struct page *newpage, pte_t orig_pte, pgprot_t prot);
#endif
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 469376d..939c171 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,7 +118,7 @@ static inline int try_to_munlock(struct page *page)
}
#endif

-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+#if defined(CONFIG_KSM)
int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
#endif

@@ -136,7 +136,7 @@ static inline int page_mkclean(struct page *page)
return 0;
}

-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+#if defined(CONFIG_KSM)
static inline int page_wrprotect(struct page *page, int *odirect_sync,
int count_offset)
{
diff --git a/mm/Kconfig b/mm/Kconfig
index 5ebfd18..e7c118f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -227,10 +227,9 @@ config MMU_NOTIFIER
bool

config KSM
- tristate "Enable KSM for page sharing"
+ bool "Enable KSM for page sharing"
help
- Enable the KSM kernel module to allow page sharing of equal pages
- among different tasks.
+ Enable KSM to allow page sharing of equal pages among different tasks.

config NOMMU_INITIAL_TRIM_EXCESS
int "Turn on mmap() excess space trimming before booting"
diff --git a/mm/memory.c b/mm/memory.c
index 8b4e40e..e23d4dd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1617,7 +1617,7 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
}
EXPORT_SYMBOL(vm_insert_mixed);

-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+#if defined(CONFIG_KSM)

/**
* replace_page - replace page in vma with new page
--
1.5.6.5





>From 3d9975dea43ae848f21875b5f99accecf1366765 Mon Sep 17 00:00:00 2001
From: Izik Eidus <[email protected]>
Date: Wed, 10 Jun 2009 09:16:26 +0300
Subject: [PATCH 2/2] ksm: remove page_wrprotect() from rmap.c

Remove page_wrprotect() from rmap.c and instead embedded the needed code
into ksm.c

Hugh pointed out that for the ksm usage case, we dont have to walk over the rmap
and to write protected page after page beacuse when Anonymous page is mapped
more than once, it have to be write protected already, and in a case that it
mapped just once, no need to walk over the rmap, we can instead write protect
it from inside ksm.c.

Thanks.

Signed-off-by: Izik Eidus <[email protected]>
---
include/linux/rmap.h | 12 ----
mm/ksm.c | 86 +++++++++++++++++++++----------
mm/rmap.c | 139 --------------------------------------------------
3 files changed, 59 insertions(+), 178 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 939c171..350e76d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,10 +118,6 @@ static inline int try_to_munlock(struct page *page)
}
#endif

-#if defined(CONFIG_KSM)
-int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
-#endif
-
#else /* !CONFIG_MMU */

#define anon_vma_init() do {} while (0)
@@ -136,14 +132,6 @@ static inline int page_mkclean(struct page *page)
return 0;
}

-#if defined(CONFIG_KSM)
-static inline int page_wrprotect(struct page *page, int *odirect_sync,
- int count_offset)
-{
- return 0;
-}
-#endif
-
#endif /* CONFIG_MMU */

/*
diff --git a/mm/ksm.c b/mm/ksm.c
index 74d921b..9d4be62 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -37,6 +37,7 @@
#include <linux/swap.h>
#include <linux/rbtree.h>
#include <linux/anon_inodes.h>
+#include <linux/mmu_notifier.h>
#include <linux/ksm.h>

#include <asm/tlbflush.h>
@@ -642,6 +643,60 @@ static inline int pages_identical(struct page *page1, struct page *page2)
return !memcmp_pages(page1, page2);
}

+static inline int write_protect_page(struct page *page,
+ struct vm_area_struct *vma,
+ pte_t *orig_pte)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long addr;
+ pte_t *ptep;
+ spinlock_t *ptl;
+ int swapped;
+ int ret = 1;
+
+ addr = addr_in_vma(vma, page);
+ if (addr == -EFAULT)
+ goto out;
+
+ ptep = page_check_address(page, mm, addr, &ptl, 0);
+ if (!ptep)
+ goto out;
+
+ if (pte_write(*ptep)) {
+ pte_t entry;
+
+ swapped = PageSwapCache(page);
+ flush_cache_page(vma, addr, page_to_pfn(page));
+ /*
+ * Ok this is tricky, when get_user_pages_fast() run it doesnt
+ * take any lock, therefore the check that we are going to make
+ * with the pagecount against the mapcount is racey and
+ * O_DIRECT can happen right after the check.
+ * So we clear the pte and flush the tlb before the check
+ * this assure us that no O_DIRECT can happen after the check
+ * or in the middle of the check.
+ */
+ entry = ptep_clear_flush(vma, addr, ptep);
+ /*
+ * Check that no O_DIRECT or similar I/O is in progress on the
+ * page
+ */
+ if ((page_mapcount(page) + 2 + swapped) != page_count(page)) {
+ set_pte_at_notify(mm, addr, ptep, entry);
+ goto out_unlock;
+ }
+ entry = pte_wrprotect(entry);
+ set_pte_at_notify(mm, addr, ptep, entry);
+ }
+ *orig_pte = *ptep;
+ ret = 0;
+
+out_unlock:
+ pte_unmap_unlock(ptep, ptl);
+out:
+ return ret;
+}
+
/*
* try_to_merge_one_page - take two pages and merge them into one
* @mm: mm_struct that hold vma pointing into oldpage
@@ -661,9 +716,7 @@ static int try_to_merge_one_page(struct mm_struct *mm,
pgprot_t newprot)
{
int ret = 1;
- int odirect_sync;
- unsigned long page_addr_in_vma;
- pte_t orig_pte, *orig_ptep;
+ pte_t orig_pte = __pte(0);

if (!PageAnon(oldpage))
goto out;
@@ -671,42 +724,21 @@ static int try_to_merge_one_page(struct mm_struct *mm,
get_page(newpage);
get_page(oldpage);

- page_addr_in_vma = addr_in_vma(vma, oldpage);
- if (page_addr_in_vma == -EFAULT)
- goto out_putpage;
-
- orig_ptep = get_pte(mm, page_addr_in_vma);
- if (!orig_ptep)
- goto out_putpage;
- orig_pte = *orig_ptep;
- pte_unmap(orig_ptep);
- if (!pte_present(orig_pte))
- goto out_putpage;
- if (page_to_pfn(oldpage) != pte_pfn(orig_pte))
- goto out_putpage;
/*
* we need the page lock to read a stable PageSwapCache in
- * page_wrprotect().
+ * write_protect_page().
* we use trylock_page() instead of lock_page(), beacuse we dont want to
* wait here, we prefer to continue scanning and merging diffrent pages
* and to come back to this page when it is unlocked.
*/
if (!trylock_page(oldpage))
goto out_putpage;
- /*
- * page_wrprotect check if the page is swapped or in swap cache,
- * in the future we might want to run here if_present_pte and then
- * swap_free
- */
- if (!page_wrprotect(oldpage, &odirect_sync, 2)) {
+
+ if (write_protect_page(oldpage, vma, &orig_pte)) {
unlock_page(oldpage);
goto out_putpage;
}
unlock_page(oldpage);
- if (!odirect_sync)
- goto out_putpage;
-
- orig_pte = pte_wrprotect(orig_pte);

if (pages_identical(oldpage, newpage))
ret = replace_page(vma, oldpage, newpage, orig_pte, newprot);
diff --git a/mm/rmap.c b/mm/rmap.c
index f53074c..c3ba0b9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -585,145 +585,6 @@ int page_mkclean(struct page *page)
}
EXPORT_SYMBOL_GPL(page_mkclean);

-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
-
-static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
- int *odirect_sync, int count_offset)
-{
- struct mm_struct *mm = vma->vm_mm;
- unsigned long address;
- pte_t *pte;
- spinlock_t *ptl;
- int ret = 0;
-
- address = vma_address(page, vma);
- if (address == -EFAULT)
- goto out;
-
- pte = page_check_address(page, mm, address, &ptl, 0);
- if (!pte)
- goto out;
-
- if (pte_write(*pte)) {
- pte_t entry;
-
- flush_cache_page(vma, address, pte_pfn(*pte));
- /*
- * Ok this is tricky, when get_user_pages_fast() run it doesnt
- * take any lock, therefore the check that we are going to make
- * with the pagecount against the mapcount is racey and
- * O_DIRECT can happen right after the check.
- * So we clear the pte and flush the tlb before the check
- * this assure us that no O_DIRECT can happen after the check
- * or in the middle of the check.
- */
- entry = ptep_clear_flush(vma, address, pte);
- /*
- * Check that no O_DIRECT or similar I/O is in progress on the
- * page
- */
- if ((page_mapcount(page) + count_offset) != page_count(page)) {
- *odirect_sync = 0;
- set_pte_at_notify(mm, address, pte, entry);
- goto out_unlock;
- }
- entry = pte_wrprotect(entry);
- set_pte_at_notify(mm, address, pte, entry);
- }
- ret = 1;
-
-out_unlock:
- pte_unmap_unlock(pte, ptl);
-out:
- return ret;
-}
-
-static int page_wrprotect_file(struct page *page, int *odirect_sync,
- int count_offset)
-{
- struct address_space *mapping;
- struct prio_tree_iter iter;
- struct vm_area_struct *vma;
- pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
- int ret = 0;
-
- mapping = page_mapping(page);
- if (!mapping)
- return ret;
-
- spin_lock(&mapping->i_mmap_lock);
-
- vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
- ret += page_wrprotect_one(page, vma, odirect_sync,
- count_offset);
-
- spin_unlock(&mapping->i_mmap_lock);
-
- return ret;
-}
-
-static int page_wrprotect_anon(struct page *page, int *odirect_sync,
- int count_offset)
-{
- struct vm_area_struct *vma;
- struct anon_vma *anon_vma;
- int ret = 0;
-
- anon_vma = page_lock_anon_vma(page);
- if (!anon_vma)
- return ret;
-
- /*
- * If the page is inside the swap cache, its _count number was
- * increased by one, therefore we have to increase count_offset by one.
- */
- if (PageSwapCache(page))
- count_offset++;
-
- list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
- ret += page_wrprotect_one(page, vma, odirect_sync,
- count_offset);
-
- page_unlock_anon_vma(anon_vma);
-
- return ret;
-}
-
-/**
- * page_wrprotect - set all ptes pointing to a page as readonly
- * @page: the page to set as readonly
- * @odirect_sync: boolean value that is set to 0 when some of the ptes were not
- * marked as readonly beacuse page_wrprotect_one() was not able
- * to mark this ptes as readonly without opening window to a race
- * with odirect
- * @count_offset: number of times page_wrprotect() caller had called get_page()
- * on the page
- *
- * returns the number of ptes which were marked as readonly.
- * (ptes that were readonly before this function was called are counted as well)
- */
-int page_wrprotect(struct page *page, int *odirect_sync, int count_offset)
-{
- int ret = 0;
-
- /*
- * Page lock is needed for anon pages for the PageSwapCache check,
- * and for page_mapping for filebacked pages
- */
- BUG_ON(!PageLocked(page));
-
- *odirect_sync = 1;
- if (PageAnon(page))
- ret = page_wrprotect_anon(page, odirect_sync, count_offset);
- else
- ret = page_wrprotect_file(page, odirect_sync, count_offset);
-
- return ret;
-}
-EXPORT_SYMBOL(page_wrprotect);
-
-#endif
-
/**
* __page_set_anon_rmap - setup new anonymous rmap
* @page: the page to add the mapping to
--
1.5.6.5




>
> static inline int write_protect_page(struct page *page,
> struct vm_area_struct *vma,
> pte_t *orig_pte)
> {
> struct mm_struct *mm = vma->vm_mm;
> unsigned long addr;
> pte_t *ptep;
> spinlock_t *ptl;
> int swapped;
> int ret = 1;
>
> addr = addr_in_vma(vma, page);
> if (addr == -EFAULT)
> goto out;
>
> ptep = page_check_address(page, mm, addr, &ptl, 0);
> if (!ptep)
> goto out;
>
> if (pte_write(*ptep)) {
> pte_t entry;
>
> swapped = PageSwapCache(page);
> flush_cache_page(vma, addr, page_to_pfn(page));
> /*
> * Ok this is tricky, when get_user_pages_fast() run
> it doesnt
> * take any lock, therefore the check that we are
> going to make
> * with the pagecount against the mapcount is racey
> and
> * O_DIRECT can happen right after the check.
> * So we clear the pte and flush the tlb before the
> check
> * this assure us that no O_DIRECT can happen after
> the check
> * or in the middle of the check.
> */
> entry = ptep_clear_flush(vma, addr, ptep);
> /*
> * Check that no O_DIRECT or similar I/O is in
> progress on the
> * page
> */
> if ((page_mapcount(page) + 2 + swapped) !=
> page_count(page)) { set_pte_at_notify(mm, addr, ptep, entry);
> goto out_unlock;
> }
> entry = pte_wrprotect(entry);
> set_pte_at_notify(mm, addr, ptep, entry);
> *orig_pte = *ptep;
> }
> ret = 0;
>
> out_unlock:
> pte_unmap_unlock(ptep, ptl);
> out:
> return ret;
> }
>
> /*
> * try_to_merge_one_page - take two pages and merge them into one
> * @mm: mm_struct that hold vma pointing into oldpage
> * @vma: the vma that hold the pte pointing into oldpage
> * @oldpage: the page that we want to replace with newpage
> * @newpage: the page that we want to map instead of oldpage
> * @newprot: the new permission of the pte inside vma
> * note:
> * oldpage should be anon page while newpage should be file mapped
> page *
> * this function return 0 if the pages were merged, 1 otherwise.
> */
> static int try_to_merge_one_page(struct mm_struct *mm,
> struct vm_area_struct *vma,
> struct page *oldpage,
> struct page *newpage,
> pgprot_t newprot)
> {
> int ret = 1;
> pte_t orig_pte;
>
> if (!PageAnon(oldpage))
> goto out;
>
> get_page(newpage);
> get_page(oldpage);
>
> /*
> * we need the page lock to read a stable PageSwapCache in
> * write_protect_page().
> * we use trylock_page() instead of lock_page(), beacuse we
> dont want to
> * wait here, we prefer to continue scanning and merging
> diffrent pages
> * and to come back to this page when it is unlocked.
> */
> if (!trylock_page(oldpage))
> goto out_putpage;
>
> if (write_protect_page(oldpage, vma, &orig_pte)) {
> unlock_page(oldpage);
> goto out_putpage;
> }
> unlock_page(oldpage);
>
> if (pages_identical(oldpage, newpage))
> ret = replace_page(vma, oldpage, newpage, orig_pte,
> newprot);
>
> out_putpage:
> put_page(oldpage);
> put_page(newpage);
> out:
> return ret;
> }

2009-06-11 16:59:06

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 0/4] RFC - ksm api change into madvise

On Wed, 10 Jun 2009, Izik Eidus wrote:
>
> Great!, what you think about below?

A couple of things: please use #ifdef CONFIG_KSM throughout
now that you can, rather than #if defined(CONFIG_KSM).

And we ought to add a comment above the write_protect_page() call:
/*
* If this anonymous page is mapped only here, its pte may need
* to be write-protected. If it's mapped elsewhere, all its
* ptes are necessarily already write-protected. In either
* case, we need to lock and check page_count is not raised.
*/

> another thing we want to add or to start sending it to Andrew?

Umm, umm, umm, let's hold off sending Andrew for now. It would be
a bit silly to send him 2/2, when what's actually required is to
withdraw ksm-add-page_wrprotect-write-protecting-page.patch,
providing adjustments as necessary to the other patches.

But I don't want us fiddling with Andrew's collection every day
or two (forget my earlier scan fix if you prefer, if you're sure
you caught all of that in what you have now); and I rather doubt
mmotm's KSM is getting much testing at present. It would be nice
to get the withdrawn rmap.c changes out of everybody else's face,
but not worth messing mmotm around for that right now.

Whilst we can't let our divergence drift on very long, I think
it's better we give him a new collection of patches once we've
more or less settled the madvise switchover. That new collection
won't contain much in mm/rmap.c if at all.

Unless you or Andrew feel strongly that I'm wrong on this.

>
> btw may i add your signed-off to this patch?

If you've tested and found it works, please, by all means.
If you've not tested, or found it broken, then I deny all
knowledge of these changes ;) But it's hypothetical: the
replacement collection to Andrew won't contain either of
these patches as such.

> (the only thing that i changed was taking down the *orig_pte = *ptep,
> so we will merge write_protected pages, and add orig_pte = __pte(0) to
> avoid annoying warning message about being used uninitialized)

Okay. We do have a macro to annotate such pacifying initializations,
but I've not used it before, and forget what it is or where to look
for an example, I don't see it in kernel.h or compiler.h. Maybe
Andrew will chime in and remind us.

Hugh

2009-06-12 21:50:01

by Izik Eidus

[permalink] [raw]
Subject: Re: [PATCH 0/4] RFC - ksm api change into madvise

Hugh Dickins wrote:
>
> Okay. We do have a macro to annotate such pacifying initializations,
> but I've not used it before, and forget what it is or where to look
> for an example, I don't see it in kernel.h or compiler.h. Maybe
> Andrew will chime in and remind us.
>
> Hugh
>
I have looked on compiler.h - this file have something that deal with
warnings, but didnt find anything that is good for our case....

Anyone know where is this thing is found?

2009-06-13 15:05:44

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 0/4] RFC - ksm api change into madvise

Hi Andrea,

On Tue, 9 Jun 2009, Andrea Arcangeli wrote:
>
> So let us know what you think about the rmap_item/tree_item out of
> sync, or in sync with mmu notifier. As said Izik already did a
> preliminary patch with mmu notifier registration. I doubt we want to
> invest in that direction unless there's 100% agreement that it is
> definitely the way to go, and the expectation that it will make a
> substantial difference to the KSM users. Minor optimizations that
> increase complexity a lot, can be left for later.

Thanks for your detailed mail, of which this is merely the final
paragraph. Thought I'd better respond with a little reassurance,
though I'm not yet ready to write in detail.

I agree 100% that KSM is entitled to be as "lazy" about clearing
up pages as it is about merging them in the first place: you're
absolutely right to avoid the unnecessary overhead of keeping
strictly in synch, and I recognize the lock ordering problems
that keeping strictly in synch would be likely to entail.

My remarks about "lost" pages came from the belief that operations
upon the vmas could move pages to where they thereafter escaped
the attention of KSM's scans for an _indefinite_ period (until
the process exited or even after): that's what has worried me,
but I've yet to demonstrate such a problem, and the rework
naturally changes what happens here.

So, rest assured, I'm not wanting to impose a stricter discipline and
tighter linkage, unless it's to fix a proven indefinite discrepancy.

Hugh