2012-06-07 17:01:43

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 0/3] uprobes fixes for 3.5

Hello,

This doesn't depend on other uprobes patches I sent, and I think
this is 3.5 material. However, please remember I do not understand
vm, iow please review.

write_opcode() becomes really ugly after 3/3, I'll try to cleanup
it later. In fact I think it needs some cleanups even without this
fix.

Oleg.


2012-06-07 17:02:30

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 1/3] uprobes: valid_vma() should reject VM_HUGETLB

__replace_page() obviously can't work with the hugetlbfs mappings,
uprobe_register() will likely crash the kernel. Change valid_vma()
to check VM_HUGETLB as well.

As for PageTransHuge() no need to worry, vma->vm_file != NULL.

Signed-off-by: Oleg Nesterov <[email protected]>
---
kernel/events/uprobes.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6e3b181..48d53af 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -105,7 +105,8 @@ static bool valid_vma(struct vm_area_struct *vma, bool is_register)
if (!is_register)
return true;

- if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) == (VM_READ|VM_EXEC))
+ if ((vma->vm_flags & (VM_HUGETLB|VM_READ|VM_WRITE|VM_EXEC|VM_SHARED))
+ == (VM_READ|VM_EXEC))
return true;

return false;
--
1.5.5.1

2012-06-07 17:02:36

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 2/3] uprobes: __copy_insn() should ensure a_ops->readpage != NULL

__copy_insn() blindly calls read_mapping_page(), this will crash
the kernel if ->readpage == NULL, add the necessary check. For
example, hugetlbfs_aops->readpage is NULL. Perhaps we should change
read_mapping_page() instead.

Signed-off-by: Oleg Nesterov <[email protected]>
---
kernel/events/uprobes.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 48d53af..9c53bc2 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -616,6 +616,8 @@ __copy_insn(struct address_space *mapping, struct vm_area_struct *vma, char *ins

if (!filp)
return -EINVAL;
+ if (!mapping->a_ops->readpage)
+ return -EIO;

idx = (unsigned long)(offset >> PAGE_CACHE_SHIFT);
off1 = offset &= ~PAGE_MASK;
--
1.5.5.1

2012-06-07 17:02:44

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap()

write_opcode() gets old_page via get_user_pages() and then calls
__replace_page() which assumes that this old_page is still mapped
after pte_offset_map_lock().

This is not true if this old_page was already try_to_unmap()'ed,
and in this case everything __replace_page() does with old_page
is wrong. Just for example, put_page() is not balanced.

I think it is possible to teach __replace_page() to handle this
unlikely case correctly, but this patch simply changes it to use
page_check_address() and return -EAGAIN if it fails. The caller
should notice this error code and retry.

Note: write_opcode() asks for the cleanups, I'll try to do this
in a separate patch.

Signed-off-by: Oleg Nesterov <[email protected]>
---
kernel/events/uprobes.c | 34 +++++++++-------------------------
1 files changed, 9 insertions(+), 25 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 9c53bc2..b3e8d5b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -135,33 +135,17 @@ static loff_t vma_address(struct vm_area_struct *vma, loff_t offset)
static int __replace_page(struct vm_area_struct *vma, struct page *page, struct page *kpage)
{
struct mm_struct *mm = vma->vm_mm;
- pgd_t *pgd;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *ptep;
- spinlock_t *ptl;
unsigned long addr;
- int err = -EFAULT;
+ spinlock_t *ptl;
+ pte_t *ptep;

addr = page_address_in_vma(page, vma);
if (addr == -EFAULT)
- goto out;
-
- pgd = pgd_offset(mm, addr);
- if (!pgd_present(*pgd))
- goto out;
-
- pud = pud_offset(pgd, addr);
- if (!pud_present(*pud))
- goto out;
-
- pmd = pmd_offset(pud, addr);
- if (!pmd_present(*pmd))
- goto out;
+ return -EFAULT;

- ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
- goto out;
+ return -EAGAIN;

get_page(kpage);
page_add_new_anon_rmap(kpage, vma, addr);
@@ -180,10 +164,8 @@ static int __replace_page(struct vm_area_struct *vma, struct page *page, struct
try_to_free_swap(page);
put_page(page);
pte_unmap_unlock(ptep, ptl);
- err = 0;

-out:
- return err;
+ return 0;
}

/**
@@ -230,7 +212,7 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
struct uprobe *uprobe;
loff_t addr;
int ret;
-
+retry:
/* Read the page with vaddr into memory */
ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &old_page, &vma);
if (ret <= 0)
@@ -297,6 +279,8 @@ unlock_out:
put_out:
put_page(old_page);

+ if (unlikely(ret == -EAGAIN))
+ goto retry;
return ret;
}

--
1.5.5.1

2012-06-08 08:47:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap()

On Thu, 2012-06-07 at 19:00 +0200, Oleg Nesterov wrote:
> write_opcode() gets old_page via get_user_pages() and then calls
> __replace_page() which assumes that this old_page is still mapped
> after pte_offset_map_lock().
>
> This is not true if this old_page was already try_to_unmap()'ed,
> and in this case everything __replace_page() does with old_page
> is wrong. Just for example, put_page() is not balanced.
>
> I think it is possible to teach __replace_page() to handle this
> unlikely case correctly, but this patch simply changes it to use
> page_check_address() and return -EAGAIN if it fails. The caller
> should notice this error code and retry.

Note that replace_page() was nicked from ksm, does that suffer a similar
problem?

2012-06-08 10:06:04

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap()

On 06/08, Peter Zijlstra wrote:
>
> On Thu, 2012-06-07 at 19:00 +0200, Oleg Nesterov wrote:
> > write_opcode() gets old_page via get_user_pages() and then calls
> > __replace_page() which assumes that this old_page is still mapped
> > after pte_offset_map_lock().
> >
> > This is not true if this old_page was already try_to_unmap()'ed,
> > and in this case everything __replace_page() does with old_page
> > is wrong. Just for example, put_page() is not balanced.
> >
> > I think it is possible to teach __replace_page() to handle this
> > unlikely case correctly, but this patch simply changes it to use
> > page_check_address() and return -EAGAIN if it fails. The caller
> > should notice this error code and retry.
>
> Note that replace_page() was nicked from ksm, does that suffer a similar
> problem?

Yes, I looked at replace_page() too. Afaics it looks fine, it does the
additional pte_same(pte, orig_pte) check.

Oleg.

2012-06-08 14:06:23

by Oleg Nesterov

[permalink] [raw]
Subject: oom-killer is crazy? (Was: [PATCH 0/3] uprobes fixes for 3.5)

On 06/07, Oleg Nesterov wrote:
>
> This doesn't depend on other uprobes patches I sent, and I think
> this is 3.5 material.

And during the testing I found another thing which should be fixed
in 3.5 imho. I noticed that oom-killer goes crazy. In the simplest
case, when there is the single and "obvious" memory hog it kills
sshd daemon.

Hmm. oom_badness() does

if (has_capability_noaudit(p, CAP_SYS_ADMIN))
points -= 30 * totalpages / 1000;

very nice, but what if this underflows? points is unsigned long.
points += p->signal->oom_score_adj... looks suspicious too.

Looks like we should remove "unsigned" from oom_badness() and
its callers? Probably not, it does "return points ? points : 1".

Confused.

Oleg.

2012-06-08 14:26:55

by Dave Jones

[permalink] [raw]
Subject: Re: oom-killer is crazy? (Was: [PATCH 0/3] uprobes fixes for 3.5)

On Fri, Jun 08, 2012 at 04:03:28PM +0200, Oleg Nesterov wrote:
> On 06/07, Oleg Nesterov wrote:
> >
> > This doesn't depend on other uprobes patches I sent, and I think
> > this is 3.5 material.
>
> And during the testing I found another thing which should be fixed
> in 3.5 imho. I noticed that oom-killer goes crazy. In the simplest
> case, when there is the single and "obvious" memory hog it kills
> sshd daemon.
>
> Hmm. oom_badness() does
>
> if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> points -= 30 * totalpages / 1000;
>
> very nice, but what if this underflows? points is unsigned long.
> points += p->signal->oom_score_adj... looks suspicious too.
>
> Looks like we should remove "unsigned" from oom_badness() and
> its callers? Probably not, it does "return points ? points : 1".

I've been running this from David for a week, but it still isn't right..

Dave

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ed0e196..416637f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -183,7 +183,7 @@ static bool oom_unkillable_task(struct task_struct *p,
unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
const nodemask_t *nodemask, unsigned long totalpages)
{
- unsigned long points;
+ long points;

if (oom_unkillable_task(p, memcg, nodemask))
return 0;
@@ -223,7 +223,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
* Never return 0 for an eligible task regardless of the root bonus and
* oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
*/
- return points ? points : 1;
+ return points > 0 ? points : 1;
}

/*




2012-06-08 15:06:48

by Oleg Nesterov

[permalink] [raw]
Subject: Re: oom-killer is crazy? (Was: [PATCH 0/3] uprobes fixes for 3.5)

On 06/08, Dave Jones wrote:
>
> On Fri, Jun 08, 2012 at 04:03:28PM +0200, Oleg Nesterov wrote:
> >
> > Hmm. oom_badness() does
> >
> > if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> > points -= 30 * totalpages / 1000;
> >
> > very nice, but what if this underflows? points is unsigned long.
> > points += p->signal->oom_score_adj... looks suspicious too.
> >
> > Looks like we should remove "unsigned" from oom_badness() and
> > its callers? Probably not, it does "return points ? points : 1".
>
> I've been running this from David for a week, but it still isn't right..
>
> Dave
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ed0e196..416637f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -183,7 +183,7 @@ static bool oom_unkillable_task(struct task_struct *p,
> unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> const nodemask_t *nodemask, unsigned long totalpages)
> {
> - unsigned long points;
> + long points;
>
> if (oom_unkillable_task(p, memcg, nodemask))
> return 0;
> @@ -223,7 +223,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> * Never return 0 for an eligible task regardless of the root bonus and
> * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
> */
> - return points ? points : 1;
> + return points > 0 ? points : 1;
> }

I did the same to avoid the problem.

Even if it still isn't right, I think it is much better ;) Currently
oom_badness() is obviously and seriously broken.

Oleg.

2012-06-08 16:57:55

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap()

On 06/07, Oleg Nesterov wrote:
>
> The caller
> should notice this error code and retry.

Damn. But I didn't notice that the caller changes "vaddr" in the
middle.

> Note: write_opcode() asks for the cleanups, I'll try to do this
> in a separate patch.

Yes.


Please find v2 below. The only difference is that write_opcode()
uses the temporary variable instead of "vaddr &= ~PAGE_MASK".


---
Subject: [PATCH] uprobes: write_opcode()->__replace_page() can race with try_to_unmap()

write_opcode() gets old_page via get_user_pages() and then calls
__replace_page() which assumes that this old_page is still mapped
after pte_offset_map_lock().

This is not true if this old_page was already try_to_unmap()'ed,
and in this case everything __replace_page() does with old_page
is wrong. Just for example, put_page() is not balanced.

I think it is possible to teach __replace_page() to handle this
unlikely case correctly, but this patch simply changes it to use
page_check_address() and return -EAGAIN if it fails. The caller
should notice this error code and retry.

Note: write_opcode() asks for the cleanups, I'll try to do this
in a separate patch.

Signed-off-by: Oleg Nesterov <[email protected]>
---
kernel/events/uprobes.c | 41 +++++++++++++----------------------------
1 files changed, 13 insertions(+), 28 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 9c53bc2..54c8780 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -135,33 +135,17 @@ static loff_t vma_address(struct vm_area_struct *vma, loff_t offset)
static int __replace_page(struct vm_area_struct *vma, struct page *page, struct page *kpage)
{
struct mm_struct *mm = vma->vm_mm;
- pgd_t *pgd;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *ptep;
- spinlock_t *ptl;
unsigned long addr;
- int err = -EFAULT;
+ spinlock_t *ptl;
+ pte_t *ptep;

addr = page_address_in_vma(page, vma);
if (addr == -EFAULT)
- goto out;
-
- pgd = pgd_offset(mm, addr);
- if (!pgd_present(*pgd))
- goto out;
-
- pud = pud_offset(pgd, addr);
- if (!pud_present(*pud))
- goto out;
+ return -EFAULT;

- pmd = pmd_offset(pud, addr);
- if (!pmd_present(*pmd))
- goto out;
-
- ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
- goto out;
+ return -EAGAIN;

get_page(kpage);
page_add_new_anon_rmap(kpage, vma, addr);
@@ -180,10 +164,8 @@ static int __replace_page(struct vm_area_struct *vma, struct page *page, struct
try_to_free_swap(page);
put_page(page);
pte_unmap_unlock(ptep, ptl);
- err = 0;

-out:
- return err;
+ return 0;
}

/**
@@ -228,9 +210,10 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
void *vaddr_old, *vaddr_new;
struct vm_area_struct *vma;
struct uprobe *uprobe;
+ unsigned long pgoff;
loff_t addr;
int ret;
-
+retry:
/* Read the page with vaddr into memory */
ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &old_page, &vma);
if (ret <= 0)
@@ -275,9 +258,9 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
memcpy(vaddr_new, vaddr_old, PAGE_SIZE);

/* poke the new insn in, ASSUMES we don't cross page boundary */
- vaddr &= ~PAGE_MASK;
- BUG_ON(vaddr + UPROBE_SWBP_INSN_SIZE > PAGE_SIZE);
- memcpy(vaddr_new + vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+ pgoff = (vaddr & ~PAGE_MASK);
+ BUG_ON(pgoff + UPROBE_SWBP_INSN_SIZE > PAGE_SIZE);
+ memcpy(vaddr_new + pgoff, &opcode, UPROBE_SWBP_INSN_SIZE);

kunmap_atomic(vaddr_new);
kunmap_atomic(vaddr_old);
@@ -297,6 +280,8 @@ unlock_out:
put_out:
put_page(old_page);

+ if (unlikely(ret == -EAGAIN))
+ goto retry;
return ret;
}

--
1.5.5.1

2012-06-08 20:21:30

by David Rientjes

[permalink] [raw]
Subject: [patch for-3.5-rc1] mm, oom: fix badness score underflow

If the privileges given to root threads (3% of allowable memory) or a
negative value of /proc/pid/oom_score_adj happen to exceed the amount of
rss of a thread, its badness score overflows as a result of a7f638f999ff
("mm, oom: normalize oom scores to oom_score_adj scale only for
userspace").

Fix this by making the type signed and return 1, meaning the thread is
still eligible for kill, if the value is negative.

Reported-by: Dave Jones <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
mm/oom_kill.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -183,7 +183,7 @@ static bool oom_unkillable_task(struct task_struct *p,
unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
const nodemask_t *nodemask, unsigned long totalpages)
{
- unsigned long points;
+ long points;

if (oom_unkillable_task(p, memcg, nodemask))
return 0;
@@ -223,7 +223,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
* Never return 0 for an eligible task regardless of the root bonus and
* oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
*/
- return points ? points : 1;
+ return points > 0 ? points : 1;
}

/*

2012-06-09 22:25:46

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch for-3.5-rc1] mm, oom: fix badness score underflow

(6/8/12 4:21 PM), David Rientjes wrote:
> If the privileges given to root threads (3% of allowable memory) or a
> negative value of /proc/pid/oom_score_adj happen to exceed the amount of
> rss of a thread, its badness score overflows as a result of a7f638f999ff
> ("mm, oom: normalize oom scores to oom_score_adj scale only for
> userspace").
>
> Fix this by making the type signed and return 1, meaning the thread is
> still eligible for kill, if the value is negative.
>
> Reported-by: Dave Jones<[email protected]>
> Acked-by: Oleg Nesterov<[email protected]>
> Signed-off-by: David Rientjes<[email protected]>
> ---
> mm/oom_kill.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -183,7 +183,7 @@ static bool oom_unkillable_task(struct task_struct *p,
> unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> const nodemask_t *nodemask, unsigned long totalpages)
> {
> - unsigned long points;
> + long points;
>
> if (oom_unkillable_task(p, memcg, nodemask))
> return 0;
> @@ -223,7 +223,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> * Never return 0 for an eligible task regardless of the root bonus and
> * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
> */
> - return points ? points : 1;
> + return points> 0 ? points : 1;
> }

Use long long. following line is dangerous.

points += p->signal->oom_score_adj * totalpages / 1000;

maximum oom_score_adj is 1000. then if system has >8G memory on 32bit
(i.e. LONG_MAX [pages] * 4096 [pagesize] / 1000), it might get an overflow.

Or, don't use normalized oom_score_adj. i.e, oom_score_adj_write() convert
oom_score_adj into rss based modifier.

This is oom-killer code. A micro optimization don't bring us a performance benefit.

2012-06-15 06:16:55

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap()

> ---
> Subject: [PATCH] uprobes: write_opcode()->__replace_page() can race with try_to_unmap()
>
> write_opcode() gets old_page via get_user_pages() and then calls
> __replace_page() which assumes that this old_page is still mapped
> after pte_offset_map_lock().
>
> This is not true if this old_page was already try_to_unmap()'ed,
> and in this case everything __replace_page() does with old_page
> is wrong. Just for example, put_page() is not balanced.
>
> I think it is possible to teach __replace_page() to handle this
> unlikely case correctly, but this patch simply changes it to use
> page_check_address() and return -EAGAIN if it fails. The caller
> should notice this error code and retry.
>
> Note: write_opcode() asks for the cleanups, I'll try to do this
> in a separate patch.
>
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
> kernel/events/uprobes.c | 41 +++++++++++++----------------------------
> 1 files changed, 13 insertions(+), 28 deletions(-)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 9c53bc2..54c8780 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -135,33 +135,17 @@ static loff_t vma_address(struct vm_area_struct *vma, loff_t offset)
> static int __replace_page(struct vm_area_struct *vma, struct page *page, struct page *kpage)
> {
> struct mm_struct *mm = vma->vm_mm;
> - pgd_t *pgd;
> - pud_t *pud;
> - pmd_t *pmd;
> - pte_t *ptep;
> - spinlock_t *ptl;
> unsigned long addr;
> - int err = -EFAULT;
> + spinlock_t *ptl;
> + pte_t *ptep;
>
> addr = page_address_in_vma(page, vma);
> if (addr == -EFAULT)
> - goto out;
> -
> - pgd = pgd_offset(mm, addr);
> - if (!pgd_present(*pgd))
> - goto out;
> -
> - pud = pud_offset(pgd, addr);
> - if (!pud_present(*pud))
> - goto out;
> + return -EFAULT;
>
> - pmd = pmd_offset(pud, addr);
> - if (!pmd_present(*pmd))
> - goto out;
> -
> - ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> + ptep = page_check_address(page, mm, addr, &ptl, 0);
> if (!ptep)
> - goto out;
> + return -EAGAIN;
>
> get_page(kpage);
> page_add_new_anon_rmap(kpage, vma, addr);
> @@ -180,10 +164,8 @@ static int __replace_page(struct vm_area_struct *vma, struct page *page, struct
> try_to_free_swap(page);
> put_page(page);
> pte_unmap_unlock(ptep, ptl);
> - err = 0;
>
> -out:
> - return err;
> + return 0;
> }
>
> /**
> @@ -228,9 +210,10 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> void *vaddr_old, *vaddr_new;
> struct vm_area_struct *vma;
> struct uprobe *uprobe;
> + unsigned long pgoff;
> loff_t addr;
> int ret;
> -
> +retry:

Just a check on coding style: Shouldnt we have a preceeding blank
line before the goto label.

> /* Read the page with vaddr into memory */
> ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &old_page, &vma);
> if (ret <= 0)
> @@ -275,9 +258,9 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
>
> /* poke the new insn in, ASSUMES we don't cross page boundary */
> - vaddr &= ~PAGE_MASK;
> - BUG_ON(vaddr + UPROBE_SWBP_INSN_SIZE > PAGE_SIZE);
> - memcpy(vaddr_new + vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
> + pgoff = (vaddr & ~PAGE_MASK);
> + BUG_ON(pgoff + UPROBE_SWBP_INSN_SIZE > PAGE_SIZE);
> + memcpy(vaddr_new + pgoff, &opcode, UPROBE_SWBP_INSN_SIZE);
>
> kunmap_atomic(vaddr_new);
> kunmap_atomic(vaddr_old);
> @@ -297,6 +280,8 @@ unlock_out:
> put_out:
> put_page(old_page);
>
> + if (unlikely(ret == -EAGAIN))
> + goto retry;
> return ret;
> }
>

Looks good.
Acked-by: Srikar Dronamraju <[email protected]>

--
Thanks and Regards
Srikar

2012-06-15 06:31:35

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH 2/3] uprobes: __copy_insn() should ensure a_ops->readpage != NULL

* Oleg Nesterov <[email protected]> [2012-06-07 19:00:18]:

> __copy_insn() blindly calls read_mapping_page(), this will crash
> the kernel if ->readpage == NULL, add the necessary check. For
> example, hugetlbfs_aops->readpage is NULL. Perhaps we should change
> read_mapping_page() instead.
>
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
> kernel/events/uprobes.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 48d53af..9c53bc2 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -616,6 +616,8 @@ __copy_insn(struct address_space *mapping, struct vm_area_struct *vma, char *ins
>
> if (!filp)
> return -EINVAL;
> + if (!mapping->a_ops->readpage)
> + return -EIO;

Nit: Should there be a blank line before the if. Ingo had insisted on
these coding style changes.

> idx = (unsigned long)(offset >> PAGE_CACHE_SHIFT);
> off1 = offset &= ~PAGE_MASK;

Acked-by: Srikar Dronamraju <[email protected]>

--
Thanks and Regards
Srikar

2012-06-15 06:36:05

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH 1/3] uprobes: valid_vma() should reject VM_HUGETLB

* Oleg Nesterov <[email protected]> [2012-06-07 19:00:02]:

> __replace_page() obviously can't work with the hugetlbfs mappings,
> uprobe_register() will likely crash the kernel. Change valid_vma()
> to check VM_HUGETLB as well.
>
> As for PageTransHuge() no need to worry, vma->vm_file != NULL.
>
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
> kernel/events/uprobes.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 6e3b181..48d53af 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -105,7 +105,8 @@ static bool valid_vma(struct vm_area_struct *vma, bool is_register)
> if (!is_register)
> return true;
>
> - if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) == (VM_READ|VM_EXEC))
> + if ((vma->vm_flags & (VM_HUGETLB|VM_READ|VM_WRITE|VM_EXEC|VM_SHARED))
> + == (VM_READ|VM_EXEC))
> return true;
>
> return false;

Acked-by: Srikar Dronamraju <[email protected]>

2012-06-15 12:10:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/3] uprobes: __copy_insn() should ensure a_ops->readpage != NULL


* Srikar Dronamraju <[email protected]> wrote:

> * Oleg Nesterov <[email protected]> [2012-06-07 19:00:18]:
>
> > __copy_insn() blindly calls read_mapping_page(), this will crash
> > the kernel if ->readpage == NULL, add the necessary check. For
> > example, hugetlbfs_aops->readpage is NULL. Perhaps we should change
> > read_mapping_page() instead.
> >
> > Signed-off-by: Oleg Nesterov <[email protected]>
> > ---
> > kernel/events/uprobes.c | 2 ++
> > 1 files changed, 2 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 48d53af..9c53bc2 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -616,6 +616,8 @@ __copy_insn(struct address_space *mapping, struct vm_area_struct *vma, char *ins
> >
> > if (!filp)
> > return -EINVAL;
> > + if (!mapping->a_ops->readpage)
> > + return -EIO;
>
> Nit: Should there be a blank line before the if. Ingo had insisted on
> these coding style changes.

Well, within sensible limits: if()'s can certainly be in a block
if they do related things (such as boundary checks, etc.), so
the above isn't ugly IMO.

Nor are separating newlines strictly necessary - it's also a
function of the complexity of the code in question - if it's
complex code then we want all visual and cleanliness help that
we can think of.

Thanks,

Ingo

2012-06-15 12:11:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap()


* Srikar Dronamraju <[email protected]> wrote:

> > ---
> > Subject: [PATCH] uprobes: write_opcode()->__replace_page() can race with try_to_unmap()
> >
> > write_opcode() gets old_page via get_user_pages() and then calls
> > __replace_page() which assumes that this old_page is still mapped
> > after pte_offset_map_lock().
> >
> > This is not true if this old_page was already try_to_unmap()'ed,
> > and in this case everything __replace_page() does with old_page
> > is wrong. Just for example, put_page() is not balanced.
> >
> > I think it is possible to teach __replace_page() to handle this
> > unlikely case correctly, but this patch simply changes it to use
> > page_check_address() and return -EAGAIN if it fails. The caller
> > should notice this error code and retry.
> >
> > Note: write_opcode() asks for the cleanups, I'll try to do this
> > in a separate patch.
> >
> > Signed-off-by: Oleg Nesterov <[email protected]>
> > ---
> > kernel/events/uprobes.c | 41 +++++++++++++----------------------------
> > 1 files changed, 13 insertions(+), 28 deletions(-)
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 9c53bc2..54c8780 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -135,33 +135,17 @@ static loff_t vma_address(struct vm_area_struct *vma, loff_t offset)
> > static int __replace_page(struct vm_area_struct *vma, struct page *page, struct page *kpage)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > - pgd_t *pgd;
> > - pud_t *pud;
> > - pmd_t *pmd;
> > - pte_t *ptep;
> > - spinlock_t *ptl;
> > unsigned long addr;
> > - int err = -EFAULT;
> > + spinlock_t *ptl;
> > + pte_t *ptep;
> >
> > addr = page_address_in_vma(page, vma);
> > if (addr == -EFAULT)
> > - goto out;
> > -
> > - pgd = pgd_offset(mm, addr);
> > - if (!pgd_present(*pgd))
> > - goto out;
> > -
> > - pud = pud_offset(pgd, addr);
> > - if (!pud_present(*pud))
> > - goto out;
> > + return -EFAULT;
> >
> > - pmd = pmd_offset(pud, addr);
> > - if (!pmd_present(*pmd))
> > - goto out;
> > -
> > - ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > + ptep = page_check_address(page, mm, addr, &ptl, 0);
> > if (!ptep)
> > - goto out;
> > + return -EAGAIN;
> >
> > get_page(kpage);
> > page_add_new_anon_rmap(kpage, vma, addr);
> > @@ -180,10 +164,8 @@ static int __replace_page(struct vm_area_struct *vma, struct page *page, struct
> > try_to_free_swap(page);
> > put_page(page);
> > pte_unmap_unlock(ptep, ptl);
> > - err = 0;
> >
> > -out:
> > - return err;
> > + return 0;
> > }
> >
> > /**
> > @@ -228,9 +210,10 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > void *vaddr_old, *vaddr_new;
> > struct vm_area_struct *vma;
> > struct uprobe *uprobe;
> > + unsigned long pgoff;
> > loff_t addr;
> > int ret;
> > -
> > +retry:
>
> Just a check on coding style: Shouldnt we have a preceeding blank
> line before the goto label.

Yeah, that's most likely helpful to readability.

Thanks,

Ingo

2012-06-15 15:50:30

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap()

On 06/15, Ingo Molnar wrote:
>
> * Srikar Dronamraju <[email protected]> wrote:
>
> > > @@ -228,9 +210,10 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > > void *vaddr_old, *vaddr_new;
> > > struct vm_area_struct *vma;
> > > struct uprobe *uprobe;
> > > + unsigned long pgoff;
> > > loff_t addr;
> > > int ret;
> > > -
> > > +retry:
> >
> > Just a check on coding style: Shouldnt we have a preceeding blank
> > line before the goto label.
>
> Yeah, that's most likely helpful to readability.

Aaah. Srikar, sorry, I didn't notice this comment and I already
sent 1-15. But I added the blank line in 2/15 ;)

Ingo, please let me know if I need to re-diff and resend. Otherwise
I'll add the blank line later, write_opcode() needs more changes
anyway.

Oleg.

2012-06-16 07:11:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap()


* Oleg Nesterov <[email protected]> wrote:

> On 06/15, Ingo Molnar wrote:
> >
> > * Srikar Dronamraju <[email protected]> wrote:
> >
> > > > @@ -228,9 +210,10 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > > > void *vaddr_old, *vaddr_new;
> > > > struct vm_area_struct *vma;
> > > > struct uprobe *uprobe;
> > > > + unsigned long pgoff;
> > > > loff_t addr;
> > > > int ret;
> > > > -
> > > > +retry:
> > >
> > > Just a check on coding style: Shouldnt we have a preceeding blank
> > > line before the goto label.
> >
> > Yeah, that's most likely helpful to readability.
>
> Aaah. Srikar, sorry, I didn't notice this comment and I already
> sent 1-15. But I added the blank line in 2/15 ;)
>
> Ingo, please let me know if I need to re-diff and resend.
> Otherwise I'll add the blank line later, write_opcode() needs
> more changes anyway.

No need, I've added it before I committed the patch.

Thanks,

Ingo