2009-03-02 17:02:18

by Masami Hiramatsu

[permalink] [raw]
Subject: [RFC][PATCH] x86: make text_poke() atomic

Masami Hiramatsu wrote:
> Mathieu Desnoyers wrote:
>> * Masami Hiramatsu ([email protected]) wrote:
>>> Mathieu Desnoyers wrote:
>>>> * Masami Hiramatsu ([email protected]) wrote:
>>>>> Steven Rostedt wrote:
>>>>>> On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
>>>>>>>> Hmm, lets see. I simply set a bit in the PTE mappings. There's not many,
>>>>>>>> since a lot are 2M pages, for x86_64. Call stop_machine, and now I can
>>>>>>>> modify 1 or 20,000 locations. Set the PTE bit back. Note, the changing of
>>>>>>>> the bits are only done when CONFIG_DEBUG_RODATA is set.
>>>>>>>>
>>>>>>>> text_poke requires allocating a page. Map the page into memory. Set up a
>>>>>>>> break point.
>>>>>>> text_poke does not _require_ a break point. text_poke can work with
>>>>>>> stop_machine.
>>>>>> It can? Doesn't text_poke require allocating pages? The code called by
>>>>>> stop_machine is all atomic. vmap does not give an option to allocate with
>>>>>> GFP_ATOMIC.
>>>>> Hi,
>>>>>
>>>>> With my patch, text_poke() never allocate pages any more :)
>>>>>
>>>>> BTW, IMHO, both of your methods are useful and have trade-off.
>>>>>
>>>>> ftrace wants to change massive amount of code at once. If we do
>>>>> that with text_poke(), we have to map/unmap pages each time and
>>>>> it will take a long time -- might be longer than one stop_machine_run().
>>>>>
>>>>> On the other hand, text_poke() user like as kprobes and tracepoints,
>>>>> just want to change a few amount of code at once, and it will be
>>>>> added/removed incrementally. If we do that with stop_machine_run(),
>>>>> we'll be annoyed by frequent machine stops.(Moreover, kprobes uses
>>>>> breakpoint, so it doesn't need stop_machine_run())
>>>>>
>>>> Hi Masami,
>>>>
>>>> Is this text_poke version executable in atomic context ? If yes, then
>>>> that would be good to add a comment saying it. Please see below for
>>>> comments.
>>> Thank you for comments!
>>> I think it could be. ah, spin_lock might be changed to spin_lock_irqsave()...
>>>
>> You are right. If we plan to execute this in both atomic and non-atomic
>> context, spin_lock_irqsave would make sure we are always busy-looping
>> with interrupts off.
>
> Oops, when I tested spin_lock_irqsave, it caused warnings.
>
> ------------[ cut here ]------------
> WARNING: at /home/mhiramat/ksrc/linux-2.6/kernel/smp.c:329 smp_call_function_man
> y+0x37/0x1c9()
> Hardware name: Precision WorkStation T5400
> Modules linked in:
> Pid: 1, comm: swapper Tainted: G W 2.6.29-rc6 #16
> Call Trace:
> [<c042f7b1>] warn_slowpath+0x71/0xa8
> [<c044dccb>] ? trace_hardirqs_on_caller+0x18/0x145
> [<c06dc42f>] ? _spin_unlock_irq+0x22/0x2f
> [<c044efc9>] ? print_lock_contention_bug+0x14/0xd7
> [<c044efc9>] ? print_lock_contention_bug+0x14/0xd7
> [<c044cfbb>] ? trace_hardirqs_off_caller+0x18/0xa3
> [<c045383b>] smp_call_function_many+0x37/0x1c9
> [<c04138fc>] ? do_flush_tlb_all+0x0/0x3c
> [<c04138fc>] ? do_flush_tlb_all+0x0/0x3c
> [<c04539e9>] smp_call_function+0x1c/0x23
> [<c0433ee1>] on_each_cpu+0xf/0x3a
> [<c04138c6>] flush_tlb_all+0x14/0x16
> [<c04946f7>] unmap_kernel_range+0xf/0x11
> [<c06dd78a>] text_poke+0xf1/0x12c
>
> unmap_kernel_range() requires irq enabled, but spin_lock_irqsave
> (and stop_machine too)disables irq. so we have to solve this issue.
>
> I have some ideas:
>
> - export(just remove static) vunmap_page_range() and don't use
> flush_tlb_all().
> Because this vm_area are not used by other cpus, we don't need
> to flush TLB of all cpus.
>
> - make unmap_kernel_range_local() function.
>
> - extend kmap_atomic_prot() to map lowmem page when the 'prot'
> is different.
>

I updated my patch based on the first idea.
I also checked that text_poke() can be called from stop_machine()
with this patch.
---

Use map_vm_area() instead of vmap() in text_poke() for avoiding page allocation
and delayed unmapping, and call vunmap_page_range() and local_flush_tlb()
directly because this mapping is temporary and local.

At the result of above change, text_poke() becomes atomic and can be called
from stop_machine() etc.

Signed-off-by: Masami Hiramatsu <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Nick Piggin <[email protected]>
---
arch/x86/include/asm/alternative.h | 1 +
arch/x86/kernel/alternative.c | 36 +++++++++++++++++++++++++++++-------
include/linux/vmalloc.h | 1 +
init/main.c | 3 +++
mm/vmalloc.c | 2 +-
5 files changed, 35 insertions(+), 8 deletions(-)

Index: linux-2.6/arch/x86/include/asm/alternative.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/alternative.h
+++ linux-2.6/arch/x86/include/asm/alternative.h
@@ -177,6 +177,7 @@ extern void add_nops(void *insns, unsign
* The _early version expects the memory to already be RW.
*/

+extern void text_poke_init(void);
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern void *text_poke_early(void *addr, const void *opcode, size_t len);

Index: linux-2.6/arch/x86/kernel/alternative.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/alternative.c
+++ linux-2.6/arch/x86/kernel/alternative.c
@@ -12,6 +12,7 @@
#include <asm/nmi.h>
#include <asm/vsyscall.h>
#include <asm/cacheflush.h>
+#include <asm/tlbflush.h>
#include <asm/io.h>

#define MAX_PATCH_LEN (255-1)
@@ -485,6 +486,16 @@ void *text_poke_early(void *addr, const
return addr;
}

+static struct vm_struct *text_poke_area[2];
+static DEFINE_SPINLOCK(text_poke_lock);
+
+void __init text_poke_init(void)
+{
+ text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
+ text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
+ BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
+}
+
/**
* text_poke - Update instructions on a live kernel
* @addr: address to modify
@@ -501,8 +512,9 @@ void *__kprobes text_poke(void *addr, co
unsigned long flags;
char *vaddr;
int nr_pages = 2;
- struct page *pages[2];
- int i;
+ struct page *pages[2], **pgp = pages;
+ int i, ret;
+ struct vm_struct *vma;

if (!core_kernel_text((unsigned long)addr)) {
pages[0] = vmalloc_to_page(addr);
@@ -515,12 +527,22 @@ void *__kprobes text_poke(void *addr, co
BUG_ON(!pages[0]);
if (!pages[1])
nr_pages = 1;
- vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
- BUG_ON(!vaddr);
- local_irq_save(flags);
+ spin_lock_irqsave(&text_poke_lock, flags);
+ vma = text_poke_area[nr_pages-1];
+ ret = map_vm_area(vma, PAGE_KERNEL, &pgp);
+ BUG_ON(ret);
+ vaddr = vma->addr;
memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
- local_irq_restore(flags);
- vunmap(vaddr);
+ /* Ported from unmap_kernel_range() */
+ flush_cache_vunmap((unsigned long)vma->addr, (unsigned long)vma->size);
+ vunmap_page_range((unsigned long)vma->addr,
+ (unsigned long)vma->addr + (unsigned long)vma->size);
+ /*
+ * Since this mapping is temporary, local and protected by spinlock,
+ * we just need to flush TLB on local processor.
+ */
+ local_flush_tlb();
+ spin_unlock_irqrestore(&text_poke_lock, flags);
sync_core();
/* Could also do a CLFLUSH here to speed up CPU recovery; but
that causes hangs on some VIA CPUs. */
Index: linux-2.6/init/main.c
===================================================================
--- linux-2.6.orig/init/main.c
+++ linux-2.6/init/main.c
@@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
taskstats_init_early();
delayacct_init();

+#ifdef CONFIG_X86
+ text_poke_init();
+#endif
check_bugs();

acpi_early_init(); /* before LAPIC and SMP init */
Index: linux-2.6/mm/vmalloc.c
===================================================================
--- linux-2.6.orig/mm/vmalloc.c
+++ linux-2.6/mm/vmalloc.c
@@ -71,7 +71,7 @@ static void vunmap_pud_range(pgd_t *pgd,
} while (pud++, addr = next, addr != end);
}

-static void vunmap_page_range(unsigned long addr, unsigned long end)
+void vunmap_page_range(unsigned long addr, unsigned long end)
{
pgd_t *pgd;
unsigned long next;
Index: linux-2.6/include/linux/vmalloc.h
===================================================================
--- linux-2.6.orig/include/linux/vmalloc.h
+++ linux-2.6/include/linux/vmalloc.h
@@ -96,6 +96,7 @@ extern struct vm_struct *remove_vm_area(
extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
struct page ***pages);
extern void unmap_kernel_range(unsigned long addr, unsigned long size);
+extern void vunmap_page_range(unsigned long addr, unsigned long end);

/* Allocate/destroy a 'vmalloc' VM area. */
extern struct vm_struct *alloc_vm_area(size_t size);

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]


2009-03-02 17:19:50

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic

* Masami Hiramatsu ([email protected]) wrote:
> Masami Hiramatsu wrote:
>> Mathieu Desnoyers wrote:
>>> * Masami Hiramatsu ([email protected]) wrote:
>>>> Mathieu Desnoyers wrote:
>>>>> * Masami Hiramatsu ([email protected]) wrote:
>>>>>> Steven Rostedt wrote:
>>>>>>> On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
>>>>>>>>> Hmm, lets see. I simply set a bit in the PTE mappings.
>>>>>>>>> There's not many, since a lot are 2M pages, for x86_64.
>>>>>>>>> Call stop_machine, and now I can modify 1 or 20,000
>>>>>>>>> locations. Set the PTE bit back. Note, the changing of
>>>>>>>>> the bits are only done when CONFIG_DEBUG_RODATA is set.
>>>>>>>>>
>>>>>>>>> text_poke requires allocating a page. Map the page into
>>>>>>>>> memory. Set up a break point.
>>>>>>>> text_poke does not _require_ a break point. text_poke can work with
>>>>>>>> stop_machine.
>>>>>>> It can? Doesn't text_poke require allocating pages? The code
>>>>>>> called by stop_machine is all atomic. vmap does not give an
>>>>>>> option to allocate with GFP_ATOMIC.
>>>>>> Hi,
>>>>>>
>>>>>> With my patch, text_poke() never allocate pages any more :)
>>>>>>
>>>>>> BTW, IMHO, both of your methods are useful and have trade-off.
>>>>>>
>>>>>> ftrace wants to change massive amount of code at once. If we do
>>>>>> that with text_poke(), we have to map/unmap pages each time and
>>>>>> it will take a long time -- might be longer than one stop_machine_run().
>>>>>>
>>>>>> On the other hand, text_poke() user like as kprobes and tracepoints,
>>>>>> just want to change a few amount of code at once, and it will be
>>>>>> added/removed incrementally. If we do that with stop_machine_run(),
>>>>>> we'll be annoyed by frequent machine stops.(Moreover, kprobes uses
>>>>>> breakpoint, so it doesn't need stop_machine_run())
>>>>>>
>>>>> Hi Masami,
>>>>>
>>>>> Is this text_poke version executable in atomic context ? If yes, then
>>>>> that would be good to add a comment saying it. Please see below for
>>>>> comments.
>>>> Thank you for comments!
>>>> I think it could be. ah, spin_lock might be changed to spin_lock_irqsave()...
>>>>
>>> You are right. If we plan to execute this in both atomic and non-atomic
>>> context, spin_lock_irqsave would make sure we are always busy-looping
>>> with interrupts off.
>>
>> Oops, when I tested spin_lock_irqsave, it caused warnings.
>>
>> ------------[ cut here ]------------
>> WARNING: at /home/mhiramat/ksrc/linux-2.6/kernel/smp.c:329 smp_call_function_man
>> y+0x37/0x1c9()
>> Hardware name: Precision WorkStation T5400
>> Modules linked in:
>> Pid: 1, comm: swapper Tainted: G W 2.6.29-rc6 #16
>> Call Trace:
>> [<c042f7b1>] warn_slowpath+0x71/0xa8
>> [<c044dccb>] ? trace_hardirqs_on_caller+0x18/0x145
>> [<c06dc42f>] ? _spin_unlock_irq+0x22/0x2f
>> [<c044efc9>] ? print_lock_contention_bug+0x14/0xd7
>> [<c044efc9>] ? print_lock_contention_bug+0x14/0xd7
>> [<c044cfbb>] ? trace_hardirqs_off_caller+0x18/0xa3
>> [<c045383b>] smp_call_function_many+0x37/0x1c9
>> [<c04138fc>] ? do_flush_tlb_all+0x0/0x3c
>> [<c04138fc>] ? do_flush_tlb_all+0x0/0x3c
>> [<c04539e9>] smp_call_function+0x1c/0x23
>> [<c0433ee1>] on_each_cpu+0xf/0x3a
>> [<c04138c6>] flush_tlb_all+0x14/0x16
>> [<c04946f7>] unmap_kernel_range+0xf/0x11
>> [<c06dd78a>] text_poke+0xf1/0x12c
>>
>> unmap_kernel_range() requires irq enabled, but spin_lock_irqsave
>> (and stop_machine too)disables irq. so we have to solve this issue.
>>
>> I have some ideas:
>>
>> - export(just remove static) vunmap_page_range() and don't use
>> flush_tlb_all().
>> Because this vm_area are not used by other cpus, we don't need
>> to flush TLB of all cpus.
>>
>> - make unmap_kernel_range_local() function.
>>
>> - extend kmap_atomic_prot() to map lowmem page when the 'prot'
>> is different.
>>
>
> I updated my patch based on the first idea.
> I also checked that text_poke() can be called from stop_machine()
> with this patch.
> ---
>
> Use map_vm_area() instead of vmap() in text_poke() for avoiding page allocation
> and delayed unmapping, and call vunmap_page_range() and local_flush_tlb()
> directly because this mapping is temporary and local.
>
> At the result of above change, text_poke() becomes atomic and can be called
> from stop_machine() etc.
>
> Signed-off-by: Masami Hiramatsu <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Nick Piggin <[email protected]>
> ---
> arch/x86/include/asm/alternative.h | 1 +
> arch/x86/kernel/alternative.c | 36 +++++++++++++++++++++++++++++-------
> include/linux/vmalloc.h | 1 +
> init/main.c | 3 +++
> mm/vmalloc.c | 2 +-
> 5 files changed, 35 insertions(+), 8 deletions(-)
>
> Index: linux-2.6/arch/x86/include/asm/alternative.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/alternative.h
> +++ linux-2.6/arch/x86/include/asm/alternative.h
> @@ -177,6 +177,7 @@ extern void add_nops(void *insns, unsign
> * The _early version expects the memory to already be RW.
> */
>
> +extern void text_poke_init(void);
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>
> Index: linux-2.6/arch/x86/kernel/alternative.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/alternative.c
> +++ linux-2.6/arch/x86/kernel/alternative.c
> @@ -12,6 +12,7 @@
> #include <asm/nmi.h>
> #include <asm/vsyscall.h>
> #include <asm/cacheflush.h>
> +#include <asm/tlbflush.h>
> #include <asm/io.h>
>
> #define MAX_PATCH_LEN (255-1)
> @@ -485,6 +486,16 @@ void *text_poke_early(void *addr, const
> return addr;
> }
>
> +static struct vm_struct *text_poke_area[2];
> +static DEFINE_SPINLOCK(text_poke_lock);
> +
> +void __init text_poke_init(void)
> +{
> + text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
> + text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
> + BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
> +}
> +
> /**
> * text_poke - Update instructions on a live kernel
> * @addr: address to modify
> @@ -501,8 +512,9 @@ void *__kprobes text_poke(void *addr, co
> unsigned long flags;
> char *vaddr;
> int nr_pages = 2;
> - struct page *pages[2];
> - int i;
> + struct page *pages[2], **pgp = pages;
> + int i, ret;
> + struct vm_struct *vma;
>
> if (!core_kernel_text((unsigned long)addr)) {
> pages[0] = vmalloc_to_page(addr);
> @@ -515,12 +527,22 @@ void *__kprobes text_poke(void *addr, co
> BUG_ON(!pages[0]);
> if (!pages[1])
> nr_pages = 1;
> - vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
> - BUG_ON(!vaddr);
> - local_irq_save(flags);
> + spin_lock_irqsave(&text_poke_lock, flags);
> + vma = text_poke_area[nr_pages-1];
> + ret = map_vm_area(vma, PAGE_KERNEL, &pgp);
> + BUG_ON(ret);
> + vaddr = vma->addr;
> memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
> - local_irq_restore(flags);
> - vunmap(vaddr);
> + /* Ported from unmap_kernel_range() */
> + flush_cache_vunmap((unsigned long)vma->addr, (unsigned long)vma->size);
> + vunmap_page_range((unsigned long)vma->addr,
> + (unsigned long)vma->addr + (unsigned long)vma->size);
> + /*
> + * Since this mapping is temporary, local and protected by spinlock,
> + * we just need to flush TLB on local processor.
> + */
> + local_flush_tlb();
> + spin_unlock_irqrestore(&text_poke_lock, flags);
> sync_core();
> /* Could also do a CLFLUSH here to speed up CPU recovery; but
> that causes hangs on some VIA CPUs. */
> Index: linux-2.6/init/main.c
> ===================================================================
> --- linux-2.6.orig/init/main.c
> +++ linux-2.6/init/main.c
> @@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
> taskstats_init_early();
> delayacct_init();
>
> +#ifdef CONFIG_X86
> + text_poke_init();
> +#endif

All good, except this above. There should be an empty text_poke_init()
in some header file, and an implementation for the X86 arch rather than
a ifdef in init/main.c.

Mathieu

> check_bugs();
>
> acpi_early_init(); /* before LAPIC and SMP init */
> Index: linux-2.6/mm/vmalloc.c
> ===================================================================
> --- linux-2.6.orig/mm/vmalloc.c
> +++ linux-2.6/mm/vmalloc.c
> @@ -71,7 +71,7 @@ static void vunmap_pud_range(pgd_t *pgd,
> } while (pud++, addr = next, addr != end);
> }
>
> -static void vunmap_page_range(unsigned long addr, unsigned long end)
> +void vunmap_page_range(unsigned long addr, unsigned long end)
> {
> pgd_t *pgd;
> unsigned long next;
> Index: linux-2.6/include/linux/vmalloc.h
> ===================================================================
> --- linux-2.6.orig/include/linux/vmalloc.h
> +++ linux-2.6/include/linux/vmalloc.h
> @@ -96,6 +96,7 @@ extern struct vm_struct *remove_vm_area(
> extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
> struct page ***pages);
> extern void unmap_kernel_range(unsigned long addr, unsigned long size);
> +extern void vunmap_page_range(unsigned long addr, unsigned long end);
>
> /* Allocate/destroy a 'vmalloc' VM area. */
> extern struct vm_struct *alloc_vm_area(size_t size);
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-02 18:28:54

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic

>
> Use map_vm_area() instead of vmap() in text_poke() for avoiding page
> allocation and delayed unmapping, and call vunmap_page_range() and
> local_flush_tlb() directly because this mapping is temporary and
> local.
>
> At the result of above change, text_poke() becomes atomic and can be
> called from stop_machine() etc.

.... but text_poke() realistically needs to call stop_machine() since
you can't poke live code.... so that makes me wonder how useful this
is...

2009-03-02 18:36:42

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic

* Arjan van de Ven ([email protected]) wrote:
> >
> > Use map_vm_area() instead of vmap() in text_poke() for avoiding page
> > allocation and delayed unmapping, and call vunmap_page_range() and
> > local_flush_tlb() directly because this mapping is temporary and
> > local.
> >
> > At the result of above change, text_poke() becomes atomic and can be
> > called from stop_machine() etc.
>
> .... but text_poke() realistically needs to call stop_machine() since
> you can't poke live code.... so that makes me wonder how useful this
> is...

Hi Arjan,

Stop machine is not required when inserting a breakpoint. And cleverly
using this breakpoint technique can permit modifying other instructions
as well.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-02 18:44:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic



On Mon, 2 Mar 2009, Arjan van de Ven wrote:
>
> .... but text_poke() realistically needs to call stop_machine() since
> you can't poke live code.... so that makes me wonder how useful this
> is...

Well, not always. There's at least two cases where we don't need it:

- in the UP -> SMP transition.

- perhaps more interestingly, we're still kind of waiting for the
resolution of the whole "nop out the first byte to a single-byte 'irq3'
trap instruction, then rewrite the rest of the instruction, and then
reset the first byte of the final instruction" thing.

IOW, there are possible non-stop_machine() models where rewriting
instructions in a live system does work, and quite frankly, I think we
need them. Making the rule be the (obviously safe) "we can only do this in
stop_machine" is quite possibly not going to be an acceptable rule and we
may need alternatives.

Linus

2009-03-02 18:55:14

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic

On Mon, 2 Mar 2009 13:36:17 -0500
Mathieu Desnoyers <[email protected]> wrote:

> * Arjan van de Ven ([email protected]) wrote:
> > >
> > > Use map_vm_area() instead of vmap() in text_poke() for avoiding
> > > page allocation and delayed unmapping, and call
> > > vunmap_page_range() and local_flush_tlb() directly because this
> > > mapping is temporary and local.
> > >
> > > At the result of above change, text_poke() becomes atomic and can
> > > be called from stop_machine() etc.
> >
> > .... but text_poke() realistically needs to call stop_machine()
> > since you can't poke live code.... so that makes me wonder how
> > useful this is...
>
> Hi Arjan,
>
> Stop machine is not required when inserting a breakpoint.

that is your assumption; when I spoke with CPU architects they
cringed ;(



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-03-02 19:14:45

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic

Arjan van de Ven wrote:
> On Mon, 2 Mar 2009 13:36:17 -0500
> Mathieu Desnoyers <[email protected]> wrote:
>
>> * Arjan van de Ven ([email protected]) wrote:
>>>> Use map_vm_area() instead of vmap() in text_poke() for avoiding
>>>> page allocation and delayed unmapping, and call
>>>> vunmap_page_range() and local_flush_tlb() directly because this
>>>> mapping is temporary and local.
>>>>
>>>> At the result of above change, text_poke() becomes atomic and can
>>>> be called from stop_machine() etc.
>>> .... but text_poke() realistically needs to call stop_machine()
>>> since you can't poke live code.... so that makes me wonder how
>>> useful this is...
>> Hi Arjan,
>>
>> Stop machine is not required when inserting a breakpoint.
>
> that is your assumption; when I spoke with CPU architects they
> cringed ;(

Is that true even if modifying just one-byte (like int3 insertion)
and we don't care synchronous write(that means code modification
effects on other processors after a while)?

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2009-03-02 19:28:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic

Masami Hiramatsu wrote:
>>>
>>> Stop machine is not required when inserting a breakpoint.
>>
>> that is your assumption; when I spoke with CPU architects they
>> cringed ;(
>
> Is that true even if modifying just one-byte (like int3 insertion)
> and we don't care synchronous write(that means code modification
> effects on other processors after a while)?
>

The problem is that he's using is as part of a sequenced series of
steps. The lack of synchronization comes into play there.

-hpa

2009-03-02 19:52:49

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic

* Arjan van de Ven ([email protected]) wrote:
> On Mon, 2 Mar 2009 13:36:17 -0500
> Mathieu Desnoyers <[email protected]> wrote:
>
> > * Arjan van de Ven ([email protected]) wrote:
> > > >
> > > > Use map_vm_area() instead of vmap() in text_poke() for avoiding
> > > > page allocation and delayed unmapping, and call
> > > > vunmap_page_range() and local_flush_tlb() directly because this
> > > > mapping is temporary and local.
> > > >
> > > > At the result of above change, text_poke() becomes atomic and can
> > > > be called from stop_machine() etc.
> > >
> > > .... but text_poke() realistically needs to call stop_machine()
> > > since you can't poke live code.... so that makes me wonder how
> > > useful this is...
> >
> > Hi Arjan,
> >
> > Stop machine is not required when inserting a breakpoint.
>
> that is your assumption; when I spoke with CPU architects they
> cringed ;(
>

Given you are not citing any technical material, I guess you are
refering to :

Intel® Core™2 Duo Processor E8000Δ and E7000Δ Series
http://download.intel.com/design/processor/specupdt/318733.pdf (page 46)

AW75. Unsynchronized Cross-Modifying Code Operations Can Cause
Unexpected Instruction Execution Results

Am I correct ? This errata has been around since the Pentium III and is
still valid today. Other current CPUs with this errata :

Intel® Atom™ Processor Z5xxΔ Series
http://download.intel.com/design/processor/specupdt/319536.pdf (page 22)
AAE18 Unsynchronized Cross-Modifying Code Operations Can Cause
Unexpected Instruction Execution Results


First point : given your statement, kprobes would be buggy on x86 _and_
ia64. If this is true, then it should be addressed. If not, then we
should not worry about it.


The algorithm they propose to work around the architectural limitations
is stated here :
http://download.intel.com/design/PentiumII/manuals/24319202.pdf
7.1.3 Handling Self- and Cross-Modifying Code

Basically implies using something like stop-machine. However, if we read
carefully the few amount of information available in this errata :

"The act of a processor writing data into a currently executing code
segment with the intent of executing that data as code is called
self-modifying code. Intel Architecture processors exhibit
model-specific behavior when executing self-modified code, depending
upon how far ahead of the current execution pointer the code has been
modified. As processor architectures become more complex and start to
speculatively execute code ahead of the retirement point (as in the P6
family processors), the rules regarding which code should execute, pre-
or post-modification, become blurred."

Basically, this points to the speculative code execution as being the
core of the problems encountered with code modification. But given int3
*IS* a _serializing_ instruction, it is not affected by this errata.
Quoting Richard J Moore from IBM from a discussion we had a few years
ago :

* "There is another issue to consider when looking into using probes other
* then int3:
*
* Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the
* practice of modifying code on one processor where another has prefetched
* the unmodified version of the code. Intel states that unpredictable general
* protection faults may result if a synchronizing instruction (iret, int,
* int3, cpuid, etc ) is not executed on the second processor before it
* executes the pre-fetched out-of-date copy of the instruction.
*
* When we became aware of this I had a long discussion with Intel's
* microarchitecture guys. It turns out that the reason for this erratum
* (which incidentally Intel does not intend to fix) is because the trace
* cache - the stream of micro-ops resulting from instruction interpretation -
* cannot be guaranteed to be valid. Reading between the lines I assume this
* issue arises because of optimization done in the trace cache, where it is
* no longer possible to identify the original instruction boundaries. If the
* CPU discoverers that the trace cache has been invalidated because of
* unsynchronized cross-modification then instruction execution will be
* aborted with a GPF. Further discussion with Intel revealed that replacing
* the first opcode byte with an int3 would not be subject to this erratum.
*
* So, is cmpxchg reliable? One has to guarantee more than mere atomicity."

Therefore, I think assuming int3 as safe for _synchronized_ XMC is ok.
The multi-step algorithm I use to perform code modification in my
immediate values patch based on int3 basically writes the int3, sends an
IPI to _each_ CPU to make sure they issue a synchronizing instruction
(cpuid) and then I can safely proceed to change the instruction,
including the first byte, because I know that all CPUs which could have
potentially seen the old instruction have had the seen the new version
(breakpoint) and have issued a synchronizing instruction (in that order).
Note that I put a smp_wmb() after the int3 write, and a smp_rmb() in the
IPI handler before the cpuid instruction.

Note that extra care will have to be taken to handle synchronization of
instruction and data caches on the Itanium, but this is a different
architecture and topic, which is not the primary focus of our discussion
here :
Cache Coherency in Itanium® Processor Software
http://cache-www.intel.com/cd/00/00/21/57/215792_215792.pdf

Mathieu



--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-02 22:16:33

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic

Mathieu Desnoyers wrote:
> * Masami Hiramatsu ([email protected]) wrote:
>> Index: linux-2.6/init/main.c
>> ===================================================================
>> --- linux-2.6.orig/init/main.c
>> +++ linux-2.6/init/main.c
>> @@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
>> taskstats_init_early();
>> delayacct_init();
>>
>> +#ifdef CONFIG_X86
>> + text_poke_init();
>> +#endif
>
> All good, except this above. There should be an empty text_poke_init()
> in some header file, and an implementation for the X86 arch rather than
> a ifdef in init/main.c.

Hmm, I'd rather use __weak function instead of defining it in some header
files, because text_poke() and alternatives exist only on x86.

I know that we need to discuss cross modifying code on x86 with
Arjan or other Intel engineers. This patch may still be useful
for removing unnecessary vm_area allocation in text_poke().

Thank you,

---
Use map_vm_area() instead of vmap() in text_poke() for avoiding page allocation
and delayed unmapping, and call vunmap_page_range() and local_flush_tlb()
directly because this mapping is temporary and local.

At the result of above change, text_poke() becomes atomic and can be called
from stop_machine() etc.

Signed-off-by: Masami Hiramatsu <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Nick Piggin <[email protected]>
---
arch/x86/include/asm/alternative.h | 1 +
arch/x86/kernel/alternative.c | 36 +++++++++++++++++++++++++++++-------
include/linux/vmalloc.h | 1 +
init/main.c | 5 +++++
mm/vmalloc.c | 2 +-
5 files changed, 37 insertions(+), 8 deletions(-)

Index: linux-2.6/arch/x86/include/asm/alternative.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/alternative.h
+++ linux-2.6/arch/x86/include/asm/alternative.h
@@ -177,6 +177,7 @@ extern void add_nops(void *insns, unsign
* The _early version expects the memory to already be RW.
*/

+extern void text_poke_init(void);
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern void *text_poke_early(void *addr, const void *opcode, size_t len);

Index: linux-2.6/arch/x86/kernel/alternative.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/alternative.c
+++ linux-2.6/arch/x86/kernel/alternative.c
@@ -12,6 +12,7 @@
#include <asm/nmi.h>
#include <asm/vsyscall.h>
#include <asm/cacheflush.h>
+#include <asm/tlbflush.h>
#include <asm/io.h>

#define MAX_PATCH_LEN (255-1)
@@ -485,6 +486,16 @@ void *text_poke_early(void *addr, const
return addr;
}

+static struct vm_struct *text_poke_area[2];
+static DEFINE_SPINLOCK(text_poke_lock);
+
+void __init text_poke_init(void)
+{
+ text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
+ text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
+ BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
+}
+
/**
* text_poke - Update instructions on a live kernel
* @addr: address to modify
@@ -501,8 +512,9 @@ void *__kprobes text_poke(void *addr, co
unsigned long flags;
char *vaddr;
int nr_pages = 2;
- struct page *pages[2];
- int i;
+ struct page *pages[2], **pgp = pages;
+ int i, ret;
+ struct vm_struct *vma;

if (!core_kernel_text((unsigned long)addr)) {
pages[0] = vmalloc_to_page(addr);
@@ -515,12 +527,22 @@ void *__kprobes text_poke(void *addr, co
BUG_ON(!pages[0]);
if (!pages[1])
nr_pages = 1;
- vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
- BUG_ON(!vaddr);
- local_irq_save(flags);
+ spin_lock_irqsave(&text_poke_lock, flags);
+ vma = text_poke_area[nr_pages-1];
+ ret = map_vm_area(vma, PAGE_KERNEL, &pgp);
+ BUG_ON(ret);
+ vaddr = vma->addr;
memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
- local_irq_restore(flags);
- vunmap(vaddr);
+ /* Ported from unmap_kernel_range() */
+ flush_cache_vunmap((unsigned long)vma->addr, (unsigned long)vma->size);
+ vunmap_page_range((unsigned long)vma->addr,
+ (unsigned long)vma->addr + (unsigned long)vma->size);
+ /*
+ * Since this mapping is temporary, local and protected by spinlock,
+ * we just need to flush TLB on local processor.
+ */
+ local_flush_tlb();
+ spin_unlock_irqrestore(&text_poke_lock, flags);
sync_core();
/* Could also do a CLFLUSH here to speed up CPU recovery; but
that causes hangs on some VIA CPUs. */
Index: linux-2.6/init/main.c
===================================================================
--- linux-2.6.orig/init/main.c
+++ linux-2.6/init/main.c
@@ -526,6 +526,10 @@ void __init __weak thread_info_cache_ini
{
}

+void __init __weak text_poke_init(void)
+{
+}
+
asmlinkage void __init start_kernel(void)
{
char * command_line;
@@ -676,6 +680,7 @@ asmlinkage void __init start_kernel(void
taskstats_init_early();
delayacct_init();

+ text_poke_init();
check_bugs();

acpi_early_init(); /* before LAPIC and SMP init */
Index: linux-2.6/mm/vmalloc.c
===================================================================
--- linux-2.6.orig/mm/vmalloc.c
+++ linux-2.6/mm/vmalloc.c
@@ -71,7 +71,7 @@ static void vunmap_pud_range(pgd_t *pgd,
} while (pud++, addr = next, addr != end);
}

-static void vunmap_page_range(unsigned long addr, unsigned long end)
+void vunmap_page_range(unsigned long addr, unsigned long end)
{
pgd_t *pgd;
unsigned long next;
Index: linux-2.6/include/linux/vmalloc.h
===================================================================
--- linux-2.6.orig/include/linux/vmalloc.h
+++ linux-2.6/include/linux/vmalloc.h
@@ -96,6 +96,7 @@ extern struct vm_struct *remove_vm_area(
extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
struct page ***pages);
extern void unmap_kernel_range(unsigned long addr, unsigned long size);
+extern void vunmap_page_range(unsigned long addr, unsigned long end);

/* Allocate/destroy a 'vmalloc' VM area. */
extern struct vm_struct *alloc_vm_area(size_t size);

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2009-03-02 22:24:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic


* Masami Hiramatsu <[email protected]> wrote:

> Mathieu Desnoyers wrote:
>> * Masami Hiramatsu ([email protected]) wrote:
>>> Index: linux-2.6/init/main.c
>>> ===================================================================
>>> --- linux-2.6.orig/init/main.c
>>> +++ linux-2.6/init/main.c
>>> @@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
>>> taskstats_init_early();
>>> delayacct_init();
>>>
>>> +#ifdef CONFIG_X86
>>> + text_poke_init();
>>> +#endif
>>
>> All good, except this above. There should be an empty text_poke_init()
>> in some header file, and an implementation for the X86 arch rather than
>> a ifdef in init/main.c.
>
> Hmm, I'd rather use __weak function instead of defining it in some header
> files, because text_poke() and alternatives exist only on x86.
>
> I know that we need to discuss cross modifying code on x86 with
> Arjan or other Intel engineers. This patch may still be useful
> for removing unnecessary vm_area allocation in text_poke().
>
> Thank you,
>
> ---
>
> Use map_vm_area() instead of vmap() in text_poke() for
> avoiding page allocation and delayed unmapping, and call
> vunmap_page_range() and local_flush_tlb() directly because
> this mapping is temporary and local.
>
> At the result of above change, text_poke() becomes atomic and
> can be called from stop_machine() etc.

That looks like a good fix in itself - see a few minor details
below.

(Note, i could not try your patch because it has widespread
whitespace damage - please watch out for this for future
patches.)

> +static struct vm_struct *text_poke_area[2];
> +static DEFINE_SPINLOCK(text_poke_lock);
> +
> +void __init text_poke_init(void)
> +{
> + text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
> + text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
> + BUG_ON(!text_poke_area[0] || !text_poke_area[1]);

BUG_ON() for non-100%-essential init code is a no-no. Please
change it to WARN_ON() so that people have a chance to report i.

Also, i think all these vma complications came from the decision
to use vmap - and vmap enhancements in .29 complicated this
supposedly-simple interface.

So perhaps another approach to (re-)consider would be to go back
to atomic fixmaps here. It spends 3 slots but that's no big
deal.

In exchange it will be conceptually simpler, and will also scale
much better than a global spinlock. What do you think?

Ingo

2009-03-02 22:57:18

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic



Ingo Molnar wrote:
> * Masami Hiramatsu <[email protected]> wrote:
>
>> Mathieu Desnoyers wrote:
>>> * Masami Hiramatsu ([email protected]) wrote:
>>>> Index: linux-2.6/init/main.c
>>>> ===================================================================
>>>> --- linux-2.6.orig/init/main.c
>>>> +++ linux-2.6/init/main.c
>>>> @@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
>>>> taskstats_init_early();
>>>> delayacct_init();
>>>>
>>>> +#ifdef CONFIG_X86
>>>> + text_poke_init();
>>>> +#endif
>>> All good, except this above. There should be an empty text_poke_init()
>>> in some header file, and an implementation for the X86 arch rather than
>>> a ifdef in init/main.c.
>> Hmm, I'd rather use __weak function instead of defining it in some header
>> files, because text_poke() and alternatives exist only on x86.
>>
>> I know that we need to discuss cross modifying code on x86 with
>> Arjan or other Intel engineers. This patch may still be useful
>> for removing unnecessary vm_area allocation in text_poke().
>>
>> Thank you,
>>
>> ---
>>
>> Use map_vm_area() instead of vmap() in text_poke() for
>> avoiding page allocation and delayed unmapping, and call
>> vunmap_page_range() and local_flush_tlb() directly because
>> this mapping is temporary and local.
>>
>> At the result of above change, text_poke() becomes atomic and
>> can be called from stop_machine() etc.
>
> That looks like a good fix in itself - see a few minor details
> below.

Thank you for review,

>
> (Note, i could not try your patch because it has widespread
> whitespace damage - please watch out for this for future
> patches.)

Oops, it was my mis-setting...

>
>> +static struct vm_struct *text_poke_area[2];
>> +static DEFINE_SPINLOCK(text_poke_lock);
>> +
>> +void __init text_poke_init(void)
>> +{
>> + text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
>> + text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
>> + BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
>
> BUG_ON() for non-100%-essential init code is a no-no. Please
> change it to WARN_ON() so that people have a chance to report i.

Sure.

>
> Also, i think all these vma complications came from the decision
> to use vmap - and vmap enhancements in .29 complicated this
> supposedly-simple interface.
>
> So perhaps another approach to (re-)consider would be to go back
> to atomic fixmaps here. It spends 3 slots but that's no big
> deal.

Oh, it's a good idea! fixmaps must make it simpler.

>
> In exchange it will be conceptually simpler, and will also scale
> much better than a global spinlock. What do you think?

I think even if I use fixmaps, we have to use a spinlock to protect
the fixmap area from other threads...

>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2009-03-02 23:10:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic


* Masami Hiramatsu <[email protected]> wrote:

>
>
> Ingo Molnar wrote:
> > * Masami Hiramatsu <[email protected]> wrote:
> >
> >> Mathieu Desnoyers wrote:
> >>> * Masami Hiramatsu ([email protected]) wrote:
> >>>> Index: linux-2.6/init/main.c
> >>>> ===================================================================
> >>>> --- linux-2.6.orig/init/main.c
> >>>> +++ linux-2.6/init/main.c
> >>>> @@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
> >>>> taskstats_init_early();
> >>>> delayacct_init();
> >>>>
> >>>> +#ifdef CONFIG_X86
> >>>> + text_poke_init();
> >>>> +#endif
> >>> All good, except this above. There should be an empty text_poke_init()
> >>> in some header file, and an implementation for the X86 arch rather than
> >>> a ifdef in init/main.c.
> >> Hmm, I'd rather use __weak function instead of defining it in some header
> >> files, because text_poke() and alternatives exist only on x86.
> >>
> >> I know that we need to discuss cross modifying code on x86 with
> >> Arjan or other Intel engineers. This patch may still be useful
> >> for removing unnecessary vm_area allocation in text_poke().
> >>
> >> Thank you,
> >>
> >> ---
> >>
> >> Use map_vm_area() instead of vmap() in text_poke() for
> >> avoiding page allocation and delayed unmapping, and call
> >> vunmap_page_range() and local_flush_tlb() directly because
> >> this mapping is temporary and local.
> >>
> >> At the result of above change, text_poke() becomes atomic and
> >> can be called from stop_machine() etc.
> >
> > That looks like a good fix in itself - see a few minor details
> > below.
>
> Thank you for review,
>
> >
> > (Note, i could not try your patch because it has widespread
> > whitespace damage - please watch out for this for future
> > patches.)
>
> Oops, it was my mis-setting...
>
> >
> >> +static struct vm_struct *text_poke_area[2];
> >> +static DEFINE_SPINLOCK(text_poke_lock);
> >> +
> >> +void __init text_poke_init(void)
> >> +{
> >> + text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
> >> + text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
> >> + BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
> >
> > BUG_ON() for non-100%-essential init code is a no-no. Please
> > change it to WARN_ON() so that people have a chance to report i.
>
> Sure.
>
> >
> > Also, i think all these vma complications came from the decision
> > to use vmap - and vmap enhancements in .29 complicated this
> > supposedly-simple interface.
> >
> > So perhaps another approach to (re-)consider would be to go back
> > to atomic fixmaps here. It spends 3 slots but that's no big
> > deal.
>
> Oh, it's a good idea! fixmaps must make it simpler.
>
> >
> > In exchange it will be conceptually simpler, and will also scale
> > much better than a global spinlock. What do you think?
>
> I think even if I use fixmaps, we have to use a spinlock to protect
> the fixmap area from other threads...

that's why i suggested to use an atomic-kmap, not a fixmap.

Ingo

2009-03-02 23:38:38

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic

Ingo Molnar wrote:
>>> So perhaps another approach to (re-)consider would be to go back
>>> to atomic fixmaps here. It spends 3 slots but that's no big
>>> deal.
>> Oh, it's a good idea! fixmaps must make it simpler.
>>
>>> In exchange it will be conceptually simpler, and will also scale
>>> much better than a global spinlock. What do you think?
>> I think even if I use fixmaps, we have to use a spinlock to protect
>> the fixmap area from other threads...
>
> that's why i suggested to use an atomic-kmap, not a fixmap.

Even if the mapping is atomic, text_poke() has to protect pte
from other text_poke()s while changing code.
AFAIK, atomic-kmap itself doesn't ensure that, does it?

Thank you,

>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2009-03-02 23:50:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic


* Masami Hiramatsu <[email protected]> wrote:

> Ingo Molnar wrote:
> >>> So perhaps another approach to (re-)consider would be to go back
> >>> to atomic fixmaps here. It spends 3 slots but that's no big
> >>> deal.
> >> Oh, it's a good idea! fixmaps must make it simpler.
> >>
> >>> In exchange it will be conceptually simpler, and will also scale
> >>> much better than a global spinlock. What do you think?
> >> I think even if I use fixmaps, we have to use a spinlock to protect
> >> the fixmap area from other threads...
> >
> > that's why i suggested to use an atomic-kmap, not a fixmap.
>
> Even if the mapping is atomic, text_poke() has to protect pte
> from other text_poke()s while changing code.
> AFAIK, atomic-kmap itself doesn't ensure that, does it?

Well, but text_poke() is not a serializing API to begin with.
It's normally used in code patching sequences when we 'know'
that there cannot be similar parallel activities. The kprobes
usage of text_poke() looks unsafe - and that needs to be fixed.

So indeed a new global lock is needed there.

It's fixable and we'll fixit, but text_poke() is really more
complex than i'd like it to be.

stop_machine_run() is essentially instantaneous in practice and
obviously serializing so it warrants a second look at least.
Have you tried to use it in kprobes?

Ingo

2009-03-03 00:00:47

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic

* Ingo Molnar ([email protected]) wrote:
>
> * Masami Hiramatsu <[email protected]> wrote:
>
> > Ingo Molnar wrote:
> > >>> So perhaps another approach to (re-)consider would be to go back
> > >>> to atomic fixmaps here. It spends 3 slots but that's no big
> > >>> deal.
> > >> Oh, it's a good idea! fixmaps must make it simpler.
> > >>
> > >>> In exchange it will be conceptually simpler, and will also scale
> > >>> much better than a global spinlock. What do you think?
> > >> I think even if I use fixmaps, we have to use a spinlock to protect
> > >> the fixmap area from other threads...
> > >
> > > that's why i suggested to use an atomic-kmap, not a fixmap.
> >
> > Even if the mapping is atomic, text_poke() has to protect pte
> > from other text_poke()s while changing code.
> > AFAIK, atomic-kmap itself doesn't ensure that, does it?
>
> Well, but text_poke() is not a serializing API to begin with.
> It's normally used in code patching sequences when we 'know'
> that there cannot be similar parallel activities. The kprobes
> usage of text_poke() looks unsafe - and that needs to be fixed.
>
> So indeed a new global lock is needed there.
>
> It's fixable and we'll fixit, but text_poke() is really more
> complex than i'd like it to be.
>
> stop_machine_run() is essentially instantaneous in practice and
> obviously serializing so it warrants a second look at least.
> Have you tried to use it in kprobes?
>
> Ingo

This is why I prepared

text-edit-lock-architecture-independent-code.patch
text-edit-lock-kprobes-architecture-independent-support.patch

A while ago. I'll post them right away.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-03 00:02:05

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH] Text Edit Lock - Architecture Independent Code

This is an architecture independant synchronization around kernel text
modifications through use of a global mutex.

A mutex has been chosen so that kprobes, the main user of this, can sleep during
memory allocation between the memory read of the instructions it must replace
and the memory write of the breakpoint.

Other user of this interface: immediate values.

Paravirt and alternatives are always done when SMP is inactive, so there is no
need to use locks.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Ingo Molnar <[email protected]>
---
include/linux/memory.h | 7 +++++++
mm/memory.c | 34 ++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+)

Index: linux-2.6-lttng/include/linux/memory.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/memory.h 2009-01-30 09:47:59.000000000 -0500
+++ linux-2.6-lttng/include/linux/memory.h 2009-01-30 10:25:33.000000000 -0500
@@ -99,4 +99,11 @@ enum mem_add_context { BOOT, HOTPLUG };
#define hotplug_memory_notifier(fn, pri) do { } while (0)
#endif

+/*
+ * Take and release the kernel text modification lock, used for code patching.
+ * Users of this lock can sleep.
+ */
+extern void kernel_text_lock(void);
+extern void kernel_text_unlock(void);
+
#endif /* _LINUX_MEMORY_H_ */
Index: linux-2.6-lttng/mm/memory.c
===================================================================
--- linux-2.6-lttng.orig/mm/memory.c 2009-01-30 09:47:59.000000000 -0500
+++ linux-2.6-lttng/mm/memory.c 2009-01-30 10:26:01.000000000 -0500
@@ -48,6 +48,8 @@
#include <linux/rmap.h>
#include <linux/module.h>
#include <linux/delayacct.h>
+#include <linux/kprobes.h>
+#include <linux/mutex.h>
#include <linux/init.h>
#include <linux/writeback.h>
#include <linux/memcontrol.h>
@@ -99,6 +101,12 @@ int randomize_va_space __read_mostly =
2;
#endif

+/*
+ * mutex protecting text section modification (dynamic code patching).
+ * some users need to sleep (allocating memory...) while they hold this lock.
+ */
+static DEFINE_MUTEX(text_mutex);
+
static int __init disable_randmaps(char *s)
{
randomize_va_space = 0;
@@ -3192,3 +3200,29 @@ void might_fault(void)
}
EXPORT_SYMBOL(might_fault);
#endif
+
+/**
+ * kernel_text_lock - Take the kernel text modification lock
+ *
+ * Insures mutual write exclusion of kernel and modules text live text
+ * modification. Should be used for code patching.
+ * Users of this lock can sleep.
+ */
+void __kprobes kernel_text_lock(void)
+{
+ mutex_lock(&text_mutex);
+}
+EXPORT_SYMBOL_GPL(kernel_text_lock);
+
+/**
+ * kernel_text_unlock - Release the kernel text modification lock
+ *
+ * Insures mutual write exclusion of kernel and modules text live text
+ * modification. Should be used for code patching.
+ * Users of this lock can sleep.
+ */
+void __kprobes kernel_text_unlock(void)
+{
+ mutex_unlock(&text_mutex);
+}
+EXPORT_SYMBOL_GPL(kernel_text_unlock);

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-03 00:02:32

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH] Text Edit Lock - kprobes architecture independent support

Use the mutual exclusion provided by the text edit lock in the kprobes code. It
allows coherent manipulation of the kernel code by other subsystems.

Changelog:

Move the kernel_text_lock/unlock out of the for loops.

(applies on 2.6.29-rc6)

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Ananth N Mavinakayanahalli <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: Roel Kluin <[email protected]>
---
kernel/kprobes.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng/kernel/kprobes.c
===================================================================
--- linux-2.6-lttng.orig/kernel/kprobes.c 2009-01-30 10:24:45.000000000 -0500
+++ linux-2.6-lttng/kernel/kprobes.c 2009-01-30 10:27:56.000000000 -0500
@@ -43,6 +43,7 @@
#include <linux/seq_file.h>
#include <linux/debugfs.h>
#include <linux/kdebug.h>
+#include <linux/memory.h>

#include <asm-generic/sections.h>
#include <asm/cacheflush.h>
@@ -699,9 +700,10 @@ int __kprobes register_kprobe(struct kpr
goto out;
}

+ kernel_text_lock();
ret = arch_prepare_kprobe(p);
if (ret)
- goto out;
+ goto out_unlock_text;

INIT_HLIST_NODE(&p->hlist);
hlist_add_head_rcu(&p->hlist,
@@ -709,7 +711,8 @@ int __kprobes register_kprobe(struct kpr

if (kprobe_enabled)
arch_arm_kprobe(p);
-
+out_unlock_text:
+ kernel_text_unlock();
out:
mutex_unlock(&kprobe_mutex);

@@ -746,8 +749,11 @@ valid_p:
* enabled and not gone - otherwise, the breakpoint would
* already have been removed. We save on flushing icache.
*/
- if (kprobe_enabled && !kprobe_gone(old_p))
+ if (kprobe_enabled && !kprobe_gone(old_p)) {
+ kernel_text_lock();
arch_disarm_kprobe(p);
+ kernel_text_unlock();
+ }
hlist_del_rcu(&old_p->hlist);
} else {
if (p->break_handler && !kprobe_gone(p))
@@ -918,7 +924,6 @@ static int __kprobes pre_handler_kretpro
}

arch_prepare_kretprobe(ri, regs);
-
/* XXX(hch): why is there no hlist_move_head? */
INIT_HLIST_NODE(&ri->hlist);
kretprobe_table_lock(hash, &flags);
@@ -1280,12 +1285,14 @@ static void __kprobes enable_all_kprobes
if (kprobe_enabled)
goto already_enabled;

+ kernel_text_lock();
for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
head = &kprobe_table[i];
hlist_for_each_entry_rcu(p, node, head, hlist)
if (!kprobe_gone(p))
arch_arm_kprobe(p);
}
+ kernel_text_unlock();

kprobe_enabled = true;
printk(KERN_INFO "Kprobes globally enabled\n");
@@ -1310,6 +1317,7 @@ static void __kprobes disable_all_kprobe

kprobe_enabled = false;
printk(KERN_INFO "Kprobes globally disabled\n");
+ kernel_text_lock();
for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
head = &kprobe_table[i];
hlist_for_each_entry_rcu(p, node, head, hlist) {
@@ -1317,6 +1325,7 @@ static void __kprobes disable_all_kprobe
arch_disarm_kprobe(p);
}
}
+ kernel_text_unlock();

mutex_unlock(&kprobe_mutex);
/* Allow all currently running kprobes to complete */
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-03 00:06:55

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic



Ingo Molnar wrote:
> * Masami Hiramatsu <[email protected]> wrote:
>
>> Ingo Molnar wrote:
>>>>> So perhaps another approach to (re-)consider would be to go back
>>>>> to atomic fixmaps here. It spends 3 slots but that's no big
>>>>> deal.
>>>> Oh, it's a good idea! fixmaps must make it simpler.
>>>>
>>>>> In exchange it will be conceptually simpler, and will also scale
>>>>> much better than a global spinlock. What do you think?
>>>> I think even if I use fixmaps, we have to use a spinlock to protect
>>>> the fixmap area from other threads...
>>> that's why i suggested to use an atomic-kmap, not a fixmap.
>> Even if the mapping is atomic, text_poke() has to protect pte
>> from other text_poke()s while changing code.
>> AFAIK, atomic-kmap itself doesn't ensure that, does it?
>
> Well, but text_poke() is not a serializing API to begin with.
> It's normally used in code patching sequences when we 'know'
> that there cannot be similar parallel activities. The kprobes
> usage of text_poke() looks unsafe - and that needs to be fixed.

Oh, kprobes already prohibited parallel arming/disarming
by using kprobe_mutex. :-)

> So indeed a new global lock is needed there.
>
> It's fixable and we'll fixit, but text_poke() is really more
> complex than i'd like it to be.
>
> stop_machine_run() is essentially instantaneous in practice and
> obviously serializing so it warrants a second look at least.
> Have you tried to use it in kprobes?

No, but it seems that cost high for incremental use(registration)
of kprobes...

Thank you,

>
> Ingo

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2009-03-03 00:11:20

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH] Text Edit Lock - kprobes architecture independent support

Mathieu Desnoyers wrote:
> Use the mutual exclusion provided by the text edit lock in the kprobes code. It
> allows coherent manipulation of the kernel code by other subsystems.

Oh, I see what you said...
This seems really useful.

Acked-by: Masami Hiramatsu <[email protected]>

>
> Changelog:
>
> Move the kernel_text_lock/unlock out of the for loops.
>
> (applies on 2.6.29-rc6)
>
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> Acked-by: Ananth N Mavinakayanahalli <[email protected]>
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> CC: Roel Kluin <[email protected]>
> ---
> kernel/kprobes.c | 17 +++++++++++++----
> 1 file changed, 13 insertions(+), 4 deletions(-)
>
> Index: linux-2.6-lttng/kernel/kprobes.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/kprobes.c 2009-01-30 10:24:45.000000000 -0500
> +++ linux-2.6-lttng/kernel/kprobes.c 2009-01-30 10:27:56.000000000 -0500
> @@ -43,6 +43,7 @@
> #include <linux/seq_file.h>
> #include <linux/debugfs.h>
> #include <linux/kdebug.h>
> +#include <linux/memory.h>
>
> #include <asm-generic/sections.h>
> #include <asm/cacheflush.h>
> @@ -699,9 +700,10 @@ int __kprobes register_kprobe(struct kpr
> goto out;
> }
>
> + kernel_text_lock();
> ret = arch_prepare_kprobe(p);
> if (ret)
> - goto out;
> + goto out_unlock_text;
>
> INIT_HLIST_NODE(&p->hlist);
> hlist_add_head_rcu(&p->hlist,
> @@ -709,7 +711,8 @@ int __kprobes register_kprobe(struct kpr
>
> if (kprobe_enabled)
> arch_arm_kprobe(p);
> -
> +out_unlock_text:
> + kernel_text_unlock();
> out:
> mutex_unlock(&kprobe_mutex);
>
> @@ -746,8 +749,11 @@ valid_p:
> * enabled and not gone - otherwise, the breakpoint would
> * already have been removed. We save on flushing icache.
> */
> - if (kprobe_enabled && !kprobe_gone(old_p))
> + if (kprobe_enabled && !kprobe_gone(old_p)) {
> + kernel_text_lock();
> arch_disarm_kprobe(p);
> + kernel_text_unlock();
> + }
> hlist_del_rcu(&old_p->hlist);
> } else {
> if (p->break_handler && !kprobe_gone(p))
> @@ -918,7 +924,6 @@ static int __kprobes pre_handler_kretpro
> }
>
> arch_prepare_kretprobe(ri, regs);
> -
> /* XXX(hch): why is there no hlist_move_head? */
> INIT_HLIST_NODE(&ri->hlist);
> kretprobe_table_lock(hash, &flags);
> @@ -1280,12 +1285,14 @@ static void __kprobes enable_all_kprobes
> if (kprobe_enabled)
> goto already_enabled;
>
> + kernel_text_lock();
> for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> head = &kprobe_table[i];
> hlist_for_each_entry_rcu(p, node, head, hlist)
> if (!kprobe_gone(p))
> arch_arm_kprobe(p);
> }
> + kernel_text_unlock();
>
> kprobe_enabled = true;
> printk(KERN_INFO "Kprobes globally enabled\n");
> @@ -1310,6 +1317,7 @@ static void __kprobes disable_all_kprobe
>
> kprobe_enabled = false;
> printk(KERN_INFO "Kprobes globally disabled\n");
> + kernel_text_lock();
> for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> head = &kprobe_table[i];
> hlist_for_each_entry_rcu(p, node, head, hlist) {
> @@ -1317,6 +1325,7 @@ static void __kprobes disable_all_kprobe
> arch_disarm_kprobe(p);
> }
> }
> + kernel_text_unlock();
>
> mutex_unlock(&kprobe_mutex);
> /* Allow all currently running kprobes to complete */

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2009-03-03 00:23:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic


* Masami Hiramatsu <[email protected]> wrote:

>
>
> Ingo Molnar wrote:
> > * Masami Hiramatsu <[email protected]> wrote:
> >
> >> Ingo Molnar wrote:
> >>>>> So perhaps another approach to (re-)consider would be to go back
> >>>>> to atomic fixmaps here. It spends 3 slots but that's no big
> >>>>> deal.
> >>>> Oh, it's a good idea! fixmaps must make it simpler.
> >>>>
> >>>>> In exchange it will be conceptually simpler, and will also scale
> >>>>> much better than a global spinlock. What do you think?
> >>>> I think even if I use fixmaps, we have to use a spinlock to protect
> >>>> the fixmap area from other threads...
> >>> that's why i suggested to use an atomic-kmap, not a fixmap.
> >> Even if the mapping is atomic, text_poke() has to protect pte
> >> from other text_poke()s while changing code.
> >> AFAIK, atomic-kmap itself doesn't ensure that, does it?
> >
> > Well, but text_poke() is not a serializing API to begin with.
> > It's normally used in code patching sequences when we 'know'
> > that there cannot be similar parallel activities. The kprobes
> > usage of text_poke() looks unsafe - and that needs to be fixed.
>
> Oh, kprobes already prohibited parallel arming/disarming
> by using kprobe_mutex. :-)

yeah, but still the API is somewhat unsafe.

In any case, you also answered your own question:

> >> Even if the mapping is atomic, text_poke() has to protect pte
> >> from other text_poke()s while changing code.
> >> AFAIK, atomic-kmap itself doesn't ensure that, does it?

kprobe_mutex does that.

Ingo

2009-03-03 00:32:22

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic

Ingo Molnar wrote:
> * Masami Hiramatsu <[email protected]> wrote:
>
>>
>> Ingo Molnar wrote:
>>> * Masami Hiramatsu <[email protected]> wrote:
>>>
>>>> Ingo Molnar wrote:
>>>>>>> So perhaps another approach to (re-)consider would be to go back
>>>>>>> to atomic fixmaps here. It spends 3 slots but that's no big
>>>>>>> deal.
>>>>>> Oh, it's a good idea! fixmaps must make it simpler.
>>>>>>
>>>>>>> In exchange it will be conceptually simpler, and will also scale
>>>>>>> much better than a global spinlock. What do you think?
>>>>>> I think even if I use fixmaps, we have to use a spinlock to protect
>>>>>> the fixmap area from other threads...
>>>>> that's why i suggested to use an atomic-kmap, not a fixmap.
>>>> Even if the mapping is atomic, text_poke() has to protect pte
>>>> from other text_poke()s while changing code.
>>>> AFAIK, atomic-kmap itself doesn't ensure that, does it?
>>> Well, but text_poke() is not a serializing API to begin with.
>>> It's normally used in code patching sequences when we 'know'
>>> that there cannot be similar parallel activities. The kprobes
>>> usage of text_poke() looks unsafe - and that needs to be fixed.
>> Oh, kprobes already prohibited parallel arming/disarming
>> by using kprobe_mutex. :-)
>
> yeah, but still the API is somewhat unsafe.

Yeah, kprobe_mutex protects text_poke from other kprobes, but
not from other text_poke() users...

> In any case, you also answered your own question:
>
>>>> Even if the mapping is atomic, text_poke() has to protect pte
>>>> from other text_poke()s while changing code.
>>>> AFAIK, atomic-kmap itself doesn't ensure that, does it?
>
> kprobe_mutex does that.

Anyway, text_edit_lock ensures that.

By the way, I think set_fixmap/clear_fixmap seems simpler than
kmap_atomic() variant. Would you think improving kmap_atomic_prot()
is better?

>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2009-03-03 00:33:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] Text Edit Lock - Architecture Independent Code


* Mathieu Desnoyers <[email protected]> wrote:

> +/*
> + * Take and release the kernel text modification lock, used for code patching.
> + * Users of this lock can sleep.
> + */
> +extern void kernel_text_lock(void);
> +extern void kernel_text_unlock(void);

Locking APIs with hidden semantics are very ugly. Remember
lock_kernel()?

> +/*
> + * mutex protecting text section modification (dynamic code patching).
> + * some users need to sleep (allocating memory...) while they hold this lock.
> + */
> +static DEFINE_MUTEX(text_mutex);

Please update those sites to do an explicit:

mutex_lock(&text_mutex);

instead.

That way we save a function call, and we'll also see exactly
what type of lock is being taken, etc.

Ingo

2009-03-03 00:39:42

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] Text Edit Lock - Architecture Independent Code

* Ingo Molnar ([email protected]) wrote:
>
> * Mathieu Desnoyers <[email protected]> wrote:
>
> > +/*
> > + * Take and release the kernel text modification lock, used for code patching.
> > + * Users of this lock can sleep.
> > + */
> > +extern void kernel_text_lock(void);
> > +extern void kernel_text_unlock(void);
>
> Locking APIs with hidden semantics are very ugly. Remember
> lock_kernel()?
>
> > +/*
> > + * mutex protecting text section modification (dynamic code patching).
> > + * some users need to sleep (allocating memory...) while they hold this lock.
> > + */
> > +static DEFINE_MUTEX(text_mutex);
>
> Please update those sites to do an explicit:
>
> mutex_lock(&text_mutex);
>
> instead.
>
> That way we save a function call, and we'll also see exactly
> what type of lock is being taken, etc.
>

OK. However we'll have to export the text_mutex symbol and use it in
various locations. As long as we are fine with that, I'll provide an
updated patch.

Mathieu

> Ingo

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-03 01:30:47

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH] Text Edit Lock - Architecture Independent Code (v2)

This is an architecture independant synchronization around kernel text
modifications through use of a global mutex.

A mutex has been chosen so that kprobes, the main user of this, can sleep during
memory allocation between the memory read of the instructions it must replace
and the memory write of the breakpoint.

Other user of this interface: immediate values.

Paravirt and alternatives are always done when SMP is inactive, so there is no
need to use locks.

Changelog :
Export text_mutex directly.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Ingo Molnar <[email protected]>
---
include/linux/memory.h | 6 ++++++
mm/memory.c | 9 +++++++++
2 files changed, 15 insertions(+)

Index: linux-2.6-lttng/include/linux/memory.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/memory.h 2009-03-02 13:13:35.000000000 -0500
+++ linux-2.6-lttng/include/linux/memory.h 2009-03-02 19:23:52.000000000 -0500
@@ -99,4 +99,10 @@ enum mem_add_context { BOOT, HOTPLUG };
#define hotplug_memory_notifier(fn, pri) do { } while (0)
#endif

+/*
+ * Kernel text modification mutex, used for code patching. Users of this lock
+ * can sleep.
+ */
+extern struct mutex text_mutex;
+
#endif /* _LINUX_MEMORY_H_ */
Index: linux-2.6-lttng/mm/memory.c
===================================================================
--- linux-2.6-lttng.orig/mm/memory.c 2009-03-02 13:13:35.000000000 -0500
+++ linux-2.6-lttng/mm/memory.c 2009-03-02 19:24:33.000000000 -0500
@@ -48,6 +48,8 @@
#include <linux/rmap.h>
#include <linux/module.h>
#include <linux/delayacct.h>
+#include <linux/kprobes.h>
+#include <linux/mutex.h>
#include <linux/init.h>
#include <linux/writeback.h>
#include <linux/memcontrol.h>
@@ -99,6 +101,13 @@ int randomize_va_space __read_mostly =
2;
#endif

+/*
+ * mutex protecting text section modification (dynamic code patching).
+ * some users need to sleep (allocating memory...) while they hold this lock.
+ */
+DEFINE_MUTEX(text_mutex);
+EXPORT_SYMBOL_GPL(text_mutex);
+
static int __init disable_randmaps(char *s)
{
randomize_va_space = 0;
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-03 01:31:37

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH] Text Edit Lock - kprobes architecture independent support (v2)

Use the mutual exclusion provided by the text edit lock in the kprobes code. It
allows coherent manipulation of the kernel code by other subsystems.

Changelog:

Move the kernel_text_lock/unlock out of the for loops.
Use text_mutex directly instead of a function.

(applies on 2.6.29-rc6)

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Ananth N Mavinakayanahalli <[email protected]>
Acked-by: Masami Hiramatsu <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: Roel Kluin <[email protected]>
---
kernel/kprobes.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)

Index: linux-2.6-lttng/kernel/kprobes.c
===================================================================
--- linux-2.6-lttng.orig/kernel/kprobes.c 2009-03-02 19:22:51.000000000 -0500
+++ linux-2.6-lttng/kernel/kprobes.c 2009-03-02 19:27:26.000000000 -0500
@@ -43,6 +43,7 @@
#include <linux/seq_file.h>
#include <linux/debugfs.h>
#include <linux/kdebug.h>
+#include <linux/memory.h>

#include <asm-generic/sections.h>
#include <asm/cacheflush.h>
@@ -699,9 +700,10 @@ int __kprobes register_kprobe(struct kpr
goto out;
}

+ mutex_lock(&text_mutex);
ret = arch_prepare_kprobe(p);
if (ret)
- goto out;
+ goto out_unlock_text;

INIT_HLIST_NODE(&p->hlist);
hlist_add_head_rcu(&p->hlist,
@@ -709,7 +711,8 @@ int __kprobes register_kprobe(struct kpr

if (kprobe_enabled)
arch_arm_kprobe(p);
-
+out_unlock_text:
+ mutex_unlock(&text_mutex);
out:
mutex_unlock(&kprobe_mutex);

@@ -746,8 +749,11 @@ valid_p:
* enabled and not gone - otherwise, the breakpoint would
* already have been removed. We save on flushing icache.
*/
- if (kprobe_enabled && !kprobe_gone(old_p))
+ if (kprobe_enabled && !kprobe_gone(old_p)) {
+ mutex_lock(&text_mutex);
arch_disarm_kprobe(p);
+ mutex_unlock(&text_mutex);
+ }
hlist_del_rcu(&old_p->hlist);
} else {
if (p->break_handler && !kprobe_gone(p))
@@ -918,7 +924,6 @@ static int __kprobes pre_handler_kretpro
}

arch_prepare_kretprobe(ri, regs);
-
/* XXX(hch): why is there no hlist_move_head? */
INIT_HLIST_NODE(&ri->hlist);
kretprobe_table_lock(hash, &flags);
@@ -1280,12 +1285,14 @@ static void __kprobes enable_all_kprobes
if (kprobe_enabled)
goto already_enabled;

+ mutex_lock(&text_mutex);
for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
head = &kprobe_table[i];
hlist_for_each_entry_rcu(p, node, head, hlist)
if (!kprobe_gone(p))
arch_arm_kprobe(p);
}
+ mutex_unlock(&text_mutex);

kprobe_enabled = true;
printk(KERN_INFO "Kprobes globally enabled\n");
@@ -1310,6 +1317,7 @@ static void __kprobes disable_all_kprobe

kprobe_enabled = false;
printk(KERN_INFO "Kprobes globally disabled\n");
+ mutex_lock(&text_mutex);
for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
head = &kprobe_table[i];
hlist_for_each_entry_rcu(p, node, head, hlist) {
@@ -1317,7 +1325,7 @@ static void __kprobes disable_all_kprobe
arch_disarm_kprobe(p);
}
}
-
+ mutex_unlock(&text_mutex);
mutex_unlock(&kprobe_mutex);
/* Allow all currently running kprobes to complete */
synchronize_sched();
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-03 04:54:24

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86: make text_poke() atomic

On Mon, Mar 02, 2009 at 12:01:29PM -0500, Masami Hiramatsu wrote:
> ---
>
> Use map_vm_area() instead of vmap() in text_poke() for avoiding page
> allocation
> and delayed unmapping, and call vunmap_page_range() and local_flush_tlb()
> directly because this mapping is temporary and local.
>
> At the result of above change, text_poke() becomes atomic and can be called
> from stop_machine() etc.
>
> Signed-off-by: Masami Hiramatsu <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Nick Piggin <[email protected]>
> ---
> arch/x86/include/asm/alternative.h | 1 +
> arch/x86/kernel/alternative.c | 36
> +++++++++++++++++++++++++++++-------
> include/linux/vmalloc.h | 1 +
> init/main.c | 3 +++
> mm/vmalloc.c | 2 +-
> 5 files changed, 35 insertions(+), 8 deletions(-)
>
> Index: linux-2.6/arch/x86/include/asm/alternative.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/alternative.h
> +++ linux-2.6/arch/x86/include/asm/alternative.h
> @@ -177,6 +177,7 @@ extern void add_nops(void *insns, unsign
> * The _early version expects the memory to already be RW.
> */
>
> +extern void text_poke_init(void);
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>
> Index: linux-2.6/arch/x86/kernel/alternative.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/alternative.c
> +++ linux-2.6/arch/x86/kernel/alternative.c
> @@ -12,6 +12,7 @@
> #include <asm/nmi.h>
> #include <asm/vsyscall.h>
> #include <asm/cacheflush.h>
> +#include <asm/tlbflush.h>
> #include <asm/io.h>
>
> #define MAX_PATCH_LEN (255-1)
> @@ -485,6 +486,16 @@ void *text_poke_early(void *addr, const
> return addr;
> }
>
> +static struct vm_struct *text_poke_area[2];
> +static DEFINE_SPINLOCK(text_poke_lock);
> +
> +void __init text_poke_init(void)
> +{
> + text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
> + text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
> + BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
> +}
> +
> /**
> * text_poke - Update instructions on a live kernel
> * @addr: address to modify
> @@ -501,8 +512,9 @@ void *__kprobes text_poke(void *addr, co
> unsigned long flags;
> char *vaddr;
> int nr_pages = 2;
> - struct page *pages[2];
> - int i;
> + struct page *pages[2], **pgp = pages;
> + int i, ret;
> + struct vm_struct *vma;
>
> if (!core_kernel_text((unsigned long)addr)) {
> pages[0] = vmalloc_to_page(addr);

This is really good....

> @@ -515,12 +527,22 @@ void *__kprobes text_poke(void *addr, co
> BUG_ON(!pages[0]);
> if (!pages[1])
> nr_pages = 1;
> - vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
> - BUG_ON(!vaddr);
^^^^^^^^^^^^^^
This really is nasty bug in text_poke, and I never knew why it was
allowed to live for so long!

Thanks,
Nick

2009-03-03 09:28:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] Text Edit Lock - kprobes architecture independent support (v2)


* Mathieu Desnoyers <[email protected]> wrote:

> @@ -709,7 +711,8 @@ int __kprobes register_kprobe(struct kpr
>
> if (kprobe_enabled)
> arch_arm_kprobe(p);

hm, it's cleaner now, but there's serious locking dependency
problems visible in the patch:

> -
> +out_unlock_text:
> + mutex_unlock(&text_mutex);
> out:
> mutex_unlock(&kprobe_mutex);

this one creates a (text_mutex -> kprobe_mutex) dependency.
(also you removed a newline spuriously - dont do that)

> @@ -746,8 +749,11 @@ valid_p:
> * enabled and not gone - otherwise, the breakpoint would
> * already have been removed. We save on flushing icache.
> */
> - if (kprobe_enabled && !kprobe_gone(old_p))
> + if (kprobe_enabled && !kprobe_gone(old_p)) {
> + mutex_lock(&text_mutex);
> arch_disarm_kprobe(p);
> + mutex_unlock(&text_mutex);
> + }
> hlist_del_rcu(&old_p->hlist);

(kprobe_mutex -> text_mutex) dependency. AB-BA deadlock.

> } else {
> if (p->break_handler && !kprobe_gone(p))
> @@ -918,7 +924,6 @@ static int __kprobes pre_handler_kretpro
> }
>
> arch_prepare_kretprobe(ri, regs);
> -

spurious (and wrong) newline removal.

> /* XXX(hch): why is there no hlist_move_head? */
> INIT_HLIST_NODE(&ri->hlist);
> kretprobe_table_lock(hash, &flags);
> @@ -1280,12 +1285,14 @@ static void __kprobes enable_all_kprobes
> if (kprobe_enabled)
> goto already_enabled;
>
> + mutex_lock(&text_mutex);
> for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> head = &kprobe_table[i];
> hlist_for_each_entry_rcu(p, node, head, hlist)
> if (!kprobe_gone(p))
> arch_arm_kprobe(p);
> }
> + mutex_unlock(&text_mutex);

this one creates a (kprobe_mutex -> text_mutex) dependency
again.

> @@ -1310,6 +1317,7 @@ static void __kprobes disable_all_kprobe
>
> kprobe_enabled = false;
> printk(KERN_INFO "Kprobes globally disabled\n");
> + mutex_lock(&text_mutex);
> for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> head = &kprobe_table[i];
> hlist_for_each_entry_rcu(p, node, head, hlist) {
> @@ -1317,7 +1325,7 @@ static void __kprobes disable_all_kprobe
> arch_disarm_kprobe(p);
> }
> }
> -
> + mutex_unlock(&text_mutex);
> mutex_unlock(&kprobe_mutex);

And this one in the reverse direction again.

The dependencies are totally wrong. The text lock (a low level
lock) should nest inside the kprobes mutex (which is the higher
level lock).

Have you reviewed the locking dependencies when writing this
patch, at all? That's one of the first things to do when adding
a new lock.

Ingo

Subject: Re: [PATCH] Text Edit Lock - kprobes architecture independent support (v2)

On Tue, Mar 03, 2009 at 10:27:50AM +0100, Ingo Molnar wrote:
>
> * Mathieu Desnoyers <[email protected]> wrote:
>
> > @@ -709,7 +711,8 @@ int __kprobes register_kprobe(struct kpr

Hi Ingo,

> > if (kprobe_enabled)
> > arch_arm_kprobe(p);
>
> hm, it's cleaner now, but there's serious locking dependency
> problems visible in the patch:
>
> > -
> > +out_unlock_text:
> > + mutex_unlock(&text_mutex);
> > out:
> > mutex_unlock(&kprobe_mutex);
>
> this one creates a (text_mutex -> kprobe_mutex) dependency.
> (also you removed a newline spuriously - dont do that)

That is a mutex_unlock :-) ...

> > @@ -746,8 +749,11 @@ valid_p:
> > * enabled and not gone - otherwise, the breakpoint would
> > * already have been removed. We save on flushing icache.
> > */
> > - if (kprobe_enabled && !kprobe_gone(old_p))
> > + if (kprobe_enabled && !kprobe_gone(old_p)) {
> > + mutex_lock(&text_mutex);
> > arch_disarm_kprobe(p);
> > + mutex_unlock(&text_mutex);
> > + }
> > hlist_del_rcu(&old_p->hlist);
>
> (kprobe_mutex -> text_mutex) dependency. AB-BA deadlock.

At this time the kprobe_mutex is already held.

...

> > @@ -1280,12 +1285,14 @@ static void __kprobes enable_all_kprobes
> > if (kprobe_enabled)
> > goto already_enabled;
> >
> > + mutex_lock(&text_mutex);
> > for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> > head = &kprobe_table[i];
> > hlist_for_each_entry_rcu(p, node, head, hlist)
> > if (!kprobe_gone(p))
> > arch_arm_kprobe(p);
> > }
> > + mutex_unlock(&text_mutex);
>
> this one creates a (kprobe_mutex -> text_mutex) dependency
> again.

kprobe_mutex his held here too...

> > @@ -1310,6 +1317,7 @@ static void __kprobes disable_all_kprobe
> >
> > kprobe_enabled = false;
> > printk(KERN_INFO "Kprobes globally disabled\n");
> > + mutex_lock(&text_mutex);
> > for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> > head = &kprobe_table[i];
> > hlist_for_each_entry_rcu(p, node, head, hlist) {
> > @@ -1317,7 +1325,7 @@ static void __kprobes disable_all_kprobe
> > arch_disarm_kprobe(p);
> > }
> > }
> > -
> > + mutex_unlock(&text_mutex);
> > mutex_unlock(&kprobe_mutex);
>
> And this one in the reverse direction again.

Unlock again :-)

> The dependencies are totally wrong. The text lock (a low level
> lock) should nest inside the kprobes mutex (which is the higher
> level lock).

>From what I see, Mathieu has done just that and has gotten the order
right in all cases. Or maybe I am missing something?

(I recall having tested this patch with LOCKDEP turned on and it
din't throw any errors).

Ananth

2009-03-03 14:33:32

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] Text Edit Lock - kprobes architecture independent support (v2)

* Ananth N Mavinakayanahalli ([email protected]) wrote:
> On Tue, Mar 03, 2009 at 10:27:50AM +0100, Ingo Molnar wrote:
> >
> > * Mathieu Desnoyers <[email protected]> wrote:
> >
> > > @@ -709,7 +711,8 @@ int __kprobes register_kprobe(struct kpr
>
> Hi Ingo,
>
> > > if (kprobe_enabled)
> > > arch_arm_kprobe(p);
> >
> > hm, it's cleaner now, but there's serious locking dependency
> > problems visible in the patch:
> >
> > > -
> > > +out_unlock_text:
> > > + mutex_unlock(&text_mutex);
> > > out:
> > > mutex_unlock(&kprobe_mutex);
> >
> > this one creates a (text_mutex -> kprobe_mutex) dependency.
> > (also you removed a newline spuriously - dont do that)
>
> That is a mutex_unlock :-) ...
>
> > > @@ -746,8 +749,11 @@ valid_p:
> > > * enabled and not gone - otherwise, the breakpoint would
> > > * already have been removed. We save on flushing icache.
> > > */
> > > - if (kprobe_enabled && !kprobe_gone(old_p))
> > > + if (kprobe_enabled && !kprobe_gone(old_p)) {
> > > + mutex_lock(&text_mutex);
> > > arch_disarm_kprobe(p);
> > > + mutex_unlock(&text_mutex);
> > > + }
> > > hlist_del_rcu(&old_p->hlist);
> >
> > (kprobe_mutex -> text_mutex) dependency. AB-BA deadlock.
>
> At this time the kprobe_mutex is already held.
>
> ...
>
> > > @@ -1280,12 +1285,14 @@ static void __kprobes enable_all_kprobes
> > > if (kprobe_enabled)
> > > goto already_enabled;
> > >
> > > + mutex_lock(&text_mutex);
> > > for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> > > head = &kprobe_table[i];
> > > hlist_for_each_entry_rcu(p, node, head, hlist)
> > > if (!kprobe_gone(p))
> > > arch_arm_kprobe(p);
> > > }
> > > + mutex_unlock(&text_mutex);
> >
> > this one creates a (kprobe_mutex -> text_mutex) dependency
> > again.
>
> kprobe_mutex his held here too...
>
> > > @@ -1310,6 +1317,7 @@ static void __kprobes disable_all_kprobe
> > >
> > > kprobe_enabled = false;
> > > printk(KERN_INFO "Kprobes globally disabled\n");
> > > + mutex_lock(&text_mutex);
> > > for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> > > head = &kprobe_table[i];
> > > hlist_for_each_entry_rcu(p, node, head, hlist) {
> > > @@ -1317,7 +1325,7 @@ static void __kprobes disable_all_kprobe
> > > arch_disarm_kprobe(p);
> > > }
> > > }
> > > -
> > > + mutex_unlock(&text_mutex);
> > > mutex_unlock(&kprobe_mutex);
> >
> > And this one in the reverse direction again.
>
> Unlock again :-)
>
> > The dependencies are totally wrong. The text lock (a low level
> > lock) should nest inside the kprobes mutex (which is the higher
> > level lock).
>
> From what I see, Mathieu has done just that and has gotten the order
> right in all cases. Or maybe I am missing something?
>
> (I recall having tested this patch with LOCKDEP turned on and it
> din't throw any errors).
>

Yes, I even moved all kprobe_mutexes out of arch_arm_kprobe/arch_arm_kprobe
a while ago in preparation for this patch. :) I can repost without the
white space modifications.

Mathieu

> Ananth

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-03 14:33:50

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH] Text Edit Lock - kprobes architecture independent support (v3)

Use the mutual exclusion provided by the text edit lock in the kprobes code. It
allows coherent manipulation of the kernel code by other subsystems.

Changelog:

Move the kernel_text_lock/unlock out of the for loops.
Use text_mutex directly instead of a function.
Remove whitespace modifications.

(note : kprobes_mutex is always taken outside of text_mutex)

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Ananth N Mavinakayanahalli <[email protected]>
Acked-by: Masami Hiramatsu <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: Roel Kluin <[email protected]>
---
kernel/kprobes.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

Index: linux-2.6-lttng/kernel/kprobes.c
===================================================================
--- linux-2.6-lttng.orig/kernel/kprobes.c 2009-03-02 19:22:51.000000000 -0500
+++ linux-2.6-lttng/kernel/kprobes.c 2009-03-03 09:11:17.000000000 -0500
@@ -43,6 +43,7 @@
#include <linux/seq_file.h>
#include <linux/debugfs.h>
#include <linux/kdebug.h>
+#include <linux/memory.h>

#include <asm-generic/sections.h>
#include <asm/cacheflush.h>
@@ -699,9 +700,10 @@ int __kprobes register_kprobe(struct kpr
goto out;
}

+ mutex_lock(&text_mutex);
ret = arch_prepare_kprobe(p);
if (ret)
- goto out;
+ goto out_unlock_text;

INIT_HLIST_NODE(&p->hlist);
hlist_add_head_rcu(&p->hlist,
@@ -710,6 +712,8 @@ int __kprobes register_kprobe(struct kpr
if (kprobe_enabled)
arch_arm_kprobe(p);

+out_unlock_text:
+ mutex_unlock(&text_mutex);
out:
mutex_unlock(&kprobe_mutex);

@@ -746,8 +750,11 @@ valid_p:
* enabled and not gone - otherwise, the breakpoint would
* already have been removed. We save on flushing icache.
*/
- if (kprobe_enabled && !kprobe_gone(old_p))
+ if (kprobe_enabled && !kprobe_gone(old_p)) {
+ mutex_lock(&text_mutex);
arch_disarm_kprobe(p);
+ mutex_unlock(&text_mutex);
+ }
hlist_del_rcu(&old_p->hlist);
} else {
if (p->break_handler && !kprobe_gone(p))
@@ -1280,12 +1287,14 @@ static void __kprobes enable_all_kprobes
if (kprobe_enabled)
goto already_enabled;

+ mutex_lock(&text_mutex);
for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
head = &kprobe_table[i];
hlist_for_each_entry_rcu(p, node, head, hlist)
if (!kprobe_gone(p))
arch_arm_kprobe(p);
}
+ mutex_unlock(&text_mutex);

kprobe_enabled = true;
printk(KERN_INFO "Kprobes globally enabled\n");
@@ -1310,6 +1319,7 @@ static void __kprobes disable_all_kprobe

kprobe_enabled = false;
printk(KERN_INFO "Kprobes globally disabled\n");
+ mutex_lock(&text_mutex);
for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
head = &kprobe_table[i];
hlist_for_each_entry_rcu(p, node, head, hlist) {
@@ -1318,6 +1328,7 @@ static void __kprobes disable_all_kprobe
}
}

+ mutex_unlock(&text_mutex);
mutex_unlock(&kprobe_mutex);
/* Allow all currently running kprobes to complete */
synchronize_sched();

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-03 14:54:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] Text Edit Lock - kprobes architecture independent support (v2)


* Ananth N Mavinakayanahalli <[email protected]> wrote:

> On Tue, Mar 03, 2009 at 10:27:50AM +0100, Ingo Molnar wrote:
> >
> > * Mathieu Desnoyers <[email protected]> wrote:
> >
> > > @@ -709,7 +711,8 @@ int __kprobes register_kprobe(struct kpr
>
> Hi Ingo,
>
> > > if (kprobe_enabled)
> > > arch_arm_kprobe(p);
> >
> > hm, it's cleaner now, but there's serious locking dependency
> > problems visible in the patch:
> >
> > > -
> > > +out_unlock_text:
> > > + mutex_unlock(&text_mutex);
> > > out:
> > > mutex_unlock(&kprobe_mutex);
> >
> > this one creates a (text_mutex -> kprobe_mutex) dependency.
> > (also you removed a newline spuriously - dont do that)
>
> That is a mutex_unlock :-) ...
>
> > > @@ -746,8 +749,11 @@ valid_p:
> > > * enabled and not gone - otherwise, the breakpoint would
> > > * already have been removed. We save on flushing icache.
> > > */
> > > - if (kprobe_enabled && !kprobe_gone(old_p))
> > > + if (kprobe_enabled && !kprobe_gone(old_p)) {
> > > + mutex_lock(&text_mutex);
> > > arch_disarm_kprobe(p);
> > > + mutex_unlock(&text_mutex);
> > > + }
> > > hlist_del_rcu(&old_p->hlist);
> >
> > (kprobe_mutex -> text_mutex) dependency. AB-BA deadlock.
>
> At this time the kprobe_mutex is already held.
>
> ...
>
> > > @@ -1280,12 +1285,14 @@ static void __kprobes enable_all_kprobes
> > > if (kprobe_enabled)
> > > goto already_enabled;
> > >
> > > + mutex_lock(&text_mutex);
> > > for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> > > head = &kprobe_table[i];
> > > hlist_for_each_entry_rcu(p, node, head, hlist)
> > > if (!kprobe_gone(p))
> > > arch_arm_kprobe(p);
> > > }
> > > + mutex_unlock(&text_mutex);
> >
> > this one creates a (kprobe_mutex -> text_mutex) dependency
> > again.
>
> kprobe_mutex his held here too...
>
> > > @@ -1310,6 +1317,7 @@ static void __kprobes disable_all_kprobe
> > >
> > > kprobe_enabled = false;
> > > printk(KERN_INFO "Kprobes globally disabled\n");
> > > + mutex_lock(&text_mutex);
> > > for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> > > head = &kprobe_table[i];
> > > hlist_for_each_entry_rcu(p, node, head, hlist) {
> > > @@ -1317,7 +1325,7 @@ static void __kprobes disable_all_kprobe
> > > arch_disarm_kprobe(p);
> > > }
> > > }
> > > -
> > > + mutex_unlock(&text_mutex);
> > > mutex_unlock(&kprobe_mutex);
> >
> > And this one in the reverse direction again.
>
> Unlock again :-)
>
> > The dependencies are totally wrong. The text lock (a low level
> > lock) should nest inside the kprobes mutex (which is the higher
> > level lock).
>
> From what I see, Mathieu has done just that and has gotten the
> order right in all cases. Or maybe I am missing something?

No, it's fine indeed, i got the locking order messed up ...
twice :-)

Ingo

2009-03-03 16:32:19

by Masami Hiramatsu

[permalink] [raw]
Subject: [PATCH] x86: make text_poke() atomic using fixmap

Masami Hiramatsu wrote:
> Ingo Molnar wrote:
>> * Masami Hiramatsu <[email protected]> wrote:
>>
>>> Ingo Molnar wrote:
>>>> * Masami Hiramatsu <[email protected]> wrote:
>>>>
>>>>> Ingo Molnar wrote:
>>>>>>>> So perhaps another approach to (re-)consider would be to go back
>>>>>>>> to atomic fixmaps here. It spends 3 slots but that's no big
>>>>>>>> deal.
>>>>>>> Oh, it's a good idea! fixmaps must make it simpler.
>>>>>>>
>>>>>>>> In exchange it will be conceptually simpler, and will also scale
>>>>>>>> much better than a global spinlock. What do you think?
>>>>>>> I think even if I use fixmaps, we have to use a spinlock to protect
>>>>>>> the fixmap area from other threads...
>>>>>> that's why i suggested to use an atomic-kmap, not a fixmap.
>>>>> Even if the mapping is atomic, text_poke() has to protect pte
>>>>> from other text_poke()s while changing code.
>>>>> AFAIK, atomic-kmap itself doesn't ensure that, does it?
>>>> Well, but text_poke() is not a serializing API to begin with.
>>>> It's normally used in code patching sequences when we 'know'
>>>> that there cannot be similar parallel activities. The kprobes
>>>> usage of text_poke() looks unsafe - and that needs to be fixed.
>>> Oh, kprobes already prohibited parallel arming/disarming
>>> by using kprobe_mutex. :-)
>> yeah, but still the API is somewhat unsafe.
>
> Yeah, kprobe_mutex protects text_poke from other kprobes, but
> not from other text_poke() users...
>
>> In any case, you also answered your own question:
>>
>>>>> Even if the mapping is atomic, text_poke() has to protect pte
>>>>> from other text_poke()s while changing code.
>>>>> AFAIK, atomic-kmap itself doesn't ensure that, does it?
>> kprobe_mutex does that.
>
> Anyway, text_edit_lock ensures that.
>
> By the way, I think set_fixmap/clear_fixmap seems simpler than
> kmap_atomic() variant. Would you think improving kmap_atomic_prot()
> is better?

Hi Ingo,

Here is the patch which uses fixmaps instead of vmap in text_poke().
This made the code much simpler than I thought :).

Thanks,

----
Use fixmaps instead of vmap/vunmap in text_poke() for avoiding page allocation
and delayed unmapping.

At the result of above change, text_poke() becomes atomic and can be called
from stop_machine() etc.

Signed-off-by: Masami Hiramatsu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
---
arch/x86/include/asm/fixmap_32.h | 2 ++
arch/x86/include/asm/fixmap_64.h | 2 ++
arch/x86/kernel/alternative.c | 18 ++++++++++++------
3 files changed, 16 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/include/asm/fixmap_32.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/fixmap_32.h
+++ linux-2.6/arch/x86/include/asm/fixmap_32.h
@@ -81,6 +81,8 @@ enum fixed_addresses {
#ifdef CONFIG_PARAVIRT
FIX_PARAVIRT_BOOTMAP,
#endif
+ FIX_TEXT_POKE0, /* reserve 2 pages for text_poke() */
+ FIX_TEXT_POKE1,
__end_of_permanent_fixed_addresses,
/*
* 256 temporary boot-time mappings, used by early_ioremap(),
Index: linux-2.6/arch/x86/include/asm/fixmap_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/fixmap_64.h
+++ linux-2.6/arch/x86/include/asm/fixmap_64.h
@@ -49,6 +49,8 @@ enum fixed_addresses {
#ifdef CONFIG_PARAVIRT
FIX_PARAVIRT_BOOTMAP,
#endif
+ FIX_TEXT_POKE0, /* reserve 2 pages for text_poke() */
+ FIX_TEXT_POKE1,
__end_of_permanent_fixed_addresses,
#ifdef CONFIG_ACPI
FIX_ACPI_BEGIN,
Index: linux-2.6/arch/x86/kernel/alternative.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/alternative.c
+++ linux-2.6/arch/x86/kernel/alternative.c
@@ -12,7 +12,9 @@
#include <asm/nmi.h>
#include <asm/vsyscall.h>
#include <asm/cacheflush.h>
+#include <asm/tlbflush.h>
#include <asm/io.h>
+#include <asm/fixmap.h>

#define MAX_PATCH_LEN (255-1)

@@ -495,12 +497,13 @@ void *text_poke_early(void *addr, const
* It means the size must be writable atomically and the address must be aligned
* in a way that permits an atomic write. It also makes sure we fit on a single
* page.
+ *
+ * Note: Must be called under text_mutex.
*/
void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
{
unsigned long flags;
char *vaddr;
- int nr_pages = 2;
struct page *pages[2];
int i;

@@ -513,14 +516,17 @@ void *__kprobes text_poke(void *addr, co
pages[1] = virt_to_page(addr + PAGE_SIZE);
}
BUG_ON(!pages[0]);
- if (!pages[1])
- nr_pages = 1;
- vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
- BUG_ON(!vaddr);
+ set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
+ if (pages[1])
+ set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
+ vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
local_irq_save(flags);
memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
local_irq_restore(flags);
- vunmap(vaddr);
+ clear_fixmap(FIX_TEXT_POKE0);
+ if (pages[1])
+ clear_fixmap(FIX_TEXT_POKE1);
+ local_flush_tlb();
sync_core();
/* Could also do a CLFLUSH here to speed up CPU recovery; but
that causes hangs on some VIA CPUs. */

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2009-03-03 17:14:21

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] x86: make text_poke() atomic using fixmap

* Masami Hiramatsu ([email protected]) wrote:
> Masami Hiramatsu wrote:
> > Ingo Molnar wrote:
> >> * Masami Hiramatsu <[email protected]> wrote:
> >>
> >>> Ingo Molnar wrote:
> >>>> * Masami Hiramatsu <[email protected]> wrote:
> >>>>
> >>>>> Ingo Molnar wrote:
> >>>>>>>> So perhaps another approach to (re-)consider would be to go back
> >>>>>>>> to atomic fixmaps here. It spends 3 slots but that's no big
> >>>>>>>> deal.
> >>>>>>> Oh, it's a good idea! fixmaps must make it simpler.
> >>>>>>>
> >>>>>>>> In exchange it will be conceptually simpler, and will also scale
> >>>>>>>> much better than a global spinlock. What do you think?
> >>>>>>> I think even if I use fixmaps, we have to use a spinlock to protect
> >>>>>>> the fixmap area from other threads...
> >>>>>> that's why i suggested to use an atomic-kmap, not a fixmap.
> >>>>> Even if the mapping is atomic, text_poke() has to protect pte
> >>>>> from other text_poke()s while changing code.
> >>>>> AFAIK, atomic-kmap itself doesn't ensure that, does it?
> >>>> Well, but text_poke() is not a serializing API to begin with.
> >>>> It's normally used in code patching sequences when we 'know'
> >>>> that there cannot be similar parallel activities. The kprobes
> >>>> usage of text_poke() looks unsafe - and that needs to be fixed.
> >>> Oh, kprobes already prohibited parallel arming/disarming
> >>> by using kprobe_mutex. :-)
> >> yeah, but still the API is somewhat unsafe.
> >
> > Yeah, kprobe_mutex protects text_poke from other kprobes, but
> > not from other text_poke() users...
> >
> >> In any case, you also answered your own question:
> >>
> >>>>> Even if the mapping is atomic, text_poke() has to protect pte
> >>>>> from other text_poke()s while changing code.
> >>>>> AFAIK, atomic-kmap itself doesn't ensure that, does it?
> >> kprobe_mutex does that.
> >
> > Anyway, text_edit_lock ensures that.
> >
> > By the way, I think set_fixmap/clear_fixmap seems simpler than
> > kmap_atomic() variant. Would you think improving kmap_atomic_prot()
> > is better?
>
> Hi Ingo,
>
> Here is the patch which uses fixmaps instead of vmap in text_poke().
> This made the code much simpler than I thought :).
>
> Thanks,
>
> ----
> Use fixmaps instead of vmap/vunmap in text_poke() for avoiding page allocation
> and delayed unmapping.
>
> At the result of above change, text_poke() becomes atomic and can be called
> from stop_machine() etc.
>

It looks great, thanks !

Acked-by: Mathieu Desnoyers <[email protected]>

> Signed-off-by: Masami Hiramatsu <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> ---
> arch/x86/include/asm/fixmap_32.h | 2 ++
> arch/x86/include/asm/fixmap_64.h | 2 ++
> arch/x86/kernel/alternative.c | 18 ++++++++++++------
> 3 files changed, 16 insertions(+), 6 deletions(-)
>
> Index: linux-2.6/arch/x86/include/asm/fixmap_32.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/fixmap_32.h
> +++ linux-2.6/arch/x86/include/asm/fixmap_32.h
> @@ -81,6 +81,8 @@ enum fixed_addresses {
> #ifdef CONFIG_PARAVIRT
> FIX_PARAVIRT_BOOTMAP,
> #endif
> + FIX_TEXT_POKE0, /* reserve 2 pages for text_poke() */
> + FIX_TEXT_POKE1,
> __end_of_permanent_fixed_addresses,
> /*
> * 256 temporary boot-time mappings, used by early_ioremap(),
> Index: linux-2.6/arch/x86/include/asm/fixmap_64.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/fixmap_64.h
> +++ linux-2.6/arch/x86/include/asm/fixmap_64.h
> @@ -49,6 +49,8 @@ enum fixed_addresses {
> #ifdef CONFIG_PARAVIRT
> FIX_PARAVIRT_BOOTMAP,
> #endif
> + FIX_TEXT_POKE0, /* reserve 2 pages for text_poke() */
> + FIX_TEXT_POKE1,
> __end_of_permanent_fixed_addresses,
> #ifdef CONFIG_ACPI
> FIX_ACPI_BEGIN,
> Index: linux-2.6/arch/x86/kernel/alternative.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/alternative.c
> +++ linux-2.6/arch/x86/kernel/alternative.c
> @@ -12,7 +12,9 @@
> #include <asm/nmi.h>
> #include <asm/vsyscall.h>
> #include <asm/cacheflush.h>
> +#include <asm/tlbflush.h>
> #include <asm/io.h>
> +#include <asm/fixmap.h>
>
> #define MAX_PATCH_LEN (255-1)
>
> @@ -495,12 +497,13 @@ void *text_poke_early(void *addr, const
> * It means the size must be writable atomically and the address must be aligned
> * in a way that permits an atomic write. It also makes sure we fit on a single
> * page.
> + *
> + * Note: Must be called under text_mutex.
> */
> void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
> {
> unsigned long flags;
> char *vaddr;
> - int nr_pages = 2;
> struct page *pages[2];
> int i;
>
> @@ -513,14 +516,17 @@ void *__kprobes text_poke(void *addr, co
> pages[1] = virt_to_page(addr + PAGE_SIZE);
> }
> BUG_ON(!pages[0]);
> - if (!pages[1])
> - nr_pages = 1;
> - vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
> - BUG_ON(!vaddr);
> + set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
> + if (pages[1])
> + set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
> + vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
> local_irq_save(flags);
> memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
> local_irq_restore(flags);
> - vunmap(vaddr);
> + clear_fixmap(FIX_TEXT_POKE0);
> + if (pages[1])
> + clear_fixmap(FIX_TEXT_POKE1);
> + local_flush_tlb();
> sync_core();
> /* Could also do a CLFLUSH here to speed up CPU recovery; but
> that causes hangs on some VIA CPUs. */
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-05 10:39:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: make text_poke() atomic using fixmap


* Masami Hiramatsu <[email protected]> wrote:

> Hi Ingo,
>
> Here is the patch which uses fixmaps instead of vmap in
> text_poke(). This made the code much simpler than I thought
> :).

Looks good to me at a quick glance albeit Linus had second
thoughts about using fixmaps for this in the past. But with
delayed-flush for vmaps i think fixmaps are again the simpler
and more robust - albeit more limited - choice ...

In any case, the x86 tree already unified fixmap.h so could you
please resend the whole series as a 0/3, 1/3, 2/3, 3/3 thing
against tip:master, starting a new thread on lkml? (this thread
is already way too deep)

Ingo

2009-03-06 14:07:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: make text_poke() atomic using fixmap


* Ingo Molnar <[email protected]> wrote:

>
> * Masami Hiramatsu <[email protected]> wrote:
>
> > Hi Ingo,
> >
> > Here is the patch which uses fixmaps instead of vmap in
> > text_poke(). This made the code much simpler than I thought
> > :).
>
> Looks good to me at a quick glance albeit Linus had second
> thoughts about using fixmaps for this in the past. But with
> delayed-flush for vmaps i think fixmaps are again the simpler
> and more robust - albeit more limited - choice ...
>
> In any case, the x86 tree already unified fixmap.h so could
> you please resend the whole series as a 0/3, 1/3, 2/3, 3/3
> thing against tip:master, starting a new thread on lkml? (this
> thread is already way too deep)

Ping? I think there's agreement and it would be nice to fix this
in .30. Looks too complex for .29 - maybe backportable to .29.1
if it stays problem-free in testing.

Ingo

2009-03-06 14:50:17

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH] x86: make text_poke() atomic using fixmap

Hi Ingo,

Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>> * Masami Hiramatsu <[email protected]> wrote:
>>
>>> Hi Ingo,
>>>
>>> Here is the patch which uses fixmaps instead of vmap in
>>> text_poke(). This made the code much simpler than I thought
>>> :).
>> Looks good to me at a quick glance albeit Linus had second
>> thoughts about using fixmaps for this in the past. But with
>> delayed-flush for vmaps i think fixmaps are again the simpler
>> and more robust - albeit more limited - choice ...
>>
>> In any case, the x86 tree already unified fixmap.h so could
>> you please resend the whole series as a 0/3, 1/3, 2/3, 3/3
>> thing against tip:master, starting a new thread on lkml? (this
>> thread is already way too deep)
>
> Ping? I think there's agreement and it would be nice to fix this
> in .30. Looks too complex for .29 - maybe backportable to .29.1
> if it stays problem-free in testing.

Sorry for later, I'll post it as soon as possible.

>
> Ingo

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]