On arm64, modules are allocated from a 128 MB window which is close to
the core kernel, so that relative direct branches are guaranteed to be
in range (except in some KASLR configurations). Also, module_alloc()
is in charge of allocating KASAN shadow memory when running with KASAN
enabled.
This means that the way BPF reuses module_alloc()/module_memfree() is
undesirable on arm64 (and potentially other architectures as well),
and so this series refactors BPF's use of those functions to permit
architectures to change this behavior.
Patch #1 breaks out the module_alloc() and module_memfree() calls into
__weak functions so they can be overridden.
Patch #2 implements the new alloc/free overrides for arm64
Changes since v3:
- drop 'const' modifier for free() hook void* argument
- move the dedicated BPF region to before the module region, putting it
within 4GB of the module and kernel regions on non-KASLR kernels
Changes since v2:
- properly build time and runtime tested this time (log after the diffstat)
- create a dedicated 128 MB region at the top of the vmalloc space for BPF
programs, ensuring that the programs will be in branching range of each
other (which we currently rely upon) but at an arbitrary distance from
the kernel and modules (which we don't care about)
Changes since v1:
- Drop misguided attempt to 'fix' and refactor the free path. Instead,
just add another __weak wrapper for the invocation of module_memfree()
Cc: Daniel Borkmann <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Jessica Yu <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Ard Biesheuvel (2):
bpf: add __weak hook for allocating executable memory
arm64/bpf: don't allocate BPF JIT programs in module memory
arch/arm64/include/asm/memory.h | 5 ++++-
arch/arm64/net/bpf_jit_comp.c | 13 +++++++++++++
kernel/bpf/core.c | 14 ++++++++++++--
3 files changed, 29 insertions(+), 3 deletions(-)
--
2.19.1
By default, BPF uses module_alloc() to allocate executable memory,
but this is not necessary on all arches and potentially undesirable
on some of them.
So break out the module_alloc() and module_memfree() calls into __weak
functions to allow them to be overridden in arch code.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
kernel/bpf/core.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 1a796e0799ec..78e9b76201b3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -609,6 +609,16 @@ static void bpf_jit_uncharge_modmem(u32 pages)
atomic_long_sub(pages, &bpf_jit_current);
}
+void *__weak bpf_jit_alloc_exec(unsigned long size)
+{
+ return module_alloc(size);
+}
+
+void __weak bpf_jit_free_exec(void *addr)
+{
+ module_memfree(addr);
+}
+
struct bpf_binary_header *
bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
unsigned int alignment,
@@ -626,7 +636,7 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
if (bpf_jit_charge_modmem(pages))
return NULL;
- hdr = module_alloc(size);
+ hdr = bpf_jit_alloc_exec(size);
if (!hdr) {
bpf_jit_uncharge_modmem(pages);
return NULL;
@@ -650,7 +660,7 @@ void bpf_jit_binary_free(struct bpf_binary_header *hdr)
{
u32 pages = hdr->pages;
- module_memfree(hdr);
+ bpf_jit_free_exec(hdr);
bpf_jit_uncharge_modmem(pages);
}
--
2.19.1
The arm64 module region is a 128 MB region that is kept close to
the core kernel, in order to ensure that relative branches are
always in range. So using the same region for programs that do
not have this restriction is wasteful, and preferably avoided.
Now that the core BPF JIT code permits the alloc/free routines to
be overridden, implement them by vmalloc()/vfree() calls from a
dedicated 128 MB region set aside for BPF programs. This ensures
that BPF programs are still in branching range of each other, which
is something the JIT currently depends upon (and is not guaranteed
when using module_alloc() on KASLR kernels like we do currently).
It also ensures that placement of BPF programs does not correlate
with the placement of the core kernel or modules, making it less
likely that leaking the former will reveal the latter.
This also solves an issue under KASAN, where shadow memory is
needlessly allocated for all BPF programs (which don't require KASAN
shadow pages since they are not KASAN instrumented)
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/memory.h | 5 ++++-
arch/arm64/net/bpf_jit_comp.c | 13 +++++++++++++
2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index b96442960aea..ee20fc63899c 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -62,8 +62,11 @@
#define PAGE_OFFSET (UL(0xffffffffffffffff) - \
(UL(1) << (VA_BITS - 1)) + 1)
#define KIMAGE_VADDR (MODULES_END)
+#define BPF_JIT_REGION_START (VA_START + KASAN_SHADOW_SIZE)
+#define BPF_JIT_REGION_SIZE (SZ_128M)
+#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
#define MODULES_END (MODULES_VADDR + MODULES_VSIZE)
-#define MODULES_VADDR (VA_START + KASAN_SHADOW_SIZE)
+#define MODULES_VADDR (BPF_JIT_REGION_END)
#define MODULES_VSIZE (SZ_128M)
#define VMEMMAP_START (PAGE_OFFSET - VMEMMAP_SIZE)
#define PCI_IO_END (VMEMMAP_START - SZ_2M)
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index a6fdaea07c63..76c2ab40c02d 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -940,3 +940,16 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
tmp : orig_prog);
return prog;
}
+
+void *bpf_jit_alloc_exec(unsigned long size)
+{
+ return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
+ BPF_JIT_REGION_END, GFP_KERNEL,
+ PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
+ __builtin_return_address(0));
+}
+
+void bpf_jit_free_exec(void *addr)
+{
+ return vfree(addr);
+}
--
2.19.1
On Fri, 2018-11-23 at 23:18 +0100, Ard Biesheuvel wrote:
> By default, BPF uses module_alloc() to allocate executable memory,
> but this is not necessary on all arches and potentially undesirable
> on some of them.
>
> So break out the module_alloc() and module_memfree() calls into __weak
> functions to allow them to be overridden in arch code.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
It looks like some of the architectures call module_alloc directly in their
bpf_jit_compile implementations as well.
Rick
On Fri, Nov 23, 2018 at 11:18:04PM +0100, Ard Biesheuvel wrote:
> The arm64 module region is a 128 MB region that is kept close to
> the core kernel, in order to ensure that relative branches are
> always in range. So using the same region for programs that do
> not have this restriction is wasteful, and preferably avoided.
>
> Now that the core BPF JIT code permits the alloc/free routines to
> be overridden, implement them by vmalloc()/vfree() calls from a
> dedicated 128 MB region set aside for BPF programs. This ensures
> that BPF programs are still in branching range of each other, which
> is something the JIT currently depends upon (and is not guaranteed
> when using module_alloc() on KASLR kernels like we do currently).
> It also ensures that placement of BPF programs does not correlate
> with the placement of the core kernel or modules, making it less
> likely that leaking the former will reveal the latter.
>
> This also solves an issue under KASAN, where shadow memory is
> needlessly allocated for all BPF programs (which don't require KASAN
> shadow pages since they are not KASAN instrumented)
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/include/asm/memory.h | 5 ++++-
> arch/arm64/net/bpf_jit_comp.c | 13 +++++++++++++
> 2 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index b96442960aea..ee20fc63899c 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -62,8 +62,11 @@
> #define PAGE_OFFSET (UL(0xffffffffffffffff) - \
> (UL(1) << (VA_BITS - 1)) + 1)
> #define KIMAGE_VADDR (MODULES_END)
> +#define BPF_JIT_REGION_START (VA_START + KASAN_SHADOW_SIZE)
> +#define BPF_JIT_REGION_SIZE (SZ_128M)
> +#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
> #define MODULES_END (MODULES_VADDR + MODULES_VSIZE)
> -#define MODULES_VADDR (VA_START + KASAN_SHADOW_SIZE)
> +#define MODULES_VADDR (BPF_JIT_REGION_END)
> #define MODULES_VSIZE (SZ_128M)
> #define VMEMMAP_START (PAGE_OFFSET - VMEMMAP_SIZE)
> #define PCI_IO_END (VMEMMAP_START - SZ_2M)
> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> index a6fdaea07c63..76c2ab40c02d 100644
> --- a/arch/arm64/net/bpf_jit_comp.c
> +++ b/arch/arm64/net/bpf_jit_comp.c
> @@ -940,3 +940,16 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
> tmp : orig_prog);
> return prog;
> }
> +
> +void *bpf_jit_alloc_exec(unsigned long size)
> +{
> + return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
> + BPF_JIT_REGION_END, GFP_KERNEL,
> + PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> + __builtin_return_address(0));
I guess we'll want VM_IMMEDIATE_UNMAP here if Rich gets that merged. In the
meantime, I wonder if it's worth zeroing the region in bpf_jit_free_exec()?
(although we'd need the size information...).
Will
On Fri, 30 Nov 2018 at 19:26, Will Deacon <[email protected]> wrote:
>
> On Fri, Nov 23, 2018 at 11:18:04PM +0100, Ard Biesheuvel wrote:
> > The arm64 module region is a 128 MB region that is kept close to
> > the core kernel, in order to ensure that relative branches are
> > always in range. So using the same region for programs that do
> > not have this restriction is wasteful, and preferably avoided.
> >
> > Now that the core BPF JIT code permits the alloc/free routines to
> > be overridden, implement them by vmalloc()/vfree() calls from a
> > dedicated 128 MB region set aside for BPF programs. This ensures
> > that BPF programs are still in branching range of each other, which
> > is something the JIT currently depends upon (and is not guaranteed
> > when using module_alloc() on KASLR kernels like we do currently).
> > It also ensures that placement of BPF programs does not correlate
> > with the placement of the core kernel or modules, making it less
> > likely that leaking the former will reveal the latter.
> >
> > This also solves an issue under KASAN, where shadow memory is
> > needlessly allocated for all BPF programs (which don't require KASAN
> > shadow pages since they are not KASAN instrumented)
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
> > arch/arm64/include/asm/memory.h | 5 ++++-
> > arch/arm64/net/bpf_jit_comp.c | 13 +++++++++++++
> > 2 files changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> > index b96442960aea..ee20fc63899c 100644
> > --- a/arch/arm64/include/asm/memory.h
> > +++ b/arch/arm64/include/asm/memory.h
> > @@ -62,8 +62,11 @@
> > #define PAGE_OFFSET (UL(0xffffffffffffffff) - \
> > (UL(1) << (VA_BITS - 1)) + 1)
> > #define KIMAGE_VADDR (MODULES_END)
> > +#define BPF_JIT_REGION_START (VA_START + KASAN_SHADOW_SIZE)
> > +#define BPF_JIT_REGION_SIZE (SZ_128M)
> > +#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
> > #define MODULES_END (MODULES_VADDR + MODULES_VSIZE)
> > -#define MODULES_VADDR (VA_START + KASAN_SHADOW_SIZE)
> > +#define MODULES_VADDR (BPF_JIT_REGION_END)
> > #define MODULES_VSIZE (SZ_128M)
> > #define VMEMMAP_START (PAGE_OFFSET - VMEMMAP_SIZE)
> > #define PCI_IO_END (VMEMMAP_START - SZ_2M)
> > diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> > index a6fdaea07c63..76c2ab40c02d 100644
> > --- a/arch/arm64/net/bpf_jit_comp.c
> > +++ b/arch/arm64/net/bpf_jit_comp.c
> > @@ -940,3 +940,16 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
> > tmp : orig_prog);
> > return prog;
> > }
> > +
> > +void *bpf_jit_alloc_exec(unsigned long size)
> > +{
> > + return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
> > + BPF_JIT_REGION_END, GFP_KERNEL,
> > + PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> > + __builtin_return_address(0));
>
> I guess we'll want VM_IMMEDIATE_UNMAP here if Rich gets that merged.
I think akpm already queued up that patch.
> In the
> meantime, I wonder if it's worth zeroing the region in bpf_jit_free_exec()?
> (although we'd need the size information...).
>
Not sure. What exactly would that achieve?
On Fri, Nov 30, 2018 at 08:20:06PM +0100, Ard Biesheuvel wrote:
> On Fri, 30 Nov 2018 at 19:26, Will Deacon <[email protected]> wrote:
> >
> > On Fri, Nov 23, 2018 at 11:18:04PM +0100, Ard Biesheuvel wrote:
> > > The arm64 module region is a 128 MB region that is kept close to
> > > the core kernel, in order to ensure that relative branches are
> > > always in range. So using the same region for programs that do
> > > not have this restriction is wasteful, and preferably avoided.
> > >
> > > Now that the core BPF JIT code permits the alloc/free routines to
> > > be overridden, implement them by vmalloc()/vfree() calls from a
> > > dedicated 128 MB region set aside for BPF programs. This ensures
> > > that BPF programs are still in branching range of each other, which
> > > is something the JIT currently depends upon (and is not guaranteed
> > > when using module_alloc() on KASLR kernels like we do currently).
> > > It also ensures that placement of BPF programs does not correlate
> > > with the placement of the core kernel or modules, making it less
> > > likely that leaking the former will reveal the latter.
> > >
> > > This also solves an issue under KASAN, where shadow memory is
> > > needlessly allocated for all BPF programs (which don't require KASAN
> > > shadow pages since they are not KASAN instrumented)
> > >
> > > Signed-off-by: Ard Biesheuvel <[email protected]>
> > > ---
> > > arch/arm64/include/asm/memory.h | 5 ++++-
> > > arch/arm64/net/bpf_jit_comp.c | 13 +++++++++++++
> > > 2 files changed, 17 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> > > index b96442960aea..ee20fc63899c 100644
> > > --- a/arch/arm64/include/asm/memory.h
> > > +++ b/arch/arm64/include/asm/memory.h
> > > @@ -62,8 +62,11 @@
> > > #define PAGE_OFFSET (UL(0xffffffffffffffff) - \
> > > (UL(1) << (VA_BITS - 1)) + 1)
> > > #define KIMAGE_VADDR (MODULES_END)
> > > +#define BPF_JIT_REGION_START (VA_START + KASAN_SHADOW_SIZE)
> > > +#define BPF_JIT_REGION_SIZE (SZ_128M)
> > > +#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
> > > #define MODULES_END (MODULES_VADDR + MODULES_VSIZE)
> > > -#define MODULES_VADDR (VA_START + KASAN_SHADOW_SIZE)
> > > +#define MODULES_VADDR (BPF_JIT_REGION_END)
> > > #define MODULES_VSIZE (SZ_128M)
> > > #define VMEMMAP_START (PAGE_OFFSET - VMEMMAP_SIZE)
> > > #define PCI_IO_END (VMEMMAP_START - SZ_2M)
> > > diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> > > index a6fdaea07c63..76c2ab40c02d 100644
> > > --- a/arch/arm64/net/bpf_jit_comp.c
> > > +++ b/arch/arm64/net/bpf_jit_comp.c
> > > @@ -940,3 +940,16 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
> > > tmp : orig_prog);
> > > return prog;
> > > }
> > > +
> > > +void *bpf_jit_alloc_exec(unsigned long size)
> > > +{
> > > + return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
> > > + BPF_JIT_REGION_END, GFP_KERNEL,
> > > + PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> > > + __builtin_return_address(0));
> >
> > I guess we'll want VM_IMMEDIATE_UNMAP here if Rich gets that merged.
>
> I think akpm already queued up that patch.
>
> > In the
> > meantime, I wonder if it's worth zeroing the region in bpf_jit_free_exec()?
> > (although we'd need the size information...).
> >
>
> Not sure. What exactly would that achieve?
I think the zero encoding is guaranteed to be undefined, so it would limit
the usefulness of any stale, executable TLB entries. However, we'd also need
cache maintenance to make that stuff visible to the I side, so it's probably
not worth it, especially if akpm has queued the stuff from Rich.
Maybe just add an:
/* FIXME: Remove this when VM_IMMEDIATE_UNMAP is supported */
#ifndef VM_IMMEDIATE_UNMAP
#define VM_IMMEDIATE_UNMAP 0
#endif
so we remember to come back and sort this out? Up to you.
Will
On Mon, 3 Dec 2018 at 13:49, Will Deacon <[email protected]> wrote:
>
> On Fri, Nov 30, 2018 at 08:20:06PM +0100, Ard Biesheuvel wrote:
> > On Fri, 30 Nov 2018 at 19:26, Will Deacon <[email protected]> wrote:
> > >
> > > On Fri, Nov 23, 2018 at 11:18:04PM +0100, Ard Biesheuvel wrote:
> > > > The arm64 module region is a 128 MB region that is kept close to
> > > > the core kernel, in order to ensure that relative branches are
> > > > always in range. So using the same region for programs that do
> > > > not have this restriction is wasteful, and preferably avoided.
> > > >
> > > > Now that the core BPF JIT code permits the alloc/free routines to
> > > > be overridden, implement them by vmalloc()/vfree() calls from a
> > > > dedicated 128 MB region set aside for BPF programs. This ensures
> > > > that BPF programs are still in branching range of each other, which
> > > > is something the JIT currently depends upon (and is not guaranteed
> > > > when using module_alloc() on KASLR kernels like we do currently).
> > > > It also ensures that placement of BPF programs does not correlate
> > > > with the placement of the core kernel or modules, making it less
> > > > likely that leaking the former will reveal the latter.
> > > >
> > > > This also solves an issue under KASAN, where shadow memory is
> > > > needlessly allocated for all BPF programs (which don't require KASAN
> > > > shadow pages since they are not KASAN instrumented)
> > > >
> > > > Signed-off-by: Ard Biesheuvel <[email protected]>
> > > > ---
> > > > arch/arm64/include/asm/memory.h | 5 ++++-
> > > > arch/arm64/net/bpf_jit_comp.c | 13 +++++++++++++
> > > > 2 files changed, 17 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> > > > index b96442960aea..ee20fc63899c 100644
> > > > --- a/arch/arm64/include/asm/memory.h
> > > > +++ b/arch/arm64/include/asm/memory.h
> > > > @@ -62,8 +62,11 @@
> > > > #define PAGE_OFFSET (UL(0xffffffffffffffff) - \
> > > > (UL(1) << (VA_BITS - 1)) + 1)
> > > > #define KIMAGE_VADDR (MODULES_END)
> > > > +#define BPF_JIT_REGION_START (VA_START + KASAN_SHADOW_SIZE)
> > > > +#define BPF_JIT_REGION_SIZE (SZ_128M)
> > > > +#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
> > > > #define MODULES_END (MODULES_VADDR + MODULES_VSIZE)
> > > > -#define MODULES_VADDR (VA_START + KASAN_SHADOW_SIZE)
> > > > +#define MODULES_VADDR (BPF_JIT_REGION_END)
> > > > #define MODULES_VSIZE (SZ_128M)
> > > > #define VMEMMAP_START (PAGE_OFFSET - VMEMMAP_SIZE)
> > > > #define PCI_IO_END (VMEMMAP_START - SZ_2M)
> > > > diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> > > > index a6fdaea07c63..76c2ab40c02d 100644
> > > > --- a/arch/arm64/net/bpf_jit_comp.c
> > > > +++ b/arch/arm64/net/bpf_jit_comp.c
> > > > @@ -940,3 +940,16 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
> > > > tmp : orig_prog);
> > > > return prog;
> > > > }
> > > > +
> > > > +void *bpf_jit_alloc_exec(unsigned long size)
> > > > +{
> > > > + return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
> > > > + BPF_JIT_REGION_END, GFP_KERNEL,
> > > > + PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> > > > + __builtin_return_address(0));
> > >
> > > I guess we'll want VM_IMMEDIATE_UNMAP here if Rich gets that merged.
> >
> > I think akpm already queued up that patch.
> >
> > > In the
> > > meantime, I wonder if it's worth zeroing the region in bpf_jit_free_exec()?
> > > (although we'd need the size information...).
> > >
> >
> > Not sure. What exactly would that achieve?
>
> I think the zero encoding is guaranteed to be undefined, so it would limit
> the usefulness of any stale, executable TLB entries. However, we'd also need
> cache maintenance to make that stuff visible to the I side, so it's probably
> not worth it, especially if akpm has queued the stuff from Rich.
>
> Maybe just add an:
>
> /* FIXME: Remove this when VM_IMMEDIATE_UNMAP is supported */
> #ifndef VM_IMMEDIATE_UNMAP
> #define VM_IMMEDIATE_UNMAP 0
> #endif
>
> so we remember to come back and sort this out? Up to you.
>
I'll just make a note to send out that patch once the definition lands via -akpm
Hi Will,
On 12/04/2018 04:45 PM, Ard Biesheuvel wrote:
> On Mon, 3 Dec 2018 at 13:49, Will Deacon <[email protected]> wrote:
>> On Fri, Nov 30, 2018 at 08:20:06PM +0100, Ard Biesheuvel wrote:
>>> On Fri, 30 Nov 2018 at 19:26, Will Deacon <[email protected]> wrote:
>>>> On Fri, Nov 23, 2018 at 11:18:04PM +0100, Ard Biesheuvel wrote:
>>>>> The arm64 module region is a 128 MB region that is kept close to
>>>>> the core kernel, in order to ensure that relative branches are
>>>>> always in range. So using the same region for programs that do
>>>>> not have this restriction is wasteful, and preferably avoided.
>>>>>
>>>>> Now that the core BPF JIT code permits the alloc/free routines to
>>>>> be overridden, implement them by vmalloc()/vfree() calls from a
>>>>> dedicated 128 MB region set aside for BPF programs. This ensures
>>>>> that BPF programs are still in branching range of each other, which
>>>>> is something the JIT currently depends upon (and is not guaranteed
>>>>> when using module_alloc() on KASLR kernels like we do currently).
>>>>> It also ensures that placement of BPF programs does not correlate
>>>>> with the placement of the core kernel or modules, making it less
>>>>> likely that leaking the former will reveal the latter.
>>>>>
>>>>> This also solves an issue under KASAN, where shadow memory is
>>>>> needlessly allocated for all BPF programs (which don't require KASAN
>>>>> shadow pages since they are not KASAN instrumented)
>>>>>
>>>>> Signed-off-by: Ard Biesheuvel <[email protected]>
>>>>> ---
>>>>> arch/arm64/include/asm/memory.h | 5 ++++-
>>>>> arch/arm64/net/bpf_jit_comp.c | 13 +++++++++++++
>>>>> 2 files changed, 17 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
>>>>> index b96442960aea..ee20fc63899c 100644
>>>>> --- a/arch/arm64/include/asm/memory.h
>>>>> +++ b/arch/arm64/include/asm/memory.h
>>>>> @@ -62,8 +62,11 @@
>>>>> #define PAGE_OFFSET (UL(0xffffffffffffffff) - \
>>>>> (UL(1) << (VA_BITS - 1)) + 1)
>>>>> #define KIMAGE_VADDR (MODULES_END)
>>>>> +#define BPF_JIT_REGION_START (VA_START + KASAN_SHADOW_SIZE)
>>>>> +#define BPF_JIT_REGION_SIZE (SZ_128M)
>>>>> +#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
>>>>> #define MODULES_END (MODULES_VADDR + MODULES_VSIZE)
>>>>> -#define MODULES_VADDR (VA_START + KASAN_SHADOW_SIZE)
>>>>> +#define MODULES_VADDR (BPF_JIT_REGION_END)
>>>>> #define MODULES_VSIZE (SZ_128M)
>>>>> #define VMEMMAP_START (PAGE_OFFSET - VMEMMAP_SIZE)
>>>>> #define PCI_IO_END (VMEMMAP_START - SZ_2M)
>>>>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
>>>>> index a6fdaea07c63..76c2ab40c02d 100644
>>>>> --- a/arch/arm64/net/bpf_jit_comp.c
>>>>> +++ b/arch/arm64/net/bpf_jit_comp.c
>>>>> @@ -940,3 +940,16 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>>>> tmp : orig_prog);
>>>>> return prog;
>>>>> }
>>>>> +
>>>>> +void *bpf_jit_alloc_exec(unsigned long size)
>>>>> +{
>>>>> + return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
>>>>> + BPF_JIT_REGION_END, GFP_KERNEL,
>>>>> + PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>>>>> + __builtin_return_address(0));
>>>>
>>>> I guess we'll want VM_IMMEDIATE_UNMAP here if Rich gets that merged.
>>>
>>> I think akpm already queued up that patch.
>>>
>>>> In the
>>>> meantime, I wonder if it's worth zeroing the region in bpf_jit_free_exec()?
>>>> (although we'd need the size information...).
>>>>
>>>
>>> Not sure. What exactly would that achieve?
>>
>> I think the zero encoding is guaranteed to be undefined, so it would limit
>> the usefulness of any stale, executable TLB entries. However, we'd also need
>> cache maintenance to make that stuff visible to the I side, so it's probably
>> not worth it, especially if akpm has queued the stuff from Rich.
>>
>> Maybe just add an:
>>
>> /* FIXME: Remove this when VM_IMMEDIATE_UNMAP is supported */
>> #ifndef VM_IMMEDIATE_UNMAP
>> #define VM_IMMEDIATE_UNMAP 0
>> #endif
>>
>> so we remember to come back and sort this out? Up to you.
>
> I'll just make a note to send out that patch once the definition lands via -akpm
Could I get an ACK from you for this patch, then I'd take the series into bpf-next.
Thanks,
Daniel
On Wed, Dec 05, 2018 at 01:24:17PM +0100, Daniel Borkmann wrote:
> On 12/04/2018 04:45 PM, Ard Biesheuvel wrote:
> > On Mon, 3 Dec 2018 at 13:49, Will Deacon <[email protected]> wrote:
> >> On Fri, Nov 30, 2018 at 08:20:06PM +0100, Ard Biesheuvel wrote:
> >>> On Fri, 30 Nov 2018 at 19:26, Will Deacon <[email protected]> wrote:
> >>>> On Fri, Nov 23, 2018 at 11:18:04PM +0100, Ard Biesheuvel wrote:
> >>>>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> >>>>> index a6fdaea07c63..76c2ab40c02d 100644
> >>>>> --- a/arch/arm64/net/bpf_jit_comp.c
> >>>>> +++ b/arch/arm64/net/bpf_jit_comp.c
> >>>>> @@ -940,3 +940,16 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
> >>>>> tmp : orig_prog);
> >>>>> return prog;
> >>>>> }
> >>>>> +
> >>>>> +void *bpf_jit_alloc_exec(unsigned long size)
> >>>>> +{
> >>>>> + return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
> >>>>> + BPF_JIT_REGION_END, GFP_KERNEL,
> >>>>> + PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> >>>>> + __builtin_return_address(0));
> >>>>
> >>>> I guess we'll want VM_IMMEDIATE_UNMAP here if Rich gets that merged.
> >>>
> >>> I think akpm already queued up that patch.
> >>>
> >>>> In the
> >>>> meantime, I wonder if it's worth zeroing the region in bpf_jit_free_exec()?
> >>>> (although we'd need the size information...).
> >>>>
> >>>
> >>> Not sure. What exactly would that achieve?
> >>
> >> I think the zero encoding is guaranteed to be undefined, so it would limit
> >> the usefulness of any stale, executable TLB entries. However, we'd also need
> >> cache maintenance to make that stuff visible to the I side, so it's probably
> >> not worth it, especially if akpm has queued the stuff from Rich.
> >>
> >> Maybe just add an:
> >>
> >> /* FIXME: Remove this when VM_IMMEDIATE_UNMAP is supported */
> >> #ifndef VM_IMMEDIATE_UNMAP
> >> #define VM_IMMEDIATE_UNMAP 0
> >> #endif
> >>
> >> so we remember to come back and sort this out? Up to you.
> >
> > I'll just make a note to send out that patch once the definition lands via -akpm
>
> Could I get an ACK from you for this patch, then I'd take the series into bpf-next.
Gah, thanks for the ping: I thought I acked this initially, but turns out I
didn't.
Acked-by: Will Deacon <[email protected]>
Will
On 12/05/2018 02:24 PM, Will Deacon wrote:
> On Wed, Dec 05, 2018 at 01:24:17PM +0100, Daniel Borkmann wrote:
>> On 12/04/2018 04:45 PM, Ard Biesheuvel wrote:
>>> On Mon, 3 Dec 2018 at 13:49, Will Deacon <[email protected]> wrote:
>>>> On Fri, Nov 30, 2018 at 08:20:06PM +0100, Ard Biesheuvel wrote:
>>>>> On Fri, 30 Nov 2018 at 19:26, Will Deacon <[email protected]> wrote:
>>>>>> On Fri, Nov 23, 2018 at 11:18:04PM +0100, Ard Biesheuvel wrote:
>>>>>>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
>>>>>>> index a6fdaea07c63..76c2ab40c02d 100644
>>>>>>> --- a/arch/arm64/net/bpf_jit_comp.c
>>>>>>> +++ b/arch/arm64/net/bpf_jit_comp.c
>>>>>>> @@ -940,3 +940,16 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>>>>>> tmp : orig_prog);
>>>>>>> return prog;
>>>>>>> }
>>>>>>> +
>>>>>>> +void *bpf_jit_alloc_exec(unsigned long size)
>>>>>>> +{
>>>>>>> + return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
>>>>>>> + BPF_JIT_REGION_END, GFP_KERNEL,
>>>>>>> + PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>>>>>>> + __builtin_return_address(0));
>>>>>>
>>>>>> I guess we'll want VM_IMMEDIATE_UNMAP here if Rich gets that merged.
>>>>>
>>>>> I think akpm already queued up that patch.
>>>>>
>>>>>> In the
>>>>>> meantime, I wonder if it's worth zeroing the region in bpf_jit_free_exec()?
>>>>>> (although we'd need the size information...).
>>>>>>
>>>>>
>>>>> Not sure. What exactly would that achieve?
>>>>
>>>> I think the zero encoding is guaranteed to be undefined, so it would limit
>>>> the usefulness of any stale, executable TLB entries. However, we'd also need
>>>> cache maintenance to make that stuff visible to the I side, so it's probably
>>>> not worth it, especially if akpm has queued the stuff from Rich.
>>>>
>>>> Maybe just add an:
>>>>
>>>> /* FIXME: Remove this when VM_IMMEDIATE_UNMAP is supported */
>>>> #ifndef VM_IMMEDIATE_UNMAP
>>>> #define VM_IMMEDIATE_UNMAP 0
>>>> #endif
>>>>
>>>> so we remember to come back and sort this out? Up to you.
>>>
>>> I'll just make a note to send out that patch once the definition lands via -akpm
>>
>> Could I get an ACK from you for this patch, then I'd take the series into bpf-next.
>
> Gah, thanks for the ping: I thought I acked this initially, but turns out I
> didn't.
>
> Acked-by: Will Deacon <[email protected]>
Applied, thanks everyone!
On Mon, Nov 26, 2018 at 9:02 AM Edgecombe, Rick P
<[email protected]> wrote:
>
> On Fri, 2018-11-23 at 23:18 +0100, Ard Biesheuvel wrote:
> > By default, BPF uses module_alloc() to allocate executable memory,
> > but this is not necessary on all arches and potentially undesirable
> > on some of them.
> >
> > So break out the module_alloc() and module_memfree() calls into __weak
> > functions to allow them to be overridden in arch code.
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
>
> It looks like some of the architectures call module_alloc directly in their
> bpf_jit_compile implementations as well.
Ew, good catch. :P
--
Kees Cook