On Wed, Oct 28, 2015 at 8:09 PM, Catalin Marinas
<[email protected]> wrote:
> On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote:
>> From: Tirumalesh Chalamarla <[email protected]>
>>
>> Increase the standard cacheline size to avoid having locks in the same
>> cacheline.
>>
>> Cavium's ThunderX core implements cache lines of 128 byte size. With
>> current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could
>> share the same cache line leading a performance degradation.
>> Increasing the size fixes that.
>>
>> Increasing the size has no negative impact to cache invalidation on
>> systems with a smaller cache line. There is an impact on memory usage,
>> but that's not too important for arm64 use cases.
>>
>> Signed-off-by: Tirumalesh Chalamarla <[email protected]>
>> Signed-off-by: Robert Richter <[email protected]>
>
> Applied. Thanks.
This patch causes a BUG() on r8a7795/salvator-x, for which support is not
yet upstream.
My config (attached) uses SLAB. If I switch to SLUB, it works.
The arm64 defconfig works, even if I switch from SLUB to SLAB.
Booting Linux on physical CPU 0x0
Linux version 4.3.0-salvator-x-08136-gce303d3c239f417e-dirty
(geert@ramsan) (gcc version 4.9.0 (GCC) ) #264 SMP Tue Nov 3 11:40:59
CET 2015
Boot CPU: AArch64 Processor [411fd073]
bootconsole [uart0] enabled
debug: ignoring loglevel setting.
efi: Getting EFI parameters from FDT:
efi: UEFI not found.
On node 0 totalpages: 262144
DMA zone: 3584 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 262144 pages, LIFO batch:31
PERCPU: Embedded 16 pages/cpu @ffffffc03ffc4000 s27136 r8192 d30208 u65536
pcpu-alloc: s27136 r8192 d30208 u65536 alloc=16*4096
pcpu-alloc: [0] 0
Detected PIPT I-cache on CPU0
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 258560
Kernel command line: earlycon ignore_loglevel ip=dhcp root=/dev/nfs rw
nfsroot=192.168.97.21:/home/salvator-x/debian-arm64
PID hash table entries: 4096 (order: 3, 32768 bytes)
Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes)
Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
software IO TLB [mem 0x7ae00000-0x7ee00000] (64MB) mapped at
[ffffffc03ae00000-ffffffc03edfffff]
Memory: 946232K/1048576K available (4159K kernel code, 395K rwdat
a, 1964K rodata, 244K init, 13383K bss, 102344K reserved, 0K cma-reserved)
Virtual kernel memory layout:
vmalloc : 0xffffff8000000000 - 0xffffffbdffff0000 ( 247 GB)
vmemmap : 0xffffffbe00000000 - 0xffffffbfc0000000 ( 7 GB maximum)
0xffffffbe00e00000 - 0xffffffbe01c00000 ( 14 MB actual)
fixed : 0xffffffbffa7fd000 - 0xffffffbffac00000 ( 4108 KB)
PCI I/O : 0xffffffbffae00000 - 0xffffffbffbe00000 ( 16 MB)
modules : 0xffffffbffc000000 - 0xffffffc000000000 ( 64 MB)
memory : 0xffffffc000000000 - 0xffffffc040000000 ( 1024 MB)
.init : 0xffffffc00067d000 - 0xffffffc0006ba000 ( 244 KB)
.text : 0xffffffc000080000 - 0xffffffc00067cdb4 ( 6132 KB)
.data : 0xffffffc0006c7000 - 0xffffffc000729e00 ( 396 KB)
------------[ cut here ]------------
kernel BUG at mm/slab.c:2283!
Internal error: Oops - BUG: 0 [#1] SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted
4.3.0-salvator-x-08136-gce303d3c239f417e-dirty #264
Hardware name: Renesas Salvator-X board based on r8a7795 (DT)
task: ffffffc0006d22f0 ti: ffffffc0006c8000 task.ti: ffffffc0006c8000
PC is at __kmem_cache_create+0x21c/0x280
LR is at __kmem_cache_create+0x210/0x280
pc : [<ffffffc00014f9b4>] lr : [<ffffffc00014f9a8>] pstate: 800002c5
sp : ffffffc0006cbe50
x29: ffffffc0006cbe50 x28: 0000000000000000
x27: ffffffc000081198 x26: 000000004143f000
x25: 000000004143c000 x24: ffffffc03ffd6200
x23: 0000000080000000 x22: 0000000000000000
x21: 0000000080002000 x20: ffffffc03a800180
x19: 0000000000000020 x18: 000000007b73533d
x17: ffffffc00048c000 x16: 0000000000000000
x15: 0000000080000000 x14: 0000000000000081
x13: 0000000000001000 x12: 0000000000000001
x11: 0000000000000080 x10: ffffffc0006a7d58
x9 : 000000000000007f x8 : ffffffffffffff80
x7 : 0000000000000080 x6 : 0000000000001000
x5 : 0000000000000000 x4 : 0000000000000000
x3 : 0000000000000000 x2 : 0000000000000000
x1 : 0000000000000000 x0 : 0000000000000000
Process swapper (pid: 0, stack limit = 0xffffffc0006c8020)
Stack: (0xffffffc0006cbe50 to 0xffffffc0006cc000)
be40: ffffffc0006cbe90 ffffffc00068be50
be60: ffffffc03a800180 0000000000000080 ffffffc0005e90a4 0000000000002000
be80: ffffffc0013ed450 ffffffc0000d5890 ffffffc0006cbed0 ffffffc00068bed8
bea0: ffffffc03a800180 ffffffc0005e90a4 0000000000000080 ffffffc0005e90a4
bec0: 0000000000000080 0000000000002000 ffffffc0006cbf10 ffffffc00068bfc0
bee0: ffffffc0006ac170 ffffffc0013ed450 0000000000000000 ffffffc00014e8b4
bf00: ffffffc0006cbf30 0000000000002000 ffffffc0006cbf60 ffffffc00068db08
bf20: ffffffc0006a7c60 ffffffc0013ed000 ffffffc0006e3818 ffffffc0013ed450
bf40: ffffffc0013ed000 ffffffc03ffd6200 000000004143c000 ffffffc0013ed450
bf60: ffffffc0006cbfa0 ffffffc00067d7d8 ffffffc0006ce000 ffffffc00072d000
bf80: ffffffc00072d000 ffffffc0006ce000 ffffffc0006a6ed8 ffffffc00067d7d0
bfa0: 0000000000000000 0000000040488000 0000000000000400 0000000000000e11
bfc0: 0000000040079000 0000000000000003 000000007fe31b40 0000000040000000
bfe0: 0000000000000000 ffffffc0006a6ed8 0000000000000000 0000000000000000
Call trace:
[<ffffffc00014f9b4>] __kmem_cache_create+0x21c/0x280
[<ffffffc00068be50>] create_boot_cache+0x4c/0x80
[<ffffffc00068bed8>] create_kmalloc_cache+0x54/0x88
[<ffffffc00068bfc0>] create_kmalloc_caches+0x50/0xf4
[<ffffffc00068db08>] kmem_cache_init+0x104/0x118
[<ffffffc00067d7d8>] start_kernel+0x218/0x33c
[<0000000040488000>] 0x40488000
Code: 97ffa117 f9002280 f100401f 54000048 (d4210000)
---[ end trace cb88537fdc8fa200 ]---
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Tue, Nov 03, 2015 at 12:07:06PM +0100, Geert Uytterhoeven wrote:
> On Wed, Oct 28, 2015 at 8:09 PM, Catalin Marinas
> <[email protected]> wrote:
> > On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote:
> >> From: Tirumalesh Chalamarla <[email protected]>
> >>
> >> Increase the standard cacheline size to avoid having locks in the same
> >> cacheline.
> >>
> >> Cavium's ThunderX core implements cache lines of 128 byte size. With
> >> current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could
> >> share the same cache line leading a performance degradation.
> >> Increasing the size fixes that.
> >>
> >> Increasing the size has no negative impact to cache invalidation on
> >> systems with a smaller cache line. There is an impact on memory usage,
> >> but that's not too important for arm64 use cases.
> >>
> >> Signed-off-by: Tirumalesh Chalamarla <[email protected]>
> >> Signed-off-by: Robert Richter <[email protected]>
> >
> > Applied. Thanks.
>
> This patch causes a BUG() on r8a7795/salvator-x, for which support is not
> yet upstream.
>
> My config (attached) uses SLAB. If I switch to SLUB, it works.
> The arm64 defconfig works, even if I switch from SLUB to SLAB.
[...]
> ------------[ cut here ]------------
> kernel BUG at mm/slab.c:2283!
> Internal error: Oops - BUG: 0 [#1] SMP
[...]
> Call trace:
> [<ffffffc00014f9b4>] __kmem_cache_create+0x21c/0x280
> [<ffffffc00068be50>] create_boot_cache+0x4c/0x80
> [<ffffffc00068bed8>] create_kmalloc_cache+0x54/0x88
> [<ffffffc00068bfc0>] create_kmalloc_caches+0x50/0xf4
> [<ffffffc00068db08>] kmem_cache_init+0x104/0x118
> [<ffffffc00067d7d8>] start_kernel+0x218/0x33c
I haven't managed to reproduce this on a Juno kernel. Could you please
add some debug printing to see what's being passed to kmalloc_slab()?
The freelist_size getting out of bounds?
--
Catalin
On Tue, Nov 03, 2015 at 12:05:05PM +0000, Catalin Marinas wrote:
> On Tue, Nov 03, 2015 at 12:07:06PM +0100, Geert Uytterhoeven wrote:
> > On Wed, Oct 28, 2015 at 8:09 PM, Catalin Marinas
> > <[email protected]> wrote:
> > > On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote:
> > >> From: Tirumalesh Chalamarla <[email protected]>
> > >>
> > >> Increase the standard cacheline size to avoid having locks in the same
> > >> cacheline.
> > >>
> > >> Cavium's ThunderX core implements cache lines of 128 byte size. With
> > >> current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could
> > >> share the same cache line leading a performance degradation.
> > >> Increasing the size fixes that.
> > >>
> > >> Increasing the size has no negative impact to cache invalidation on
> > >> systems with a smaller cache line. There is an impact on memory usage,
> > >> but that's not too important for arm64 use cases.
> > >>
> > >> Signed-off-by: Tirumalesh Chalamarla <[email protected]>
> > >> Signed-off-by: Robert Richter <[email protected]>
> > >
> > > Applied. Thanks.
> >
> > This patch causes a BUG() on r8a7795/salvator-x, for which support is not
> > yet upstream.
> >
> > My config (attached) uses SLAB. If I switch to SLUB, it works.
> > The arm64 defconfig works, even if I switch from SLUB to SLAB.
> [...]
> > ------------[ cut here ]------------
> > kernel BUG at mm/slab.c:2283!
> > Internal error: Oops - BUG: 0 [#1] SMP
> [...]
> > Call trace:
> > [<ffffffc00014f9b4>] __kmem_cache_create+0x21c/0x280
> > [<ffffffc00068be50>] create_boot_cache+0x4c/0x80
> > [<ffffffc00068bed8>] create_kmalloc_cache+0x54/0x88
> > [<ffffffc00068bfc0>] create_kmalloc_caches+0x50/0xf4
> > [<ffffffc00068db08>] kmem_cache_init+0x104/0x118
> > [<ffffffc00067d7d8>] start_kernel+0x218/0x33c
>
> I haven't managed to reproduce this on a Juno kernel.
I now managed to reproduce it with your config (slightly adapted to
allow Juno). I'll look into it.
--
Catalin
Hi Catalin,
On Tue, Nov 3, 2015 at 3:38 PM, Catalin Marinas <[email protected]> wrote:
> On Tue, Nov 03, 2015 at 12:05:05PM +0000, Catalin Marinas wrote:
>> On Tue, Nov 03, 2015 at 12:07:06PM +0100, Geert Uytterhoeven wrote:
>> > On Wed, Oct 28, 2015 at 8:09 PM, Catalin Marinas
>> > <[email protected]> wrote:
>> > > On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote:
>> > >> From: Tirumalesh Chalamarla <[email protected]>
>> > >>
>> > >> Increase the standard cacheline size to avoid having locks in the same
>> > >> cacheline.
>> > >>
>> > >> Cavium's ThunderX core implements cache lines of 128 byte size. With
>> > >> current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could
>> > >> share the same cache line leading a performance degradation.
>> > >> Increasing the size fixes that.
>> > >>
>> > >> Increasing the size has no negative impact to cache invalidation on
>> > >> systems with a smaller cache line. There is an impact on memory usage,
>> > >> but that's not too important for arm64 use cases.
>> > >>
>> > >> Signed-off-by: Tirumalesh Chalamarla <[email protected]>
>> > >> Signed-off-by: Robert Richter <[email protected]>
>> > >
>> > > Applied. Thanks.
>> >
>> > This patch causes a BUG() on r8a7795/salvator-x, for which support is not
>> > yet upstream.
>> >
>> > My config (attached) uses SLAB. If I switch to SLUB, it works.
>> > The arm64 defconfig works, even if I switch from SLUB to SLAB.
>> [...]
>> > ------------[ cut here ]------------
>> > kernel BUG at mm/slab.c:2283!
>> > Internal error: Oops - BUG: 0 [#1] SMP
>> [...]
>> > Call trace:
>> > [<ffffffc00014f9b4>] __kmem_cache_create+0x21c/0x280
>> > [<ffffffc00068be50>] create_boot_cache+0x4c/0x80
>> > [<ffffffc00068bed8>] create_kmalloc_cache+0x54/0x88
>> > [<ffffffc00068bfc0>] create_kmalloc_caches+0x50/0xf4
>> > [<ffffffc00068db08>] kmem_cache_init+0x104/0x118
>> > [<ffffffc00067d7d8>] start_kernel+0x218/0x33c
>>
>> I haven't managed to reproduce this on a Juno kernel.
>
> I now managed to reproduce it with your config (slightly adapted to
> allow Juno). I'll look into it.
Good to hear that!
BTW, I see this:
freelist_size = 32
cache_line_size() = 64
It seems like the value returned by cache_line_size() in
arch/arm64/include/asm/cache.h disagrees with L1_CACHE_SHIFT == 7:
static inline int cache_line_size(void)
{
u32 cwg = cache_type_cwg();
return cwg ? 4 << cwg : L1_CACHE_BYTES;
}
Making cache_line_size() always return L1_CACHE_BYTES doesn't help.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Tue, Nov 03, 2015 at 03:55:29PM +0100, Geert Uytterhoeven wrote:
> On Tue, Nov 3, 2015 at 3:38 PM, Catalin Marinas <[email protected]> wrote:
> > On Tue, Nov 03, 2015 at 12:05:05PM +0000, Catalin Marinas wrote:
> >> On Tue, Nov 03, 2015 at 12:07:06PM +0100, Geert Uytterhoeven wrote:
> >> > On Wed, Oct 28, 2015 at 8:09 PM, Catalin Marinas
> >> > <[email protected]> wrote:
> >> > > On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote:
> >> > >> From: Tirumalesh Chalamarla <[email protected]>
> >> > >>
> >> > >> Increase the standard cacheline size to avoid having locks in the same
> >> > >> cacheline.
> >> > >>
> >> > >> Cavium's ThunderX core implements cache lines of 128 byte size. With
> >> > >> current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could
> >> > >> share the same cache line leading a performance degradation.
> >> > >> Increasing the size fixes that.
> >> > >>
> >> > >> Increasing the size has no negative impact to cache invalidation on
> >> > >> systems with a smaller cache line. There is an impact on memory usage,
> >> > >> but that's not too important for arm64 use cases.
> >> > >>
> >> > >> Signed-off-by: Tirumalesh Chalamarla <[email protected]>
> >> > >> Signed-off-by: Robert Richter <[email protected]>
> >> > >
> >> > > Applied. Thanks.
> >> >
> >> > This patch causes a BUG() on r8a7795/salvator-x, for which support is not
> >> > yet upstream.
> >> >
> >> > My config (attached) uses SLAB. If I switch to SLUB, it works.
> >> > The arm64 defconfig works, even if I switch from SLUB to SLAB.
> >> [...]
> >> > ------------[ cut here ]------------
> >> > kernel BUG at mm/slab.c:2283!
> >> > Internal error: Oops - BUG: 0 [#1] SMP
> >> [...]
> >> > Call trace:
> >> > [<ffffffc00014f9b4>] __kmem_cache_create+0x21c/0x280
> >> > [<ffffffc00068be50>] create_boot_cache+0x4c/0x80
> >> > [<ffffffc00068bed8>] create_kmalloc_cache+0x54/0x88
> >> > [<ffffffc00068bfc0>] create_kmalloc_caches+0x50/0xf4
> >> > [<ffffffc00068db08>] kmem_cache_init+0x104/0x118
> >> > [<ffffffc00067d7d8>] start_kernel+0x218/0x33c
> >>
> >> I haven't managed to reproduce this on a Juno kernel.
> >
> > I now managed to reproduce it with your config (slightly adapted to
> > allow Juno). I'll look into it.
>
> Good to hear that!
>
> BTW, I see this:
>
> freelist_size = 32
> cache_line_size() = 64
>
> It seems like the value returned by cache_line_size() in
> arch/arm64/include/asm/cache.h disagrees with L1_CACHE_SHIFT == 7:
>
> static inline int cache_line_size(void)
> {
> u32 cwg = cache_type_cwg();
> return cwg ? 4 << cwg : L1_CACHE_BYTES;
> }
>
> Making cache_line_size() always return L1_CACHE_BYTES doesn't help.
(cc'ing Jonsoo and Christoph; summary: slab failure with L1_CACHE_BYTES
of 128 and sizeof(kmem_cache_node) of 152)
If I revert commit 8fc9cf420b36 ("slab: make more slab management
structure off the slab") it works but I still need to figure out how
slab indices are calculated. The size_index[] array is overridden so
that 0..15 are 7 and 16..23 are 8. But the kmalloc_caches[7] has never
been populated, hence the BUG_ON. Another option may be to change
kmalloc_size and kmalloc_index to cope with KMALLOC_MIN_SIZE of 128.
I'll do some more investigation tomorrow.
--
Catalin
On Tue, 3 Nov 2015, Catalin Marinas wrote:
> (cc'ing Jonsoo and Christoph; summary: slab failure with L1_CACHE_BYTES
> of 128 and sizeof(kmem_cache_node) of 152)
Hmmm... Yes that would mean use the 196 sized kmalloc array which is not a
power of two slab. But the code looks fine to me.
> If I revert commit 8fc9cf420b36 ("slab: make more slab management
> structure off the slab") it works but I still need to figure out how
> slab indices are calculated. The size_index[] array is overridden so
> that 0..15 are 7 and 16..23 are 8. But the kmalloc_caches[7] has never
> been populated, hence the BUG_ON. Another option may be to change
> kmalloc_size and kmalloc_index to cope with KMALLOC_MIN_SIZE of 128.
>
> I'll do some more investigation tomorrow.
The commit allows off slab management for PAGE_SIZE >> 5 that is 128.
After that commit kmem_cache_create would try to allocate an off slab
management structure which is not available during early boot.
But the slab_early_init is set which should prevent the use of an off slab
management infrastructure in kmem_cache_create().
However, the failure in line 2283 shows that the OFF SLAB flag was
mistakenly set anyways!!!! Something must havee cleared slab_early_init?
(+ linux-mm)
On Tue, Nov 03, 2015 at 05:33:25PM -0600, Christoph Lameter wrote:
> On Tue, 3 Nov 2015, Catalin Marinas wrote:
> > (cc'ing Jonsoo and Christoph; summary: slab failure with L1_CACHE_BYTES
> > of 128 and sizeof(kmem_cache_node) of 152)
>
> Hmmm... Yes that would mean use the 196 sized kmalloc array which is not a
> power of two slab. But the code looks fine to me.
I'm not entirely sure that gets used (or even created).
kmalloc_index(152) returns 8 (INDEX_NODE==8) since KMALLOC_MIN_SIZE==128
and the "kmalloc-node" cache size is 256.
> > If I revert commit 8fc9cf420b36 ("slab: make more slab management
> > structure off the slab") it works but I still need to figure out how
> > slab indices are calculated. The size_index[] array is overridden so
> > that 0..15 are 7 and 16..23 are 8. But the kmalloc_caches[7] has never
> > been populated, hence the BUG_ON. Another option may be to change
> > kmalloc_size and kmalloc_index to cope with KMALLOC_MIN_SIZE of 128.
> >
> > I'll do some more investigation tomorrow.
>
> The commit allows off slab management for PAGE_SIZE >> 5 that is 128.
This means that the first kmalloc cache to be created, "kmalloc-128", is
off slab.
> After that commit kmem_cache_create would try to allocate an off slab
> management structure which is not available during early boot.
> But the slab_early_init is set which should prevent the use of an off slab
> management infrastructure in kmem_cache_create().
>
> However, the failure in line 2283 shows that the OFF SLAB flag was
> mistakenly set anyways!!!! Something must havee cleared slab_early_init?
slab_early_init is cleared after "kmem_cache" and "kmalloc-node" caches
are successfully created. Following this, the minimum kmalloc allocation
goes off-slab when KMALLOC_MIN_SIZE == 128.
When trying to create "kmalloc-128" (via create_kmalloc_caches(),
slab_early_init is already 0), __kmem_cache_create() requires an
allocation of 32 bytes (freelist_size) which has index 7, hence exactly
the kmalloc_caches[7] we are trying to create.
The simplest option would be to make sure that off slab isn't allowed
for caches of KMALLOC_MIN_SIZE or smaller, with the drawback that not
only "kmalloc-128" but any other such caches will be on slab.
I think a better option would be to first check that there is a
kmalloc_caches[] entry for freelist_size before deciding to go off-slab.
See below:
-----8<------------------------------
>From ce27c5c6d156522ceaff20de8a7af281cf079b6f Mon Sep 17 00:00:00 2001
From: Catalin Marinas <[email protected]>
Date: Wed, 4 Nov 2015 12:19:00 +0000
Subject: [PATCH] mm: slab: Avoid BUG when KMALLOC_MIN_SIZE == (PAGE_SIZE >> 5)
The slab allocator, following commit 8fc9cf420b36 ("slab: make more slab
management structure off the slab"), tries to place slab management
off-slab when the object size is PAGE_SIZE >> 5 or larger. On arm64 with
KMALLOC_MIN_SIZE = L1_CACHE_BYTES = 128, "kmalloc-128" is the smallest
cache to be created after slab_early_init = 0. The current
__kmem_cache_create() implementation aims to place the management
structure off-slab. However, the kmalloc_slab(freelist_size) has not
been populated yet, triggering a bug on !cachep->freelist_cache.
This patch addresses the problem by keeping the management structure
on-slab if the corresponding kmalloc_caches[] is not populated yet.
Fixes: 8fc9cf420b36 ("slab: make more slab management structure off the slab")
Cc: <[email protected]> # 3.15+
Reported-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Catalin Marinas <[email protected]>
---
mm/slab.c | 43 ++++++++++++++++++++++++-------------------
1 file changed, 24 insertions(+), 19 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 4fcc5dd8d5a6..d4a21736eb5d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2246,16 +2246,33 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
if (flags & CFLGS_OFF_SLAB) {
/* really off slab. No need for manual alignment */
- freelist_size = calculate_freelist_size(cachep->num, 0);
+ size_t off_freelist_size = calculate_freelist_size(cachep->num, 0);
+
+ cachep->freelist_cache = kmalloc_slab(off_freelist_size, 0u);
+ if (ZERO_OR_NULL_PTR(cachep->freelist_cache)) {
+ /*
+ * We don't have kmalloc_caches[] populated for
+ * off_freelist_size yet. This can happen during
+ * create_kmalloc_caches() when KMALLOC_MIN_SIZE >=
+ * (PAGE_SHIFT >> 5) and CFLGS_OFF_SLAB is set. Move
+ * the cache on-slab.
+ */
+ flags &= ~CFLGS_OFF_SLAB;
+ left_over = calculate_slab_order(cachep, size, cachep->align, flags);
+ } else {
+ freelist_size = off_freelist_size;
#ifdef CONFIG_PAGE_POISONING
- /* If we're going to use the generic kernel_map_pages()
- * poisoning, then it's going to smash the contents of
- * the redzone and userword anyhow, so switch them off.
- */
- if (size % PAGE_SIZE == 0 && flags & SLAB_POISON)
- flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);
+ /*
+ * If we're going to use the generic kernel_map_pages()
+ * poisoning, then it's going to smash the contents of
+ * the redzone and userword anyhow, so switch them off.
+ */
+ if (size % PAGE_SIZE == 0 && flags & SLAB_POISON)
+ flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);
#endif
+ }
+
}
cachep->colour_off = cache_line_size();
@@ -2271,18 +2288,6 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
cachep->size = size;
cachep->reciprocal_buffer_size = reciprocal_value(size);
- if (flags & CFLGS_OFF_SLAB) {
- cachep->freelist_cache = kmalloc_slab(freelist_size, 0u);
- /*
- * This is a possibility for one of the kmalloc_{dma,}_caches.
- * But since we go off slab only for object size greater than
- * PAGE_SIZE/8, and kmalloc_{dma,}_caches get created
- * in ascending order,this should not happen at all.
- * But leave a BUG_ON for some lucky dude.
- */
- BUG_ON(ZERO_OR_NULL_PTR(cachep->freelist_cache));
- }
-
err = setup_cpu_cache(cachep, gfp);
if (err) {
__kmem_cache_shutdown(cachep);
On Wed, 4 Nov 2015, Catalin Marinas wrote:
> The simplest option would be to make sure that off slab isn't allowed
> for caches of KMALLOC_MIN_SIZE or smaller, with the drawback that not
> only "kmalloc-128" but any other such caches will be on slab.
The reason for an off slab configuration is denser object packing.
> I think a better option would be to first check that there is a
> kmalloc_caches[] entry for freelist_size before deciding to go off-slab.
Hmmm.. Yes seems to be an option.
Maybe we simply revert commit 8fc9cf420b36 instead? That does not seem to
make too much sense to me and the goal of the commit cannot be
accomplished on ARM. Your patch essentially reverts the effect anyways.
Smaller slabs really do not need off slab management anyways since they
will only loose a few objects per slab page.
On Wed, Nov 04, 2015 at 07:53:50AM -0600, Christoph Lameter wrote:
> On Wed, 4 Nov 2015, Catalin Marinas wrote:
>
> > The simplest option would be to make sure that off slab isn't allowed
> > for caches of KMALLOC_MIN_SIZE or smaller, with the drawback that not
> > only "kmalloc-128" but any other such caches will be on slab.
>
> The reason for an off slab configuration is denser object packing.
>
> > I think a better option would be to first check that there is a
> > kmalloc_caches[] entry for freelist_size before deciding to go off-slab.
>
> Hmmm.. Yes seems to be an option.
>
> Maybe we simply revert commit 8fc9cf420b36 instead?
I'm fine with this. Also note that the arm64 commit changing
L1_CACHE_BYTES to 128 hasn't been pushed yet (it's queued for 4.4).
> That does not seem to make too much sense to me and the goal of the
> commit cannot be accomplished on ARM. Your patch essentially reverts
> the effect anyways.
In theory it only reverts the effect for the first kmalloc_cache
("kmalloc-128" in the arm64 case). Any other bigger cache which would
not be mergeable with an existing one still has the potential of
off-slab management.
> Smaller slabs really do not need off slab management anyways since they
> will only loose a few objects per slab page.
IIUC, starting with 128 slab size for a 4KB page, you have 32 objects
per page. The freelist takes 32 bytes (or 31), therefore you waste a
single slab object. However, only 1/4 of it is used for freelist and the
waste gets bigger with 256 slab size, hence the original commit.
BTW, assuming L1_CACHE_BYTES is 512 (I don't ever see this happening but
just in theory), we potentially have the same issue. What would save us
is that INDEX_NODE would match the first "kmalloc-512" cache, so we have
it pre-populated.
--
Catalin
On Wed, 4 Nov 2015, Catalin Marinas wrote:
> BTW, assuming L1_CACHE_BYTES is 512 (I don't ever see this happening but
> just in theory), we potentially have the same issue. What would save us
> is that INDEX_NODE would match the first "kmalloc-512" cache, so we have
> it pre-populated.
Ok maybe add some BUILD_BUG_ONs to ensure that builds fail until we have
addressed that.
On Wed, Nov 04, 2015 at 09:28:34AM -0600, Christoph Lameter wrote:
> On Wed, 4 Nov 2015, Catalin Marinas wrote:
>
> > BTW, assuming L1_CACHE_BYTES is 512 (I don't ever see this happening but
> > just in theory), we potentially have the same issue. What would save us
> > is that INDEX_NODE would match the first "kmalloc-512" cache, so we have
> > it pre-populated.
>
> Ok maybe add some BUILD_BUG_ONs to ensure that builds fail until we have
> addressed that.
A BUILD_BUG_ON should be fine.
Thinking some more, I think if KMALLOC_MIN_SIZE is 128, there is no gain
with off-slab management since the freelist allocation would still be
128 bytes. An alternative to reverting while still having a little
benefit of off-slab for 256 bytes objects (rather than 512 as we would
get with the revert):
diff --git a/mm/slab.c b/mm/slab.c
index 4fcc5dd8d5a6..ac32b4a0f2ec 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2212,8 +2212,8 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
* it too early on. Always use on-slab management when
* SLAB_NOLEAKTRACE to avoid recursive calls into kmemleak)
*/
- if ((size >= (PAGE_SIZE >> 5)) && !slab_early_init &&
- !(flags & SLAB_NOLEAKTRACE))
+ if ((size >= (PAGE_SIZE >> 5)) && (size > KMALLOC_MIN_SIZE) &&
+ !slab_early_init && !(flags & SLAB_NOLEAKTRACE))
/*
* Size is large, assume best to place the slab management obj
* off-slab (should allow better packing of objs).
Whichever you prefer.
--
Catalin
On Wed, Nov 04, 2015 at 03:39:10PM +0000, Catalin Marinas wrote:
> On Wed, Nov 04, 2015 at 09:28:34AM -0600, Christoph Lameter wrote:
> > On Wed, 4 Nov 2015, Catalin Marinas wrote:
> >
> > > BTW, assuming L1_CACHE_BYTES is 512 (I don't ever see this happening but
> > > just in theory), we potentially have the same issue. What would save us
> > > is that INDEX_NODE would match the first "kmalloc-512" cache, so we have
> > > it pre-populated.
> >
> > Ok maybe add some BUILD_BUG_ONs to ensure that builds fail until we have
> > addressed that.
>
> A BUILD_BUG_ON should be fine.
>
> Thinking some more, I think if KMALLOC_MIN_SIZE is 128, there is no gain
> with off-slab management since the freelist allocation would still be
> 128 bytes. An alternative to reverting while still having a little
> benefit of off-slab for 256 bytes objects (rather than 512 as we would
> get with the revert):
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 4fcc5dd8d5a6..ac32b4a0f2ec 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -2212,8 +2212,8 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
> * it too early on. Always use on-slab management when
> * SLAB_NOLEAKTRACE to avoid recursive calls into kmemleak)
> */
> - if ((size >= (PAGE_SIZE >> 5)) && !slab_early_init &&
> - !(flags & SLAB_NOLEAKTRACE))
> + if ((size >= (PAGE_SIZE >> 5)) && (size > KMALLOC_MIN_SIZE) &&
> + !slab_early_init && !(flags & SLAB_NOLEAKTRACE))
> /*
> * Size is large, assume best to place the slab management obj
> * off-slab (should allow better packing of objs).
>
> Whichever you prefer.
Hello,
I prefer this simple way. It looks like that it can solve the issue on
any arbitrary configuration.
Thanks.
Commit 8fc9cf420b36 ("slab: make more slab management structure off the
slab") enables off-slab management objects for sizes starting with
PAGE_SIZE >> 5. This means 128 bytes for a 4KB page configuration.
However, on systems with a KMALLOC_MIN_SIZE of 128 (arm64 in 4.4), such
optimisation does not make sense since the slab management allocation
would take 128 bytes anyway (even though freelist_size is 32) with the
additional overhead of another allocation.
This patch introduces an OFF_SLAB_MIN_SIZE macro which takes
KMALLOC_MIN_SIZE into account. It also solves a slab bug on arm64 where
the first kmalloc_cache to be initialised after slab_early_init = 0,
"kmalloc-128", fails to allocate off-slab management objects from the
same "kmalloc-128" cache.
Fixes: 8fc9cf420b36 ("slab: make more slab management structure off the slab")
Cc: <[email protected]> # 3.15+
Reported-by: Geert Uytterhoeven <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Catalin Marinas <[email protected]>
---
mm/slab.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 4fcc5dd8d5a6..419b649b395e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -282,6 +282,7 @@ static void kmem_cache_node_init(struct kmem_cache_node *parent)
#define CFLGS_OFF_SLAB (0x80000000UL)
#define OFF_SLAB(x) ((x)->flags & CFLGS_OFF_SLAB)
+#define OFF_SLAB_MIN_SIZE (max_t(size_t, PAGE_SIZE >> 5, KMALLOC_MIN_SIZE + 1))
#define BATCHREFILL_LIMIT 16
/*
@@ -2212,7 +2213,7 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
* it too early on. Always use on-slab management when
* SLAB_NOLEAKTRACE to avoid recursive calls into kmemleak)
*/
- if ((size >= (PAGE_SIZE >> 5)) && !slab_early_init &&
+ if (size >= OFF_SLAB_MIN_SIZE && !slab_early_init &&
!(flags & SLAB_NOLEAKTRACE))
/*
* Size is large, assume best to place the slab management obj
@@ -2276,7 +2277,7 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
/*
* This is a possibility for one of the kmalloc_{dma,}_caches.
* But since we go off slab only for object size greater than
- * PAGE_SIZE/8, and kmalloc_{dma,}_caches get created
+ * OFF_SLAB_MIN_SIZE, and kmalloc_{dma,}_caches get created
* in ascending order,this should not happen at all.
* But leave a BUG_ON for some lucky dude.
*/
On Thu, 5 Nov 2015 11:50:35 +0000 Catalin Marinas <[email protected]> wrote:
> Commit 8fc9cf420b36 ("slab: make more slab management structure off the
> slab") enables off-slab management objects for sizes starting with
> PAGE_SIZE >> 5. This means 128 bytes for a 4KB page configuration.
> However, on systems with a KMALLOC_MIN_SIZE of 128 (arm64 in 4.4), such
> optimisation does not make sense since the slab management allocation
> would take 128 bytes anyway (even though freelist_size is 32) with the
> additional overhead of another allocation.
>
> This patch introduces an OFF_SLAB_MIN_SIZE macro which takes
> KMALLOC_MIN_SIZE into account. It also solves a slab bug on arm64 where
> the first kmalloc_cache to be initialised after slab_early_init = 0,
> "kmalloc-128", fails to allocate off-slab management objects from the
> same "kmalloc-128" cache.
That all seems to be quite minor stuff.
> Fixes: 8fc9cf420b36 ("slab: make more slab management structure off the slab")
> Cc: <[email protected]> # 3.15+
Yet you believe the fix should be backported.
So, the usual refrain: when fixing a bug, please describe the end-user
visible effects of that bug.
On Thu, Nov 05, 2015 at 05:31:39AM -0800, Andrew Morton wrote:
> On Thu, 5 Nov 2015 11:50:35 +0000 Catalin Marinas <[email protected]> wrote:
>
> > Commit 8fc9cf420b36 ("slab: make more slab management structure off the
> > slab") enables off-slab management objects for sizes starting with
> > PAGE_SIZE >> 5. This means 128 bytes for a 4KB page configuration.
> > However, on systems with a KMALLOC_MIN_SIZE of 128 (arm64 in 4.4), such
> > optimisation does not make sense since the slab management allocation
> > would take 128 bytes anyway (even though freelist_size is 32) with the
> > additional overhead of another allocation.
> >
> > This patch introduces an OFF_SLAB_MIN_SIZE macro which takes
> > KMALLOC_MIN_SIZE into account. It also solves a slab bug on arm64 where
> > the first kmalloc_cache to be initialised after slab_early_init = 0,
> > "kmalloc-128", fails to allocate off-slab management objects from the
> > same "kmalloc-128" cache.
>
> That all seems to be quite minor stuff.
Apart from "it also solves a bug on arm64...". But I agree, the initial
commit log doesn't give any justification for cc stable.
> > Fixes: 8fc9cf420b36 ("slab: make more slab management structure off the slab")
> > Cc: <[email protected]> # 3.15+
>
> Yet you believe the fix should be backported.
>
> So, the usual refrain: when fixing a bug, please describe the end-user
> visible effects of that bug.
What about (unless you prefer this slightly more intrusive fix:
http://article.gmane.org/gmane.linux.ports.sh.devel/50303):
------------------8<--------------------------
>From fda8f306b6941f4ddbefcbcfaa59fedef4a679a3 Mon Sep 17 00:00:00 2001
From: Catalin Marinas <[email protected]>
Date: Thu, 5 Nov 2015 11:14:48 +0000
Subject: [PATCH] mm: slab: Only move management objects off-slab for sizes
larger than KMALLOC_MIN_SIZE
On systems with a KMALLOC_MIN_SIZE of 128 (arm64, some mips and powerpc
configurations defining ARCH_DMA_MINALIGN to 128), the first
kmalloc_caches[] entry to be initialised after slab_early_init = 0 is
"kmalloc-128" with index 7. Depending on the debug kernel configuration,
sizeof(struct kmem_cache) can be larger than 128 resulting in an
INDEX_NODE of 8.
Commit 8fc9cf420b36 ("slab: make more slab management structure off the
slab") enables off-slab management objects for sizes starting with
PAGE_SIZE >> 5 (128 bytes for a 4KB page configuration) and the creation
of the "kmalloc-128" cache would try to place the management objects
off-slab. However, since KMALLOC_MIN_SIZE is already 128 and
freelist_size == 32 in __kmem_cache_create(),
kmalloc_slab(freelist_size) returns NULL (kmalloc_caches[7] not
populated yet). This triggers the following bug on arm64:
[ 0.000000] kernel BUG at /work/Linux/linux-2.6-aarch64/mm/slab.c:2283!
[ 0.000000] Internal error: Oops - BUG: 0 [#1] SMP
[ 0.000000] Modules linked in:
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.3.0-rc4+ #540
[ 0.000000] Hardware name: Juno (DT)
[ 0.000000] task: ffffffc0006962b0 ti: ffffffc00068c000 task.ti: ffffffc00068c000
[ 0.000000] PC is at __kmem_cache_create+0x21c/0x280
[ 0.000000] LR is at __kmem_cache_create+0x210/0x280
[...]
[ 0.000000] Call trace:
[ 0.000000] [<ffffffc000154948>] __kmem_cache_create+0x21c/0x280
[ 0.000000] [<ffffffc000652da4>] create_boot_cache+0x48/0x80
[ 0.000000] [<ffffffc000652e2c>] create_kmalloc_cache+0x50/0x88
[ 0.000000] [<ffffffc000652f14>] create_kmalloc_caches+0x4c/0xf4
[ 0.000000] [<ffffffc000654a9c>] kmem_cache_init+0x100/0x118
[ 0.000000] [<ffffffc0006447d4>] start_kernel+0x214/0x33c
This patch introduces an OFF_SLAB_MIN_SIZE definition to avoid off-slab
management objects for sizes equal to or smaller than KMALLOC_MIN_SIZE.
Fixes: 8fc9cf420b36 ("slab: make more slab management structure off the slab")
Cc: <[email protected]> # 3.15+
Reported-by: Geert Uytterhoeven <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Catalin Marinas <[email protected]>
---
mm/slab.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 4fcc5dd8d5a6..419b649b395e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -282,6 +282,7 @@ static void kmem_cache_node_init(struct kmem_cache_node *parent)
#define CFLGS_OFF_SLAB (0x80000000UL)
#define OFF_SLAB(x) ((x)->flags & CFLGS_OFF_SLAB)
+#define OFF_SLAB_MIN_SIZE (max_t(size_t, PAGE_SIZE >> 5, KMALLOC_MIN_SIZE + 1))
#define BATCHREFILL_LIMIT 16
/*
@@ -2212,7 +2213,7 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
* it too early on. Always use on-slab management when
* SLAB_NOLEAKTRACE to avoid recursive calls into kmemleak)
*/
- if ((size >= (PAGE_SIZE >> 5)) && !slab_early_init &&
+ if (size >= OFF_SLAB_MIN_SIZE && !slab_early_init &&
!(flags & SLAB_NOLEAKTRACE))
/*
* Size is large, assume best to place the slab management obj
@@ -2276,7 +2277,7 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
/*
* This is a possibility for one of the kmalloc_{dma,}_caches.
* But since we go off slab only for object size greater than
- * PAGE_SIZE/8, and kmalloc_{dma,}_caches get created
+ * OFF_SLAB_MIN_SIZE, and kmalloc_{dma,}_caches get created
* in ascending order,this should not happen at all.
* But leave a BUG_ON for some lucky dude.
*/
On Thu, 5 Nov 2015, Catalin Marinas wrote:
> This patch introduces an OFF_SLAB_MIN_SIZE macro which takes
> KMALLOC_MIN_SIZE into account. It also solves a slab bug on arm64 where
> the first kmalloc_cache to be initialised after slab_early_init = 0,
> "kmalloc-128", fails to allocate off-slab management objects from the
> same "kmalloc-128" cache.
Acked-by: Christoph Lameter <[email protected]>
On Thu, Nov 5, 2015 at 5:08 PM, Catalin Marinas <[email protected]> wrote:
> From fda8f306b6941f4ddbefcbcfaa59fedef4a679a3 Mon Sep 17 00:00:00 2001
> From: Catalin Marinas <[email protected]>
> Date: Thu, 5 Nov 2015 11:14:48 +0000
> Subject: [PATCH] mm: slab: Only move management objects off-slab for sizes
> larger than KMALLOC_MIN_SIZE
>
> On systems with a KMALLOC_MIN_SIZE of 128 (arm64, some mips and powerpc
> configurations defining ARCH_DMA_MINALIGN to 128), the first
> kmalloc_caches[] entry to be initialised after slab_early_init = 0 is
> "kmalloc-128" with index 7. Depending on the debug kernel configuration,
> sizeof(struct kmem_cache) can be larger than 128 resulting in an
> INDEX_NODE of 8.
>
> Commit 8fc9cf420b36 ("slab: make more slab management structure off the
> slab") enables off-slab management objects for sizes starting with
> PAGE_SIZE >> 5 (128 bytes for a 4KB page configuration) and the creation
> of the "kmalloc-128" cache would try to place the management objects
> off-slab. However, since KMALLOC_MIN_SIZE is already 128 and
> freelist_size == 32 in __kmem_cache_create(),
> kmalloc_slab(freelist_size) returns NULL (kmalloc_caches[7] not
> populated yet). This triggers the following bug on arm64:
>
> [ 0.000000] kernel BUG at /work/Linux/linux-2.6-aarch64/mm/slab.c:2283!
> [ 0.000000] Internal error: Oops - BUG: 0 [#1] SMP
> [ 0.000000] Modules linked in:
> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.3.0-rc4+ #540
> [ 0.000000] Hardware name: Juno (DT)
> [ 0.000000] task: ffffffc0006962b0 ti: ffffffc00068c000 task.ti: ffffffc00068c000
> [ 0.000000] PC is at __kmem_cache_create+0x21c/0x280
> [ 0.000000] LR is at __kmem_cache_create+0x210/0x280
> [...]
> [ 0.000000] Call trace:
> [ 0.000000] [<ffffffc000154948>] __kmem_cache_create+0x21c/0x280
> [ 0.000000] [<ffffffc000652da4>] create_boot_cache+0x48/0x80
> [ 0.000000] [<ffffffc000652e2c>] create_kmalloc_cache+0x50/0x88
> [ 0.000000] [<ffffffc000652f14>] create_kmalloc_caches+0x4c/0xf4
> [ 0.000000] [<ffffffc000654a9c>] kmem_cache_init+0x100/0x118
> [ 0.000000] [<ffffffc0006447d4>] start_kernel+0x214/0x33c
>
> This patch introduces an OFF_SLAB_MIN_SIZE definition to avoid off-slab
> management objects for sizes equal to or smaller than KMALLOC_MIN_SIZE.
>
>
> Fixes: 8fc9cf420b36 ("slab: make more slab management structure off the slab")
> Cc: <[email protected]> # 3.15+
> Reported-by: Geert Uytterhoeven <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Pekka Enberg <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Signed-off-by: Catalin Marinas <[email protected]>
Thanks a lot!
For the record (the fix is already upstream):
Tested-by: Geert Uytterhoeven <[email protected]>
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds