2015-11-05 04:40:06

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH] arm64: Increase the max granular size

On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote:
> From: Tirumalesh Chalamarla <[email protected]>
>
> Increase the standard cacheline size to avoid having locks in the same
> cacheline.
>
> Cavium's ThunderX core implements cache lines of 128 byte size. With
> current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could
> share the same cache line leading a performance degradation.
> Increasing the size fixes that.

Beside, slab-side bug, I don't think this argument is valid.
Even if this change is applied, statically allocated spinlock could
share the same cache line.

If two locks should not share the same cache line, you'd better to use
compiler attribute such as ____cacheline_aligned_in_smp in appropriate
place.

Thanks.

>
> Increasing the size has no negative impact to cache invalidation on
> systems with a smaller cache line. There is an impact on memory usage,
> but that's not too important for arm64 use cases.
>
> Signed-off-by: Tirumalesh Chalamarla <[email protected]>
> Signed-off-by: Robert Richter <[email protected]>
> ---
> arch/arm64/include/asm/cache.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> index bde449936e2f..5082b30bc2c0 100644
> --- a/arch/arm64/include/asm/cache.h
> +++ b/arch/arm64/include/asm/cache.h
> @@ -18,7 +18,7 @@
>
> #include <asm/cachetype.h>
>
> -#define L1_CACHE_SHIFT 6
> +#define L1_CACHE_SHIFT 7
> #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
>
> /*
> --
> 2.1.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


2015-11-05 10:32:20

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH] arm64: Increase the max granular size

On Thu, Nov 05, 2015 at 01:40:14PM +0900, Joonsoo Kim wrote:
> On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote:
> > From: Tirumalesh Chalamarla <[email protected]>
> >
> > Increase the standard cacheline size to avoid having locks in the same
> > cacheline.
> >
> > Cavium's ThunderX core implements cache lines of 128 byte size. With
> > current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could
> > share the same cache line leading a performance degradation.
> > Increasing the size fixes that.
>
> Beside, slab-side bug, I don't think this argument is valid.
> Even if this change is applied, statically allocated spinlock could
> share the same cache line.

The benchmarks didn't show any difference with or without this patch
applied. What convinced me to apply it was this email:

http://lkml.kernel.org/g/CAOZdJXUiRMAguDV+HEJqPg57MyBNqEcTyaH+ya=U93NHb-pdJA@mail.gmail.com

On ARM we have a notion of cache writeback granule (CWG) which tells us
"the maximum size of memory that can be overwritten as a result of the
eviction of a cache entry that has had a memory location in it
modified". What we actually needed was ARCH_DMA_MINALIGN to be 128
(currently defined to the L1_CACHE_BYTES value). However, this wouldn't
have fixed the KMALLOC_MIN_SIZE, unless we somehow generate different
kmalloc_caches[] and kmalloc_dma_caches[] and probably introduce a
size_dma_index[].

> If two locks should not share the same cache line, you'd better to use
> compiler attribute such as ____cacheline_aligned_in_smp in appropriate
> place.

We could decouple SMP_CACHE_BYTES from L1_CACHE_BYTES but see above for
the other issue we had to solve.

--
Catalin

2015-11-05 11:45:36

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH] arm64: Increase the max granular size

2015-11-05 19:32 GMT+09:00 Catalin Marinas <[email protected]>:
> On Thu, Nov 05, 2015 at 01:40:14PM +0900, Joonsoo Kim wrote:
>> On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote:
>> > From: Tirumalesh Chalamarla <[email protected]>
>> >
>> > Increase the standard cacheline size to avoid having locks in the same
>> > cacheline.
>> >
>> > Cavium's ThunderX core implements cache lines of 128 byte size. With
>> > current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could
>> > share the same cache line leading a performance degradation.
>> > Increasing the size fixes that.
>>
>> Beside, slab-side bug, I don't think this argument is valid.
>> Even if this change is applied, statically allocated spinlock could
>> share the same cache line.
>
> The benchmarks didn't show any difference with or without this patch
> applied. What convinced me to apply it was this email:
>
> http://lkml.kernel.org/g/CAOZdJXUiRMAguDV+HEJqPg57MyBNqEcTyaH+ya=U93NHb-pdJA@mail.gmail.com

Okay.

> On ARM we have a notion of cache writeback granule (CWG) which tells us
> "the maximum size of memory that can be overwritten as a result of the
> eviction of a cache entry that has had a memory location in it
> modified". What we actually needed was ARCH_DMA_MINALIGN to be 128
> (currently defined to the L1_CACHE_BYTES value). However, this wouldn't
> have fixed the KMALLOC_MIN_SIZE, unless we somehow generate different
> kmalloc_caches[] and kmalloc_dma_caches[] and probably introduce a
> size_dma_index[].

If we create separate kmalloc caches for dma, can we apply this alignment
requirement only to dma caches? I guess some memory allocation request
that will be used for DMA operation doesn't specify GFP_DMA because
it doesn't want the memory from ZONE_DMA. In this case, we should apply
dma alignment requirement to all types of caches.

In fact, I know someone who try to implement this alignment separation like
as you mentioned to reduce memory waste. I first suggest this solution
to him but now I realize that it isn't possible because of above reason.

Am I missing?

If it isn't possible, is there another way to reduce memory waste due to
increase of dma alignment requirement in arm64?

Thanks.

2015-11-05 12:17:12

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH] arm64: Increase the max granular size

On Thu, Nov 05, 2015 at 08:45:08PM +0900, Joonsoo Kim wrote:
> 2015-11-05 19:32 GMT+09:00 Catalin Marinas <[email protected]>:
> > On ARM we have a notion of cache writeback granule (CWG) which tells us
> > "the maximum size of memory that can be overwritten as a result of the
> > eviction of a cache entry that has had a memory location in it
> > modified". What we actually needed was ARCH_DMA_MINALIGN to be 128
> > (currently defined to the L1_CACHE_BYTES value). However, this wouldn't
> > have fixed the KMALLOC_MIN_SIZE, unless we somehow generate different
> > kmalloc_caches[] and kmalloc_dma_caches[] and probably introduce a
> > size_dma_index[].
>
> If we create separate kmalloc caches for dma, can we apply this alignment
> requirement only to dma caches? I guess some memory allocation request
> that will be used for DMA operation doesn't specify GFP_DMA because
> it doesn't want the memory from ZONE_DMA. In this case, we should apply
> dma alignment requirement to all types of caches.

I think you are right. While something like swiotlb (through the
streaming DMA API) could do bounce buffering and allocate one from
ZONE_DMA, this is not guaranteed if the buffer physical address happens
to match the dma_mask. Similarly with an IOMMU, no bouncing happens but
the alignment is still required.

> If it isn't possible, is there another way to reduce memory waste due to
> increase of dma alignment requirement in arm64?

I first need to see how significant the impact is (especially for
embedded/mobiles platforms).

An alternative is to leave L1_CACHE_BYTES to 64 by default but warn if
the CWG is 128 on systems with non-coherent DMA (and hope that we won't
have any). It's not really fix, rather an assumption. Anyway I would
very much like the same kernel image for all platforms and no Kconfig
entry for the cache line size but if the waste is significant, we may
add one for some specific builds (like a mobile phone; or such vendors
could patch the kernel themselves).

--
Catalin

2015-11-09 07:42:03

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH] arm64: Increase the max granular size

2015-11-05 21:17 GMT+09:00 Catalin Marinas <[email protected]>:
> On Thu, Nov 05, 2015 at 08:45:08PM +0900, Joonsoo Kim wrote:
>> 2015-11-05 19:32 GMT+09:00 Catalin Marinas <[email protected]>:
>> > On ARM we have a notion of cache writeback granule (CWG) which tells us
>> > "the maximum size of memory that can be overwritten as a result of the
>> > eviction of a cache entry that has had a memory location in it
>> > modified". What we actually needed was ARCH_DMA_MINALIGN to be 128
>> > (currently defined to the L1_CACHE_BYTES value). However, this wouldn't
>> > have fixed the KMALLOC_MIN_SIZE, unless we somehow generate different
>> > kmalloc_caches[] and kmalloc_dma_caches[] and probably introduce a
>> > size_dma_index[].
>>
>> If we create separate kmalloc caches for dma, can we apply this alignment
>> requirement only to dma caches? I guess some memory allocation request
>> that will be used for DMA operation doesn't specify GFP_DMA because
>> it doesn't want the memory from ZONE_DMA. In this case, we should apply
>> dma alignment requirement to all types of caches.
>
> I think you are right. While something like swiotlb (through the
> streaming DMA API) could do bounce buffering and allocate one from
> ZONE_DMA, this is not guaranteed if the buffer physical address happens
> to match the dma_mask. Similarly with an IOMMU, no bouncing happens but
> the alignment is still required.
>
>> If it isn't possible, is there another way to reduce memory waste due to
>> increase of dma alignment requirement in arm64?
>
> I first need to see how significant the impact is (especially for
> embedded/mobiles platforms).

I don't have any ARM64 device. What I have just one report
about slab usage from our developer.

The report shows slab usage just after android boot is done
in ARM64.

Total slab usage: 90 MB
kmalloc usage: 25 MB
kmalloc (<=64) usage: 7 MB

This would be measured without slab_nomerge so there is
a possibility that some usages on kmem_cache is merged
into usage of kmalloc (<=64).

Anyway, if ARM64 increase L1_CACHE_BYTES to 128, roughly
7 MB would be wasted. I don't know how this picture is varied
in runtime, but, even boot time overhead, 7 MB looks large to me.

> An alternative is to leave L1_CACHE_BYTES to 64 by default but warn if
> the CWG is 128 on systems with non-coherent DMA (and hope that we won't
> have any). It's not really fix, rather an assumption. Anyway I would
> very much like the same kernel image for all platforms and no Kconfig
> entry for the cache line size but if the waste is significant, we may
> add one for some specific builds (like a mobile phone; or such vendors
> could patch the kernel themselves).

Okay. In fact, I'm not very familiar with android device so it'd be better
for you to ask some other android developer about how significant
the impact is in android or mobile device.

Thanks.

2015-11-09 18:36:14

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH] arm64: Increase the max granular size

On Mon, Nov 09, 2015 at 04:41:58PM +0900, Joonsoo Kim wrote:
> 2015-11-05 21:17 GMT+09:00 Catalin Marinas <[email protected]>:
> > On Thu, Nov 05, 2015 at 08:45:08PM +0900, Joonsoo Kim wrote:
> >> If it isn't possible, is there another way to reduce memory waste due to
> >> increase of dma alignment requirement in arm64?
> >
> > I first need to see how significant the impact is (especially for
> > embedded/mobiles platforms).
>
> I don't have any ARM64 device. What I have just one report
> about slab usage from our developer.
>
> The report shows slab usage just after android boot is done
> in ARM64.
>
> Total slab usage: 90 MB
> kmalloc usage: 25 MB
> kmalloc (<=64) usage: 7 MB
>
> This would be measured without slab_nomerge so there is
> a possibility that some usages on kmem_cache is merged
> into usage of kmalloc (<=64).
>
> Anyway, if ARM64 increase L1_CACHE_BYTES to 128, roughly
> 7 MB would be wasted. I don't know how this picture is varied
> in runtime, but, even boot time overhead, 7 MB looks large to me.

7MB is considerable but I guess it wouldn't be all wasted with
L1_CACHE_BYTES == 128, maybe half or slightly over. It would be good to
know the other kmalloc caches, maybe up to 256.

I don't have an Android filesystem but I just tried to boot Arch
(aarch64). Immediately after boot and slab_nomerge, with 128 L1 I get:

kmalloc-128: 6624
kmalloc-256: 1488

With L1 64, I get:

kmalloc-64: 5760
kmalloc-128: 1152
kmalloc-192: 1155
kmalloc-256: 320

So that's about 1.2MB vs 0.8MB. The ratio is 3:2, though I'm not sure it
will stay the same as the slab usage increases.

It would be good to get more numbers, we could add a Kconfig option just
for specific builds while keeping the default to 128.

--
Catalin

2015-11-10 00:19:25

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH] arm64: Increase the max granular size

On Mon, Nov 09, 2015 at 06:36:09PM +0000, Catalin Marinas wrote:
> On Mon, Nov 09, 2015 at 04:41:58PM +0900, Joonsoo Kim wrote:
> > 2015-11-05 21:17 GMT+09:00 Catalin Marinas <[email protected]>:
> > > On Thu, Nov 05, 2015 at 08:45:08PM +0900, Joonsoo Kim wrote:
> > >> If it isn't possible, is there another way to reduce memory waste due to
> > >> increase of dma alignment requirement in arm64?
> > >
> > > I first need to see how significant the impact is (especially for
> > > embedded/mobiles platforms).
> >
> > I don't have any ARM64 device. What I have just one report
> > about slab usage from our developer.
> >
> > The report shows slab usage just after android boot is done
> > in ARM64.
> >
> > Total slab usage: 90 MB
> > kmalloc usage: 25 MB
> > kmalloc (<=64) usage: 7 MB
> >
> > This would be measured without slab_nomerge so there is
> > a possibility that some usages on kmem_cache is merged
> > into usage of kmalloc (<=64).
> >
> > Anyway, if ARM64 increase L1_CACHE_BYTES to 128, roughly
> > 7 MB would be wasted. I don't know how this picture is varied
> > in runtime, but, even boot time overhead, 7 MB looks large to me.
>
> 7MB is considerable but I guess it wouldn't be all wasted with
> L1_CACHE_BYTES == 128, maybe half or slightly over. It would be good to
> know the other kmalloc caches, maybe up to 256.
>
> I don't have an Android filesystem but I just tried to boot Arch
> (aarch64). Immediately after boot and slab_nomerge, with 128 L1 I get:
>
> kmalloc-128: 6624
> kmalloc-256: 1488
>
> With L1 64, I get:
>
> kmalloc-64: 5760
> kmalloc-128: 1152
> kmalloc-192: 1155
> kmalloc-256: 320
>
> So that's about 1.2MB vs 0.8MB. The ratio is 3:2, though I'm not sure it
> will stay the same as the slab usage increases.
>
> It would be good to get more numbers, we could add a Kconfig option just
> for specific builds while keeping the default to 128.

Okay.

Thanks.