2002-12-05 19:55:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] kmalloc_percpu -- 2 of 2

Dipankar Sarma wrote:
>
> Hi Andrew,
>
> On Wed, Dec 04, 2002 at 08:32:58PM -0800, Andrew Morton wrote:
> > Where in the kernel is such a large number of 4-, 8- or 16-byte
> > objects being used?
>
> Well, kernel objects may not be that small, but one would expect
> the per-cpu parts of the kernel objects to be sometimes small, often down to
> a couple of counters counting statistics.

Sorry, "one would expect" is not sufficient grounds for incorporation of
a new allocator. As far as I can tell, all the proposed users are in
fact allocating decent-sized aggregates, and that will remain the usual
case.

The code exists, great. We can pull it in when there is a demonstrated
need for it. But until that need is shown, this is overdesign.

> >
> > The slab allocator will support caches right down to 1024 x 4-byte
> > objects per page. Why is that not appropriate?
>
> Well, if you allocated 4-byte objects directly from the slab allocator,
> you aren't guranteed to *not* share a cache line with another object
> modified by a different cpu.

If that's a problem it can be addressed in the slab head arrays - make
sure that they are always filled and emptied in multiple-of-cacheline-sized
units for objects which are smaller than a cacheline. That benefits all
slab users.

> >
> > Sorry, but you have what is basically a brand new allocator in
> > there, and we need a very good reason for including it. I'd like
> > to know what that reason is, please.
>
> The reason is concern about per-cpu allocation for small per-CPU
> parts (typically counters) of objects. If a driver has two counters
> counting reads and writes, you don't want to eat up a whole cacheline
> for them for each CPU per instance of the device.
>

I don't buy it.

- If the driver has two counters per device then the storage is
infinitesimal.

- If it has multiple counters per device (always the case) then
the driver will aggregate them anyway.

I am not aware of any situations in which a driver has a large
(or even medium) number of small, discrete counters of this nature.
Sufficiently large to justify a new allocator.

I'd suggest that you drop the new allocator until a compelling
need for it (in real, live 2.5/2.6 code) has been demonstrated.


2002-12-05 21:17:17

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [patch] kmalloc_percpu -- 2 of 2

On Thu, Dec 05, 2002 at 09:10:16PM +0100, Andrew Morton wrote:
>
> I'd suggest that you drop the new allocator until a compelling
> need for it (in real, live 2.5/2.6 code) has been demonstrated.

Fine with me since atleast one workaround for fragmentation with small
allocations is known. I can't see anything in 2.5 timeframe
requiring small per-cpu allocations.

Would you like me to resubmit a simple kmalloc-only version ?

Thanks
Dipankar

2002-12-09 05:24:17

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch] kmalloc_percpu -- 2 of 2

Hi Andrew,
Sorry for the delayed response... I was out of station and couldn't
reply earlier ....

On Thu, Dec 05, 2002 at 12:02:51PM -0800, Andrew Morton wrote:
> Dipankar Sarma wrote:
> >
> > Hi Andrew,
> >
> > On Wed, Dec 04, 2002 at 08:32:58PM -0800, Andrew Morton wrote:
> > > Where in the kernel is such a large number of 4-, 8- or 16-byte
> > > objects being used?
> >
> > Well, kernel objects may not be that small, but one would expect
> > the per-cpu parts of the kernel objects to be sometimes small, often down to
> > a couple of counters counting statistics.
>
> Sorry, "one would expect" is not sufficient grounds for incorporation of
> a new allocator. As far as I can tell, all the proposed users are in
> fact allocating decent-sized aggregates, and that will remain the usual
> case.

The main objective of the interlaced allocator was cacheline utilisation
more than main memory fragmentation (That has been my understanding at least).
Without the interlaced allocator, we'd just pad up data and lose
precious cacheline space. If you have a general purpose object
allocator, one would want objects in different cachelines as kmalloc
does, but that is not the case for kmalloc_percpu users. If obj A and
obj B exists on the same cacheline, atleast objB does not take
another cacheline...If you hit objB after objA, you gain, but if
you don't, you don't lose.

As for the object sizes
1. We are assuming 32 bytes cachelines in this thread I suppose
But ppc64 has a 128 byte cacheline and s390 a 256 byte Jumbo cacheline.
I guess with larger cacheline sizes you have lesser no of cachelines --
makes cachelines all the more precious. (Right now, I am speaking
in ignorance of the ppc64 and s390 cache architectures .. I
can just see L1_CACHE_SHIFT in the kernel sources). So wouldn't
interlaced allocations help these archs .. even when you have 64
bytes big objects?

2. When we have a case for data structures to be per-cpued, not all
the members will be frequently modified or 'bouncy'... say if you take
netdevice stats, rx and tx counters are likely to be hot
and bouncy....and others not that hot... making the whole
structure per-cpu might not be good, but we did not have
a clean workaround until kmalloc_percpu. So when you start
identifying hot objects in these data structures, and making
per-cpu objects only of hot objects, your object size
tends to go down .. making a case for the interlaced allocator .....
This capability is not possible without the interlaced allocator no?

Does this make a reasonable case for interlaced allocator now?
(Of course, blklist init in the patch has to be modified to create
blklists for objects of size 4, 8 .... SMP_CACHE_BYTES/2)

Thanks,
Kiran