2001-12-06 16:11:18

by Niels Christiansen

[permalink] [raw]
Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters


Hi Kiran,

> Are you concerned with increase in memory used per counter Here? I
suppose
> that must not be that much of an issue for a 16 processor box....

Nope, I'm concerned that if this mechanism is to be used for all counters,
the improvement in cache coherence might be less significant to the point
where the additional overhead isn't worth it.

Arjab van de Ven voiced similar concerns but he also said:

> There's several things where per cpu data is useful; low frequency
> statistics is not one of them in my opinion.

...which may be true for 4-ways and even 8-ways but when you get to
32-ways and greater, you start seeing cache problems. That was the
case on AIX and per-cpu counters was one of the changes that helped
get the spectacular scalability on Regatta.

Anyway, since we just had a long thread going on NUMA topology, maybe
it would be proper to investigate if there is a better way, such as
using the topology to decide where to put counters? I think so, seeing
as it is that most Intel based 8-ways and above will have at least some
NUMA in them.

> Well, I wrote a simple kernel module which just increments a shared
global
> counter a million times per processor in parallel, and compared it with
> the statctr which would be incremented a million times per processor in
> parallel..

I suspected that. Would it be possible to do the test on the real
counters?

Niels Christiansen
IBM LTC, Kernel Performance


2001-12-07 08:50:36

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

Hi Niels,

On Thu, Dec 06, 2001 at 10:10:47AM -0600, Niels Christiansen wrote:
>
> Hi Kiran,
>
> > Are you concerned with increase in memory used per counter Here? I
> suppose
> > that must not be that much of an issue for a 16 processor box....
>
> Nope, I'm concerned that if this mechanism is to be used for all counters,
> the improvement in cache coherence might be less significant to the point
> where the additional overhead isn't worth it.

In a low-cpu-count SMP box, yes, this will be a concern. Kiran and I
do plan to study this and understand the impact.

>
> Arjab van de Ven voiced similar concerns but he also said:
>
> > There's several things where per cpu data is useful; low frequency
> > statistics is not one of them in my opinion.
>
> ...which may be true for 4-ways and even 8-ways but when you get to
> 32-ways and greater, you start seeing cache problems. That was the
> case on AIX and per-cpu counters was one of the changes that helped
> get the spectacular scalability on Regatta.

Yes. It also helped us in DYNIX/ptx on Sequent boxes. What we
need to do is to verify if theory based on prior experience is
also applicable to linux.

>
> Anyway, since we just had a long thread going on NUMA topology, maybe
> it would be proper to investigate if there is a better way, such as
> using the topology to decide where to put counters? I think so, seeing
> as it is that most Intel based 8-ways and above will have at least some
> NUMA in them.

It should be easy to place the counters in appropriately close
memory if linux gets good NUMA APIs built on top of the topology
services. If we extend kmem_cache_alloc() to allocate memory
in a particular NUMA node, we could simply do this for placing the
counters -

static int pcpu_ctr_mem_grow(struct pcpu_ctr_ctl *ctl, int flags)
{
void *addr;
struct pcpu_ctr_blk *blkp;
unsigned int save_flags;
int i;

if (!(blkp = pcpu_ctr_blkctl_alloc(ctl, flags)))
return 0;

/* Get per cpu cache lines for the block */
for_each_cpu(cpu) {
blkp->lineaddr[cpu] = kmem_cache_alloc_node(ctl->cachep,
flags, CPU_TO_NODE(cpu));
if(!(blkp->lineaddr[cpu]))
goto exit1;
memset(blkp->lineaddr[cpu], 0, PCPU_CTR_LINE_SIZE);
}

This would put the block of counters corresponding to a CPU in
memory local to the NUMA node. If there are more sophisticated
APIs available for suitable memory selection, those too can be made
use of here.

Is this the kind of thing you are looking at ?

Thanks
Dipankar
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2001-12-07 11:37:30

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

Hi Niels, Arjan

On Thu, Dec 06, 2001 at 10:10:47AM -0600, Niels Christiansen wrote:
>
> > Well, I wrote a simple kernel module which just increments a shared
> global
> > counter a million times per processor in parallel, and compared it with
> > the statctr which would be incremented a million times per processor in
> > parallel..
>
> I suspected that. Would it be possible to do the test on the real
> counters?

Yep, I am gonna run a benchmark after changing some stat counters in the
Kernel. That should let us know if there are performance gains or otherwise..

Kiran
--
Ravikiran G Thirumalai <[email protected]>
Linux Technology Center, IBM Software Labs,
Bangalore.

2001-12-08 14:51:57

by Anton Blanchard

[permalink] [raw]
Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters


Hi,

> > There's several things where per cpu data is useful; low frequency
> > statistics is not one of them in my opinion.
>
> ...which may be true for 4-ways and even 8-ways but when you get to
> 32-ways and greater, you start seeing cache problems. That was the
> case on AIX and per-cpu counters was one of the changes that helped
> get the spectacular scalability on Regatta.

I agree there are large areas of improvement to be done wrt cacheline
ping ponging (see my patch in 2.4.17-pre6 for one example), but we
should do our own benchmarking and not look at what AIX has been doing.

Anton
(ppc64 Linux Hacker)

2001-12-08 22:24:03

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

On Fri, 7 Dec 2001, Dipankar Sarma wrote:
> .... If we extend kmem_cache_alloc() to allocate memory
> in a particular NUMA node, we could simply do this for placing the
> counters -
>
> static int pcpu_ctr_mem_grow(struct pcpu_ctr_ctl *ctl, int flags)
> {
> ...
>
> /* Get per cpu cache lines for the block */
> for_each_cpu(cpu) {
> blkp->lineaddr[cpu] = kmem_cache_alloc_node(ctl->cachep,
> flags, CPU_TO_NODE(cpu));
> ...
> }
>
> This would put the block of counters corresponding to a CPU in
> memory local to the NUMA node.

Rather than baking into each call of kmem_cache_alloc_node()
the CPU_TO_NODE() transformation, how about having a
kmem_cache_alloc_cpu() call that allocates closest to a
specified cpu.

I would prefer to avoid spreading the assumption that for each
cpu there is an identifiable node that has a single memory
that is best for all cpus on that node. Instead, assume that
for each cpu, there is an identifiable best memory ... and let
the internal implementation of kmem_cache_alloc_cpu() find that
best memory for the specified cpu.

Given this change, the kmem_cache_alloc_cpu() implementation
could use the CpuMemSets NUMA infrastructure that my group is
working on to find the best memory. With CpuMemSets, the
kernel will have, for each cpu, a list of memories, sorted
by distance from that cpu. Just take the first memory block
off the selected cpus memory list for the above purpose.


I won't rest till it's the best ...
Manager, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2001-12-09 03:46:49

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

>
> On Fri, 7 Dec 2001, Dipankar Sarma wrote:
> > .... If we extend kmem_cache_alloc() to allocate memory
> > in a particular NUMA node, we could simply do this for placing the
> > counters -
> >
> > static int pcpu_ctr_mem_grow(struct pcpu_ctr_ctl *ctl, int flags)
> > {
> > ...
> >
> > /* Get per cpu cache lines for the block */
> > for_each_cpu(cpu) {
> > blkp->lineaddr[cpu] = kmem_cache_alloc_node(ctl->cachep,
> > flags, CPU_TO_NODE(cpu));
> > ...
> > }
> >
> > This would put the block of counters corresponding to a CPU in
> > memory local to the NUMA node.
>
> Rather than baking into each call of kmem_cache_alloc_node()
> the CPU_TO_NODE() transformation, how about having a
> kmem_cache_alloc_cpu() call that allocates closest to a
> specified cpu.

I think it depends on whether the slab allocator manages memory by cpu or
node. Since the number of cpus per node is rather small (<=8) for
most NUMA systems, I would expect the slab allocator to manage by node.
Managing by cpu would likely add extra fragmentation and no real performance
benefit.


>
> I would prefer to avoid spreading the assumption that for each
> cpu there is an identifiable node that has a single memory
> that is best for all cpus on that node.

But this is already true for the page allocator (alloc_pages_node()).

The page pools are managed by node, not cpu. All memory on a node is
managed by a single pg_data_t structure. This structure contains/points-to
the tables for the memory on the node (page structs, free lists, etc).

If a cpu needs to allocate local memory, it determines it's node_id.
This node_id is in the cpu_data structure for the cpu so this is an easy
calculation (one memory reference). The nodeid is used find the pgdata_t struct
for the node (index into an array of node-local pointers, so again, no offnode
references).

Assuming the slab allocator manages by node, kmem_cache_alloc_node() &
kmem_cache_alloc_cpu() would be identical (exzcept for spelling :-).
Each would pick up the nodeid from the cpu_data struct, then allocate
from the slab cache for that node.


> Instead, assume that
> for each cpu, there is an identifiable best memory ... and let
> the internal implementation of kmem_cache_alloc_cpu() find that
> best memory for the specified cpu.
>
> Given this change, the kmem_cache_alloc_cpu() implementation
> could use the CpuMemSets NUMA infrastructure that my group is
> working on to find the best memory. With CpuMemSets, the
> kernel will have, for each cpu, a list of memories, sorted
> by distance from that cpu. Just take the first memory block
> off the selected cpus memory list for the above purpose.
>
>
> I won't rest till it's the best ...
> Manager, Linux Scalability
> Paul Jackson <[email protected]> 1.650.933.1373
>
>
> _______________________________________________
> Lse-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/lse-tech
>


--
Thanks

Jack Steiner (651-683-5302) (vnet 233-5302) [email protected]

2001-12-09 04:43:30

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

I think Jack got his attribution wrong. Which is good for me,
since I wrote what Jack just gently demolished <grin>.


On Sat, 8 Dec 2001, Jack Steiner wrote:

> I think it depends on whether the slab allocator manages
> memory by cpu or node. Since the number of cpus per node
> is rather small (<=8) for most NUMA systems, I would expect
> the slab allocator to manage by node. Managing by cpu would
> likely add extra fragmentation and no real performance benefit.

I wasn't intending to suggest that the slab allocator manage by
cpu, rather than by node. Pretty clearly, that would be silly.
Rather I was doing two things:

1) Suggesting that if some code asking for memory wanted
it on a node near to some cpu, that it not convert that
cpu to a node _before_ the call, but rather, pass in the
cpu, and let the called routine, kmem_cache_alloc_node()
or renamed to kmem_cache_alloc_cpu() map the cpu to the
node, inside the call.

2) Suggesting (against common usage in the kernel, as Jack
describes, so probably I'm tilting at wind mills) that
we distinguish between nodes and and chunks of memory
that I've started calling memory blocks.

I think (1) is sensible enough - the entire discussion leading
up to the code example involved getting memory near to some
cpu or other - so let the call state just that, and let the
details of translating to whatever units the slab allocator
works with be handled inside the call. Don't make each
caller remember to perform a CPU_TO_NODE conversion - it's
just a little silly duplication of code (a kernel a few
bytes larger), a slightly less direct interface, and one
more detail to impose on each person coding such a call.

As to (2) I'd have to get Jack in a room with a white board, and
at this point, I'm placing no bets on what we would conclude
(well, actually I'd bet on Jack if forced ... his batting
average is pretty good ;).


I won't rest till it's the best ...
Manager, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2001-12-09 17:35:15

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

>
> I think Jack got his attribution wrong. Which is good for me,
> since I wrote what Jack just gently demolished <grin>.

And I probably should not have been reading mail while I
debugged a weird system hang. :-) I missed the earlier
part of the thread - I though you were refering to local
allocation.

I dont think I have a strong opinion yet about kmem_cache_alloc_node()
vs kmem_cache_alloc_cpu(). I would not be surprised to find that
both interfaces make sense.

If code want to allocate close to a cpu, then kmem_cache_alloc_cpu()
is the best choice. However, I would also expect that some code
already knows the node. Then kmem_cache_alloc_node() is best.

Conversion of cpu->node is easy. Conversion of node->cpu
is slightly more difficult (currently) and has the ambiguity
that there may be multiple cpus on the node - which one should
you select? And does it matter?

As precident, the page allocation routines are all node-based.
(ie., alloc_pages_node(), etc...)


--
Thanks

Jack Steiner (651-683-5302) (vnet 233-5302) [email protected]

2001-12-11 23:27:35

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

On Sun, 9 Dec 2001, Jack Steiner wrote:

> If code want to allocate close to a cpu, then kmem_cache_alloc_cpu()
> is the best choice. However, I would also expect that some code
> already knows the node. Then kmem_cache_alloc_node() is best.

yup.


> As precident, the page allocation routines are all node-based.
> (ie., alloc_pages_node(), etc...)

My inclinations would be to prefer more cpu-based allocators.
But until I happen to catch you in a room with a white board,
my inclinations are unlikely to go anywhere ... perhaps someday.


I won't rest till it's the best ...
Manager, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373