Hello, Kiran,
> Statistics counters are used in many places in the Linux kernel,
including
> storage, network I/O subsystems etc. These counters are not atomic since
> accuracy is not so important. Nevertheless, frequent updation of these
> counters result in cacheline bouncing among various cpus in a multi
processor
> environment. This patch introduces a new set of interfaces, which should
> improve performance of such counters in MP environment. This
implementation
> switches to code that is devoid of overheads for SMP if these interfaces
> are used with a UP kernel.
>
> Comments are welcome :)
>
>Regards,
>Kiran
I'm wondering about the scope of this. My Ethernet adapter with, maybe, 20
counter fields would have 20 counters allocated for each of my 16
processors.
The only way to get the total would be to use statctr_read() to merge them.
Same for the who knows how many IP counters etc., etc.
How many and which counters were converted for the test you refer to?
I do like the idea of a uniform access mechanism, though. It is well in
line
with my thoughts about an architected interface for topology and
instrumentation
so I'll definitely get back to you as I try to collect requirements.
Niels
Hi Niels,
On Wed, Dec 05, 2001 at 10:02:33AM -0500, Niels Christiansen wrote:
>
> I'm wondering about the scope of this. My Ethernet adapter with, maybe, 20
> counter fields would have 20 counters allocated for each of my 16
> processors.
> The only way to get the total would be to use statctr_read() to merge them.
> Same for the who knows how many IP counters etc., etc.
Are you concerned with increase in memory used per counter Here? I suppose
that must not be that much of an issue for a 16 processor box....
>
> How many and which counters were converted for the test you refer to?
>
Well, I wrote a simple kernel module which just increments a shared global
counter a million times per processor in parallel, and compared it with
the statctr which would be incremented a million times per processor in
parallel..
> I do like the idea of a uniform access mechanism, though. It is well in
> line
> with my thoughts about an architected interface for topology and
> instrumentation
> so I'll definitely get back to you as I try to collect requirements.
>
> Niels
Hope we can come out with a really cool and acceptable interface..
Kiran
--
Ravikiran G Thirumalai <[email protected]>
Linux Technology Center, IBM Software Labs,
Bangalore.
On Thu, 6 Dec 2001 18:03:53 +0530,
Ravikiran G Thirumalai <[email protected]> wrote:
>Hope we can come out with a really cool and acceptable interface..
How about a user space interface that runs at machine speed and
extracts counters without any syscall overhead? This proposal got very
little attention at the time so we put it aside until more people
were interested.
http://marc.theaimsgroup.com/?l=linux-kernel&m=98578952028153&w=2
Ralf Baechle has pointed out one problem, virtually indexed caches :(.
That prevents a single user space mmap over the scattered kernel pages,
kernel and user space addresses have to be in sync in the cache. So
user space sees the scattered pages and has to run the structure
itself. No big deal, just a library function that converts an instance
name and cpu number into an address.
Ravikiran G Thirumalai wrote:
>
> Hi Niels,
>
> On Wed, Dec 05, 2001 at 10:02:33AM -0500, Niels Christiansen wrote:
> >
> > I'm wondering about the scope of this. My Ethernet adapter with, maybe, 20
> > counter fields would have 20 counters allocated for each of my 16
> > processors.
> > The only way to get the total would be to use statctr_read() to merge them.
> > Same for the who knows how many IP counters etc., etc.
>
> Are you concerned with increase in memory used per counter Here? I suppose
> that must not be that much of an issue for a 16 processor box....
>
> >
> > How many and which counters were converted for the test you refer to?
> >
>
> Well, I wrote a simple kernel module which just increments a shared global
> counter a million times per processor in parallel, and compared it with
> the statctr which would be incremented a million times per processor in
> parallel..
Would you care to point out a statistic in the kernel that is
incremented
more than 10.000 times/second ? (I'm giving you a a factor of 100 of
playroom
here) [One that isn't per-cpu yet of course]
Greetings,
Arjan van de Ven
On Thu, Dec 06, 2001 at 01:07:37PM +0000, Arjan van de Ven wrote:
> >
> > >
> > > How many and which counters were converted for the test you refer to?
> > >
> >
> > Well, I wrote a simple kernel module which just increments a shared global
> > counter a million times per processor in parallel, and compared it with
> > the statctr which would be incremented a million times per processor in
> > parallel..
>
> Would you care to point out a statistic in the kernel that is
> incremented
> more than 10.000 times/second ? (I'm giving you a a factor of 100 of
> playroom
> here) [One that isn't per-cpu yet of course]
Well, as I mentioned in my earlier post, we have performed
"micro benchmarking", which does not reflect the actual run time
kernel conditions. I guess u gotta take these results with a
pinch of salt.
But, you cannot deny that there r gonna be a lot of cacheline
invalidations, if you use a global counter. Using per-cpu versions is
definitely going to improve kernel performance.
Kiran
--
Ravikiran G Thirumalai <[email protected]>
Linux Technology Center, IBM Software Labs,
Bangalore.
On Thu, Dec 06, 2001 at 07:39:40PM +0530, Ravikiran G Thirumalai wrote:
> Well, as I mentioned in my earlier post, we have performed
> "micro benchmarking", which does not reflect the actual run time
> kernel conditions. I guess u gotta take these results with a
> pinch of salt.
>
> But, you cannot deny that there r gonna be a lot of cacheline
> invalidations, if you use a global counter. Using per-cpu versions is
> definitely going to improve kernel performance.
there's not that many counters in fact. And if you care about a gige
counter, just bind the card to a specific CPU and you have ad-hoc per-cpu
counters...
The extra cost of getting to them (extra indirection) makes each access
more expensive..... in the end it might be a loss.
There's several things where per cpu data is useful; low frequency
statistics is not one of them in my opinion.
Greetings,
Arjan van de Ven
On Thu, Dec 06, 2001 at 09:10:15AM -0500, Arjan van de Ven wrote:
> On Thu, Dec 06, 2001 at 07:39:40PM +0530, Ravikiran G Thirumalai wrote:
> > But, you cannot deny that there r gonna be a lot of cacheline
> > invalidations, if you use a global counter. Using per-cpu versions is
> > definitely going to improve kernel performance.
>
> there's not that many counters in fact. And if you care about a gige
> counter, just bind the card to a specific CPU and you have ad-hoc per-cpu
> counters...
>
> The extra cost of getting to them (extra indirection) makes each access
> more expensive..... in the end it might be a loss.
If it is a low frequency statistics then the expensive access wouldn't
really matter much, right ? On the other hand, this will likely help
specially with larger number of CPUs.
>
> There's several things where per cpu data is useful; low frequency
> statistics is not one of them in my opinion.
It is quite possible that you are right. What we need to do is
a measurement effort to understand the impact.
Thanks
Dipankar
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.
--On Thursday, 06 December, 2001 1:07 PM +0000 Arjan van de Ven
<[email protected]> wrote:
> Would you care to point out a statistic in the kernel that is
> incremented
> more than 10.000 times/second ? (I'm giving you a a factor of 100 of
> playroom
> here) [One that isn't per-cpu yet of course]
cat /proc/net/dev
80,000 increments a second here on at least 4 counters
--
Alex Bligh
yOn Fri, Dec 07, 2001 at 09:09:25PM -0000, Alex Bligh - linux-kernel wrote:
>
>
> --On Thursday, 06 December, 2001 1:07 PM +0000 Arjan van de Ven
> <[email protected]> wrote:
>
> > Would you care to point out a statistic in the kernel that is
> > incremented
> > more than 10.000 times/second ? (I'm giving you a a factor of 100 of
> > playroom
> > here) [One that isn't per-cpu yet of course]
>
> cat /proc/net/dev
>
> 80,000 increments a second here on at least 4 counters
except that
1) you can (and should) bind nics to cpus
and
2) the cacheline for nic stats should (for good drivers) be in the cacheline
the ISR gets into the cpu ANYWAY to get to the device data -> no extra
cacheline pingpong