Hi all,
I've created very preliminary patches to add PPC Performance Monitor
support to oprofile cvs and linux 2.4.20-ben8:
http://bryanr.org/linux/oprofile/
Sadly, this approach seems very unstable on a 750CX imac (PVR 0008 2214).
The box freezes hard and requires a cold reboot after just a few minutes of
profiling. As benh hinted in a previous thread, I suspect there's an
undocumented erratum for this CPU related to decrementer + pm use.
If anyone has contacts in IBM, further info would be helpful to rule out
software error and possibly to obtain a workaround...
Assuming lots of PPC perfmon hardware is effectively useless with the
decrementer, some solutions are:
1. use the pm irq for all timer-related stuff in Linux, and turn off the
decrementer completely. May mean we need to disable idle loop HLTs, hurting
thermal dissipation / power consumption--I believe the PMC cycle counters
stop incrementing inside power_save_6xx().
2. use the decrementer for profiling. we'd forfeits the perfmon's ability to
sample when MSR[EE]=0 (irqs disabled), taking a big chunk out of
oprofile's appeal imo.
3. hybrid approach. when not using oprofile, the kernel runs as it does
now via the decrementer. when profiling, we switch everything over to the
pm system, disabling the power_save stuff. I suspect this is the best
approach on broken silicon, though there may be some minor nastiness with
swapping between different exception mechanisms for timer_interrupt()
I'm going to try the hybrid approach and will post new patches soon.
My work is commercially sponsored by Ixia and therefore focuses on 2.4
and the IBM PPC750FX/CXe. However, I'm happy to discuss a 2.5 port,
support for other chips, an eventual upstream merge, and any other issues
related to bringing up oprofile on the PPC.
-Bryan
On Mon, Mar 24, 2003 at 09:09:00PM -0800, Bryan Rittmeyer wrote:
> I'm going to try the hybrid approach and will post new patches soon.
done. patches are -v0002 at http://bryanr.org/linux/oprofile/
seems to be working pretty well, definitely _way_ more stable than
the patches that mix perfmon+decrementer... biggest kernel cycle
wasters during an imac ping -f run over ssh:
c0031344 20524 1.02949 kmalloc
c00d4ab4 22232 1.11516 netif_rx
c0074b98 24220 1.21488 n_tty_receive_buf
c00332e0 25583 1.28325 __free_pages_ok
c008ce68 28961 1.45269 gem_start_xmit
c00223a0 31661 1.58812 sys_rt_sigprocmask
c00eb92c 35584 1.7849 tcp_sendmsg
c003be24 38461 1.92921 fput
c004c55c 44923 2.25335 do_select
c0012804 47443 2.37975 memcpy
c0012980 50239 2.52 __copy_tofrom_user
c000452c 51645 2.59053 restore
c008c570 53061 2.66155 gem_rx
c004c7cc 56388 2.82844 sys_select
c008cae0 64265 3.22355 gem_interrupt
c0005b9c 148240 7.43576 idled
c000db34 242388 12.1583 openpic_read
wow. inside of open_pic.c, I bet it's code like this
while (openpic_read(addr) & OPENPIC_ACTIVITY);
that's showing up.
anyway, a short-term PPC oprofile TODO:
- further stress/stability testing (overnight etc)
- re-allow power_save_6xx() when not profiling
- actually use config options instead of hardcoding perfmon rate
- userspace profiling system call hooks
- fix op_start race between perfmon and decrementer (?)
- code to call timer_interrupt() at approximately HZ
independent of profile counter rate (or is it 2*HZ, heh)
- add support for other perfmon features (possible given that we have
to call timer_interrupt and thus need something periodic?)
If you play with this, please let me know how it goes. Ignoring the
rough edges it should be basically useable for kernel/module profiling,
even on code with irqs disabled.
-Bryan
On Tue, 2003-03-25 at 09:57, Bryan Rittmeyer wrote:
> wow. inside of open_pic.c, I bet it's code like this
>
> while (openpic_read(addr) & OPENPIC_ACTIVITY);
>
> that's showing up.
That's interesting. You are raising again an old debate of
wether to disable the IRQ during handling on openpic or not,
I beleive we don't need to disable it on this PIC, but we
do this for "safety" reasons.
In fact, I think our
openpic_ack_irq() should do something like
if (edge)
openpic_eoi();
and our openpic_end_irq() something like
if (level)
openpic_eoi();
Also, the fact that PIC access is slow is a generic "feature"
of such chips, I also think we could actually be smarter and only
soft-disable IRQs with a flag in the descriptor, and hard disable
them if and only if they actually occur while disabled.
Ben.
On Tue, Mar 25, 2003 at 05:43:09PM +0000, John Levon wrote:
> > done. patches are -v0002 at http://bryanr.org/linux/oprofile/
>
> Looks OK, modulo some minor style issues (see doc/CodingStyle).
will fix in v0004.
> The patch seems to be out of date already though. Does the new cpu speed
> code not work OK for you (for the default event value) ?
I was lagging cvs; v0003 merges to today's tree, including the new
CPU_SPEED code. posted at the url above.
> Then you should fix this generally, instead of adding the hack you do.
agree. here's a separate patch for x86+ia64. with it, there's one
extra argument to op_do_profile, and oprofile.c no longer uses op_arch.h
the architecture code directly passes eip and irq_enabled. tested on
i686 2.4.20 and ppc 2.4.20-benh.
http://bryanr.org/linux/oprofile/op_do_profile-refactor.patch
ppc v0003 depends on this change.
> > +/* TODO: fix upper level. [op_rtc_ops in ppc/ia64] is really lame. */
>
> Sure.
I'll let you handle this one.
BTW how do you feel about reworking add_sysctl and remove_sysctl to move
shared code inside oprofile.c? Right now the duplication is causing
inconsistency e.g ia64/op_pmu.c "next->mode = 0700;" vs x86/op_nmi.c
"next->mode = 0755;"
-Bryan
On Tue, Mar 25, 2003 at 10:45:02AM +0100, Benjamin Herrenschmidt wrote:
> I also think we could actually be smarter and only
> soft-disable IRQs with a flag in the descriptor, and hard disable
> them if and only if they actually occur while disabled.
yup. minimizing open_pic I/O could be a big win. the fact that
the PIC shows up _way_ ahead of the network driver and stack is
pretty freakish compared to a profile on x86 using io-apic
(the legacy x86 pic has similar but less severe issues as open_pic).
-Bryan
On Thu, Mar 27, 2003 at 02:37:34PM -0600, Andrew Fleming wrote:
> I'm 99.9% positive that the counters are different. In my experience,
> every new microarchitecture has a completely different set of counters.
yup. a kernel API to wrap all the low-level register config would be
nice but my interest for now is just the ppc750cx. If someone else wants to
build additional ppc32 support into oprofile, my patches could serve as a
basic starting point.
> Earlier on the list, the possibility of using the performance interrupt
> with the timebase was proposed.
the timebase is simpler but less configurable than the PMCs.
my latest patch runs everything (oprofile+timer_interrupt)
via the timebase--there is no real support yet for the more
sophisticated PMCs.
We can allow for more user configurability by using the timebase
along side PMC interrupts. However, then we have to identify
which counter(s) caused a given interrupt, and that probably means
keeping some past state for the timebase and PMCs. For now, I'm satisfied
with hardcoded 1500 Hz profiling via the timebase; finishing up the syscall
interception is more important.
> [dec/pm switch-over race in v0003]
known issue, I was pondering a solution for a few days. It seems like
the decrementer is not individually maskable or haltable so my v0004
approach is to just do mtspr(SPRN_DEC, 0x7fffffff) upon enabling perfmon
and again inside each pm_irq.
-Bryan
On Thu, Mar 27, 2003 at 01:01:21AM +0000, John Levon wrote:
> Please try to keep to the style.
and lo, v0004 was born:
- use (3 KHz timebase / 2) for driving profiler
- fix startup collision between decrementer and perfmon
- dynamically enable/disable power management on start/stop
- use PVR to make sure we're on a supported CPU
- actually start/stop inside pm_start_cpu/pm_stop_cpu
- begin to implement syscall interception
- build system-related cleanups
- further comments
- style cleanups
http://bryanr.org/linux/oprofile/
-Bryan