2007-01-30 21:41:20

by Maynard Johnson

[permalink] [raw]
Subject: Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

Arnd Bergmann wrote:

> On Monday 29 January 2007 20:48, Maynard Johnson wrote:
>
>>Subject: Add support to OProfile for profiling Cell BE SPUs
>>
>>From: Maynard Johnson <[email protected]>
>>
>>This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
>>to add in the SPU profiling capabilities. In addition, a 'cell' subdirectory
>>was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling
>>code.
>>
>>Signed-off-by: Carl Love <[email protected]>
>>Signed-off-by: Maynard Johnson <[email protected]>
>
>
> I can't really say much about the common oprofile files that you are
> touching, maybe someone from oprofile-list (Philippe?) to look over them
> and ack/nack them.
Anton (added to cc list) may be another good reviewer of
drivers/oprofile changes.
>
>
>>+#define number_of_online_nodes(nodes) { \
>>+ u32 cpu; u32 tmp; \
>>+ nodes = 0; \
>>+ for_each_online_cpu(cpu) { \
>>+ tmp = cbe_cpu_to_node(cpu) + 1;\
>>+ if (tmp > nodes) \
>>+ nodes++; \
>>+ } \
>>+}
>
>
> I've been discussing with benh about a better way to do this. We should
> change all the references to nodes and cpu numbers to something more
> correct in the future, so we get rid of the assumption that each
> numa node is a cell chip. It's probably best to leave your code as
> is for now, but we need to remember this one when cleaning it up.
>
> You should however convert this into an inline function
> of prototype
>
> static inline int number_of_online_nodes(void)
>
> instead of a macro.
OK.
>
>
>>+/* Defines used for sync_start */
>>+#define SKIP_GENERIC_SYNC 0
>>+#define SYNC_START_ERROR -1
>>+#define DO_GENERIC_SYNC 1
>>+
>>+typedef struct vma_map
>>+{
>>+ struct vma_map *next;
>>+ unsigned int vma;
>>+ unsigned int size;
>>+ unsigned int offset;
>>+ unsigned int guard_ptr;
>>+ unsigned int guard_val;
>>+} vma_map_t;
>
>
> please don't typedef structures.
Sure.
>
>
[snip]
>>+
>>+static int spu_prof_running = 0;
>>+static unsigned int profiling_interval = 0;
>>+
>>+extern int num_nodes;
>>+extern unsigned int khzfreq;
>
>
> You really can't have global variable with such generic names. For
> static variables, it's less of a problem, since they are not in the
> same name space, but for easier debugging, it's good to always have
> the name of your module (e.g. spu_prof_) as a prefix to the identifier.
OK, we'll add 'spu_prof' prefix to them.
>
> Of course, the best way would be to avoid global and static variables
> entirely, but that's not always possible.
>
>
[snip]
>>+ // process the collected SPU PC for each node
>>+ for_each_online_cpu(cpu) {
>>+ if (cbe_get_hw_thread_id(cpu))
>>+ continue;
>>+
>>+ node = cbe_cpu_to_node(cpu);
>>+ node_factor = node * SPUS_PER_NODE;
>>+ /* number of valid entries for this node */
>>+ entry = 0;
>>+
>>+ trace_addr = cbe_read_pm(cpu, trace_address);
>>+ while ((trace_addr & CBE_PM_TRACE_BUF_EMPTY) != 0x400)
>>+ {
>>+ /* there is data in the trace buffer to process */
>>+ cbe_read_trace_buffer(cpu, trace_buffer);
>>+ spu_mask = 0xFFFF000000000000;
>>+
>>+ /* Each SPU PC is 16 bits; hence, four spus in each of
>>+ * the two 64-bit buffer entries that make up the
>>+ * 128-bit trace_buffer entry. Process the upper and
>>+ * lower 64-bit values simultaneously.
>>+ */
>>+ for (spu = 0; spu < SPUS_PER_TB_ENTRY; spu++) {
>>+ spu_pc_lower = spu_mask & trace_buffer[0];
>>+ spu_pc_lower = spu_pc_lower >> (NUM_SPU_BITS_TRBUF
>>+ * (SPUS_PER_TB_ENTRY-spu-1));
>>+
>>+ spu_pc_upper = spu_mask & trace_buffer[1];
>>+ spu_pc_upper = spu_pc_upper >> (NUM_SPU_BITS_TRBUF
>>+ * (SPUS_PER_TB_ENTRY-spu-1));
>>+
>>+ spu_mask = spu_mask >> NUM_SPU_BITS_TRBUF;
>>+
>>+ /* spu PC trace entry is upper 16 bits of the
>>+ * 18 bit SPU program counter
>>+ */
>>+ spu_pc_lower = spu_pc_lower << 2;
>>+ spu_pc_upper = spu_pc_upper << 2;
>>+
>>+ samples[((node_factor + spu) * TRACE_ARRAY_SIZE) + entry]
>>+ = (u32) spu_pc_lower;
>>+ samples[((node_factor + spu + SPUS_PER_TB_ENTRY) * TRACE_ARRAY_SIZE) + entry]
>>+ = (u32) spu_pc_upper;
>>+ }
>>+
>>+ entry++;
>>+
>>+ if (entry >= TRACE_ARRAY_SIZE)
>>+ /* spu_samples is full */
>>+ break;
>>+
>>+ trace_addr = cbe_read_pm(cpu, trace_address);
>>+ }
>>+ samples_per_node[node] = entry;
>>+ }
>>+}
>
>
> While I can't see anything technically wrong with this function, it would be
> good to split it into smaller functions. Since you are nesting three
> loops, it should be possible to make a separate function from one of the
> inner loops without changing the actual logic behind it.
Will do.
>
>
>>+
>>+static int profile_spus(struct hrtimer * timer)
>>+{
>>+ ktime_t kt;
>>+ int cpu, node, k, num_samples, spu_num;
>
>
> whitespace damage
fixed
>
>
>>+
>>+ if (!spu_prof_running)
>>+ goto STOP;
>>+
>>+ cell_spu_pc_collection();
>>+ for_each_online_cpu(cpu) {
>>+ if (cbe_get_hw_thread_id(cpu))
>>+ continue;
>
>
> Here, you enter the same top-level loop again, why not make it
> for_each_online_cpu(cpu) {
> if (cbe_get_hw_thread_id(cpu))
> continue;
> num_samples = cell_spu_pc_collection(cpu);
> ...
Yes, good suggestion.
>
>
>>+ kt = ktime_set(0, profiling_interval);
>>+ if (!spu_prof_running)
>>+ goto STOP;
>>+ hrtimer_forward(timer, timer->base->get_time(), kt);
>>+ return HRTIMER_RESTART;
>
>
> is hrtimer_forward really the right interface here? You are ignoring
> the number of overruns anyway, so hrtimer_start(,,) sounds more
> correct to me.
According to Tom Gleixner, "hrtimer_forward is a convenience function to
move the expiry time of a timer forward in multiples of the interval, so
it is in the future. After setting the expiry time you restart the
timer either with [sic] a return HRTIMER_RESTART (if you are in the
timer callback function)."
>
>
>>+
>>+ STOP:
>
>
> labels should be in small letters.
OK
>
>
>>+ printk(KERN_INFO "SPU_PROF: spu-prof timer ending\n");
>>+ return HRTIMER_NORESTART;
>>+}
>
>
>>+void start_spu_profiling(unsigned int cycles_reset) {
>>+
>>+ ktime_t kt;
>>+
>>+ /* To calculate a timeout in nanoseconds, the basic
>>+ * formula is ns = cycles_reset * (NSEC_PER_SEC / cpu frequency).
>>+ * To avoid floating point math, we use the scale math
>>+ * technique as described in linux/jiffies.h. We use
>>+ * a scale factor of SCALE_SHIFT,which provides 4 decimal places
>>+ * of precision, which is close enough for the purpose at hand.
>>+ */
>>+
>>+ /* Since cpufreq_quick_get returns frequency in kHz, we use
>>+ * USEC_PER_SEC here vs NSEC_PER_SEC.
>>+ */
>>+ unsigned long nsPerCyc = (USEC_PER_SEC << SCALE_SHIFT)/khzfreq;
>>+ profiling_interval = (nsPerCyc * cycles_reset) >> SCALE_SHIFT;
>>+
>>+ pr_debug("timer resolution: %lu\n",
>>+ TICK_NSEC);
>
>
> Don't you need to adapt the profiling_interval at run time, when cpufreq
> changes the core frequency? You should probably use
> cpufreq_register_notifier() to update this.
Since OProfile is a statistical profiler, the exact frequency is not
critical. The user is going to be looking for hot spots in their code,
so it's all relative. With that said, I don't imagine using the
cpufreq notiication would be a big deal. We'll look at it.
>
>
>>+ kt = ktime_set(0, profiling_interval);
>>+ hrtimer_init(&timer, CLOCK_MONOTONIC, HRTIMER_REL);
>>+ timer.expires = kt;
>>+ timer.function = profile_spus;
>>+
>>+ /* Allocate arrays for collecting SPU PC samples */
>>+ samples = (u32 *) kzalloc(num_nodes * SPUS_PER_NODE * TRACE_ARRAY_SIZE * sizeof(u32), GFP_ATOMIC);
>
>
> Try to avoid atomic allocations. I don't think you are in an atomic
> context here at all, so you can just use GFP_KERNEL.
OK, I'll check it out.
>
>
>>+ samples_per_node = (u32 *) kzalloc(num_nodes * sizeof(u32), GFP_ATOMIC);
>
>
> Since MAX_NUMNODES is small, it's probably more efficient to just allocate this
> statically.
OK.
>
>
>>+
>>+ spu_prof_running = 1;
>>+ hrtimer_start(&timer, kt, HRTIMER_REL);
>>+}
>>
>>+
>>+void stop_spu_profiling(void)
>>+{
>>+
>>+ hrtimer_cancel(&timer);
>>+ kfree(samples);
>>+ kfree(samples_per_node);
>>+ pr_debug("SPU_PROF: stop_spu_profiling issued\n");
>>+ spu_prof_running = 0;
>>+}
>
>
> shouldn't you set spu_prof_running = 0 before doing any of the other things?
> It looks to me like you could otherwise get into a use-after-free
> situation. If I'm wrong with that, you probably don't need spu_prof_running
> at all ;-)
No, you're right. :-)
>
>
>>+/* Conainer for caching information about an active SPU task.
>>+ * :map -- pointer to a list of vma_maps
>>+ * :spu -- the spu for this active SPU task
>>+ * :list -- potentially could be used to contain the cached_infos
>>+ * for inactive SPU tasks.
>
>
> Documenting structures is good, but please use the common kerneldoc format
> for it. There are a number of examples for this in include/linux/
OK
>
>
>>+ *
>>+ * Ideally, we would like to be able to create the cached_info for
>>+ * an SPU task just one time -- when libspe first loads the SPU
>>+ * binary file. We would store the cached_info in a list. Then, as
>>+ * SPU tasks are switched out and new ones switched in, the cached_info
>>+ * for inactive tasks would be kept, and the active one would be placed
>>+ * at the head of the list. But this technique may not with
>>+ * current spufs functionality since the spu used in bind_context may
>>+ * be a different spu than was used in a previous bind_context for a
>>+ * reactivated SPU task. Additionally, a reactivated SPU task may be
>>+ * assigned to run on a different physical SPE. We will investigate
>>+ * further if this can be done.
>>+ *
>>+ */
>
>
> You should stuff a pointer to cached_info into struct spu_context,
> e.g. 'void *profile_private'.
>
>
>>+struct cached_info {
>>+ vma_map_t * map;
>>+ struct spu * the_spu;
>>+ struct kref cache_ref;
>>+ struct list_head list;
>>+};
>
>
> And replace the 'the_spu' member with a back pointer to the
> spu_context if you need it.
>
>
>>+
>>+/* A data structure for cached information about active SPU tasks.
>>+ * Storage is dynamically allocated, sized as
>>+ * "number of active nodes multplied by 8".
>>+ * The info_list[n] member holds 0 or more
>>+ * 'struct cached_info' objects for SPU#=n.
>>+ *
>>+ * As currently implemented, there will only ever be one cached_info
>>+ * in the list for a given SPU. If we can devise a way to maintain
>>+ * multiple cached_infos in our list, then it would make sense
>>+ * to also cache the dcookie representing the PPU task application.
>>+ * See above description of struct cached_info for more details.
>>+ */
>>+struct spu_info_stacks {
>>+ struct list_head * info_list;
>>+};
>
>
> Why do you store pointers to list_head structures? If you want to store
> lists, you should have a lists_head itself in here.
info_list is an array of n lists, where n is the number of SPUs.
>
> Why do you store them per spu in the first place? The physical spu
> doesn't have any relevance to this at all, the only data that is
> per spu is the sample data collected on a profiling interrupt,
> which you can then copy in the per-context data on a context switch.
The sample data is written out to the event buffer on every profiling
interrupt. But we don't write out the SPU program counter samples
directly to the event buffer. First, we have to find the cached_info
for the appropriate SPU context to retrieve the cached vma-to-fileoffset
map. Then we do the vma_map_lookup to find the fileoffset corresponding
to the SPU PC sample, which we then write out to the event buffer. This
is one of the most time-critical pieces of the SPU profiling code, so I
used an array to hold the cached_info for fast random access. But as I
stated in a code comment above, the negative implication of this current
implementation is that the array can only hold the cached_info for
currently running SPU tasks. I need to give this some more thought.
>
>
>
>>+/* Looks for cached info for the passed spu. If not found, the
>>+ * cached info is created for the passed spu.
>>+ * Returns 0 for success; otherwise, -1 for error.
>>+ */
>>+static int
>>+prepare_cached_spu_info(struct spu * spu, unsigned int objectId)
>>+{
>
>
> see above, this should get the spu_context pointer as its argument,
> not the spu.
>
>
>>+ vma_map_t * new_map;
>>+ unsigned long flags = 0;
>>+ int retval = 0;
>>+ /* spu->number is a system-wide value, not a per-node value. */
>>+ struct cached_info * info = get_cached_info(spu->number);
>>+ if (info == NULL) {
>
>
> if you revert the logic to
>
> if (info)
> goto out;
>
> then the bulk of your function doesn't need to get indented,
> which helps readability.
OK
>
>
>>+ /* create cached_info and add it to the list for SPU #<n>.*/
>>+ info = kzalloc(sizeof(struct cached_info), GFP_ATOMIC);
>
>
> GFP_KERNEL
OK
>
>
>>+ if (!info) {
>>+ printk(KERN_ERR "SPU_PROF: "
>>+ "%s, line %d: create vma_map failed\n",
>>+ __FUNCTION__, __LINE__);
>>+ goto ERR_ALLOC;
>>+ }
>>+ new_map = create_vma_map(spu, objectId);
>>+ if (!new_map) {
>>+ printk(KERN_ERR "SPU_PROF: "
>>+ "%s, line %d: create vma_map failed\n",
>>+ __FUNCTION__, __LINE__);
>>+ goto ERR_ALLOC;
>>+ }
>>+
>>+ pr_debug("Created vma_map\n");
>>+ info->map = new_map;
>>+ info->the_spu = spu;
>>+ kref_init(&info->cache_ref);
>>+ spin_lock_irqsave(&cache_lock, flags);
>>+ list_add(&info->list, &spu_info->info_list[spu->number]);
>>+ spin_unlock_irqrestore(&cache_lock, flags);
>>+ goto OUT;
>>+ } else {
>>+ /* Immedidately put back reference to cached_info since we don't
>>+ * really need it -- just checking whether we have it.
>>+ */
>>+ put_cached_info(info);
>>+ pr_debug("Found cached SPU info.\n");
>>+ }
>>+
>>+ERR_ALLOC:
>>+ retval = -1;
>>+OUT:
>>+ return retval;
>>+}
>
>
>>+/* Look up the dcookie for the task's first VM_EXECUTABLE mapping,
>>+ * which corresponds loosely to "application name". Also, determine
>>+ * the offset for the SPU ELF object. If computed offset is
>>+ * non-zero, it implies an embedded SPU object; otherwise, it's a
>>+ * separate SPU binary, in which case we retrieve it's dcookie.
>>+ */
>>+static unsigned long
>>+get_exec_dcookie_and_offset(
>>+ struct spu * spu, unsigned int * offsetp,
>>+ unsigned long * spu_bin_dcookie,
>>+ unsigned int spu_ref)
>>+{
>>+ unsigned long cookie = 0;
>>+ unsigned int my_offset = 0;
>>+ struct vm_area_struct * vma;
>>+ struct mm_struct * mm = spu->mm;
>
>
> indenting
uh-huh
>
>
>>+ if (!mm)
>>+ goto OUT;
>>+
>>+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
>>+ if (!vma->vm_file)
>>+ continue;
>>+ if (!(vma->vm_flags & VM_EXECUTABLE))
>>+ continue;
>>+ cookie = fast_get_dcookie(vma->vm_file->f_dentry,
>>+ vma->vm_file->f_vfsmnt);
>>+ pr_debug("got dcookie for %s\n",
>>+ vma->vm_file->f_dentry->d_name.name);
>>+ break;
>>+ }
>>+
>>+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
>>+ if (vma->vm_start > spu_ref || vma->vm_end < spu_ref)
>>+ continue;
>>+ my_offset = spu_ref - vma->vm_start;
>>+ pr_debug("Found spu ELF at "
>>+ " %X for file %s\n", my_offset,
>>+ vma->vm_file->f_dentry->d_name.name);
>>+ *offsetp = my_offset;
>>+ if (my_offset == 0) {
>>+ if (!vma->vm_file) {
>>+ goto FAIL_NO_SPU_COOKIE;
>>+ }
>>+ *spu_bin_dcookie = fast_get_dcookie(
>>+ vma->vm_file->f_dentry,
>>+ vma->vm_file->f_vfsmnt);
>>+ pr_debug("got dcookie for %s\n",
>>+ vma->vm_file->f_dentry->d_name.name);
>>+ }
>>+ break;
>>+ }
>>+
>>+OUT:
>>+ return cookie;
>>+
>>+FAIL_NO_SPU_COOKIE:
>>+ printk(KERN_ERR "SPU_PROF: "
>>+ "%s, line %d: Cannot find dcookie for SPU binary\n",
>>+ __FUNCTION__, __LINE__);
>>+ goto OUT;
>>+}
>>+
>>+
>>+
>>+/* This function finds or creates cached context information for the
>>+ * passed SPU and records SPU context information into the OProfile
>>+ * event buffer.
>>+ */
>>+static int process_context_switch(struct spu * spu, unsigned int objectId)
>>+{
>>+ unsigned long flags = 0;
>>+ int retval = 0;
>>+ unsigned int offset = 0;
>>+ unsigned long spu_cookie = 0, app_dcookie = 0;
>>+ retval = prepare_cached_spu_info(spu, objectId);
>>+ if (retval == -1) {
>>+ goto OUT;
>>+ }
>>+ /* Get dcookie first because a mutex_lock is taken in that
>>+ * code path, so interrupts must not be disabled.
>>+ */
>>+ app_dcookie = get_exec_dcookie_and_offset(spu, &offset,
>>+ &spu_cookie, objectId);
>>+
>>+ /* Record context info in event buffer */
>>+ spin_lock_irqsave(&buffer_lock, flags);
>>+ add_event_entry(ESCAPE_CODE);
>>+ add_event_entry(SPU_CTX_SWITCH_CODE);
>>+ add_event_entry(spu->number);
>>+ add_event_entry(spu->pid);
>>+ add_event_entry(spu->tgid);
>>+ add_event_entry(app_dcookie);
>>+
>>+ add_event_entry(ESCAPE_CODE);
>>+ if (offset) {
>>+ /* When offset is non-zero, this means the SPU ELF was embedded;
>>+ * otherwise, it was loaded from a separate binary file. For the
>>+ * embedded case, we record the offset of the SPU ELF into the PPU
>>+ * executable; for the non-embedded case, we record a dcookie that
>>+ * points to the location of the SPU binary that was loaded.
>>+ */
>>+ add_event_entry(SPU_OFFSET_CODE);
>>+ add_event_entry(offset);
>>+ } else {
>>+ add_event_entry(SPU_COOKIE_CODE);
>>+ add_event_entry(spu_cookie);
>>+ }
>
>
> I don't get it. What is the app_dcookie used for? If the spu binary is
> embedded into a library, you are still missing the dcookie to that .so file,
> because you return only an offset.
For embedded SPU app, the post-processing tool opens the PPE binary app
file and obtains the SPU ELF embedded thereine, and from that, we obtain
the name of the SPU binary. Also, the app name is included in the
report, along with the SPU binary filename, if the report contains
samples from more than one application.
>
> <nitpicking>
>
>>+ unsigned long flags = 0;
>
>
> no need to initialize
>
>
>>+ struct spu * the_spu = (struct spu *) data;
>
>
> no need for the cast
>
>
>>+ pr_debug("SPU event notification arrived\n");
>>+ if (val == 0){
>
>
> if (!val)
>
>
>>+ pr_debug("spu_sync_start -- running.\n");
>>+OUT:
>
>
> out:
>
>
>>+ return ret;
>>+}
>
>
> </nitpicking>
OK, we'll fix up.
>
>
>
>>@@ -480,7 +491,22 @@
>> struct op_system_config *sys, int num_ctrs)
>> {
>> int i, j, cpu;
>>+ spu_cycle_reset = 0;
>>
>>+ /* The cpufreq_quick_get function requires that cbe_cpufreq module
>>+ * be loaded. This function is not actually provided and exported
>>+ * by cbe_cpufreq, but it relies on cbe_cpufreq initialize kernel
>>+ * data structures. Since there's no way for depmod to realize
>>+ * that our OProfile module depends on cbe_cpufreq, we currently
>>+ * are letting the userspace tool, opcontrol, ensure that the
>>+ * cbe_cpufreq module is loaded.
>>+ */
>>+ khzfreq = cpufreq_quick_get(smp_processor_id());
>
>
> You should probably have a fallback in here in case the cpufreq module
> is not loaded. There is a global variable ppc_proc_freq (in Hz) that
> you can access.
Our userspace tool ensures the cpufreq module is loaded.
>
>> ;
>> }
>>
>>-static void cell_global_start(struct op_counter_config *ctr)
>>+static int calculate_lfsr(int n)
>>+{
>>+#define size 24
>>+ int i;
>>+ unsigned int newlfsr0;
>>+ unsigned int lfsr = 0xFFFFFF;
>>+ unsigned int howmany = lfsr - n;
>>+
>>+ for (i = 2; i < howmany + 2; i++) {
>>+ newlfsr0 = (((lfsr >> (size - 1 - 0)) & 1) ^
>>+ ((lfsr >> (size - 1 - 1)) & 1) ^
>>+ (((lfsr >> (size - 1 - 6)) & 1) ^
>>+ ((lfsr >> (size - 1 - 23)) & 1)));
>>+
>>+ lfsr >>= 1;
>>+ lfsr = lfsr | (newlfsr0 << (size - 1));
>>+ }
>>+ return lfsr;
>>+
>>+}
>
>
> I don't have the slightest idea what this code is about, but
Me neither. Carl, can you comment?
> it certainly looks inefficient to loop 16 million times to
> compute a constant. Could you use a faster algorithm instead,
> or at least add a comment about why you do it this way?
>
>
>>+static void cell_global_stop(void)
>>+{
>>+ if (spu_cycle_reset) {
>>+ cell_global_stop_spu();
>>+ } else {
>>+ cell_global_stop_ppu();
>>+ }
>>+
>>+}
>
>
> This looks weird as well. I suppose it's a limitation of the hardware
> that you can only do either ppu or spu profiling. However, making that
> dependent of whether the 'spu_cycle_reset' variable is set sounds
> rather bogus.
It is the only file-scoped variable relating to SPU profiling, and will
always be non-zero when the user selects to perform SPU profiling.
Seemed like a logical-enough choice to me.
>
> I don't know what the best interface for choosing the target from
> user space would be, but you probably also want to be able to
> switch between them at run time.
The hardware setup is so completely different, I don't think there's a
viable way of switching between PPU profiling and SPU profiling without
a "stop" in between.
>
> Arnd <><



2007-01-30 22:54:11

by Maynard Johnson

[permalink] [raw]
Subject: Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

Maynard Johnson wrote:

> Arnd Bergmann wrote:
>
>
>>On Monday 29 January 2007 20:48, Maynard Johnson wrote:
>>
>>
>>>Subject: Add support to OProfile for profiling Cell BE SPUs
>>>
>>>From: Maynard Johnson <[email protected]>
>>>
>>>This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
>>>to add in the SPU profiling capabilities. In addition, a 'cell' subdirectory
>>>was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling
>>>code.
>>>
>>>Signed-off-by: Carl Love <[email protected]>
>>>Signed-off-by: Maynard Johnson <[email protected]>
>>
[snip]
>>
>>>+ *
>>>+ * Ideally, we would like to be able to create the cached_info for
>>>+ * an SPU task just one time -- when libspe first loads the SPU
>>>+ * binary file. We would store the cached_info in a list. Then, as
>>>+ * SPU tasks are switched out and new ones switched in, the cached_info
>>>+ * for inactive tasks would be kept, and the active one would be placed
>>>+ * at the head of the list. But this technique may not with
>>>+ * current spufs functionality since the spu used in bind_context may
>>>+ * be a different spu than was used in a previous bind_context for a
>>>+ * reactivated SPU task. Additionally, a reactivated SPU task may be
>>>+ * assigned to run on a different physical SPE. We will investigate
>>>+ * further if this can be done.
>>>+ *
>>>+ */
>>
>>
>>You should stuff a pointer to cached_info into struct spu_context,
>>e.g. 'void *profile_private'.
>>
>>
>>
>>>+struct cached_info {
>>>+ vma_map_t * map;
>>>+ struct spu * the_spu;
>>>+ struct kref cache_ref;
>>>+ struct list_head list;
>>>+};
>>
>>
>>And replace the 'the_spu' member with a back pointer to the
>>spu_context if you need it.
>>
>>
>>
>>>+
>>>+/* A data structure for cached information about active SPU tasks.
>>>+ * Storage is dynamically allocated, sized as
>>>+ * "number of active nodes multplied by 8".
>>>+ * The info_list[n] member holds 0 or more
>>>+ * 'struct cached_info' objects for SPU#=n.
>>>+ *
>>>+ * As currently implemented, there will only ever be one cached_info
>>>+ * in the list for a given SPU. If we can devise a way to maintain
>>>+ * multiple cached_infos in our list, then it would make sense
>>>+ * to also cache the dcookie representing the PPU task application.
>>>+ * See above description of struct cached_info for more details.
>>>+ */
>>>+struct spu_info_stacks {
>>>+ struct list_head * info_list;
>>>+};
>>
>>
>>Why do you store pointers to list_head structures? If you want to store
>>lists, you should have a lists_head itself in here.
>
> info_list is an array of n lists, where n is the number of SPUs.
>
>>Why do you store them per spu in the first place? The physical spu
>>doesn't have any relevance to this at all, the only data that is
>>per spu is the sample data collected on a profiling interrupt,
>>which you can then copy in the per-context data on a context switch.
>
> The sample data is written out to the event buffer on every profiling
> interrupt. But we don't write out the SPU program counter samples
> directly to the event buffer. First, we have to find the cached_info
> for the appropriate SPU context to retrieve the cached vma-to-fileoffset
> map. Then we do the vma_map_lookup to find the fileoffset corresponding
> to the SPU PC sample, which we then write out to the event buffer. This
> is one of the most time-critical pieces of the SPU profiling code, so I
> used an array to hold the cached_info for fast random access. But as I
> stated in a code comment above, the negative implication of this current
> implementation is that the array can only hold the cached_info for
> currently running SPU tasks. I need to give this some more thought.

I've given this some more thought, and I'm coming to the conclusion that
a pure array-based implementation for holding cached_info (getting rid
of the lists) would work well for the vast majority of cases in which
OProfile will be used. Yes, it is true that the mapping of an SPU
context to a phsyical spu-numbered array location cannot be guaranteed
to stay valid, and that's why I discard the cached_info at that array
location when the SPU task is switched out. Yes, it would be terribly
inefficient if the same SPU task gets switched back in later and we
would have to recreate the cached_info. However, I contend that
OProfile users are interested in profiling one application at a time.
They are not going to want to muddy the waters with multiple SPU apps
running at the same time. I can't think of any reason why someone would
conscisouly choose to do that.

Any thoughts from the general community, especially OProfile users?

Thanks.
-Maynard
>
>>
[snip]

2007-01-30 23:31:13

by Carl Love

[permalink] [raw]
Subject: Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

On Tue, 2007-01-30 at 15:41 -0600, Maynard Johnson wrote:
[snip
> [snip]
> >>+ // process the collected SPU PC for each node
> >>+ for_each_online_cpu(cpu) {
> >>+ if (cbe_get_hw_thread_id(cpu))
> >>+ continue;
> >>+
> >>+ node = cbe_cpu_to_node(cpu);
> >>+ node_factor = node * SPUS_PER_NODE;
> >>+ /* number of valid entries for this node */
> >>+ entry = 0;
> >>+
> >>+ trace_addr = cbe_read_pm(cpu, trace_address);
> >>+ while ((trace_addr & CBE_PM_TRACE_BUF_EMPTY) != 0x400)
> >>+ {
> >>+ /* there is data in the trace buffer to process */
> >>+ cbe_read_trace_buffer(cpu, trace_buffer);
> >>+ spu_mask = 0xFFFF000000000000;
> >>+
> >>+ /* Each SPU PC is 16 bits; hence, four spus in each of
> >>+ * the two 64-bit buffer entries that make up the
> >>+ * 128-bit trace_buffer entry. Process the upper and
> >>+ * lower 64-bit values simultaneously.
> >>+ */
> >>+ for (spu = 0; spu < SPUS_PER_TB_ENTRY; spu++) {
> >>+ spu_pc_lower = spu_mask & trace_buffer[0];
> >>+ spu_pc_lower = spu_pc_lower >> (NUM_SPU_BITS_TRBUF
> >>+ * (SPUS_PER_TB_ENTRY-spu-1));
> >>+
> >>+ spu_pc_upper = spu_mask & trace_buffer[1];
> >>+ spu_pc_upper = spu_pc_upper >> (NUM_SPU_BITS_TRBUF
> >>+ * (SPUS_PER_TB_ENTRY-spu-1));
> >>+
> >>+ spu_mask = spu_mask >> NUM_SPU_BITS_TRBUF;
> >>+
> >>+ /* spu PC trace entry is upper 16 bits of the
> >>+ * 18 bit SPU program counter
> >>+ */
> >>+ spu_pc_lower = spu_pc_lower << 2;
> >>+ spu_pc_upper = spu_pc_upper << 2;
> >>+
> >>+ samples[((node_factor + spu) * TRACE_ARRAY_SIZE) + entry]
> >>+ = (u32) spu_pc_lower;
> >>+ samples[((node_factor + spu + SPUS_PER_TB_ENTRY) * TRACE_ARRAY_SIZE) + entry]
> >>+ = (u32) spu_pc_upper;
> >>+ }
> >>+
> >>+ entry++;
> >>+
> >>+ if (entry >= TRACE_ARRAY_SIZE)
> >>+ /* spu_samples is full */
> >>+ break;
> >>+
> >>+ trace_addr = cbe_read_pm(cpu, trace_address);
> >>+ }
> >>+ samples_per_node[node] = entry;
> >>+ }
> >>+}
> >
> >
> > While I can't see anything technically wrong with this function, it would be
> > good to split it into smaller functions. Since you are nesting three
> > loops, it should be possible to make a separate function from one of the
> > inner loops without changing the actual logic behind it.
> Will do.
> >
> >
> >>+
> >>+static int profile_spus(struct hrtimer * timer)
> >>+{
> >>+ ktime_t kt;
> >>+ int cpu, node, k, num_samples, spu_num;
> >
> >
> > whitespace damage
> fixed
> >
> >
> >>+
> >>+ if (!spu_prof_running)
> >>+ goto STOP;
> >>+
> >>+ cell_spu_pc_collection();
> >>+ for_each_online_cpu(cpu) {
> >>+ if (cbe_get_hw_thread_id(cpu))
> >>+ continue;
> >
> >
> > Here, you enter the same top-level loop again, why not make it
> > for_each_online_cpu(cpu) {
> > if (cbe_get_hw_thread_id(cpu))
> > continue;
> > num_samples = cell_spu_pc_collection(cpu);
> > ...
> Yes, good suggestion.
I believe what you are asking here is why can't the
cell_spu_pc_collection() function put the data in to the samples array
for a given node, in the loop that then processes the samples array for
that node. Yes, I believe that this can be done. The only restriction
is that cell_spu_pc_collection() will have to extract the SPU program
counter data for all SPUs on that node. This is due to the fact the
data for the 8 SPUs are all stored in a single entry of the hardware
trace buffer. Once the hardware trace buffer is read, the hardware
advances the read pointer so there is no way to go back and re-read the
entry.

[snip]


[snip]
> >>+static int calculate_lfsr(int n)
> >>+{
> >>+#define size 24
> >>+ int i;
> >>+ unsigned int newlfsr0;
> >>+ unsigned int lfsr = 0xFFFFFF;
> >>+ unsigned int howmany = lfsr - n;
> >>+
> >>+ for (i = 2; i < howmany + 2; i++) {
> >>+ newlfsr0 = (((lfsr >> (size - 1 - 0)) & 1) ^
> >>+ ((lfsr >> (size - 1 - 1)) & 1) ^
> >>+ (((lfsr >> (size - 1 - 6)) & 1) ^
> >>+ ((lfsr >> (size - 1 - 23)) & 1)));
> >>+
> >>+ lfsr >>= 1;
> >>+ lfsr = lfsr | (newlfsr0 << (size - 1));
> >>+ }
> >>+ return lfsr;
> >>+
> >>+}
> >
> >
> > I don't have the slightest idea what this code is about, but
> Me neither. Carl, can you comment?
> > it certainly looks inefficient to loop 16 million times to
> > compute a constant. Could you use a faster algorithm instead,
> > or at least add a comment about why you do it this way?
> >
> >

An LFSR sequence is similar to a pseudo random number sequence. For a 24
bit LFSR sequence each number between 0 and 2^24 will occur once in the
sequence but not in a normal counting order. The hardware uses the LFSR
sequence to count to since it is much simpler to implement in hardware
then a normal counter. Unfortunately, the only way we know how to
figure out what the LFSR value that corresponds to the number in the
sequence that is N before the last value (0xFFFFFF) is to calculate the
previous value N times. It is like trying to ask what is the pseudo
random number that is N before this pseudo random number?

I will add a short comment to the code that will summarize the above
paragraph.
[snip]

2007-01-30 23:34:32

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update


> I've given this some more thought, and I'm coming to the conclusion that
> a pure array-based implementation for holding cached_info (getting rid
> of the lists) would work well for the vast majority of cases in which
> OProfile will be used. Yes, it is true that the mapping of an SPU
> context to a phsyical spu-numbered array location cannot be guaranteed
> to stay valid, and that's why I discard the cached_info at that array
> location when the SPU task is switched out. Yes, it would be terribly
> inefficient if the same SPU task gets switched back in later and we
> would have to recreate the cached_info. However, I contend that
> OProfile users are interested in profiling one application at a time.
> They are not going to want to muddy the waters with multiple SPU apps
> running at the same time. I can't think of any reason why someone would
> conscisouly choose to do that.
>
> Any thoughts from the general community, especially OProfile users?

Well, it's my understanding that quite a few typical usage scenario
involve different tasks running on different SPUs passing each other
data around.

Ben.


2007-01-31 00:29:45

by Maynard Johnson

[permalink] [raw]
Subject: Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

Benjamin Herrenschmidt wrote:

>>I've given this some more thought, and I'm coming to the conclusion that
>>a pure array-based implementation for holding cached_info (getting rid
>>of the lists) would work well for the vast majority of cases in which
>>OProfile will be used. Yes, it is true that the mapping of an SPU
>>context to a phsyical spu-numbered array location cannot be guaranteed
>>to stay valid, and that's why I discard the cached_info at that array
>>location when the SPU task is switched out. Yes, it would be terribly
>>inefficient if the same SPU task gets switched back in later and we
>>would have to recreate the cached_info. However, I contend that
>>OProfile users are interested in profiling one application at a time.
>>They are not going to want to muddy the waters with multiple SPU apps
>>running at the same time. I can't think of any reason why someone would
>>conscisouly choose to do that.
>>
>>Any thoughts from the general community, especially OProfile users?
>
>
> Well, it's my understanding that quite a few typical usage scenario
> involve different tasks running on different SPUs passing each other
> data around.
That shouldn't be a problem. I would consider this to be "one large
application" consisting of multiple SPU binaries running simultaneously.
Such a scenario can be handled with no negative performance impact
using a simple 16 element array of cached_info objects -- as long as
there isn't (much) SPU task switching being done.

-Maynard
>
> Ben.
>
>


2007-01-31 01:50:50

by Christian Krafft

[permalink] [raw]
Subject: Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

On Tue, 30 Jan 2007 15:31:09 -0800
Carl Love <[email protected]> wrote:

> An LFSR sequence is similar to a pseudo random number sequence. For a 24
> bit LFSR sequence each number between 0 and 2^24 will occur once in the
> sequence but not in a normal counting order. The hardware uses the LFSR
> sequence to count to since it is much simpler to implement in hardware
> then a normal counter. Unfortunately, the only way we know how to
> figure out what the LFSR value that corresponds to the number in the
> sequence that is N before the last value (0xFFFFFF) is to calculate the
> previous value N times. It is like trying to ask what is the pseudo
> random number that is N before this pseudo random number?

That should be no problem.
You can just revers your algorithm and let it run x times instead of 0xFFFFFF-x.

>
> I will add a short comment to the code that will summarize the above
> paragraph.
> [snip]
>
> _______________________________________________
> cbe-oss-dev mailing list
> [email protected]
> https://ozlabs.org/mailman/listinfo/cbe-oss-dev


--
Mit freundlichen Gr?ssen,
kind regards,

Christian Krafft
IBM Systems & Technology Group,
Linux Kernel Development
IT Specialist

2007-01-31 05:58:05

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

On Tuesday 30 January 2007 22:41, Maynard Johnson wrote:
> Arnd Bergmann wrote:

> >>+ kt = ktime_set(0, profiling_interval);
> >>+ if (!spu_prof_running)
> >>+ goto STOP;
> >>+ hrtimer_forward(timer, timer->base->get_time(), kt);
> >>+ return HRTIMER_RESTART;
> >
> >
> > is hrtimer_forward really the right interface here? You are ignoring
> > the number of overruns anyway, so hrtimer_start(,,) sounds more
> > correct to me.
> According to Tom Gleixner, "hrtimer_forward is a convenience function to
> move the expiry time of a timer forward in multiples of the interval, so
> it is in the future. After setting the expiry time you restart the
> timer either with [sic] a return HRTIMER_RESTART (if you are in the
> timer callback function)."
> >

Ok, I see. Have you seen the timer actually coming in late, resulting
in hrtimer_forward returning non-zero? I guess it's not a big problem
for statistic data collection if that happens, but you might still want
to be able to see it.

> >>+ /* Since cpufreq_quick_get returns frequency in kHz, we use
> >>+ * USEC_PER_SEC here vs NSEC_PER_SEC.
> >>+ */
> >>+ unsigned long nsPerCyc = (USEC_PER_SEC << SCALE_SHIFT)/khzfreq;
> >>+ profiling_interval = (nsPerCyc * cycles_reset) >> SCALE_SHIFT;
> >>+
> >>+ pr_debug("timer resolution: %lu\n",
> >>+ TICK_NSEC);
> >
> >
> > Don't you need to adapt the profiling_interval at run time, when cpufreq
> > changes the core frequency? You should probably use
> > cpufreq_register_notifier() to update this.
> Since OProfile is a statistical profiler, the exact frequency is not
> critical. The user is going to be looking for hot spots in their code,
> so it's all relative. With that said, I don't imagine using the
> cpufreq notiication would be a big deal. We'll look at it.
>
> >>@@ -480,7 +491,22 @@
> >> struct op_system_config *sys, int num_ctrs)
> >> {
> >> int i, j, cpu;
> >>+ spu_cycle_reset = 0;
> >>
> >>+ /* The cpufreq_quick_get function requires that cbe_cpufreq module
> >>+ * be loaded. This function is not actually provided and exported
> >>+ * by cbe_cpufreq, but it relies on cbe_cpufreq initialize kernel
> >>+ * data structures. Since there's no way for depmod to realize
> >>+ * that our OProfile module depends on cbe_cpufreq, we currently
> >>+ * are letting the userspace tool, opcontrol, ensure that the
> >>+ * cbe_cpufreq module is loaded.
> >>+ */
> >>+ khzfreq = cpufreq_quick_get(smp_processor_id());
> >
> >
> > You should probably have a fallback in here in case the cpufreq module
> > is not loaded. There is a global variable ppc_proc_freq (in Hz) that
> > you can access.
>
> Our userspace tool ensures the cpufreq module is loaded.

You should not rely on user space tools to do the right thing in the kernel.

Moreover, if the exact frequency is not that important, as you mentioned
above, you can probably just hardcode a compile-time constant here.

> >>+ *
> >>+ * Ideally, we would like to be able to create the cached_info for
> >>+ * an SPU task just one time -- when libspe first loads the SPU
> >>+ * binary file. We would store the cached_info in a list. Then, as
> >>+ * SPU tasks are switched out and new ones switched in, the cached_info
> >>+ * for inactive tasks would be kept, and the active one would be placed
> >>+ * at the head of the list. But this technique may not with
> >>+ * current spufs functionality since the spu used in bind_context may
> >>+ * be a different spu than was used in a previous bind_context for a
> >>+ * reactivated SPU task. Additionally, a reactivated SPU task may be
> >>+ * assigned to run on a different physical SPE. We will investigate
> >>+ * further if this can be done.
> >>+ *
> >>+ */
> >
> >
> > You should stuff a pointer to cached_info into struct spu_context,
> > e.g. 'void *profile_private'.
> >
> >
> >>+struct cached_info {
> >>+ vma_map_t * map;
> >>+ struct spu * the_spu;
> >>+ struct kref cache_ref;
> >>+ struct list_head list;
> >>+};
> >
> >
> > And replace the 'the_spu' member with a back pointer to the
> > spu_context if you need it.
> >
> >
> >>+
> >>+/* A data structure for cached information about active SPU tasks.
> >>+ * Storage is dynamically allocated, sized as
> >>+ * "number of active nodes multplied by 8".
> >>+ * The info_list[n] member holds 0 or more
> >>+ * 'struct cached_info' objects for SPU#=n.
> >>+ *
> >>+ * As currently implemented, there will only ever be one cached_info
> >>+ * in the list for a given SPU. If we can devise a way to maintain
> >>+ * multiple cached_infos in our list, then it would make sense
> >>+ * to also cache the dcookie representing the PPU task application.
> >>+ * See above description of struct cached_info for more details.
> >>+ */
> >>+struct spu_info_stacks {
> >>+ struct list_head * info_list;
> >>+};
> >
> >
> > Why do you store pointers to list_head structures? If you want to store
> > lists, you should have a lists_head itself in here.
> info_list is an array of n lists, where n is the number of SPUs.

My point was that it's not an array of lists, but an array of pointers
to lists. The way that include/list.h works is by having a struct list_head
as the anchor and then add nodes to it. By simply pointing to a list_head,
you won't be able to use the list_for_each_entry() macro the way it is meant.

> >
> > I don't get it. What is the app_dcookie used for? If the spu binary is
> > embedded into a library, you are still missing the dcookie to that .so file,
> > because you return only an offset.
> For embedded SPU app, the post-processing tool opens the PPE binary app
> file and obtains the SPU ELF embedded thereine, and from that, we obtain
> the name of the SPU binary. Also, the app name is included in the
> report, along with the SPU binary filename, if the report contains
> samples from more than one application.

That's not what I meant. If you have an embedded spu ELF file in a shared
library object, you end up recording the file of the main PPE binary, and
the offset of the SPU ELF inside of the .so file, but not the name of the
.so file itself.

You also say that the you record the name of the application on purpose for
separate SPU ELF files. My assumption was that this is not necessary, but
I don't know much about that aspect of oprofile. My feeling is that the
samples for an external SPU ELF file should be handled the same way that
oprofile handles events in shared libraries on the PPU (e.g. in libc.so).

Arnd <><

2007-01-31 06:06:25

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

On Wednesday 31 January 2007 00:31, Carl Love wrote:
> Unfortunately, the only way we know how to
> figure out what the LFSR value that corresponds to the number in the
> sequence that is N before the last value (0xFFFFFF) is to calculate the
> previous value N times. ?It is like trying to ask what is the pseudo
> random number that is N before this pseudo random number?

Well, you can at least implement the lfsr both ways, and choose the one
that is faster to get at, like

u32 get_lfsr(u32 v)
{
int i;
u32 r = 0xffffff;
if (v < 0x7fffff) {
for (i = 0; i < v; i++)
r = lfsr_forwards(r);
} else {
for (i = 0; i < (0x1000000 - v); i++)
r = lfsr_backwards(r);
}
return r;
}

Also, if the value doesn't have to be really exact, you could have
a small lookup table with precomputed values, like:

u32 get_lfsr(u32 v)
{
static const lookup[256] = {
0xab3492, 0x3e3f34, 0xc47610c, ... /* insert actual values */
};

return lookup[v >> 16];
}

Arnd <><

2007-01-31 06:53:09

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

On Tuesday 30 January 2007 23:54, Maynard Johnson wrote:

> >>Why do you store them per spu in the first place? The physical spu
> >>doesn't have any relevance to this at all, the only data that is
> >>per spu is the sample data collected on a profiling interrupt,
> >>which you can then copy in the per-context data on a context switch.
> >
> > The sample data is written out to the event buffer on every profiling
> > interrupt. ?But we don't write out the SPU program counter samples
> > directly to the event buffer. ?First, we have to find the cached_info
> > for the appropriate SPU context to retrieve the cached vma-to-fileoffset
> > map. ?Then we do the vma_map_lookup to find the fileoffset corresponding
> > to the SPU PC sample, which we then write out to the event buffer. ?This
> > is one of the most time-critical pieces of the SPU profiling code, so I
> > used an array to hold the cached_info for fast random access. ?But as I
> > stated in a code comment above, the negative implication of this current
> > implementation is that the array can only hold the cached_info for
> > currently running SPU tasks. ?I need to give this some more thought.
>
> I've given this some more thought, and I'm coming to the conclusion that
> a pure array-based implementation for holding cached_info (getting rid
> of the lists) would work well for the vast majority of cases in which
> OProfile will be used. ?Yes, it is true that the mapping of an SPU
> context to a phsyical spu-numbered array location cannot be guaranteed
> to stay valid, and that's why I discard the cached_info at that array
> location when the SPU task is switched out. ?Yes, it would be terribly
> inefficient if the same SPU task gets switched back in later and we
> would have to recreate the cached_info. ?However, I contend that
> OProfile users are interested in profiling one application at a time.
> They are not going to want to muddy the waters with multiple SPU apps
> running at the same time. ?I can't think of any reason why someone would
> conscisouly choose to do that.
>
> Any thoughts from the general community, especially OProfile users?
>
Please assume that in the near future we will be scheduling SPU contexts
in and out multiple times a second. Even in a single application, you
can easily have more contexts than you have physical SPUs.

The event buffer by definition needs to be per context. If you for some
reason want to collect the samples per physical SPU during an event
interrupt, you should at least make sure that they are copied into the
per-context event buffer on a context switch.

At the context switch point, you probably also want to drain the
hw event counters, so that you account all events correctly.

We also want to be able to profile the context switch code itself, which
means that we also need one event buffer associated with the kernel to
collect events that for a zero context_id.

Of course, the recording of raw samples in the per-context buffer does
not need to have the dcookies along with it, you can still resolve
the pointers when the SPU context gets destroyed (or an object gets
unmapped).

Arnd <><

2007-02-02 16:47:33

by Maynard Johnson

[permalink] [raw]
Subject: Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

Arnd Bergmann wrote:

> On Tuesday 30 January 2007 23:54, Maynard Johnson wrote:
>
>
>>>>Why do you store them per spu in the first place? The physical spu
>>>>doesn't have any relevance to this at all, the only data that is
>>>>per spu is the sample data collected on a profiling interrupt,
>>>>which you can then copy in the per-context data on a context switch.
>>>
>>>The sample data is written out to the event buffer on every profiling
>>>interrupt. But we don't write out the SPU program counter samples
>>>directly to the event buffer. First, we have to find the cached_info
>>>for the appropriate SPU context to retrieve the cached vma-to-fileoffset
>>>map. Then we do the vma_map_lookup to find the fileoffset corresponding
>>>to the SPU PC sample, which we then write out to the event buffer. This
>>>is one of the most time-critical pieces of the SPU profiling code, so I
>>>used an array to hold the cached_info for fast random access. But as I
>>>stated in a code comment above, the negative implication of this current
>>>implementation is that the array can only hold the cached_info for
>>>currently running SPU tasks. I need to give this some more thought.
>>
>>I've given this some more thought, and I'm coming to the conclusion that
>>a pure array-based implementation for holding cached_info (getting rid
>>of the lists) would work well for the vast majority of cases in which
>>OProfile will be used. Yes, it is true that the mapping of an SPU
>>context to a phsyical spu-numbered array location cannot be guaranteed
>>to stay valid, and that's why I discard the cached_info at that array
>>location when the SPU task is switched out. Yes, it would be terribly
>>inefficient if the same SPU task gets switched back in later and we
>>would have to recreate the cached_info. However, I contend that
>>OProfile users are interested in profiling one application at a time.
>>They are not going to want to muddy the waters with multiple SPU apps
>>running at the same time. I can't think of any reason why someone would
>>conscisouly choose to do that.
>>
>>Any thoughts from the general community, especially OProfile users?
>>
>
> Please assume that in the near future we will be scheduling SPU contexts
> in and out multiple times a second. Even in a single application, you
> can easily have more contexts than you have physical SPUs.
Arnd, thanks for pointing this out. That's definitely a good reason why
my simplistic approach won't work. I'll look at other options.
>
> The event buffer by definition needs to be per context. If you for some
Yes, and it is. Right now, with the current simplistic approach, the
context and the physical SPU are kept in sync.
> reason want to collect the samples per physical SPU during an event
> interrupt, you should at least make sure that they are copied into the
> per-context event buffer on a context switch.
>
> At the context switch point, you probably also want to drain the
> hw event counters, so that you account all events correctly.
Yeah, that's a good idea. The few extraneous invalid samples would
probably never rise above the noise level, but we should do this anyway
for completeness.
>
> We also want to be able to profile the context switch code itself, which
> means that we also need one event buffer associated with the kernel to
> collect events that for a zero context_id.
The hardware design precludes tracing both SPU and PPU simultaneously.

-Maynard
>
> Of course, the recording of raw samples in the per-context buffer does
> not need to have the dcookies along with it, you can still resolve
> the pointers when the SPU context gets destroyed (or an object gets
> unmapped).
>
> Arnd <><


2007-02-02 19:27:14

by Maynard Johnson

[permalink] [raw]
Subject: Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

Arnd Bergmann wrote:

> On Tuesday 30 January 2007 22:41, Maynard Johnson wrote:
>
>>Arnd Bergmann wrote:
>
>
>>>>+ kt = ktime_set(0, profiling_interval);
>>>>+ if (!spu_prof_running)
>>>>+ goto STOP;
>>>>+ hrtimer_forward(timer, timer->base->get_time(), kt);
>>>>+ return HRTIMER_RESTART;
>>>
>>>
>>>is hrtimer_forward really the right interface here? You are ignoring
>>>the number of overruns anyway, so hrtimer_start(,,) sounds more
>>>correct to me.
>>
>>According to Tom Gleixner, "hrtimer_forward is a convenience function to
>>move the expiry time of a timer forward in multiples of the interval, so
>>it is in the future. After setting the expiry time you restart the
>>timer either with [sic] a return HRTIMER_RESTART (if you are in the
>>timer callback function)."
>>
>
> Ok, I see. Have you seen the timer actually coming in late, resulting
> in hrtimer_forward returning non-zero? I guess it's not a big problem
> for statistic data collection if that happens, but you might still want
> to be able to see it.
I don't think there's much point in knowing if we have overrun(s).
We're not going to schedule the timer any differently. We want to keep
the timer interrupts as consistent as possible according to the user's
request.
>
>
>>>>+ /* Since cpufreq_quick_get returns frequency in kHz, we use
>>>>+ * USEC_PER_SEC here vs NSEC_PER_SEC.
>>>>+ */
>>>>+ unsigned long nsPerCyc = (USEC_PER_SEC << SCALE_SHIFT)/khzfreq;
>>>>+ profiling_interval = (nsPerCyc * cycles_reset) >> SCALE_SHIFT;
>>>>+
>>>>+ pr_debug("timer resolution: %lu\n",
>>>>+ TICK_NSEC);
>>>
>>>
>>>Don't you need to adapt the profiling_interval at run time, when cpufreq
>>>changes the core frequency? You should probably use
>>>cpufreq_register_notifier() to update this.
>>
>>Since OProfile is a statistical profiler, the exact frequency is not
>>critical. The user is going to be looking for hot spots in their code,
>>so it's all relative. With that said, I don't imagine using the
>>cpufreq notiication would be a big deal. We'll look at it.
>>
>>
>>>>@@ -480,7 +491,22 @@
>>>> struct op_system_config *sys, int num_ctrs)
>>>>{
>>>> int i, j, cpu;
>>>>+ spu_cycle_reset = 0;
>>>>
>>>>+ /* The cpufreq_quick_get function requires that cbe_cpufreq module
>>>>+ * be loaded. This function is not actually provided and exported
>>>>+ * by cbe_cpufreq, but it relies on cbe_cpufreq initialize kernel
>>>>+ * data structures. Since there's no way for depmod to realize
>>>>+ * that our OProfile module depends on cbe_cpufreq, we currently
>>>>+ * are letting the userspace tool, opcontrol, ensure that the
>>>>+ * cbe_cpufreq module is loaded.
>>>>+ */
>>>>+ khzfreq = cpufreq_quick_get(smp_processor_id());
>>>
>>>
>>>You should probably have a fallback in here in case the cpufreq module
>>>is not loaded. There is a global variable ppc_proc_freq (in Hz) that
>>>you can access.
>>
>>Our userspace tool ensures the cpufreq module is loaded.
>
>
> You should not rely on user space tools to do the right thing in the kernel.
Ok, we'll look at the fallback option you suggest. I don't recall if I
even knew about ppc_proc_freq before or why I originally chose
cpufreq_guick_get. Maybe we can do without the cpufreq and use
ppc_proc_freq all the time. We'll see . . .
>
> Moreover, if the exact frequency is not that important, as you mentioned
> above, you can probably just hardcode a compile-time constant here.
Well, exact frequency isn't critical, but it should, as close as
possible, correspond with the user's requested value for "spu cycle reset".
>
>
>>>>+ *
>>>>+ * Ideally, we would like to be able to create the cached_info for
>>>>+ * an SPU task just one time -- when libspe first loads the SPU
>>>>+ * binary file. We would store the cached_info in a list. Then, as
>>>>+ * SPU tasks are switched out and new ones switched in, the cached_info
>>>>+ * for inactive tasks would be kept, and the active one would be placed
>>>>+ * at the head of the list. But this technique may not with
>>>>+ * current spufs functionality since the spu used in bind_context may
>>>>+ * be a different spu than was used in a previous bind_context for a
>>>>+ * reactivated SPU task. Additionally, a reactivated SPU task may be
>>>>+ * assigned to run on a different physical SPE. We will investigate
>>>>+ * further if this can be done.
>>>>+ *
>>>>+ */
>>>
>>>
>>>You should stuff a pointer to cached_info into struct spu_context,
>>>e.g. 'void *profile_private'.
>>>
>>>
>>>
>>>>+struct cached_info {
>>>>+ vma_map_t * map;
>>>>+ struct spu * the_spu;
>>>>+ struct kref cache_ref;
>>>>+ struct list_head list;
>>>>+};
>>>
>>>
>>>And replace the 'the_spu' member with a back pointer to the
>>>spu_context if you need it.
>>>
>>>
>>>
>>>>+
>>>>+/* A data structure for cached information about active SPU tasks.
>>>>+ * Storage is dynamically allocated, sized as
>>>>+ * "number of active nodes multplied by 8".
>>>>+ * The info_list[n] member holds 0 or more
>>>>+ * 'struct cached_info' objects for SPU#=n.
>>>>+ *
>>>>+ * As currently implemented, there will only ever be one cached_info
>>>>+ * in the list for a given SPU. If we can devise a way to maintain
>>>>+ * multiple cached_infos in our list, then it would make sense
>>>>+ * to also cache the dcookie representing the PPU task application.
>>>>+ * See above description of struct cached_info for more details.
>>>>+ */
>>>>+struct spu_info_stacks {
>>>>+ struct list_head * info_list;
>>>>+};
>>>
>>>
>>>Why do you store pointers to list_head structures? If you want to store
>>>lists, you should have a lists_head itself in here.
>>
>>info_list is an array of n lists, where n is the number of SPUs.
>
>
> My point was that it's not an array of lists, but an array of pointers
> to lists. The way that include/list.h works is by having a struct list_head
> as the anchor and then add nodes to it. By simply pointing to a list_head,
> you won't be able to use the list_for_each_entry() macro the way it is meant.
I've got to rework this implementation anyway . . .
>
>
[snip]

-Maynard

>
> Arnd <><


2007-02-03 07:40:25

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

On Friday 02 February 2007 17:47, Maynard Johnson wrote:
>
> > We also want to be able to profile the context switch code itself, which
> > means that we also need one event buffer associated with the kernel to
> > collect events that for a zero context_id.
> The hardware design precludes tracing both SPU and PPU simultaneously.
>
I mean the SPU-side part of the context switch code, which you can find
in arch/powerpc/platforms/cell/spufs/spu_{save,restore}*.

This code is the one that runs when context_id == 0 is passed to the
callback.

Arnd <><

2007-02-03 20:03:10

by Maynard Johnson

[permalink] [raw]
Subject: Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

Arnd Bergmann wrote:

>On Friday 02 February 2007 17:47, Maynard Johnson wrote:
>
>
>>>We also want to be able to profile the context switch code itself, which
>>>means that we also need one event buffer associated with the kernel to
>>>collect events that for a zero context_id.
>>>
>>>
>>The hardware design precludes tracing both SPU and PPU simultaneously.
>>
>>
>>
>I mean the SPU-side part of the context switch code, which you can find
>in arch/powerpc/platforms/cell/spufs/spu_{save,restore}*.
>
>This code is the one that runs when context_id == 0 is passed to the
>callback.
>
>
I presume you mean 'object_id'. What you're asking for is a new
requirement, and one which I don't believe is achievable in the current
timeframe. Since this is spufs code that's dynamicaly loaded into the
SPU at runtime, the symbol information for this code is not accessible
to the userspace post-processing tools. It would require an altogether
different mechanism to record samples along with necessary information,
not to mention the changes required in the post-processing tools. This
will have to be a future enhancement.

-Maynard

> Arnd <><
>
>


2007-02-04 02:42:55

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

On Saturday 03 February 2007 21:03, Maynard Johnson wrote:
> I presume you mean 'object_id'.

Right, sorry for the confusion.

> What you're asking for is a new
> requirement, and one which I don't believe is achievable in the current
> timeframe. ?Since this is spufs code that's dynamicaly loaded into the
> SPU at runtime, the symbol information for this code is not accessible
> to the userspace post-processing tools.

We can always fix the user space tool later, but it is important to
get at least the kernel interface right, so we can add the functionality
later without breaking new user space on old kernels or vice versa.

> It would require an altogether
> different mechanism to record samples along with necessary information,
> not to mention the changes required in the post-processing tools. ?This
> will have to be a future enhancement.

So what do you do with the samples for object_id == 0? I would expect them
to be simply added to the sample buffer for the kernel, which is sort
of special in oprofile anyway.

Arnd <><

2007-02-04 17:11:56

by Maynard Johnson

[permalink] [raw]
Subject: Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

Arnd Bergmann wrote:

>On Saturday 03 February 2007 21:03, Maynard Johnson wrote:
>
>
>>I presume you mean 'object_id'.
>>
>>
>
>Right, sorry for the confusion.
>
>
>
>>What you're asking for is a new
>>requirement, and one which I don't believe is achievable in the current
>>timeframe. Since this is spufs code that's dynamicaly loaded into the
>>SPU at runtime, the symbol information for this code is not accessible
>>to the userspace post-processing tools.
>>
>>
>
>We can always fix the user space tool later, but it is important to
>get at least the kernel interface right, so we can add the functionality
>later without breaking new user space on old kernels or vice versa.
>
>
There's no obvious solution to this problem, so it's going to take some
design effort to come up with a solution. We can work on the problem,
but I don't think we can allow it to get in the way of getting our
currently proposed SPU profiling functionality into the kernel.

>
>
>>It would require an altogether
>>different mechanism to record samples along with necessary information,
>>not to mention the changes required in the post-processing tools. This
>>will have to be a future enhancement.
>>
>>
>
>So what do you do with the samples for object_id == 0? I would expect them
>to be simply added to the sample buffer for the kernel, which is sort
>of special in oprofile anyway.
>
>
There is no sample buffer for the kernel in SPU profiling. When
OProfile gets notified of an SPU task switching out (object_if == 0), we
stop recording samples for the corresponding SPU. If SPUFs sends the
notification after the spu_save operation, we still would be collecting
samples during that time; however, since the VMAs of these samples would
not map to any fileoffset in the SPU binary that's executing, our
current implementation throws them away. We could change that behavior
and record them in the samples buffer as "anonymous samples". Not sure
it that would help too much, as you wouldn't be able to distinguish
between spu_save samples and samples from generated stubs that are
executing from the stack.

-Maynard

> Arnd <><
>
>