LinuxLists.cc - [patch 03/10] spufs: kspu documentation

2007-08-16 20:05:55

Subject: [patch 03/10] spufs: kspu documentation

Documentation how to use kspu from the PPU & SPU side

Signed-off-by: Sebastian Siewior <[email protected]>
--- /dev/null
+++ b/Documentation/powerpc/kspu.txt
@@ -0,0 +1,243 @@
+ KSPU: Utilization of SPUs for kernel tasks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+0. KSPU design
+==============
+
+The idea is to offload single time consuming tasks to the SPU. Those tasks are
+fed with data that they have to process.
+Once the function on the SPU side is invoked, the input data is already
+available. After the job is done, the offloaded function must kick off a DMA
+transfer, that transfers the result back to the main memory.
+On the PPU, the KSPU user queues the job temporary in a linked list and receives
+later a callback to put queue the job directly in SPU's ring buffer. The transit
+stop is required for two reasons:
+- It must be possible to queue work items from softirq context
+- All requests must be accepted, even if the ring buffer is full. Waiting (until
+ a slot becomes available) is not an option.
+
+The callback (for the enqueue process on the SPU) happens in a kthread context,
+so a mutex may be hold. However, there is only one kthread for this job, so
+every delay will have global impact.
+The user should enqueue only one job item on every enqueue request. The user
+may enqueue more than one job item if _really_ necessary. If there are not
+enough free slots than the enqueue function will be called as the first enqueue
+function once free slots are available again.
+After the offloaded function completed the job, the kthread calls the
+completion callback (it is the same kthread that is used for enqueue).
+Double/multi buffering is performed by KSPU.
+
+0.5 SPU usage
+=============
+Currently only one SPU is used and allocated. Allocation occurs during KSPU
+module initialization via the spufs interface. Therefore the physical SPU is
+considered in the scheduling process (and "shared" with user space). Right
+now, it is not "easy" to find out
+- how many SPUs may be taken (i.e. not used by user space)
+- how many SPUs are useful to be taken (depending of the workload)
+The later is (theoretically) an easy accounting approach if there are no
+dependencies in processing (and two jobs of the same kind may be processed on
+two SPUs in parallel).
+A second SPU (context) may be required if the local store memory is used up.
+This can be prevented if "overlays" are used. The advantage over several SPU
+context:
+- less complexity (since there is only one kind of SPU code)
+- no tracking which function is in which context. Plus overlay code switches
+ the binary probably faster than the scheduler does.
+
+1. Overview of memory layout
+=============================
+
+ ------------------ 256 KiB
+ | RB ENTRY |
+ ------------------ Ring buffer entries
+ | ......... |
+ ------------------
+ | RB ENTRY |
+ ------------------
+ | RB state | (consumed + outstanding)
+ ------------------
+ | STACK | Stack growing downwards
+ | || |
+ | \/ |
+ ------------------
+ | ....... | unused / reserved :)
+ ------------------
+ | Data |
+ | DMA Buffers, |
+ | functions' |
+ | private data |
+ ------------------
+ | Code |
+ | offloaded SPU |
+ | functions |
+ ------------------
+ | multiplexor | spu_main.c
+ ------------------ 0
+
+Type of Ring buffer entry is struct kspu_job.
+Number of ring buffer entries is determined by RB_SLOTS.
+Number of of DMA buffers is determined by DMA_BUFFERS.
+The stack grows uncontrolled. There is no (cheap) way to notice a stack
+overflow. After adding a new SPU program, the developer is encourage to check
+the stack usage and make sure the stack will never hit the data segment. This
+task is not required if recursive functions are used (I hope the suicide part
+has been understood).
+
+1.1 Ring buffer
+===============
+The ring buffer has been chosen because the data structure allows exchange of
+data (PPU <-> SPU) without any locking. The ring buffer entry consists of two
+parts
+- Data known by the KSPU (public data).
+- Private data is only known by the user (hidden from KSPU)
+Public data contains the function parameters of the offloaded SPU program.
+Private data is meaningless to the KSPU and may consider algorithm specific
+information (like where to put the result).
+The number of ring buffer entries (RB_SLOTS) has two constrains (except
+LS_SIZE :D):
+- it must be power of 2.
+- it must be at least DMA_BUFFERS*2
+
+1.2 DMA Buffers
+===============
+Every DMA buffer is DMA_MAX_TRANS_SIZE bytes in size. The size reflects the
+maximum transfer size that may be request by the SPU. Therefore the same
+requirements apply here like to the MFC DMA size: it must be a multiple of 16
+and may not by larger than 16KiB.
+The only limit for the number of available DMA buffers (DMA_BUFFERS) (besides
+the available space) is that "DMA_BUFFERS*2 <= RB_SLOTS" must be true. The
+reason for this constraint is that the "multiplexor", once started, requests
+DMA_BUFFERS buffers and starts processing. While processing the first batch,
+the next DMA_BUFFERS are requested (to get into streaming mode). After
+processing DMA_BUFFERS*2 requests, the first point is reached, where the SPU
+starts to notify the PPU about done requests and may stop. Therefore the
+shortest run is DMA_BUFFERS*2 requests. If there are not enough requests
+available, KSPU fills the ring buffer with NOPs to fit. A NOP is a DMA
+transfer with the size zero (nop for the MFC) and just a return statement as
+the job function.
+
+2. Offloading a task to SPU
+===========================
+Three steps are required to offload a task to SPU:
+- PPU code
+- SPU code
+- Update header files & Makefile
+
+The example code shows an example how to offload an 'add operation' via spu_add
+on the SPU. The complete implementation is in skeleton files.
+
+2.1 PPU code
+============
+1. Init
+- Prepare a struct with 'struct kspu_work_item' embedded in it.
+ struct my_spu_req {
+ struct kspu_work_item kspu_work;
+ void *data;
+ };
+
+- get global kspu ctx.
+ struct kspu_context *kctx = kspu_get_kctx();
+
+2. Enqueue a specific request. (struct my_spu_req spe_req)
+- Setup enqueue callback.
+ spe_req.kspu_work.enqueue = my_enqueue_func;
+
+- Enqueue it in kspu.
+ kspu_enqueue_work_item(kctx, &spe_req.kspu_work);
+
+3. Wait for the callback, enqueue it than on the SPU
+- Get an empty slot
+ struct kspu_job *work_item = kspu_get_rb_slot(kctx);
+
+- fill it
+ work_item->operation = MY_ADD;
+ work_item->in = spe_req.data;
+ work_item->in_size = 16;
+
+- mark it as ready
+ kspu_mark_rb_slot_ready(kctx, &spe_req.kspu_work);
+
+- set the finish callback
+ spe_req.kspu_work.notify = my_notify_func;
+
+4. Wait for the "finish" callback.
+- job finished.
+
+2.2 SPU code
+============
+- prepare a function that matches the following params:
+ void spu_my_add(struct kspu_job *kjob, void *buffer, unsigned int buf_num)
+
+ Use init_put_data() to write data back to main memory. It is just a wrapper
+ around mfc_putf(). Use the supplied buf_num as the tag.
+ init_put_data(buffer, out, length, buf_num);
+
+2.3 Update files
+================
+- define your private data structures which are visible from your PPU program
+ and from your SPU program. They become later part of struct kspu_job if you
+ need them for parameters. Keep them as small as possible.
+
+- attach the function to SPU_OPS in
+ include/asm-powerpc/kspu/merged_code.h before TOTAL_SPU_OPS
+
+- attach the function to spu_funcs[] in arch/powerpc/platforms/cell/spufs/spu_main.c
+
+2.4 Skeleton files
+=================
+PPU code in Documentation/powerpc/kspu_ppu_skeleton.c
+SPU code in Documentation/powerpc/kspu_spu_skeleton.[ch]
+
+Merge both into kspu:
+--- a/arch/powerpc/platforms/cell/spufs/Makefile
++++ b/arch/powerpc/platforms/cell/spufs/Makefile
+@@ -24,6 +24,7 @@ kspu-y += kspu_helper.o kspu_code.o
+ $(obj)/kspu_code.o: $(obj)/spu_kspu_dump.h
+
+ spu_kspu_code_obj-y += $(obj)/spu_main.o $(obj)/spu_runtime.o
++spu_kspu_code_obj-y += $(obj)/spu_kspu_ppu_skeleton.o
+ spu_kspu_code_obj-y += $(spu_kspu_code_obj-m)
+
+ $(obj)/spu_kspu: $(spu_kspu_code_obj-y)
+
+--- a/arch/powerpc/platforms/cell/spufs/spu_main.c
++++ b/arch/powerpc/platforms/cell/spufs/spu_main.c
+@@ -13,6 +13,7 @@
+
+ static spu_operation_t spu_ops[TOTAL_SPU_FUNCS] __attribute__((aligned(16))) = {
+ [SPU_OP_nop] = spu_nop,
++ [SPU_OP_my_add] = spu_my_add
+ };
+
+ static unsigned char kspu_buff[DMA_BUFFERS][DMA_MAX_TRANS_SIZE];
+
+--- a/arch/powerpc/platforms/cell/spufs/spu_runtime.h
++++ b/arch/powerpc/platforms/cell/spufs/spu_runtime.h
+@@ -25,5 +25,6 @@ void memcpy_aligned(void *dest, const vo
+ /* exported offloaded functions */
+ void spu_nop(struct kspu_job *kjob, void *buffer,
+ unsigned int buf_num);
+
++void spu_my_add(struct kspu_job *kjob, void *buffer,
++ unsigned int buf_num);
+
+ #endif
+
+--- a/include/asm-powerpc/kspu/merged_code.h
++++ b/include/asm-powerpc/kspu/merged_code.h
+@@ -14,6 +14,7 @@
+
+ enum SPU_OPERATIONS {
+ SPU_OP_nop,
++ SPU_OP_my_add,
+
+ TOTAL_OP_FUNCS,
+ };
+@@ -23,6 +24,7 @@ struct kspu_job {
+ unsigned long long in __attribute__((aligned(16)));
+ unsigned int in_size __attribute__((aligned(16)));
+ union {
++ struct my_sum my_sum;
+ } __attribute__((aligned(16)));
+ };
+

--