This patchset aims to solve issues I found in [1], and at the same time
simplify the composition algorithm.
I sent more igt-gpu-tools test [2] to cover more things and detect the
issues in [1].
This patchset is based on [1].
Patch 1/2: This patch is a no-op, but make the code more readable
regarding the pixel_read functions.
Patch 2/2: This patch is more complex. My main target was to solve issues
I found in [1], but as it was very complex to do it "in place", I choose
to rework the composition function.
The main two advantages are:
- It's now possible to create conversion function for packed & grouped
pixels. Some pixel formats need absolute x/y position and not only an
offset in the buffer to extract the correct value. This part also solve
the issues I found in [1].
- The rotation management is now way easier to understand, there is no
more switch case in different places and instead of copy/pasting rotation
formula I used drm_rect_* helpers.
[1]: https://lore.kernel.org/dri-devel/[email protected]/
[2]: https://lore.kernel.org/igt-dev/[email protected]/T/#t
Signed-off-by: Louis Chauvet <[email protected]>
---
Louis Chauvet (2):
drm/vkms: Create a type to check a function pointer validity
drm/vkms: Use a simpler composition function
drivers/gpu/drm/vkms/vkms_composer.c | 97 ++++++++-----
drivers/gpu/drm/vkms/vkms_drv.h | 32 ++++-
drivers/gpu/drm/vkms/vkms_formats.c | 254 ++++++++++++++++++-----------------
drivers/gpu/drm/vkms/vkms_formats.h | 2 +-
drivers/gpu/drm/vkms/vkms_plane.c | 13 +-
5 files changed, 236 insertions(+), 162 deletions(-)
---
base-commit: 5d189d57bb335a87ec38ea26fe43a5f3ed31ced7
change-id: 20240201-yuv-1337d90d9576
Best regards,
--
Louis Chauvet <[email protected]>
Change the composition algorithm to iterate over pixels instead of lines.
It allows a simpler management of rotation and pixel access for complex formats.
This new algorithm allows read_pixel function to have access to x/y
coordinates and make it possible to read the correct thing in a block
when block_w and block_h are not 1.
The iteration pixel-by-pixel in the same method also allows a simpler
management of rotation with drm_rect_* helpers. This way it's not needed
anymore to have misterious switch-case distributed in multiple places.
This patch is tested with IGT:
- kms_plane@pixel_format
- kms_plane@pixel-format-source-clamping
- kms_plane@plane-position-covered
- kms_plane@plane-position-hole
- kms_plane@plane-position-hole-dpms
- kms_plane@plane-panning-top-left
- kms_plane@plane-panning-bottom-right
- kms_plane@test-odd-size-with-yuv - See [1]
- kms_cursor_crc@cursor-size-change
- kms_cursor_crc@cursor-alpha-opaque
- kms_cursor_crc@cursor-alpha-transparent
- kms_cursor_crc@cursor-dpms
- kms_cursor_crc@cursor-onscreen-*
- kms_cursor_crc@cursor-offscreen-*
- kms_cursor_crc@cursor-sliding-*
- kms_cursor_crc@cursor-random-*
- kms_cursor_crc@cursor-rapid-movement-* - FAIL, but with Overflow of
CRC buffer
- kms_rotation_crc@primary-rotation-* - See [1]
- kms_rotation_crc@sprite-rotation-* - See [1]
- kms_rotation_crc@cursor-rotation-* - See [1]
- kms_rotation_crc@sprite-rotation-90-pos-100-0 - See [1]
- kms_rotation_crc@multiplane-rotation - See [1]
- kms_rotation_crc@multiplane-rotation-cropping-* - See [1]
[1]: https://lore.kernel.org/igt-dev/[email protected]/T/#t
Signed-off-by: Louis Chauvet <[email protected]>
---
drivers/gpu/drm/vkms/vkms_composer.c | 97 +++++++++-----
drivers/gpu/drm/vkms/vkms_drv.h | 21 ++-
drivers/gpu/drm/vkms/vkms_formats.c | 250 ++++++++++++++++++-----------------
drivers/gpu/drm/vkms/vkms_plane.c | 13 +-
4 files changed, 221 insertions(+), 160 deletions(-)
diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
index 3c99fb8b54e2..4c5439209907 100644
--- a/drivers/gpu/drm/vkms/vkms_composer.c
+++ b/drivers/gpu/drm/vkms/vkms_composer.c
@@ -43,7 +43,7 @@ static void pre_mul_alpha_blend(struct vkms_frame_info *frame_info,
{
int x_dst = frame_info->dst.x1;
struct pixel_argb_u16 *out = output_buffer->pixels + x_dst;
- struct pixel_argb_u16 *in = stage_buffer->pixels;
+ struct pixel_argb_u16 *in = stage_buffer->pixels + x_dst;
int x_limit = min_t(size_t, drm_rect_width(&frame_info->dst),
stage_buffer->n_pixels);
@@ -55,33 +55,6 @@ static void pre_mul_alpha_blend(struct vkms_frame_info *frame_info,
}
}
-static int get_y_pos(struct vkms_frame_info *frame_info, int y)
-{
- if (frame_info->rotation & DRM_MODE_REFLECT_Y)
- return drm_rect_height(&frame_info->rotated) - y - 1;
-
- switch (frame_info->rotation & DRM_MODE_ROTATE_MASK) {
- case DRM_MODE_ROTATE_90:
- return frame_info->rotated.x2 - y - 1;
- case DRM_MODE_ROTATE_270:
- return y + frame_info->rotated.x1;
- default:
- return y;
- }
-}
-
-static bool check_limit(struct vkms_frame_info *frame_info, int pos)
-{
- if (drm_rotation_90_or_270(frame_info->rotation)) {
- if (pos >= 0 && pos < drm_rect_width(&frame_info->rotated))
- return true;
- } else {
- if (pos >= frame_info->rotated.y1 && pos < frame_info->rotated.y2)
- return true;
- }
-
- return false;
-}
static void fill_background(const struct pixel_argb_u16 *background_color,
struct line_buffer *output_buffer)
@@ -182,18 +155,74 @@ static void blend(struct vkms_writeback_job *wb,
const struct pixel_argb_u16 background_color = { .a = 0xffff };
size_t crtc_y_limit = crtc_state->base.crtc->mode.vdisplay;
+ size_t crtc_x_limit = crtc_state->base.crtc->mode.hdisplay;
+ /*
+ * Iterating over each pixel to simplify the handling of rotations by using drm_rect_*
+ * helpers.
+ * Iteration per pixel also allosw a simple management of complex pixel formats.
+ *
+ * If needed, this triple loop might be a good place for optimizations.
+ */
for (size_t y = 0; y < crtc_y_limit; y++) {
fill_background(&background_color, output_buffer);
/* The active planes are composed associatively in z-order. */
for (size_t i = 0; i < n_active_planes; i++) {
- y_pos = get_y_pos(plane[i]->frame_info, y);
-
- if (!check_limit(plane[i]->frame_info, y_pos))
- continue;
-
- vkms_compose_row(stage_buffer, plane[i], y_pos);
+ for (size_t x = 0; x < crtc_x_limit; x++) {
+ /*
+ * @x and @y are the coordinate in the output crtc.
+ * @dst_px is used to easily check if a pixel must be blended.
+ * @src_px is used to handle rotation. Just before blending, it
+ * will contain the coordinate of the wanted source pixel in
+ * the source buffer.
+ * @tmp_src is the conversion of src rectangle to integer.
+ */
+
+ struct drm_rect dst_px = DRM_RECT_INIT(x, y, 1, 1);
+ struct drm_rect src_px = DRM_RECT_INIT(x, y, 1, 1);
+ struct drm_rect tmp_src;
+
+ drm_rect_fp_to_int(&tmp_src, &plane[i]->frame_info->src);
+
+ /*
+ * Check that dst_px is inside the plane output
+ * Note: This is destructive for dst_px, if you need this
+ * rectangle you have to do a copy
+ */
+ if (!drm_rect_intersect(&dst_px, &plane[i]->frame_info->dst))
+ continue;
+
+ /*
+ * Transform the coordinate x/y from the crtc to coordinates into
+ * coordinates for the src buffer.
+ *
+ * First step is to cancel the offset of the dst buffer.
+ * Then t will have to invert the rotation. This assumes that
+ * dst = drm_rect_rotate(src, rotation) (dst and src have the
+ * space size, but can be rotated).
+ * And then, apply the offset of the source rectangle to the
+ * coordinate.
+ */
+ drm_rect_translate(&src_px, -plane[i]->frame_info->dst.x1,
+ -plane[i]->frame_info->dst.y1);
+ drm_rect_rotate_inv(&src_px,
+ drm_rect_width(&tmp_src),
+ drm_rect_height(&tmp_src),
+ plane[i]->frame_info->rotation);
+ drm_rect_translate(&src_px, tmp_src.x1, tmp_src.y1);
+
+ /*
+ * Now, as the src_px contains the correct position, we can use
+ * it to convert the pixel color
+ */
+ plane[i]->pixel_read(plane[i]->frame_info,
+ src_px.x1,
+ src_px.y1,
+ &stage_buffer->pixels[x],
+ plane[i]->base.base.color_encoding,
+ plane[i]->base.base.color_range);
+ }
pre_mul_alpha_blend(plane[i]->frame_info, stage_buffer,
output_buffer);
}
diff --git a/drivers/gpu/drm/vkms/vkms_drv.h b/drivers/gpu/drm/vkms/vkms_drv.h
index cb20bab26cae..ba3c063adc5f 100644
--- a/drivers/gpu/drm/vkms/vkms_drv.h
+++ b/drivers/gpu/drm/vkms/vkms_drv.h
@@ -25,10 +25,20 @@
#define VKMS_LUT_SIZE 256
+/**
+ * struct vkms_frame_info - structure to store the state of a frame
+ *
+ * @fb: backing drm framebuffer
+ * @src: source rectangle of this frame in the source framebuffer
+ * @dst: destination rectangle in the crtc buffer
+ * @map: see drm_shadow_plane_state@data
+ * @rotation: rotation applied to the source.
+ *
+ * @src and @dst should have the same size modulo the rotation.
+ */
struct vkms_frame_info {
struct drm_framebuffer *fb;
struct drm_rect src, dst;
- struct drm_rect rotated;
struct iosys_map map[DRM_FORMAT_MAX_PLANES];
unsigned int rotation;
};
@@ -51,14 +61,15 @@ struct vkms_writeback_job {
/**
* typedef pixel_read_t - These functions are used to read the pixels in the source frame, convert
* them to argb16 and write them to out_pixel.
- * It assumes that src_pixels point to a valid pixel (not a block, or a block of 1x1 pixel)
*
- * @src_pixels: Source pointer to a pixel
+ * @frame_info: Frame used as source for the pixel value
+ * @x: X (width) coordinate in the source buffer
+ * @y: Y (height) coordinate in the source buffer
* @out_pixel: Pointer where to write the pixel value
* @encoding: Color encoding to use (mainly used for YUV formats)
* @range: Color range to use (mainly used for YUV formats)
*/
-typedef void (*pixel_read_t)(u8 **src_pixels, int y,
+typedef void (*pixel_read_t)(struct vkms_frame_info *frame_info, int x, int y,
struct pixel_argb_u16 *out_pixel, enum drm_color_encoding enconding,
enum drm_color_range range);
@@ -66,6 +77,8 @@ typedef void (*pixel_read_t)(u8 **src_pixels, int y,
* vkms_plane_state - Driver specific plane state
* @base: base plane state
* @frame_info: data required for composing computation
+ * @pixel_read: function to read a pixel in this plane. The creator of a vkms_plane_state must
+ * ensure that this pointer is valid
*/
struct vkms_plane_state {
struct drm_shadow_plane_state base;
diff --git a/drivers/gpu/drm/vkms/vkms_formats.c b/drivers/gpu/drm/vkms/vkms_formats.c
index c6376db58d38..8ff85ffda94f 100644
--- a/drivers/gpu/drm/vkms/vkms_formats.c
+++ b/drivers/gpu/drm/vkms/vkms_formats.c
@@ -11,79 +11,88 @@
#include "vkms_formats.h"
-static size_t pixel_offset(const struct vkms_frame_info *frame_info, int x, int y, size_t index)
+
+/**
+ * packed_pixels_offset() - Get the offset of the block containing the pixel at coordinates x/y
+ *
+ * @frame_info: Buffer metadata
+ * @x: The x coordinate of the wanted pixel in the buffer
+ * @y: The y coordinate of the wanted pixel in the buffer
+ * @plane: The index of the plane to use
+ *
+ * The caller must be aware that this offset is not always a pointer to a pixel. If individual
+ * pixel values are needed, they have to be extracted from the block.
+ */
+static size_t packed_pixels_offset(const struct vkms_frame_info *frame_info, int x, int y, size_t plane)
{
struct drm_framebuffer *fb = frame_info->fb;
-
- return fb->offsets[index] + (y * fb->pitches[index])
- + (x * fb->format->cpp[index]);
+ const struct drm_format_info *format = frame_info->fb->format;
+ /* Directly using x and y to multiply pitches and format->ccp is not sufficient because
+ * in some formats a block can represent multiple pixels.
+ *
+ * Dividing x and y by the block size allows to extract the correct offset of the block
+ * containing the pixel.
+ */
+ return fb->offsets[plane] +
+ (y / drm_format_info_block_width(format, plane)) * fb->pitches[plane] +
+ (x / drm_format_info_block_height(format, plane)) * format->char_per_block[plane];
}
/*
- * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates
+ * packed_pixels_addr - Get the pointer to the block containing the pixel at the given coordinate
*
* @frame_info: Buffer metadata
- * @x: The x(width) coordinate of the 2D buffer
- * @y: The y(Heigth) coordinate of the 2D buffer
+ * @x: The x(width) coordinate inside the 2D buffer
+ * @y: The y(Heigth) coordinate inside the 2D buffer
* @index: The index of the plane on the 2D buffer
*
- * Takes the information stored in the frame_info, a pair of coordinates, and
- * returns the address of the first color channel on the desired index. The
- * format's specific subsample is applied.
+ * Takes the information stored in the frame_info, a pair of coordinates, and returns the address
+ * of the block containing this pixel.
+ * The caller must be aware that this pointer is sometimes not directly a pixel, it needs some
+ * additional work to extract pixel color from this block.
*/
static void *packed_pixels_addr(const struct vkms_frame_info *frame_info,
int x, int y, size_t index)
{
- int vsub = index == 0 ? 1 : frame_info->fb->format->vsub;
- int hsub = index == 0 ? 1 : frame_info->fb->format->hsub;
- size_t offset = pixel_offset(frame_info, x / hsub, y / vsub, index);
-
- return (u8 *)frame_info->map[0].vaddr + offset;
+ return (u8 *)frame_info->map[0].vaddr + packed_pixels_offset(frame_info, x, y, index);
}
-static void *get_packed_src_addr(const struct vkms_frame_info *frame_info, int y, size_t index)
+static void ARGB8888_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y,
+ struct pixel_argb_u16 *out_pixel, enum drm_color_encoding encoding,
+ enum drm_color_range range)
{
- int x_src = frame_info->src.x1 >> 16;
- int y_src = y - frame_info->rotated.y1 + (frame_info->src.y1 >> 16);
-
- return packed_pixels_addr(frame_info, x_src, y_src, index);
-}
+ u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0);
-static int get_x_position(const struct vkms_frame_info *frame_info, int limit, int x)
-{
- if (frame_info->rotation & (DRM_MODE_REFLECT_X | DRM_MODE_ROTATE_270))
- return limit - x - 1;
- return x;
-}
-
-static void ARGB8888_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
- enum drm_color_encoding encoding, enum drm_color_range range)
-{
/*
* The 257 is the "conversion ratio". This number is obtained by the
* (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
* the best color value in a pixel format with more possibilities.
* A similar idea applies to others RGB color conversions.
*/
- out_pixel->a = (u16)src_pixels[0][3] * 257;
- out_pixel->r = (u16)src_pixels[0][2] * 257;
- out_pixel->g = (u16)src_pixels[0][1] * 257;
- out_pixel->b = (u16)src_pixels[0][0] * 257;
+ out_pixel->a = (u16)src_pixel[3] * 257;
+ out_pixel->r = (u16)src_pixel[2] * 257;
+ out_pixel->g = (u16)src_pixel[1] * 257;
+ out_pixel->b = (u16)src_pixel[0] * 257;
}
-static void XRGB8888_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
- enum drm_color_encoding encoding, enum drm_color_range range)
+static void XRGB8888_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y,
+ struct pixel_argb_u16 *out_pixel, enum drm_color_encoding encoding,
+ enum drm_color_range range)
{
+ u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0);
out_pixel->a = (u16)0xffff;
- out_pixel->r = (u16)src_pixels[0][2] * 257;
- out_pixel->g = (u16)src_pixels[0][1] * 257;
- out_pixel->b = (u16)src_pixels[0][0] * 257;
+ out_pixel->r = (u16)src_pixel[2] * 257;
+ out_pixel->g = (u16)src_pixel[1] * 257;
+ out_pixel->b = (u16)src_pixel[0] * 257;
}
-static void ARGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
- enum drm_color_encoding encoding, enum drm_color_range range)
+static void ARGB16161616_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y,
+ struct pixel_argb_u16 *out_pixel,
+ enum drm_color_encoding encoding,
+ enum drm_color_range range)
{
- u16 *pixels = (u16 *)src_pixels[0];
+ u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0);
+ u16 *pixels = (u16 *)src_pixel;
out_pixel->a = le16_to_cpu(pixels[3]);
out_pixel->r = le16_to_cpu(pixels[2]);
@@ -91,10 +100,13 @@ static void ARGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out
out_pixel->b = le16_to_cpu(pixels[0]);
}
-static void XRGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
- enum drm_color_encoding encoding, enum drm_color_range range)
+static void XRGB16161616_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y,
+ struct pixel_argb_u16 *out_pixel,
+ enum drm_color_encoding encoding,
+ enum drm_color_range range)
{
- u16 *pixels = (u16 *)src_pixels[0];
+ u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0);
+ u16 *pixels = (u16 *)src_pixel;
out_pixel->a = (u16)0xffff;
out_pixel->r = le16_to_cpu(pixels[2]);
@@ -102,10 +114,12 @@ static void XRGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out
out_pixel->b = le16_to_cpu(pixels[0]);
}
-static void RGB565_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
- enum drm_color_encoding encoding, enum drm_color_range range)
+static void RGB565_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y,
+ struct pixel_argb_u16 *out_pixel, enum drm_color_encoding encoding,
+ enum drm_color_range range)
{
- u16 *pixels = (u16 *)src_pixels[0];
+ u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0);
+ u16 *pixels = (u16 *)src_pixel;
s64 fp_rb_ratio = drm_fixp_div(drm_int2fixp(65535), drm_int2fixp(31));
s64 fp_g_ratio = drm_fixp_div(drm_int2fixp(65535), drm_int2fixp(63));
@@ -121,12 +135,19 @@ static void RGB565_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel
out_pixel->b = drm_fixp2int_round(drm_fixp_mul(fp_b, fp_rb_ratio));
}
+/**
+ * ycbcr2rgb() - helper to convert ycbcr 8 bits to rbg 16 bits
+ *
+ * @m: conversion matrix to use
+ * @y, @cb, @cr: component of the ycbcr pixel
+ * @r, @g, @b: pointers to write the rbg pixel
+ */
static void ycbcr2rgb(const s16 m[3][3], u8 y, u8 cb, u8 cr, u8 y_offset, u8 *r, u8 *g, u8 *b)
{
s32 y_16, cb_16, cr_16;
s32 r_16, g_16, b_16;
- y_16 = y - y_offset;
+ y_16 = y - y_offset;
cb_16 = cb - 128;
cr_16 = cr - 128;
@@ -139,6 +160,14 @@ static void ycbcr2rgb(const s16 m[3][3], u8 y, u8 cb, u8 cr, u8 y_offset, u8 *r,
*b = clamp(b_16, 0, 0xffff) >> 8;
}
+/**
+ * yuv_u8_to_argb_u16() - helper to convert yuv 8 bits to argb 16 bits
+ *
+ * @argb_u16: pointer to write the converted pixel
+ * @yuv_u8: pointer to the source yuv pixel
+ * @encoding: encoding to use
+ * @range: range to use
+ */
VISIBLE_IF_KUNIT void yuv_u8_to_argb_u16(struct pixel_argb_u16 *argb_u16,
const struct pixel_yuv_u8 *yuv_u8,
enum drm_color_encoding encoding,
@@ -205,104 +234,89 @@ VISIBLE_IF_KUNIT void yuv_u8_to_argb_u16(struct pixel_argb_u16 *argb_u16,
}
EXPORT_SYMBOL_IF_KUNIT(yuv_u8_to_argb_u16);
-static void semi_planar_yuv_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
+static void semi_planar_yuv_to_argb_u16(struct vkms_frame_info *frame_info, int x,
+ int y, struct pixel_argb_u16 *out_pixel,
enum drm_color_encoding encoding,
enum drm_color_range range)
{
struct pixel_yuv_u8 yuv_u8;
- yuv_u8.y = src_pixels[0][0];
- yuv_u8.u = src_pixels[1][0];
- yuv_u8.v = src_pixels[1][1];
+ /* Subsampling must be applied for semi-planar yuv formats */
+ int vsub = frame_info->fb->format->vsub;
+ int hsub = frame_info->fb->format->hsub;
+
+ u8 *src_y = packed_pixels_addr(frame_info, x, y, 0);
+ u8 *src_uv = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1);
+
+ yuv_u8.y = src_y[0];
+ yuv_u8.u = src_uv[0];
+ yuv_u8.v = src_uv[1];
yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range);
}
-static void semi_planar_yvu_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
+static void semi_planar_yvu_to_argb_u16(struct vkms_frame_info *frame_info, int x,
+ int y, struct pixel_argb_u16 *out_pixel,
enum drm_color_encoding encoding,
enum drm_color_range range)
{
struct pixel_yuv_u8 yuv_u8;
- yuv_u8.y = src_pixels[0][0];
- yuv_u8.v = src_pixels[1][0];
- yuv_u8.u = src_pixels[1][1];
+ /* Subsampling must be applied for semi-planar yuv formats */
+ int vsub = frame_info->fb->format->vsub ? frame_info->fb->format->vsub : 1;
+ int hsub = frame_info->fb->format->hsub ? frame_info->fb->format->hsub : 1;
+ u8 *src_y = packed_pixels_addr(frame_info, x, y, 0);
+ u8 *src_vu = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1);
+
+ yuv_u8.y = src_y[0];
+ yuv_u8.v = src_vu[0];
+ yuv_u8.u = src_vu[1];
yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range);
}
-static void planar_yuv_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
- enum drm_color_encoding encoding, enum drm_color_range range)
+static void planar_yuv_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y,
+ struct pixel_argb_u16 *out_pixel,
+ enum drm_color_encoding encoding,
+ enum drm_color_range range)
{
struct pixel_yuv_u8 yuv_u8;
- yuv_u8.y = src_pixels[0][0];
- yuv_u8.u = src_pixels[1][0];
- yuv_u8.v = src_pixels[2][0];
+ /* Subsampling must be applied for planar yuv formats */
+ int vsub = frame_info->fb->format->vsub ? frame_info->fb->format->vsub : 1;
+ int hsub = frame_info->fb->format->hsub ? frame_info->fb->format->hsub : 1;
+
+ u8 *src_y = packed_pixels_addr(frame_info, x, y, 0);
+ u8 *src_u = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1);
+ u8 *src_v = packed_pixels_addr(frame_info, x / hsub, y / vsub, 2);
+
+ yuv_u8.y = src_y[0];
+ yuv_u8.u = src_u[0];
+ yuv_u8.v = src_v[0];
yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range);
}
-static void planar_yvu_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
- enum drm_color_encoding encoding, enum drm_color_range range)
+static void planar_yvu_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y,
+ struct pixel_argb_u16 *out_pixel,
+ enum drm_color_encoding encoding,
+ enum drm_color_range range)
{
struct pixel_yuv_u8 yuv_u8;
- yuv_u8.y = src_pixels[0][0];
- yuv_u8.v = src_pixels[1][0];
- yuv_u8.u = src_pixels[2][0];
+ /* Subsampling must be applied for semi-planar yuv formats */
+ int vsub = frame_info->fb->format->vsub ? frame_info->fb->format->vsub : 1;
+ int hsub = frame_info->fb->format->hsub ? frame_info->fb->format->hsub : 1;
- yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range);
-}
+ u8 *src_y = packed_pixels_addr(frame_info, x, y, 0);
+ u8 *src_u = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1);
+ u8 *src_v = packed_pixels_addr(frame_info, x / hsub, y / vsub, 2);
-/**
- * vkms_compose_row - compose a single row of a plane
- * @stage_buffer: output line with the composed pixels
- * @plane: state of the plane that is being composed
- * @y: y coordinate of the row
- *
- * This function composes a single row of a plane. It gets the source pixels
- * through the y coordinate (see get_packed_src_addr()) and goes linearly
- * through the source pixel, reading the pixels and converting it to
- * ARGB16161616 (see the pixel_read() callback). For rotate-90 and rotate-270,
- * the source pixels are not traversed linearly. The source pixels are queried
- * on each iteration in order to traverse the pixels vertically.
- */
-void vkms_compose_row(struct line_buffer *stage_buffer, struct vkms_plane_state *plane, int y)
-{
- struct pixel_argb_u16 *out_pixels = stage_buffer->pixels;
- struct vkms_frame_info *frame_info = plane->frame_info;
- const struct drm_format_info *frame_format = frame_info->fb->format;
- int limit = min_t(size_t, drm_rect_width(&frame_info->dst), stage_buffer->n_pixels);
- u8 *src_pixels[DRM_FORMAT_MAX_PLANES];
-
- enum drm_color_encoding encoding = plane->base.base.color_encoding;
- enum drm_color_range range = plane->base.base.color_range;
-
- for (size_t i = 0; i < frame_format->num_planes; i++)
- src_pixels[i] = get_packed_src_addr(frame_info, y, i);
-
- for (size_t x = 0; x < limit; x++) {
- int x_pos = get_x_position(frame_info, limit, x);
-
- bool shoud_inc = !((x + 1) % frame_format->num_planes);
-
- if (drm_rotation_90_or_270(frame_info->rotation)) {
- for (size_t i = 0; i < frame_format->num_planes; i++) {
- src_pixels[i] = get_packed_src_addr(frame_info,
- x + frame_info->rotated.y1, i);
- if (!i || shoud_inc)
- src_pixels[i] += frame_format->cpp[i] * y;
- }
- }
-
- plane->pixel_read(src_pixels, &out_pixels[x_pos], encoding, range);
-
- for (size_t i = 0; i < frame_format->num_planes; i++) {
- if (!i || shoud_inc)
- src_pixels[i] += frame_format->cpp[i];
- }
- }
+ yuv_u8.y = src_y[0];
+ yuv_u8.v = src_u[0];
+ yuv_u8.u = src_v[0];
+
+ yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range);
}
/*
diff --git a/drivers/gpu/drm/vkms/vkms_plane.c b/drivers/gpu/drm/vkms/vkms_plane.c
index 932736fc3ee9..d818ed246a46 100644
--- a/drivers/gpu/drm/vkms/vkms_plane.c
+++ b/drivers/gpu/drm/vkms/vkms_plane.c
@@ -118,13 +118,20 @@ static void vkms_plane_atomic_update(struct drm_plane *plane,
return;
fmt = fb->format->format;
+ pixel_read_t pixel_read = get_pixel_conversion_function(fmt);
+
+ if (!pixel_read) {
+ DRM_WARN("The requested pixel format is not supported by VKMS. The new state is "
+ "not applied.\n");
+ return;
+ }
+
vkms_plane_state = to_vkms_plane_state(new_state);
shadow_plane_state = &vkms_plane_state->base;
frame_info = vkms_plane_state->frame_info;
memcpy(&frame_info->src, &new_state->src, sizeof(struct drm_rect));
memcpy(&frame_info->dst, &new_state->dst, sizeof(struct drm_rect));
- memcpy(&frame_info->rotated, &new_state->dst, sizeof(struct drm_rect));
frame_info->fb = fb;
memcpy(&frame_info->map, &shadow_plane_state->data, sizeof(frame_info->map));
drm_framebuffer_get(frame_info->fb);
@@ -134,10 +141,8 @@ static void vkms_plane_atomic_update(struct drm_plane *plane,
DRM_MODE_REFLECT_X |
DRM_MODE_REFLECT_Y);
- drm_rect_rotate(&frame_info->rotated, drm_rect_width(&frame_info->rotated),
- drm_rect_height(&frame_info->rotated), frame_info->rotation);
- vkms_plane_state->pixel_read = get_pixel_conversion_function(fmt);
+ vkms_plane_state->pixel_read = pixel_read;
}
static int vkms_plane_atomic_check(struct drm_plane *plane,
--
2.43.0
Add the pixel_read_t type to check function prototype in structures
and functions.
It avoids casting to (void *) and at the same occasion allows the
compiler to check the type properly.
Signed-off-by: Louis Chauvet <[email protected]>
---
drivers/gpu/drm/vkms/vkms_drv.h | 17 +++++++++++++++--
drivers/gpu/drm/vkms/vkms_formats.c | 4 ++--
drivers/gpu/drm/vkms/vkms_formats.h | 2 +-
3 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/vkms/vkms_drv.h b/drivers/gpu/drm/vkms/vkms_drv.h
index 51349a0c32d8..cb20bab26cae 100644
--- a/drivers/gpu/drm/vkms/vkms_drv.h
+++ b/drivers/gpu/drm/vkms/vkms_drv.h
@@ -48,6 +48,20 @@ struct vkms_writeback_job {
void (*pixel_write)(u8 *dst_pixels, struct pixel_argb_u16 *in_pixel);
};
+/**
+ * typedef pixel_read_t - These functions are used to read the pixels in the source frame, convert
+ * them to argb16 and write them to out_pixel.
+ * It assumes that src_pixels point to a valid pixel (not a block, or a block of 1x1 pixel)
+ *
+ * @src_pixels: Source pointer to a pixel
+ * @out_pixel: Pointer where to write the pixel value
+ * @encoding: Color encoding to use (mainly used for YUV formats)
+ * @range: Color range to use (mainly used for YUV formats)
+ */
+typedef void (*pixel_read_t)(u8 **src_pixels, int y,
+ struct pixel_argb_u16 *out_pixel, enum drm_color_encoding enconding,
+ enum drm_color_range range);
+
/**
* vkms_plane_state - Driver specific plane state
* @base: base plane state
@@ -56,8 +70,7 @@ struct vkms_writeback_job {
struct vkms_plane_state {
struct drm_shadow_plane_state base;
struct vkms_frame_info *frame_info;
- void (*pixel_read)(u8 **src_buffer, struct pixel_argb_u16 *out_pixel,
- enum drm_color_encoding enconding, enum drm_color_range range);
+ pixel_read_t pixel_read;
};
struct vkms_plane {
diff --git a/drivers/gpu/drm/vkms/vkms_formats.c b/drivers/gpu/drm/vkms/vkms_formats.c
index e06bbd7c0a67..c6376db58d38 100644
--- a/drivers/gpu/drm/vkms/vkms_formats.c
+++ b/drivers/gpu/drm/vkms/vkms_formats.c
@@ -390,7 +390,7 @@ void vkms_writeback_row(struct vkms_writeback_job *wb,
wb->pixel_write(dst_pixels, &in_pixels[x]);
}
-void *get_pixel_conversion_function(u32 format)
+pixel_read_t get_pixel_conversion_function(u32 format)
{
switch (format) {
case DRM_FORMAT_ARGB8888:
@@ -420,7 +420,7 @@ void *get_pixel_conversion_function(u32 format)
case DRM_FORMAT_YVU444:
return &planar_yvu_to_argb_u16;
default:
- return NULL;
+ return (pixel_read_t)NULL;
}
}
diff --git a/drivers/gpu/drm/vkms/vkms_formats.h b/drivers/gpu/drm/vkms/vkms_formats.h
index 0cf835292cec..04e31e126ab1 100644
--- a/drivers/gpu/drm/vkms/vkms_formats.h
+++ b/drivers/gpu/drm/vkms/vkms_formats.h
@@ -5,7 +5,7 @@
#include "vkms_drv.h"
-void *get_pixel_conversion_function(u32 format);
+pixel_read_t get_pixel_conversion_function(u32 format);
void *get_pixel_write_function(u32 format);
--
2.43.0
Hi Louis,
Thanks for your patches! Could you please rebase them on top of
drm-misc-next? It would make it easier for me to review and test the
patches.
Best Regards,
- Maíra
On 2/1/24 14:31, Louis Chauvet wrote:
> This patchset aims to solve issues I found in [1], and at the same time
> simplify the composition algorithm.
>
> I sent more igt-gpu-tools test [2] to cover more things and detect the
> issues in [1].
>
> This patchset is based on [1].
>
> Patch 1/2: This patch is a no-op, but make the code more readable
> regarding the pixel_read functions.
>
> Patch 2/2: This patch is more complex. My main target was to solve issues
> I found in [1], but as it was very complex to do it "in place", I choose
> to rework the composition function.
> The main two advantages are:
> - It's now possible to create conversion function for packed & grouped
> pixels. Some pixel formats need absolute x/y position and not only an
> offset in the buffer to extract the correct value. This part also solve
> the issues I found in [1].
> - The rotation management is now way easier to understand, there is no
> more switch case in different places and instead of copy/pasting rotation
> formula I used drm_rect_* helpers.
>
> [1]: https://lore.kernel.org/dri-devel/[email protected]/
> [2]: https://lore.kernel.org/igt-dev/[email protected]/T/#t
>
> Signed-off-by: Louis Chauvet <[email protected]>
> ---
> Louis Chauvet (2):
> drm/vkms: Create a type to check a function pointer validity
> drm/vkms: Use a simpler composition function
>
> drivers/gpu/drm/vkms/vkms_composer.c | 97 ++++++++-----
> drivers/gpu/drm/vkms/vkms_drv.h | 32 ++++-
> drivers/gpu/drm/vkms/vkms_formats.c | 254 ++++++++++++++++++-----------------
> drivers/gpu/drm/vkms/vkms_formats.h | 2 +-
> drivers/gpu/drm/vkms/vkms_plane.c | 13 +-
> 5 files changed, 236 insertions(+), 162 deletions(-)
> ---
> base-commit: 5d189d57bb335a87ec38ea26fe43a5f3ed31ced7
> change-id: 20240201-yuv-1337d90d9576
>
> Best regards,
Le 01/02/24 - 19:07, Maira Canal a ?crit :
> Hi Louis,
>
> Thanks for your patches! Could you please rebase them on top of
> drm-misc-next? It would make it easier for me to review and test the
> patches.
>
> Best Regards,
> - Ma?ra
Hi Ma?ra,
Do you want me to rebase the whole YUV [1] series or should I extract and
make my two patches independent?
[1]: https://lore.kernel.org/dri-devel/[email protected]/
Best regards,
Louis Chauvet
> On 2/1/24 14:31, Louis Chauvet wrote:
> > This patchset aims to solve issues I found in [1], and at the same time
> > simplify the composition algorithm.
> >
> > I sent more igt-gpu-tools test [2] to cover more things and detect the
> > issues in [1].
> >
> > This patchset is based on [1].
> >
> > Patch 1/2: This patch is a no-op, but make the code more readable
> > regarding the pixel_read functions.
> >
> > Patch 2/2: This patch is more complex. My main target was to solve issues
> > I found in [1], but as it was very complex to do it "in place", I choose
> > to rework the composition function.
> > The main two advantages are:
> > - It's now possible to create conversion function for packed & grouped
> > pixels. Some pixel formats need absolute x/y position and not only an
> > offset in the buffer to extract the correct value. This part also solve
> > the issues I found in [1].
> > - The rotation management is now way easier to understand, there is no
> > more switch case in different places and instead of copy/pasting rotation
> > formula I used drm_rect_* helpers.
> >
> > [1]: https://lore.kernel.org/dri-devel/[email protected]/
> > [2]: https://lore.kernel.org/igt-dev/[email protected]/T/#t
> >
> > Signed-off-by: Louis Chauvet <[email protected]>
> > ---
> > Louis Chauvet (2):
> > drm/vkms: Create a type to check a function pointer validity
> > drm/vkms: Use a simpler composition function
> >
> > drivers/gpu/drm/vkms/vkms_composer.c | 97 ++++++++-----
> > drivers/gpu/drm/vkms/vkms_drv.h | 32 ++++-
> > drivers/gpu/drm/vkms/vkms_formats.c | 254 ++++++++++++++++++-----------------
> > drivers/gpu/drm/vkms/vkms_formats.h | 2 +-
> > drivers/gpu/drm/vkms/vkms_plane.c | 13 +-
> > 5 files changed, 236 insertions(+), 162 deletions(-)
> > ---
> > base-commit: 5d189d57bb335a87ec38ea26fe43a5f3ed31ced7
> > change-id: 20240201-yuv-1337d90d9576
> >
> > Best regards,
On Thu, 01 Feb 2024 18:31:32 +0100
Louis Chauvet <[email protected]> wrote:
> Change the composition algorithm to iterate over pixels instead of lines.
> It allows a simpler management of rotation and pixel access for complex formats.
>
> This new algorithm allows read_pixel function to have access to x/y
> coordinates and make it possible to read the correct thing in a block
> when block_w and block_h are not 1.
> The iteration pixel-by-pixel in the same method also allows a simpler
> management of rotation with drm_rect_* helpers. This way it's not needed
> anymore to have misterious switch-case distributed in multiple places.
Hi,
there was a very good reason to write this code using lines:
performance. Before lines, it was indeed operating on individual pixels.
Please, include performance measurements before and after this series
to quantify the impact on the previously already supported pixel
formats, particularly the 32-bit-per-pixel RGB variants.
VKMS will be used more and more in CI for userspace projects, and
performance actually matters there.
I'm worrying that this performance degradation here is significant. I
believe it is possible to keep blending with lines, if you add new line
getters for reading from rotated, sub-sampled etc. images. That way you
don't have to regress the most common formats' performance.
Thanks,
pq
>
> This patch is tested with IGT:
> - kms_plane@pixel_format
> - kms_plane@pixel-format-source-clamping
> - kms_plane@plane-position-covered
> - kms_plane@plane-position-hole
> - kms_plane@plane-position-hole-dpms
> - kms_plane@plane-panning-top-left
> - kms_plane@plane-panning-bottom-right
> - kms_plane@test-odd-size-with-yuv - See [1]
> - kms_cursor_crc@cursor-size-change
> - kms_cursor_crc@cursor-alpha-opaque
> - kms_cursor_crc@cursor-alpha-transparent
> - kms_cursor_crc@cursor-dpms
> - kms_cursor_crc@cursor-onscreen-*
> - kms_cursor_crc@cursor-offscreen-*
> - kms_cursor_crc@cursor-sliding-*
> - kms_cursor_crc@cursor-random-*
> - kms_cursor_crc@cursor-rapid-movement-* - FAIL, but with Overflow of
> CRC buffer
> - kms_rotation_crc@primary-rotation-* - See [1]
> - kms_rotation_crc@sprite-rotation-* - See [1]
> - kms_rotation_crc@cursor-rotation-* - See [1]
> - kms_rotation_crc@sprite-rotation-90-pos-100-0 - See [1]
> - kms_rotation_crc@multiplane-rotation - See [1]
> - kms_rotation_crc@multiplane-rotation-cropping-* - See [1]
>
> [1]: https://lore.kernel.org/igt-dev/[email protected]/T/#t
>
> Signed-off-by: Louis Chauvet <[email protected]>
> ---
> drivers/gpu/drm/vkms/vkms_composer.c | 97 +++++++++-----
> drivers/gpu/drm/vkms/vkms_drv.h | 21 ++-
> drivers/gpu/drm/vkms/vkms_formats.c | 250 ++++++++++++++++++-----------------
> drivers/gpu/drm/vkms/vkms_plane.c | 13 +-
> 4 files changed, 221 insertions(+), 160 deletions(-)
>
> diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
> index 3c99fb8b54e2..4c5439209907 100644
> --- a/drivers/gpu/drm/vkms/vkms_composer.c
> +++ b/drivers/gpu/drm/vkms/vkms_composer.c
> @@ -43,7 +43,7 @@ static void pre_mul_alpha_blend(struct vkms_frame_info *frame_info,
> {
> int x_dst = frame_info->dst.x1;
> struct pixel_argb_u16 *out = output_buffer->pixels + x_dst;
> - struct pixel_argb_u16 *in = stage_buffer->pixels;
> + struct pixel_argb_u16 *in = stage_buffer->pixels + x_dst;
> int x_limit = min_t(size_t, drm_rect_width(&frame_info->dst),
> stage_buffer->n_pixels);
>
> @@ -55,33 +55,6 @@ static void pre_mul_alpha_blend(struct vkms_frame_info *frame_info,
> }
> }
>
> -static int get_y_pos(struct vkms_frame_info *frame_info, int y)
> -{
> - if (frame_info->rotation & DRM_MODE_REFLECT_Y)
> - return drm_rect_height(&frame_info->rotated) - y - 1;
> -
> - switch (frame_info->rotation & DRM_MODE_ROTATE_MASK) {
> - case DRM_MODE_ROTATE_90:
> - return frame_info->rotated.x2 - y - 1;
> - case DRM_MODE_ROTATE_270:
> - return y + frame_info->rotated.x1;
> - default:
> - return y;
> - }
> -}
> -
> -static bool check_limit(struct vkms_frame_info *frame_info, int pos)
> -{
> - if (drm_rotation_90_or_270(frame_info->rotation)) {
> - if (pos >= 0 && pos < drm_rect_width(&frame_info->rotated))
> - return true;
> - } else {
> - if (pos >= frame_info->rotated.y1 && pos < frame_info->rotated.y2)
> - return true;
> - }
> -
> - return false;
> -}
>
> static void fill_background(const struct pixel_argb_u16 *background_color,
> struct line_buffer *output_buffer)
> @@ -182,18 +155,74 @@ static void blend(struct vkms_writeback_job *wb,
> const struct pixel_argb_u16 background_color = { .a = 0xffff };
>
> size_t crtc_y_limit = crtc_state->base.crtc->mode.vdisplay;
> + size_t crtc_x_limit = crtc_state->base.crtc->mode.hdisplay;
>
> + /*
> + * Iterating over each pixel to simplify the handling of rotations by using drm_rect_*
> + * helpers.
> + * Iteration per pixel also allosw a simple management of complex pixel formats.
> + *
> + * If needed, this triple loop might be a good place for optimizations.
> + */
> for (size_t y = 0; y < crtc_y_limit; y++) {
> fill_background(&background_color, output_buffer);
>
> /* The active planes are composed associatively in z-order. */
> for (size_t i = 0; i < n_active_planes; i++) {
> - y_pos = get_y_pos(plane[i]->frame_info, y);
> -
> - if (!check_limit(plane[i]->frame_info, y_pos))
> - continue;
> -
> - vkms_compose_row(stage_buffer, plane[i], y_pos);
> + for (size_t x = 0; x < crtc_x_limit; x++) {
> + /*
> + * @x and @y are the coordinate in the output crtc.
> + * @dst_px is used to easily check if a pixel must be blended.
> + * @src_px is used to handle rotation. Just before blending, it
> + * will contain the coordinate of the wanted source pixel in
> + * the source buffer.
> + * @tmp_src is the conversion of src rectangle to integer.
> + */
> +
> + struct drm_rect dst_px = DRM_RECT_INIT(x, y, 1, 1);
> + struct drm_rect src_px = DRM_RECT_INIT(x, y, 1, 1);
> + struct drm_rect tmp_src;
> +
> + drm_rect_fp_to_int(&tmp_src, &plane[i]->frame_info->src);
> +
> + /*
> + * Check that dst_px is inside the plane output
> + * Note: This is destructive for dst_px, if you need this
> + * rectangle you have to do a copy
> + */
> + if (!drm_rect_intersect(&dst_px, &plane[i]->frame_info->dst))
> + continue;
> +
> + /*
> + * Transform the coordinate x/y from the crtc to coordinates into
> + * coordinates for the src buffer.
> + *
> + * First step is to cancel the offset of the dst buffer.
> + * Then t will have to invert the rotation. This assumes that
> + * dst = drm_rect_rotate(src, rotation) (dst and src have the
> + * space size, but can be rotated).
> + * And then, apply the offset of the source rectangle to the
> + * coordinate.
> + */
> + drm_rect_translate(&src_px, -plane[i]->frame_info->dst.x1,
> + -plane[i]->frame_info->dst.y1);
> + drm_rect_rotate_inv(&src_px,
> + drm_rect_width(&tmp_src),
> + drm_rect_height(&tmp_src),
> + plane[i]->frame_info->rotation);
> + drm_rect_translate(&src_px, tmp_src.x1, tmp_src.y1);
> +
> + /*
> + * Now, as the src_px contains the correct position, we can use
> + * it to convert the pixel color
> + */
> + plane[i]->pixel_read(plane[i]->frame_info,
> + src_px.x1,
> + src_px.y1,
> + &stage_buffer->pixels[x],
> + plane[i]->base.base.color_encoding,
> + plane[i]->base.base.color_range);
> + }
> pre_mul_alpha_blend(plane[i]->frame_info, stage_buffer,
> output_buffer);
> }
> diff --git a/drivers/gpu/drm/vkms/vkms_drv.h b/drivers/gpu/drm/vkms/vkms_drv.h
> index cb20bab26cae..ba3c063adc5f 100644
> --- a/drivers/gpu/drm/vkms/vkms_drv.h
> +++ b/drivers/gpu/drm/vkms/vkms_drv.h
> @@ -25,10 +25,20 @@
>
> #define VKMS_LUT_SIZE 256
>
> +/**
> + * struct vkms_frame_info - structure to store the state of a frame
> + *
> + * @fb: backing drm framebuffer
> + * @src: source rectangle of this frame in the source framebuffer
> + * @dst: destination rectangle in the crtc buffer
> + * @map: see drm_shadow_plane_state@data
> + * @rotation: rotation applied to the source.
> + *
> + * @src and @dst should have the same size modulo the rotation.
> + */
> struct vkms_frame_info {
> struct drm_framebuffer *fb;
> struct drm_rect src, dst;
> - struct drm_rect rotated;
> struct iosys_map map[DRM_FORMAT_MAX_PLANES];
> unsigned int rotation;
> };
> @@ -51,14 +61,15 @@ struct vkms_writeback_job {
> /**
> * typedef pixel_read_t - These functions are used to read the pixels in the source frame, convert
> * them to argb16 and write them to out_pixel.
> - * It assumes that src_pixels point to a valid pixel (not a block, or a block of 1x1 pixel)
> *
> - * @src_pixels: Source pointer to a pixel
> + * @frame_info: Frame used as source for the pixel value
> + * @x: X (width) coordinate in the source buffer
> + * @y: Y (height) coordinate in the source buffer
> * @out_pixel: Pointer where to write the pixel value
> * @encoding: Color encoding to use (mainly used for YUV formats)
> * @range: Color range to use (mainly used for YUV formats)
> */
> -typedef void (*pixel_read_t)(u8 **src_pixels, int y,
> +typedef void (*pixel_read_t)(struct vkms_frame_info *frame_info, int x, int y,
> struct pixel_argb_u16 *out_pixel, enum drm_color_encoding enconding,
> enum drm_color_range range);
>
> @@ -66,6 +77,8 @@ typedef void (*pixel_read_t)(u8 **src_pixels, int y,
> * vkms_plane_state - Driver specific plane state
> * @base: base plane state
> * @frame_info: data required for composing computation
> + * @pixel_read: function to read a pixel in this plane. The creator of a vkms_plane_state must
> + * ensure that this pointer is valid
> */
> struct vkms_plane_state {
> struct drm_shadow_plane_state base;
> diff --git a/drivers/gpu/drm/vkms/vkms_formats.c b/drivers/gpu/drm/vkms/vkms_formats.c
> index c6376db58d38..8ff85ffda94f 100644
> --- a/drivers/gpu/drm/vkms/vkms_formats.c
> +++ b/drivers/gpu/drm/vkms/vkms_formats.c
> @@ -11,79 +11,88 @@
>
> #include "vkms_formats.h"
>
> -static size_t pixel_offset(const struct vkms_frame_info *frame_info, int x, int y, size_t index)
> +
> +/**
> + * packed_pixels_offset() - Get the offset of the block containing the pixel at coordinates x/y
> + *
> + * @frame_info: Buffer metadata
> + * @x: The x coordinate of the wanted pixel in the buffer
> + * @y: The y coordinate of the wanted pixel in the buffer
> + * @plane: The index of the plane to use
> + *
> + * The caller must be aware that this offset is not always a pointer to a pixel. If individual
> + * pixel values are needed, they have to be extracted from the block.
> + */
> +static size_t packed_pixels_offset(const struct vkms_frame_info *frame_info, int x, int y, size_t plane)
> {
> struct drm_framebuffer *fb = frame_info->fb;
> -
> - return fb->offsets[index] + (y * fb->pitches[index])
> - + (x * fb->format->cpp[index]);
> + const struct drm_format_info *format = frame_info->fb->format;
> + /* Directly using x and y to multiply pitches and format->ccp is not sufficient because
> + * in some formats a block can represent multiple pixels.
> + *
> + * Dividing x and y by the block size allows to extract the correct offset of the block
> + * containing the pixel.
> + */
> + return fb->offsets[plane] +
> + (y / drm_format_info_block_width(format, plane)) * fb->pitches[plane] +
> + (x / drm_format_info_block_height(format, plane)) * format->char_per_block[plane];
> }
>
> /*
> - * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates
> + * packed_pixels_addr - Get the pointer to the block containing the pixel at the given coordinate
> *
> * @frame_info: Buffer metadata
> - * @x: The x(width) coordinate of the 2D buffer
> - * @y: The y(Heigth) coordinate of the 2D buffer
> + * @x: The x(width) coordinate inside the 2D buffer
> + * @y: The y(Heigth) coordinate inside the 2D buffer
> * @index: The index of the plane on the 2D buffer
> *
> - * Takes the information stored in the frame_info, a pair of coordinates, and
> - * returns the address of the first color channel on the desired index. The
> - * format's specific subsample is applied.
> + * Takes the information stored in the frame_info, a pair of coordinates, and returns the address
> + * of the block containing this pixel.
> + * The caller must be aware that this pointer is sometimes not directly a pixel, it needs some
> + * additional work to extract pixel color from this block.
> */
> static void *packed_pixels_addr(const struct vkms_frame_info *frame_info,
> int x, int y, size_t index)
> {
> - int vsub = index == 0 ? 1 : frame_info->fb->format->vsub;
> - int hsub = index == 0 ? 1 : frame_info->fb->format->hsub;
> - size_t offset = pixel_offset(frame_info, x / hsub, y / vsub, index);
> -
> - return (u8 *)frame_info->map[0].vaddr + offset;
> + return (u8 *)frame_info->map[0].vaddr + packed_pixels_offset(frame_info, x, y, index);
> }
>
> -static void *get_packed_src_addr(const struct vkms_frame_info *frame_info, int y, size_t index)
> +static void ARGB8888_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y,
> + struct pixel_argb_u16 *out_pixel, enum drm_color_encoding encoding,
> + enum drm_color_range range)
> {
> - int x_src = frame_info->src.x1 >> 16;
> - int y_src = y - frame_info->rotated.y1 + (frame_info->src.y1 >> 16);
> -
> - return packed_pixels_addr(frame_info, x_src, y_src, index);
> -}
> + u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0);
>
> -static int get_x_position(const struct vkms_frame_info *frame_info, int limit, int x)
> -{
> - if (frame_info->rotation & (DRM_MODE_REFLECT_X | DRM_MODE_ROTATE_270))
> - return limit - x - 1;
> - return x;
> -}
> -
> -static void ARGB8888_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
> - enum drm_color_encoding encoding, enum drm_color_range range)
> -{
> /*
> * The 257 is the "conversion ratio". This number is obtained by the
> * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
> * the best color value in a pixel format with more possibilities.
> * A similar idea applies to others RGB color conversions.
> */
> - out_pixel->a = (u16)src_pixels[0][3] * 257;
> - out_pixel->r = (u16)src_pixels[0][2] * 257;
> - out_pixel->g = (u16)src_pixels[0][1] * 257;
> - out_pixel->b = (u16)src_pixels[0][0] * 257;
> + out_pixel->a = (u16)src_pixel[3] * 257;
> + out_pixel->r = (u16)src_pixel[2] * 257;
> + out_pixel->g = (u16)src_pixel[1] * 257;
> + out_pixel->b = (u16)src_pixel[0] * 257;
> }
>
> -static void XRGB8888_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
> - enum drm_color_encoding encoding, enum drm_color_range range)
> +static void XRGB8888_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y,
> + struct pixel_argb_u16 *out_pixel, enum drm_color_encoding encoding,
> + enum drm_color_range range)
> {
> + u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0);
> out_pixel->a = (u16)0xffff;
> - out_pixel->r = (u16)src_pixels[0][2] * 257;
> - out_pixel->g = (u16)src_pixels[0][1] * 257;
> - out_pixel->b = (u16)src_pixels[0][0] * 257;
> + out_pixel->r = (u16)src_pixel[2] * 257;
> + out_pixel->g = (u16)src_pixel[1] * 257;
> + out_pixel->b = (u16)src_pixel[0] * 257;
> }
>
> -static void ARGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
> - enum drm_color_encoding encoding, enum drm_color_range range)
> +static void ARGB16161616_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y,
> + struct pixel_argb_u16 *out_pixel,
> + enum drm_color_encoding encoding,
> + enum drm_color_range range)
> {
> - u16 *pixels = (u16 *)src_pixels[0];
> + u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0);
> + u16 *pixels = (u16 *)src_pixel;
>
> out_pixel->a = le16_to_cpu(pixels[3]);
> out_pixel->r = le16_to_cpu(pixels[2]);
> @@ -91,10 +100,13 @@ static void ARGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out
> out_pixel->b = le16_to_cpu(pixels[0]);
> }
>
> -static void XRGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
> - enum drm_color_encoding encoding, enum drm_color_range range)
> +static void XRGB16161616_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y,
> + struct pixel_argb_u16 *out_pixel,
> + enum drm_color_encoding encoding,
> + enum drm_color_range range)
> {
> - u16 *pixels = (u16 *)src_pixels[0];
> + u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0);
> + u16 *pixels = (u16 *)src_pixel;
>
> out_pixel->a = (u16)0xffff;
> out_pixel->r = le16_to_cpu(pixels[2]);
> @@ -102,10 +114,12 @@ static void XRGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out
> out_pixel->b = le16_to_cpu(pixels[0]);
> }
>
> -static void RGB565_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
> - enum drm_color_encoding encoding, enum drm_color_range range)
> +static void RGB565_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y,
> + struct pixel_argb_u16 *out_pixel, enum drm_color_encoding encoding,
> + enum drm_color_range range)
> {
> - u16 *pixels = (u16 *)src_pixels[0];
> + u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0);
> + u16 *pixels = (u16 *)src_pixel;
>
> s64 fp_rb_ratio = drm_fixp_div(drm_int2fixp(65535), drm_int2fixp(31));
> s64 fp_g_ratio = drm_fixp_div(drm_int2fixp(65535), drm_int2fixp(63));
> @@ -121,12 +135,19 @@ static void RGB565_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel
> out_pixel->b = drm_fixp2int_round(drm_fixp_mul(fp_b, fp_rb_ratio));
> }
>
> +/**
> + * ycbcr2rgb() - helper to convert ycbcr 8 bits to rbg 16 bits
> + *
> + * @m: conversion matrix to use
> + * @y, @cb, @cr: component of the ycbcr pixel
> + * @r, @g, @b: pointers to write the rbg pixel
> + */
> static void ycbcr2rgb(const s16 m[3][3], u8 y, u8 cb, u8 cr, u8 y_offset, u8 *r, u8 *g, u8 *b)
> {
> s32 y_16, cb_16, cr_16;
> s32 r_16, g_16, b_16;
>
> - y_16 = y - y_offset;
> + y_16 = y - y_offset;
> cb_16 = cb - 128;
> cr_16 = cr - 128;
>
> @@ -139,6 +160,14 @@ static void ycbcr2rgb(const s16 m[3][3], u8 y, u8 cb, u8 cr, u8 y_offset, u8 *r,
> *b = clamp(b_16, 0, 0xffff) >> 8;
> }
>
> +/**
> + * yuv_u8_to_argb_u16() - helper to convert yuv 8 bits to argb 16 bits
> + *
> + * @argb_u16: pointer to write the converted pixel
> + * @yuv_u8: pointer to the source yuv pixel
> + * @encoding: encoding to use
> + * @range: range to use
> + */
> VISIBLE_IF_KUNIT void yuv_u8_to_argb_u16(struct pixel_argb_u16 *argb_u16,
> const struct pixel_yuv_u8 *yuv_u8,
> enum drm_color_encoding encoding,
> @@ -205,104 +234,89 @@ VISIBLE_IF_KUNIT void yuv_u8_to_argb_u16(struct pixel_argb_u16 *argb_u16,
> }
> EXPORT_SYMBOL_IF_KUNIT(yuv_u8_to_argb_u16);
>
> -static void semi_planar_yuv_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
> +static void semi_planar_yuv_to_argb_u16(struct vkms_frame_info *frame_info, int x,
> + int y, struct pixel_argb_u16 *out_pixel,
> enum drm_color_encoding encoding,
> enum drm_color_range range)
> {
> struct pixel_yuv_u8 yuv_u8;
>
> - yuv_u8.y = src_pixels[0][0];
> - yuv_u8.u = src_pixels[1][0];
> - yuv_u8.v = src_pixels[1][1];
> + /* Subsampling must be applied for semi-planar yuv formats */
> + int vsub = frame_info->fb->format->vsub;
> + int hsub = frame_info->fb->format->hsub;
> +
> + u8 *src_y = packed_pixels_addr(frame_info, x, y, 0);
> + u8 *src_uv = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1);
> +
> + yuv_u8.y = src_y[0];
> + yuv_u8.u = src_uv[0];
> + yuv_u8.v = src_uv[1];
>
> yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range);
> }
>
> -static void semi_planar_yvu_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
> +static void semi_planar_yvu_to_argb_u16(struct vkms_frame_info *frame_info, int x,
> + int y, struct pixel_argb_u16 *out_pixel,
> enum drm_color_encoding encoding,
> enum drm_color_range range)
> {
> struct pixel_yuv_u8 yuv_u8;
>
> - yuv_u8.y = src_pixels[0][0];
> - yuv_u8.v = src_pixels[1][0];
> - yuv_u8.u = src_pixels[1][1];
> + /* Subsampling must be applied for semi-planar yuv formats */
> + int vsub = frame_info->fb->format->vsub ? frame_info->fb->format->vsub : 1;
> + int hsub = frame_info->fb->format->hsub ? frame_info->fb->format->hsub : 1;
> + u8 *src_y = packed_pixels_addr(frame_info, x, y, 0);
> + u8 *src_vu = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1);
> +
> + yuv_u8.y = src_y[0];
> + yuv_u8.v = src_vu[0];
> + yuv_u8.u = src_vu[1];
>
> yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range);
> }
>
> -static void planar_yuv_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
> - enum drm_color_encoding encoding, enum drm_color_range range)
> +static void planar_yuv_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y,
> + struct pixel_argb_u16 *out_pixel,
> + enum drm_color_encoding encoding,
> + enum drm_color_range range)
> {
> struct pixel_yuv_u8 yuv_u8;
>
> - yuv_u8.y = src_pixels[0][0];
> - yuv_u8.u = src_pixels[1][0];
> - yuv_u8.v = src_pixels[2][0];
> + /* Subsampling must be applied for planar yuv formats */
> + int vsub = frame_info->fb->format->vsub ? frame_info->fb->format->vsub : 1;
> + int hsub = frame_info->fb->format->hsub ? frame_info->fb->format->hsub : 1;
> +
> + u8 *src_y = packed_pixels_addr(frame_info, x, y, 0);
> + u8 *src_u = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1);
> + u8 *src_v = packed_pixels_addr(frame_info, x / hsub, y / vsub, 2);
> +
> + yuv_u8.y = src_y[0];
> + yuv_u8.u = src_u[0];
> + yuv_u8.v = src_v[0];
>
> yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range);
> }
>
> -static void planar_yvu_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
> - enum drm_color_encoding encoding, enum drm_color_range range)
> +static void planar_yvu_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y,
> + struct pixel_argb_u16 *out_pixel,
> + enum drm_color_encoding encoding,
> + enum drm_color_range range)
> {
> struct pixel_yuv_u8 yuv_u8;
>
> - yuv_u8.y = src_pixels[0][0];
> - yuv_u8.v = src_pixels[1][0];
> - yuv_u8.u = src_pixels[2][0];
> + /* Subsampling must be applied for semi-planar yuv formats */
> + int vsub = frame_info->fb->format->vsub ? frame_info->fb->format->vsub : 1;
> + int hsub = frame_info->fb->format->hsub ? frame_info->fb->format->hsub : 1;
>
> - yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range);
> -}
> + u8 *src_y = packed_pixels_addr(frame_info, x, y, 0);
> + u8 *src_u = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1);
> + u8 *src_v = packed_pixels_addr(frame_info, x / hsub, y / vsub, 2);
>
> -/**
> - * vkms_compose_row - compose a single row of a plane
> - * @stage_buffer: output line with the composed pixels
> - * @plane: state of the plane that is being composed
> - * @y: y coordinate of the row
> - *
> - * This function composes a single row of a plane. It gets the source pixels
> - * through the y coordinate (see get_packed_src_addr()) and goes linearly
> - * through the source pixel, reading the pixels and converting it to
> - * ARGB16161616 (see the pixel_read() callback). For rotate-90 and rotate-270,
> - * the source pixels are not traversed linearly. The source pixels are queried
> - * on each iteration in order to traverse the pixels vertically.
> - */
> -void vkms_compose_row(struct line_buffer *stage_buffer, struct vkms_plane_state *plane, int y)
> -{
> - struct pixel_argb_u16 *out_pixels = stage_buffer->pixels;
> - struct vkms_frame_info *frame_info = plane->frame_info;
> - const struct drm_format_info *frame_format = frame_info->fb->format;
> - int limit = min_t(size_t, drm_rect_width(&frame_info->dst), stage_buffer->n_pixels);
> - u8 *src_pixels[DRM_FORMAT_MAX_PLANES];
> -
> - enum drm_color_encoding encoding = plane->base.base.color_encoding;
> - enum drm_color_range range = plane->base.base.color_range;
> -
> - for (size_t i = 0; i < frame_format->num_planes; i++)
> - src_pixels[i] = get_packed_src_addr(frame_info, y, i);
> -
> - for (size_t x = 0; x < limit; x++) {
> - int x_pos = get_x_position(frame_info, limit, x);
> -
> - bool shoud_inc = !((x + 1) % frame_format->num_planes);
> -
> - if (drm_rotation_90_or_270(frame_info->rotation)) {
> - for (size_t i = 0; i < frame_format->num_planes; i++) {
> - src_pixels[i] = get_packed_src_addr(frame_info,
> - x + frame_info->rotated.y1, i);
> - if (!i || shoud_inc)
> - src_pixels[i] += frame_format->cpp[i] * y;
> - }
> - }
> -
> - plane->pixel_read(src_pixels, &out_pixels[x_pos], encoding, range);
> -
> - for (size_t i = 0; i < frame_format->num_planes; i++) {
> - if (!i || shoud_inc)
> - src_pixels[i] += frame_format->cpp[i];
> - }
> - }
> + yuv_u8.y = src_y[0];
> + yuv_u8.v = src_u[0];
> + yuv_u8.u = src_v[0];
> +
> + yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range);
> }
>
> /*
> diff --git a/drivers/gpu/drm/vkms/vkms_plane.c b/drivers/gpu/drm/vkms/vkms_plane.c
> index 932736fc3ee9..d818ed246a46 100644
> --- a/drivers/gpu/drm/vkms/vkms_plane.c
> +++ b/drivers/gpu/drm/vkms/vkms_plane.c
> @@ -118,13 +118,20 @@ static void vkms_plane_atomic_update(struct drm_plane *plane,
> return;
>
> fmt = fb->format->format;
> + pixel_read_t pixel_read = get_pixel_conversion_function(fmt);
> +
> + if (!pixel_read) {
> + DRM_WARN("The requested pixel format is not supported by VKMS. The new state is "
> + "not applied.\n");
> + return;
> + }
> +
> vkms_plane_state = to_vkms_plane_state(new_state);
> shadow_plane_state = &vkms_plane_state->base;
>
> frame_info = vkms_plane_state->frame_info;
> memcpy(&frame_info->src, &new_state->src, sizeof(struct drm_rect));
> memcpy(&frame_info->dst, &new_state->dst, sizeof(struct drm_rect));
> - memcpy(&frame_info->rotated, &new_state->dst, sizeof(struct drm_rect));
> frame_info->fb = fb;
> memcpy(&frame_info->map, &shadow_plane_state->data, sizeof(frame_info->map));
> drm_framebuffer_get(frame_info->fb);
> @@ -134,10 +141,8 @@ static void vkms_plane_atomic_update(struct drm_plane *plane,
> DRM_MODE_REFLECT_X |
> DRM_MODE_REFLECT_Y);
>
> - drm_rect_rotate(&frame_info->rotated, drm_rect_width(&frame_info->rotated),
> - drm_rect_height(&frame_info->rotated), frame_info->rotation);
>
> - vkms_plane_state->pixel_read = get_pixel_conversion_function(fmt);
> + vkms_plane_state->pixel_read = pixel_read;
> }
>
> static int vkms_plane_atomic_check(struct drm_plane *plane,
>
Hi Pekka,
[email protected] wrote on Fri, 2 Feb 2024 10:55:22 +0200:
> On Thu, 01 Feb 2024 18:31:32 +0100
> Louis Chauvet <[email protected]> wrote:
>
> > Change the composition algorithm to iterate over pixels instead of lines.
> > It allows a simpler management of rotation and pixel access for complex formats.
> >
> > This new algorithm allows read_pixel function to have access to x/y
> > coordinates and make it possible to read the correct thing in a block
> > when block_w and block_h are not 1.
> > The iteration pixel-by-pixel in the same method also allows a simpler
> > management of rotation with drm_rect_* helpers. This way it's not needed
> > anymore to have misterious switch-case distributed in multiple places.
>
> Hi,
>
> there was a very good reason to write this code using lines:
> performance. Before lines, it was indeed operating on individual pixels.
>
> Please, include performance measurements before and after this series
> to quantify the impact on the previously already supported pixel
> formats, particularly the 32-bit-per-pixel RGB variants.
>
> VKMS will be used more and more in CI for userspace projects, and
> performance actually matters there.
>
> I'm worrying that this performance degradation here is significant. I
> believe it is possible to keep blending with lines, if you add new line
> getters for reading from rotated, sub-sampled etc. images. That way you
> don't have to regress the most common formats' performance.
While I understand performance is important and should be taken into
account seriously, I cannot understand how broken testing could be
considered better. Fast but inaccurate will always be significantly
less attractive to my eyes.
I am in favor of making this working first, and then improving the code
for faster results. Maybe the line-driven approach can be dedicated to
"simpler" formats where more complex corner cases do not happen. But
for now I don't see the point in comparing performances between broken
and (hopefully) non broken implementations.
Thanks,
Miquèl
On 2/2/24 05:15, Louis Chauvet wrote:
> Le 01/02/24 - 19:07, Maira Canal a écrit :
>> Hi Louis,
>>
>> Thanks for your patches! Could you please rebase them on top of
>> drm-misc-next? It would make it easier for me to review and test the
>> patches.
>>
>> Best Regards,
>> - Maíra
>
> Hi Maíra,
>
> Do you want me to rebase the whole YUV [1] series or should I extract and
> make my two patches independent?
Please, make this two patches independent.
Best Regards,
- Maíra
>
> [1]: https://lore.kernel.org/dri-devel/[email protected]/
>
> Best regards,
> Louis Chauvet
>
>> On 2/1/24 14:31, Louis Chauvet wrote:
>>> This patchset aims to solve issues I found in [1], and at the same time
>>> simplify the composition algorithm.
>>>
>>> I sent more igt-gpu-tools test [2] to cover more things and detect the
>>> issues in [1].
>>>
>>> This patchset is based on [1].
>>>
>>> Patch 1/2: This patch is a no-op, but make the code more readable
>>> regarding the pixel_read functions.
>>>
>>> Patch 2/2: This patch is more complex. My main target was to solve issues
>>> I found in [1], but as it was very complex to do it "in place", I choose
>>> to rework the composition function.
>>> The main two advantages are:
>>> - It's now possible to create conversion function for packed & grouped
>>> pixels. Some pixel formats need absolute x/y position and not only an
>>> offset in the buffer to extract the correct value. This part also solve
>>> the issues I found in [1].
>>> - The rotation management is now way easier to understand, there is no
>>> more switch case in different places and instead of copy/pasting rotation
>>> formula I used drm_rect_* helpers.
>>>
>>> [1]: https://lore.kernel.org/dri-devel/[email protected]/
>>> [2]: https://lore.kernel.org/igt-dev/[email protected]/T/#t
>>>
>>> Signed-off-by: Louis Chauvet <[email protected]>
>>> ---
>>> Louis Chauvet (2):
>>> drm/vkms: Create a type to check a function pointer validity
>>> drm/vkms: Use a simpler composition function
>>>
>>> drivers/gpu/drm/vkms/vkms_composer.c | 97 ++++++++-----
>>> drivers/gpu/drm/vkms/vkms_drv.h | 32 ++++-
>>> drivers/gpu/drm/vkms/vkms_formats.c | 254 ++++++++++++++++++-----------------
>>> drivers/gpu/drm/vkms/vkms_formats.h | 2 +-
>>> drivers/gpu/drm/vkms/vkms_plane.c | 13 +-
>>> 5 files changed, 236 insertions(+), 162 deletions(-)
>>> ---
>>> base-commit: 5d189d57bb335a87ec38ea26fe43a5f3ed31ced7
>>> change-id: 20240201-yuv-1337d90d9576
>>>
>>> Best regards,
Hi Miquel,
On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote:
> [email protected] wrote on Fri, 2 Feb 2024 10:55:22 +0200:
>
> > On Thu, 01 Feb 2024 18:31:32 +0100
> > Louis Chauvet <[email protected]> wrote:
> >
> > > Change the composition algorithm to iterate over pixels instead of lines.
> > > It allows a simpler management of rotation and pixel access for complex formats.
> > >
> > > This new algorithm allows read_pixel function to have access to x/y
> > > coordinates and make it possible to read the correct thing in a block
> > > when block_w and block_h are not 1.
> > > The iteration pixel-by-pixel in the same method also allows a simpler
> > > management of rotation with drm_rect_* helpers. This way it's not needed
> > > anymore to have misterious switch-case distributed in multiple places.
> >
> > Hi,
> >
> > there was a very good reason to write this code using lines:
> > performance. Before lines, it was indeed operating on individual pixels.
> >
> > Please, include performance measurements before and after this series
> > to quantify the impact on the previously already supported pixel
> > formats, particularly the 32-bit-per-pixel RGB variants.
> >
> > VKMS will be used more and more in CI for userspace projects, and
> > performance actually matters there.
> >
> > I'm worrying that this performance degradation here is significant. I
> > believe it is possible to keep blending with lines, if you add new line
> > getters for reading from rotated, sub-sampled etc. images. That way you
> > don't have to regress the most common formats' performance.
>
> While I understand performance is important and should be taken into
> account seriously, I cannot understand how broken testing could be
> considered better. Fast but inaccurate will always be significantly
> less attractive to my eyes.
AFAIK, neither the cover letter nor the commit log claimed it was fixing
something broken, just that it was "better" (according to what
criteria?).
If something is truly broken, it must be stated what exactly is so we
can all come up with a solution that will satisfy everyone.
> I am in favor of making this working first, and then improving the code
> for faster results. Maybe the line-driven approach can be dedicated to
> "simpler" formats where more complex corner cases do not happen.
Which ones?
Maxime
Hi Miquel,
On 2/2/24 06:26, Miquel Raynal wrote:
> Hi Pekka,
>
> [email protected] wrote on Fri, 2 Feb 2024 10:55:22 +0200:
>
>> On Thu, 01 Feb 2024 18:31:32 +0100
>> Louis Chauvet <[email protected]> wrote:
>>
>>> Change the composition algorithm to iterate over pixels instead of lines.
>>> It allows a simpler management of rotation and pixel access for complex formats.
>>>
>>> This new algorithm allows read_pixel function to have access to x/y
>>> coordinates and make it possible to read the correct thing in a block
>>> when block_w and block_h are not 1.
>>> The iteration pixel-by-pixel in the same method also allows a simpler
>>> management of rotation with drm_rect_* helpers. This way it's not needed
>>> anymore to have misterious switch-case distributed in multiple places.
>>
>> Hi,
>>
>> there was a very good reason to write this code using lines:
>> performance. Before lines, it was indeed operating on individual pixels.
>>
>> Please, include performance measurements before and after this series
>> to quantify the impact on the previously already supported pixel
>> formats, particularly the 32-bit-per-pixel RGB variants.
>>
>> VKMS will be used more and more in CI for userspace projects, and
>> performance actually matters there.
>>
>> I'm worrying that this performance degradation here is significant. I
>> believe it is possible to keep blending with lines, if you add new line
>> getters for reading from rotated, sub-sampled etc. images. That way you
>> don't have to regress the most common formats' performance.
>
> While I understand performance is important and should be taken into
> account seriously, I cannot understand how broken testing could be
> considered better. Fast but inaccurate will always be significantly
> less attractive to my eyes.
>
> I am in favor of making this working first, and then improving the code
> for faster results. Maybe the line-driven approach can be dedicated to
> "simpler" formats where more complex corner cases do not happen. But
> for now I don't see the point in comparing performances between broken
> and (hopefully) non broken implementations.
We still haven't landed the YUV series for VKMS. Therefore, the code
available in the repository is not broken.
Performance is crucial for VKMS, as Pekka mentioned. First, because it
will be used more and more in CI. Second, because I wouldn't like to see
IGT tests timing out in VKMS.
Best Regards,
- Maíra
>
> Thanks,
> Miquèl
On Fri, 2 Feb 2024 10:26:01 +0100
Miquel Raynal <[email protected]> wrote:
> Hi Pekka,
>
> [email protected] wrote on Fri, 2 Feb 2024 10:55:22 +0200:
>
> > On Thu, 01 Feb 2024 18:31:32 +0100
> > Louis Chauvet <[email protected]> wrote:
> >
> > > Change the composition algorithm to iterate over pixels instead of lines.
> > > It allows a simpler management of rotation and pixel access for complex formats.
> > >
> > > This new algorithm allows read_pixel function to have access to x/y
> > > coordinates and make it possible to read the correct thing in a block
> > > when block_w and block_h are not 1.
> > > The iteration pixel-by-pixel in the same method also allows a simpler
> > > management of rotation with drm_rect_* helpers. This way it's not needed
> > > anymore to have misterious switch-case distributed in multiple places
> >
> > Hi,
> >
> > there was a very good reason to write this code using lines:
> > performance. Before lines, it was indeed operating on individual pixels.
> >
> > Please, include performance measurements before and after this series
> > to quantify the impact on the previously already supported pixel
> > formats, particularly the 32-bit-per-pixel RGB variants.
> >
> > VKMS will be used more and more in CI for userspace projects, and
> > performance actually matters there.
> >
> > I'm worrying that this performance degradation here is significant. I
> > believe it is possible to keep blending with lines, if you add new line
> > getters for reading from rotated, sub-sampled etc. images. That way you
> > don't have to regress the most common formats' performance.
>
> While I understand performance is important and should be taken into
> account seriously, I cannot understand how broken testing could be
> considered better. Fast but inaccurate will always be significantly
> less attractive to my eyes.
>
> I am in favor of making this working first, and then improving the code
> for faster results. Maybe the line-driven approach can be dedicated to
> "simpler" formats where more complex corner cases do not happen. But
> for now I don't see the point in comparing performances between broken
> and (hopefully) non broken implementations.
Hi,
usually that is a good idea, but in this case you are changing the
fundamental design of the algorithm. That makes it easy to add new
things (e.g. YUV support) that will be even harder to make faster
later. If you then later improve performance, that is another
re-design. You would be making maintainers review two rewrites instead
of one or less.
I suspect it might be less effort for the author as well to not ditch
the line-based based algorithm as the first step.
Thanks,
pq
On Fri, 2 Feb 2024 13:13:22 +0100
Miquel Raynal <[email protected]> wrote:
> Hello Maxime,
>
> + Arthur
>
> [email protected] wrote on Fri, 2 Feb 2024 10:53:37 +0100:
>
> > Hi Miquel,
> >
> > On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote:
> > > [email protected] wrote on Fri, 2 Feb 2024 10:55:22 +0200:
> > >
> > > > On Thu, 01 Feb 2024 18:31:32 +0100
> > > > Louis Chauvet <[email protected]> wrote:
> > > >
> > > > > Change the composition algorithm to iterate over pixels instead of lines.
> > > > > It allows a simpler management of rotation and pixel access for complex formats.
> > > > >
> > > > > This new algorithm allows read_pixel function to have access to x/y
> > > > > coordinates and make it possible to read the correct thing in a block
> > > > > when block_w and block_h are not 1.
> > > > > The iteration pixel-by-pixel in the same method also allows a simpler
> > > > > management of rotation with drm_rect_* helpers. This way it's not needed
> > > > > anymore to have misterious switch-case distributed in multiple places.
> > > >
> > > > Hi,
> > > >
> > > > there was a very good reason to write this code using lines:
> > > > performance. Before lines, it was indeed operating on individual pixels.
> > > >
> > > > Please, include performance measurements before and after this series
> > > > to quantify the impact on the previously already supported pixel
> > > > formats, particularly the 32-bit-per-pixel RGB variants.
> > > >
> > > > VKMS will be used more and more in CI for userspace projects, and
> > > > performance actually matters there.
> > > >
> > > > I'm worrying that this performance degradation here is significant. I
> > > > believe it is possible to keep blending with lines, if you add new line
> > > > getters for reading from rotated, sub-sampled etc. images. That way you
> > > > don't have to regress the most common formats' performance.
> > >
> > > While I understand performance is important and should be taken into
> > > account seriously, I cannot understand how broken testing could be
> > > considered better. Fast but inaccurate will always be significantly
> > > less attractive to my eyes.
> >
> > AFAIK, neither the cover letter nor the commit log claimed it was fixing
> > something broken, just that it was "better" (according to what
> > criteria?).
>
> Better is probably too vague and I agree the "fixing" part is not
> clearly explained in the commit log. The cover-letter however states:
>
> > Patch 2/2: This patch is more complex. My main target was to solve issues
> > I found in [1], but as it was very complex to do it "in place", I choose
> > to rework the composition function.
> ...
> > [1]: https://lore.kernel.org/dri-devel/[email protected]/
>
> If you follow this link you will find all the feedback and especially
> the "broken" parts. Just to be clear, writing bugs is totally expected
> and review/testing is supposed to help on this regard. I am not blaming
> the author in any way, just focusing on getting this code in a more
> readable shape and hopefully reinforce the testing procedure.
>
> > If something is truly broken, it must be stated what exactly is so we
> > can all come up with a solution that will satisfy everyone.
>
> Maybe going through the series pointed above will give more context
> but AFAIU: the YUV composition is not totally right (and the tests used
> to validate it need to be more complex as well in order to fail).
>
> Here is a proposal.
>
> Today's RGB implementation is only optimized in the line-by-line case
> when there is no rotation. The logic is bit convoluted and may possibly
> be slightly clarified with a per-format read_line() implementation,
> at a very light performance cost. Such an improvement would definitely
> benefit to the clarity of the code, especially when transformations
> (especially the rotations) come into play because they would be clearly
> handled differently instead of being "hidden" in the optimized logic.
> Performances would not change much as this path is not optimized today
> anyway (the pixel-oriented logic is already used in the rotation case).
>
> Arthur's YUV implementation is indeed well optimized but the added
> complexity probably lead to small mistakes in the logic. The
> per-format read_line() implementation mentioned above could be
> extended to the YUV format as well, which would leverage Arthur's
> proposal by re-using his optimized version. Louis will help on this
> regard. However, for more complex cases such as when there is a
> rotation, it will be easier (and not sub-optimized compared to the RGB
> case) to also fallback to a pixel-oriented processing.
>
> Would this approach make sense?
Hi,
I think it would, if I understand what you mean. Ever since I proposed
a line-by-line algorithm to improve the performance, I was thinking of
per-format read_line() functions that would be selected outside of any
loops. Extending that to support YUV is only natural. I can imagine
rotation complicates things, and I won't oppose that resulting in a
much heavier read_line() implementation used in those cases. They might
perhaps call the original read_line() implementations pixel-by-pixel or
plane-by-plane (i.e. YUV planes) per pixel. Chroma-siting complicates
things even further. That way one could compose any
rotation-format-siting combination by chaining function pointers.
I haven't looked at VKMS in a long time, and I am disappointed to find
that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel.
The reading vfunc should be called with many pixels at a time when the
source FB layout allows it. The whole point of the line-based functions
was that they repeat the innermost loop in every function body to make
the per-pixel overhead as small as possible. The VKMS implementations
benchmarked before and after the original line-based algorithm showed
that calling a function pointer per-pixel is relatively very expensive.
Or maybe it was a switch-case.
Sorry, I didn't realize the optimization had already been lost.
Btw. I'd suggest renaming vkms_compose_row() to vkms_fetch_row() since
it's not composing anything and the name mislead me.
I think if you inspect the compositing code as of revision
8356b97906503a02125c8d03c9b88a61ea46a05a you'll get a better feeling of
what it was supposed to be.
Thanks,
pq
Hi Pekka,
[email protected] wrote on Fri, 2 Feb 2024 17:49:13 +0200:
> On Fri, 2 Feb 2024 13:13:22 +0100
> Miquel Raynal <[email protected]> wrote:
>
> > Hello Maxime,
> >
> > + Arthur
> >
> > [email protected] wrote on Fri, 2 Feb 2024 10:53:37 +0100:
> >
> > > Hi Miquel,
> > >
> > > On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote:
> > > > [email protected] wrote on Fri, 2 Feb 2024 10:55:22 +0200:
> > > >
> > > > > On Thu, 01 Feb 2024 18:31:32 +0100
> > > > > Louis Chauvet <[email protected]> wrote:
> > > > >
> > > > > > Change the composition algorithm to iterate over pixels instead of lines.
> > > > > > It allows a simpler management of rotation and pixel access for complex formats.
> > > > > >
> > > > > > This new algorithm allows read_pixel function to have access to x/y
> > > > > > coordinates and make it possible to read the correct thing in a block
> > > > > > when block_w and block_h are not 1.
> > > > > > The iteration pixel-by-pixel in the same method also allows a simpler
> > > > > > management of rotation with drm_rect_* helpers. This way it's not needed
> > > > > > anymore to have misterious switch-case distributed in multiple places.
> > > > >
> > > > > Hi,
> > > > >
> > > > > there was a very good reason to write this code using lines:
> > > > > performance. Before lines, it was indeed operating on individual pixels.
> > > > >
> > > > > Please, include performance measurements before and after this series
> > > > > to quantify the impact on the previously already supported pixel
> > > > > formats, particularly the 32-bit-per-pixel RGB variants.
> > > > >
> > > > > VKMS will be used more and more in CI for userspace projects, and
> > > > > performance actually matters there.
> > > > >
> > > > > I'm worrying that this performance degradation here is significant. I
> > > > > believe it is possible to keep blending with lines, if you add new line
> > > > > getters for reading from rotated, sub-sampled etc. images. That way you
> > > > > don't have to regress the most common formats' performance.
> > > >
> > > > While I understand performance is important and should be taken into
> > > > account seriously, I cannot understand how broken testing could be
> > > > considered better. Fast but inaccurate will always be significantly
> > > > less attractive to my eyes.
> > >
> > > AFAIK, neither the cover letter nor the commit log claimed it was fixing
> > > something broken, just that it was "better" (according to what
> > > criteria?).
> >
> > Better is probably too vague and I agree the "fixing" part is not
> > clearly explained in the commit log. The cover-letter however states:
> >
> > > Patch 2/2: This patch is more complex. My main target was to solve issues
> > > I found in [1], but as it was very complex to do it "in place", I choose
> > > to rework the composition function.
> > ...
> > > [1]: https://lore.kernel.org/dri-devel/[email protected]/
> >
> > If you follow this link you will find all the feedback and especially
> > the "broken" parts. Just to be clear, writing bugs is totally expected
> > and review/testing is supposed to help on this regard. I am not blaming
> > the author in any way, just focusing on getting this code in a more
> > readable shape and hopefully reinforce the testing procedure.
> >
> > > If something is truly broken, it must be stated what exactly is so we
> > > can all come up with a solution that will satisfy everyone.
> >
> > Maybe going through the series pointed above will give more context
> > but AFAIU: the YUV composition is not totally right (and the tests used
> > to validate it need to be more complex as well in order to fail).
> >
> > Here is a proposal.
> >
> > Today's RGB implementation is only optimized in the line-by-line case
> > when there is no rotation. The logic is bit convoluted and may possibly
> > be slightly clarified with a per-format read_line() implementation,
> > at a very light performance cost. Such an improvement would definitely
> > benefit to the clarity of the code, especially when transformations
> > (especially the rotations) come into play because they would be clearly
> > handled differently instead of being "hidden" in the optimized logic.
> > Performances would not change much as this path is not optimized today
> > anyway (the pixel-oriented logic is already used in the rotation case).
> >
> > Arthur's YUV implementation is indeed well optimized but the added
> > complexity probably lead to small mistakes in the logic. The
> > per-format read_line() implementation mentioned above could be
> > extended to the YUV format as well, which would leverage Arthur's
> > proposal by re-using his optimized version. Louis will help on this
> > regard. However, for more complex cases such as when there is a
> > rotation, it will be easier (and not sub-optimized compared to the RGB
> > case) to also fallback to a pixel-oriented processing.
> >
> > Would this approach make sense?
>
> Hi,
>
> I think it would, if I understand what you mean. Ever since I proposed
> a line-by-line algorithm to improve the performance, I was thinking of
> per-format read_line() functions that would be selected outside of any
> loops. Extending that to support YUV is only natural. I can imagine
> rotation complicates things, and I won't oppose that resulting in a
> much heavier read_line() implementation used in those cases. They might
> perhaps call the original read_line() implementations pixel-by-pixel or
> plane-by-plane (i.e. YUV planes) per pixel. Chroma-siting complicates
> things even further. That way one could compose any
> rotation-format-siting combination by chaining function pointers.
I'll let Louis also validate but on my side I feel like I totally
agree with your feedback.
> I haven't looked at VKMS in a long time, and I am disappointed to find
> that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel.
> The reading vfunc should be called with many pixels at a time when the
> source FB layout allows it. The whole point of the line-based functions
> was that they repeat the innermost loop in every function body to make
> the per-pixel overhead as small as possible. The VKMS implementations
> benchmarked before and after the original line-based algorithm showed
> that calling a function pointer per-pixel is relatively very expensive.
> Or maybe it was a switch-case.
Indeed, since your initial feedback Louis made a couple of comparisons
and the time penalty is between +5% and +60% depending on the case,
AFAIR.
> Sorry, I didn't realize the optimization had already been lost.
No problem, actually I also lost myself in my first answer as I
initially thought the (mainline) RGB logic was also broken in edge
cases, which it was not, only the YUV logic suffered from some
limitations.
> Btw. I'd suggest renaming vkms_compose_row() to vkms_fetch_row() since
> it's not composing anything and the name mislead me.
Makes sense.
> I think if you inspect the compositing code as of revision
> 8356b97906503a02125c8d03c9b88a61ea46a05a you'll get a better feeling of
> what it was supposed to be.
Excellent, thanks a lot!
Miquèl
On Fri, 2 Feb 2024 17:07:34 +0100
Miquel Raynal <[email protected]> wrote:
> Hi Pekka,
Hi Miquel,
I'm happy to see no hard feelings below. I know I may be blunt at
times.
Another thing coming to my mind is that I suppose there is no
agreed standard benchmark for VKMS performance, is there?
I recall people running selected IGT tests in a loop in the past,
and I worry that the IGT start-up overhead and small plane
dimensions might skew the results.
Would it be possible to have a standardised benchmark specifically
for performance rather than correctness, in IGT or where-ever it
would make sense? Then it would be simple to tell contributors to
run this and report the numbers before and after.
I would propose this kind of KMS layout:
- CRTC size 3841 x 2161
- primary plane, XRGB8888, 3639 x 2161 @ 101,0
- overlay A, XBGR2101010, 3033 x 1777 @ 201,199
- overlay B, ARGB8888, 1507 x 1400 @ 1800,250
The sizes and positions are deliberately odd to try to avoid happy
alignment accidents. The planes are big, which should let the pixel
operations easily dominate performance measurement. There are
different pixel formats, both opaque and semi-transparent. There is
lots of plane overlap. The planes also do not cover the whole CRTC
leaving the background visible a bit.
There should be two FBs per each plane, flipped alternatingly each
frame. Writeback should be active. Run this a number of frames, say,
100, and measure the kernel CPU time taken. It's supposed to take at
least several seconds in total.
I think something like this should be the base benchmark. One can
add more to it, like rotated planes, YUV planes, etc. or switch
settings on the existing planes. Maybe even FB_DAMAGE_CLIPS. Maybe
one more overlay that is very tall and thin.
Just an idea, what do you all think?
Thanks,
pq
> [email protected] wrote on Fri, 2 Feb 2024 17:49:13 +0200:
>
> > On Fri, 2 Feb 2024 13:13:22 +0100
> > Miquel Raynal <[email protected]> wrote:
> >
> > > Hello Maxime,
> > >
> > > + Arthur
> > >
> > > [email protected] wrote on Fri, 2 Feb 2024 10:53:37 +0100:
> > >
> > > > Hi Miquel,
> > > >
> > > > On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote:
> > > > > [email protected] wrote on Fri, 2 Feb 2024 10:55:22 +0200:
> > > > >
> > > > > > On Thu, 01 Feb 2024 18:31:32 +0100
> > > > > > Louis Chauvet <[email protected]> wrote:
> > > > > >
> > > > > > > Change the composition algorithm to iterate over pixels instead of lines.
> > > > > > > It allows a simpler management of rotation and pixel access for complex formats.
> > > > > > >
> > > > > > > This new algorithm allows read_pixel function to have access to x/y
> > > > > > > coordinates and make it possible to read the correct thing in a block
> > > > > > > when block_w and block_h are not 1.
> > > > > > > The iteration pixel-by-pixel in the same method also allows a simpler
> > > > > > > management of rotation with drm_rect_* helpers. This way it's not needed
> > > > > > > anymore to have misterious switch-case distributed in multiple places.
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > there was a very good reason to write this code using lines:
> > > > > > performance. Before lines, it was indeed operating on individual pixels.
> > > > > >
> > > > > > Please, include performance measurements before and after this series
> > > > > > to quantify the impact on the previously already supported pixel
> > > > > > formats, particularly the 32-bit-per-pixel RGB variants.
> > > > > >
> > > > > > VKMS will be used more and more in CI for userspace projects, and
> > > > > > performance actually matters there.
> > > > > >
> > > > > > I'm worrying that this performance degradation here is significant. I
> > > > > > believe it is possible to keep blending with lines, if you add new line
> > > > > > getters for reading from rotated, sub-sampled etc. images. That way you
> > > > > > don't have to regress the most common formats' performance.
> > > > >
> > > > > While I understand performance is important and should be taken into
> > > > > account seriously, I cannot understand how broken testing could be
> > > > > considered better. Fast but inaccurate will always be significantly
> > > > > less attractive to my eyes.
> > > >
> > > > AFAIK, neither the cover letter nor the commit log claimed it was fixing
> > > > something broken, just that it was "better" (according to what
> > > > criteria?).
> > >
> > > Better is probably too vague and I agree the "fixing" part is not
> > > clearly explained in the commit log. The cover-letter however states:
> > >
> > > > Patch 2/2: This patch is more complex. My main target was to solve issues
> > > > I found in [1], but as it was very complex to do it "in place", I choose
> > > > to rework the composition function.
> > > ...
> > > > [1]: https://lore.kernel.org/dri-devel/[email protected]/
> > >
> > > If you follow this link you will find all the feedback and especially
> > > the "broken" parts. Just to be clear, writing bugs is totally expected
> > > and review/testing is supposed to help on this regard. I am not blaming
> > > the author in any way, just focusing on getting this code in a more
> > > readable shape and hopefully reinforce the testing procedure.
> > >
> > > > If something is truly broken, it must be stated what exactly is so we
> > > > can all come up with a solution that will satisfy everyone.
> > >
> > > Maybe going through the series pointed above will give more context
> > > but AFAIU: the YUV composition is not totally right (and the tests used
> > > to validate it need to be more complex as well in order to fail).
> > >
> > > Here is a proposal.
> > >
> > > Today's RGB implementation is only optimized in the line-by-line case
> > > when there is no rotation. The logic is bit convoluted and may possibly
> > > be slightly clarified with a per-format read_line() implementation,
> > > at a very light performance cost. Such an improvement would definitely
> > > benefit to the clarity of the code, especially when transformations
> > > (especially the rotations) come into play because they would be clearly
> > > handled differently instead of being "hidden" in the optimized logic.
> > > Performances would not change much as this path is not optimized today
> > > anyway (the pixel-oriented logic is already used in the rotation case).
> > >
> > > Arthur's YUV implementation is indeed well optimized but the added
> > > complexity probably lead to small mistakes in the logic. The
> > > per-format read_line() implementation mentioned above could be
> > > extended to the YUV format as well, which would leverage Arthur's
> > > proposal by re-using his optimized version. Louis will help on this
> > > regard. However, for more complex cases such as when there is a
> > > rotation, it will be easier (and not sub-optimized compared to the RGB
> > > case) to also fallback to a pixel-oriented processing.
> > >
> > > Would this approach make sense?
> >
> > Hi,
> >
> > I think it would, if I understand what you mean. Ever since I proposed
> > a line-by-line algorithm to improve the performance, I was thinking of
> > per-format read_line() functions that would be selected outside of any
> > loops. Extending that to support YUV is only natural. I can imagine
> > rotation complicates things, and I won't oppose that resulting in a
> > much heavier read_line() implementation used in those cases. They might
> > perhaps call the original read_line() implementations pixel-by-pixel or
> > plane-by-plane (i.e. YUV planes) per pixel. Chroma-siting complicates
> > things even further. That way one could compose any
> > rotation-format-siting combination by chaining function pointers.
>
> I'll let Louis also validate but on my side I feel like I totally
> agree with your feedback.
>
> > I haven't looked at VKMS in a long time, and I am disappointed to find
> > that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel.
> > The reading vfunc should be called with many pixels at a time when the
> > source FB layout allows it. The whole point of the line-based functions
> > was that they repeat the innermost loop in every function body to make
> > the per-pixel overhead as small as possible. The VKMS implementations
> > benchmarked before and after the original line-based algorithm showed
> > that calling a function pointer per-pixel is relatively very expensive.
> > Or maybe it was a switch-case.
>
> Indeed, since your initial feedback Louis made a couple of comparisons
> and the time penalty is between +5% and +60% depending on the case,
> AFAIR.
>
> > Sorry, I didn't realize the optimization had already been lost.
>
> No problem, actually I also lost myself in my first answer as I
> initially thought the (mainline) RGB logic was also broken in edge
> cases, which it was not, only the YUV logic suffered from some
> limitations.
>
> > Btw. I'd suggest renaming vkms_compose_row() to vkms_fetch_row() since
> > it's not composing anything and the name mislead me.
>
> Makes sense.
>
> > I think if you inspect the compositing code as of revision
> > 8356b97906503a02125c8d03c9b88a61ea46a05a you'll get a better feeling of
> > what it was supposed to be.
>
> Excellent, thanks a lot!
>
> Miqu?l
--
Pekka Paalanen
On 02/02/24 12:49, Pekka Paalanen wrote:
> On Fri, 2 Feb 2024 13:13:22 +0100
> Miquel Raynal <[email protected]> wrote:
>
>> Hello Maxime,
>>
>> + Arthur
>>
>> [email protected] wrote on Fri, 2 Feb 2024 10:53:37 +0100:
>>
>>> Hi Miquel,
>>>
>>> On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote:
>>>> [email protected] wrote on Fri, 2 Feb 2024 10:55:22 +0200:
>>>>
>>>>> On Thu, 01 Feb 2024 18:31:32 +0100
>>>>> Louis Chauvet <[email protected]> wrote:
>>>>>
>>>>>> Change the composition algorithm to iterate over pixels instead of lines.
>>>>>> It allows a simpler management of rotation and pixel access for complex formats.
>>>>>>
>>>>>> This new algorithm allows read_pixel function to have access to x/y
>>>>>> coordinates and make it possible to read the correct thing in a block
>>>>>> when block_w and block_h are not 1.
>>>>>> The iteration pixel-by-pixel in the same method also allows a simpler
>>>>>> management of rotation with drm_rect_* helpers. This way it's not needed
>>>>>> anymore to have misterious switch-case distributed in multiple places.
>>>>>
>>>>> Hi,
>>>>>
>>>>> there was a very good reason to write this code using lines:
>>>>> performance. Before lines, it was indeed operating on individual pixels.
>>>>>
>>>>> Please, include performance measurements before and after this series
>>>>> to quantify the impact on the previously already supported pixel
>>>>> formats, particularly the 32-bit-per-pixel RGB variants.
>>>>>
>>>>> VKMS will be used more and more in CI for userspace projects, and
>>>>> performance actually matters there.
>>>>>
>>>>> I'm worrying that this performance degradation here is significant. I
>>>>> believe it is possible to keep blending with lines, if you add new line
>>>>> getters for reading from rotated, sub-sampled etc. images. That way you
>>>>> don't have to regress the most common formats' performance.
>>>>
>>>> While I understand performance is important and should be taken into
>>>> account seriously, I cannot understand how broken testing could be
>>>> considered better. Fast but inaccurate will always be significantly
>>>> less attractive to my eyes.
>>>
>>> AFAIK, neither the cover letter nor the commit log claimed it was fixing
>>> something broken, just that it was "better" (according to what
>>> criteria?).
>>
>> Better is probably too vague and I agree the "fixing" part is not
>> clearly explained in the commit log. The cover-letter however states:
>>
>>> Patch 2/2: This patch is more complex. My main target was to solve issues
>>> I found in [1], but as it was very complex to do it "in place", I choose
>>> to rework the composition function.
>> ...
>>> [1]: https://lore.kernel.org/dri-devel/[email protected]/
>>
>> If you follow this link you will find all the feedback and especially
>> the "broken" parts. Just to be clear, writing bugs is totally expected
>> and review/testing is supposed to help on this regard. I am not blaming
>> the author in any way, just focusing on getting this code in a more
>> readable shape and hopefully reinforce the testing procedure.
>>
>>> If something is truly broken, it must be stated what exactly is so we
>>> can all come up with a solution that will satisfy everyone.
>>
>> Maybe going through the series pointed above will give more context
>> but AFAIU: the YUV composition is not totally right (and the tests used
>> to validate it need to be more complex as well in order to fail).
>>
>> Here is a proposal.
>>
>> Today's RGB implementation is only optimized in the line-by-line case
>> when there is no rotation. The logic is bit convoluted and may possibly
>> be slightly clarified with a per-format read_line() implementation,
>> at a very light performance cost. Such an improvement would definitely
>> benefit to the clarity of the code, especially when transformations
>> (especially the rotations) come into play because they would be clearly
>> handled differently instead of being "hidden" in the optimized logic.
>> Performances would not change much as this path is not optimized today
>> anyway (the pixel-oriented logic is already used in the rotation case).
>>
>> Arthur's YUV implementation is indeed well optimized but the added
>> complexity probably lead to small mistakes in the logic. The
>> per-format read_line() implementation mentioned above could be
>> extended to the YUV format as well, which would leverage Arthur's
>> proposal by re-using his optimized version. Louis will help on this
>> regard. However, for more complex cases such as when there is a
>> rotation, it will be easier (and not sub-optimized compared to the RGB
>> case) to also fallback to a pixel-oriented processing.
>>
>> Would this approach make sense?
>
> Hi,
>
> I think it would, if I understand what you mean. Ever since I proposed
> a line-by-line algorithm to improve the performance, I was thinking of
> per-format read_line() functions that would be selected outside of any
> loops. Extending that to support YUV is only natural. I can imagine
> rotation complicates things, and I won't oppose that resulting in a
> much heavier read_line() implementation used in those cases. They might
> perhaps call the original read_line() implementations pixel-by-pixel or
> plane-by-plane (i.e. YUV planes) per pixel. Chroma-siting complicates
> things even further. That way one could compose any
> rotation-format-siting combination by chaining function pointers.
>
> I haven't looked at VKMS in a long time, and I am disappointed to find
> that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel.
> The reading vfunc should be called with many pixels at a time when the
> source FB layout allows it. The whole point of the line-based functions
> was that they repeat the innermost loop in every function body to make
> the per-pixel overhead as small as possible. The VKMS implementations
> benchmarked before and after the original line-based algorithm showed
> that calling a function pointer per-pixel is relatively very expensive.
> Or maybe it was a switch-case.
Hi,
I think I'm the culprit for that, as stated on [1]. My intention with
the suggestion was to remove some code repetition and too facilitate the
rotation support implementation. Going back, I think I was to high on
DRY at the time and didn't worry about optimization, which was a
mistake.
But, I agree with Miquel that the rotation logic is easier to implement
in a pixel-based way. So going pixel-by-pixel only when rotation occurs
would be great.
Best Regards,
~Arthur Grillo
[1]: https://lore.kernel.org/dri-devel/[email protected]/
>
> Sorry, I didn't realize the optimization had already been lost.
>
> Btw. I'd suggest renaming vkms_compose_row() to vkms_fetch_row() since
> it's not composing anything and the name mislead me.
>
> I think if you inspect the compositing code as of revision
> 8356b97906503a02125c8d03c9b88a61ea46a05a you'll get a better feeling of
> what it was supposed to be.
>
>
> Thanks,
> pq
On 01/02/24 14:31, Louis Chauvet wrote:
> Add the pixel_read_t type to check function prototype in structures
> and functions.
> It avoids casting to (void *) and at the same occasion allows the
> compiler to check the type properly.
>
> Signed-off-by: Louis Chauvet <[email protected]>
> ---
> drivers/gpu/drm/vkms/vkms_drv.h | 17 +++++++++++++++--
> drivers/gpu/drm/vkms/vkms_formats.c | 4 ++--
> drivers/gpu/drm/vkms/vkms_formats.h | 2 +-
> 3 files changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/vkms/vkms_drv.h b/drivers/gpu/drm/vkms/vkms_drv.h
> index 51349a0c32d8..cb20bab26cae 100644
> --- a/drivers/gpu/drm/vkms/vkms_drv.h
> +++ b/drivers/gpu/drm/vkms/vkms_drv.h
> @@ -48,6 +48,20 @@ struct vkms_writeback_job {
> void (*pixel_write)(u8 *dst_pixels, struct pixel_argb_u16 *in_pixel);
> };
>
> +/**
> + * typedef pixel_read_t - These functions are used to read the pixels in the source frame, convert
> + * them to argb16 and write them to out_pixel.
> + * It assumes that src_pixels point to a valid pixel (not a block, or a block of 1x1 pixel)
> + *
> + * @src_pixels: Source pointer to a pixel
> + * @out_pixel: Pointer where to write the pixel value
> + * @encoding: Color encoding to use (mainly used for YUV formats)
> + * @range: Color range to use (mainly used for YUV formats)
> + */
> +typedef void (*pixel_read_t)(u8 **src_pixels, int y,
> + struct pixel_argb_u16 *out_pixel, enum drm_color_encoding enconding,
> + enum drm_color_range range);
> +
> /**
> * vkms_plane_state - Driver specific plane state
> * @base: base plane state
> @@ -56,8 +70,7 @@ struct vkms_writeback_job {
> struct vkms_plane_state {
> struct drm_shadow_plane_state base;
> struct vkms_frame_info *frame_info;
> - void (*pixel_read)(u8 **src_buffer, struct pixel_argb_u16 *out_pixel,
> - enum drm_color_encoding enconding, enum drm_color_range range);
> + pixel_read_t pixel_read;
Hi,
Please Cc me on the next versions,
You added the argument 'y' to the function but did not change the calls
to it, resulting in a compiler error. I think this argument addition
would be better on patch #2.
Best Regards,
~Arthur Grillo
> };
>
> struct vkms_plane {
> diff --git a/drivers/gpu/drm/vkms/vkms_formats.c b/drivers/gpu/drm/vkms/vkms_formats.c
> index e06bbd7c0a67..c6376db58d38 100644
> --- a/drivers/gpu/drm/vkms/vkms_formats.c
> +++ b/drivers/gpu/drm/vkms/vkms_formats.c
> @@ -390,7 +390,7 @@ void vkms_writeback_row(struct vkms_writeback_job *wb,
> wb->pixel_write(dst_pixels, &in_pixels[x]);
> }
>
> -void *get_pixel_conversion_function(u32 format)
> +pixel_read_t get_pixel_conversion_function(u32 format)
> {
> switch (format) {
> case DRM_FORMAT_ARGB8888:
> @@ -420,7 +420,7 @@ void *get_pixel_conversion_function(u32 format)
> case DRM_FORMAT_YVU444:
> return &planar_yvu_to_argb_u16;
> default:
> - return NULL;
> + return (pixel_read_t)NULL;
> }
> }
>
> diff --git a/drivers/gpu/drm/vkms/vkms_formats.h b/drivers/gpu/drm/vkms/vkms_formats.h
> index 0cf835292cec..04e31e126ab1 100644
> --- a/drivers/gpu/drm/vkms/vkms_formats.h
> +++ b/drivers/gpu/drm/vkms/vkms_formats.h
> @@ -5,7 +5,7 @@
>
> #include "vkms_drv.h"
>
> -void *get_pixel_conversion_function(u32 format);
> +pixel_read_t get_pixel_conversion_function(u32 format);
>
> void *get_pixel_write_function(u32 format);
>
>
On Fri, 2 Feb 2024 17:02:20 -0300
Arthur Grillo <[email protected]> wrote:
> On 02/02/24 12:49, Pekka Paalanen wrote:
> > On Fri, 2 Feb 2024 13:13:22 +0100
> > Miquel Raynal <[email protected]> wrote:
> >
> >> Hello Maxime,
> >>
> >> + Arthur
> >>
> >> [email protected] wrote on Fri, 2 Feb 2024 10:53:37 +0100:
> >>
> >>> Hi Miquel,
> >>>
> >>> On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote:
> >>>> [email protected] wrote on Fri, 2 Feb 2024 10:55:22 +0200:
> >>>>
> >>>>> On Thu, 01 Feb 2024 18:31:32 +0100
> >>>>> Louis Chauvet <[email protected]> wrote:
> >>>>>
> >>>>>> Change the composition algorithm to iterate over pixels instead of lines.
> >>>>>> It allows a simpler management of rotation and pixel access for complex formats.
> >>>>>>
> >>>>>> This new algorithm allows read_pixel function to have access to x/y
> >>>>>> coordinates and make it possible to read the correct thing in a block
> >>>>>> when block_w and block_h are not 1.
> >>>>>> The iteration pixel-by-pixel in the same method also allows a simpler
> >>>>>> management of rotation with drm_rect_* helpers. This way it's not needed
> >>>>>> anymore to have misterious switch-case distributed in multiple places.
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> there was a very good reason to write this code using lines:
> >>>>> performance. Before lines, it was indeed operating on individual pixels.
> >>>>>
> >>>>> Please, include performance measurements before and after this series
> >>>>> to quantify the impact on the previously already supported pixel
> >>>>> formats, particularly the 32-bit-per-pixel RGB variants.
> >>>>>
> >>>>> VKMS will be used more and more in CI for userspace projects, and
> >>>>> performance actually matters there.
> >>>>>
> >>>>> I'm worrying that this performance degradation here is significant. I
> >>>>> believe it is possible to keep blending with lines, if you add new line
> >>>>> getters for reading from rotated, sub-sampled etc. images. That way you
> >>>>> don't have to regress the most common formats' performance.
> >>>>
> >>>> While I understand performance is important and should be taken into
> >>>> account seriously, I cannot understand how broken testing could be
> >>>> considered better. Fast but inaccurate will always be significantly
> >>>> less attractive to my eyes.
> >>>
> >>> AFAIK, neither the cover letter nor the commit log claimed it was fixing
> >>> something broken, just that it was "better" (according to what
> >>> criteria?).
> >>
> >> Better is probably too vague and I agree the "fixing" part is not
> >> clearly explained in the commit log. The cover-letter however states:
> >>
> >>> Patch 2/2: This patch is more complex. My main target was to solve issues
> >>> I found in [1], but as it was very complex to do it "in place", I choose
> >>> to rework the composition function.
> >> ...
> >>> [1]: https://lore.kernel.org/dri-devel/[email protected]/
> >>
> >> If you follow this link you will find all the feedback and especially
> >> the "broken" parts. Just to be clear, writing bugs is totally expected
> >> and review/testing is supposed to help on this regard. I am not blaming
> >> the author in any way, just focusing on getting this code in a more
> >> readable shape and hopefully reinforce the testing procedure.
> >>
> >>> If something is truly broken, it must be stated what exactly is so we
> >>> can all come up with a solution that will satisfy everyone.
> >>
> >> Maybe going through the series pointed above will give more context
> >> but AFAIU: the YUV composition is not totally right (and the tests used
> >> to validate it need to be more complex as well in order to fail).
> >>
> >> Here is a proposal.
> >>
> >> Today's RGB implementation is only optimized in the line-by-line case
> >> when there is no rotation. The logic is bit convoluted and may possibly
> >> be slightly clarified with a per-format read_line() implementation,
> >> at a very light performance cost. Such an improvement would definitely
> >> benefit to the clarity of the code, especially when transformations
> >> (especially the rotations) come into play because they would be clearly
> >> handled differently instead of being "hidden" in the optimized logic.
> >> Performances would not change much as this path is not optimized today
> >> anyway (the pixel-oriented logic is already used in the rotation case).
> >>
> >> Arthur's YUV implementation is indeed well optimized but the added
> >> complexity probably lead to small mistakes in the logic. The
> >> per-format read_line() implementation mentioned above could be
> >> extended to the YUV format as well, which would leverage Arthur's
> >> proposal by re-using his optimized version. Louis will help on this
> >> regard. However, for more complex cases such as when there is a
> >> rotation, it will be easier (and not sub-optimized compared to the RGB
> >> case) to also fallback to a pixel-oriented processing.
> >>
> >> Would this approach make sense?
> >
> > Hi,
> >
> > I think it would, if I understand what you mean. Ever since I proposed
> > a line-by-line algorithm to improve the performance, I was thinking of
> > per-format read_line() functions that would be selected outside of any
> > loops. Extending that to support YUV is only natural. I can imagine
> > rotation complicates things, and I won't oppose that resulting in a
> > much heavier read_line() implementation used in those cases. They might
> > perhaps call the original read_line() implementations pixel-by-pixel or
> > plane-by-plane (i.e. YUV planes) per pixel. Chroma-siting complicates
> > things even further. That way one could compose any
> > rotation-format-siting combination by chaining function pointers.
> >
> > I haven't looked at VKMS in a long time, and I am disappointed to find
> > that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel.
> > The reading vfunc should be called with many pixels at a time when the
> > source FB layout allows it. The whole point of the line-based functions
> > was that they repeat the innermost loop in every function body to make
> > the per-pixel overhead as small as possible. The VKMS implementations
> > benchmarked before and after the original line-based algorithm showed
> > that calling a function pointer per-pixel is relatively very expensive.
> > Or maybe it was a switch-case.
>
> Hi,
>
> I think I'm the culprit for that, as stated on [1]. My intention with
> the suggestion was to remove some code repetition and too facilitate the
> rotation support implementation. Going back, I think I was to high on
> DRY at the time and didn't worry about optimization, which was a
> mistake.
Hi Arthur,
no worries. We can also blame reviewers for not minding benchmarks. ;-)
> But, I agree with Miquel that the rotation logic is easier to implement
> in a pixel-based way. So going pixel-by-pixel only when rotation occurs
> would be great.
Yes, and I think that can very well be done in the line-based framework
still that existed in the old days before any rotation support was
added. Essentially a plug-in line-getter function that then calls a
format-specific line-getter pixel-by-pixel while applying the rotation.
It would be simple, it would leave unrotated performance unharmed (use
format-specific line-getter directly with lines), but it might be
somewhat less performant for rotated KMS planes. I suspect that might
be a good compromise.
Format-specific line-getters could also be parameterized by
pixel-to-pixel offset in bytes. Then they could directly traverse FB
rows forward and backward, and even FB columns. It may or may not have
a penalty compared to the original line-getters, so it would have to
be benchmarked.
Line-getters working on planar YUV FBs might delegate Y, U, V, or UV/VU
reading to R8 and/or RG88 line or pixel reader functions. They might
also return block-height lines instead of one line at a time. However,
I wouldn't commit to any approach without benchmarking alternatives.
The performance comparison must justify the code complexity, like it
was seen with the initial line-based implementation.
A good benchmark is key, IMO.
Thanks,
pq
>
> Best Regards,
> ~Arthur Grillo
>
> [1]: https://lore.kernel.org/dri-devel/[email protected]/
>
> >
> > Sorry, I didn't realize the optimization had already been lost.
> >
> > Btw. I'd suggest renaming vkms_compose_row() to vkms_fetch_row() since
> > it's not composing anything and the name mislead me.
> >
> > I think if you inspect the compositing code as of revision
> > 8356b97906503a02125c8d03c9b88a61ea46a05a you'll get a better feeling of
> > what it was supposed to be.
> >
> >
> > Thanks,
> > pq
On Mon, 5 Feb 2024 12:12:04 +0200
Pekka Paalanen <[email protected]> wrote:
> On Fri, 2 Feb 2024 17:02:20 -0300
> Arthur Grillo <[email protected]> wrote:
>
> > On 02/02/24 12:49, Pekka Paalanen wrote:
> > > On Fri, 2 Feb 2024 13:13:22 +0100
> > > Miquel Raynal <[email protected]> wrote:
> > >
> > >> Hello Maxime,
> > >>
> > >> + Arthur
> > >>
> > >> [email protected] wrote on Fri, 2 Feb 2024 10:53:37 +0100:
> > >>
> > >>> Hi Miquel,
> > >>>
> > >>> On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote:
> > >>>> [email protected] wrote on Fri, 2 Feb 2024 10:55:22 +0200:
> > >>>>
> > >>>>> On Thu, 01 Feb 2024 18:31:32 +0100
> > >>>>> Louis Chauvet <[email protected]> wrote:
> > >>>>>
> > >>>>>> Change the composition algorithm to iterate over pixels instead of lines.
> > >>>>>> It allows a simpler management of rotation and pixel access for complex formats.
> > >>>>>>
> > >>>>>> This new algorithm allows read_pixel function to have access to x/y
> > >>>>>> coordinates and make it possible to read the correct thing in a block
> > >>>>>> when block_w and block_h are not 1.
> > >>>>>> The iteration pixel-by-pixel in the same method also allows a simpler
> > >>>>>> management of rotation with drm_rect_* helpers. This way it's not needed
> > >>>>>> anymore to have misterious switch-case distributed in multiple places.
> > >>>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> there was a very good reason to write this code using lines:
> > >>>>> performance. Before lines, it was indeed operating on individual pixels.
> > >>>>>
> > >>>>> Please, include performance measurements before and after this series
> > >>>>> to quantify the impact on the previously already supported pixel
> > >>>>> formats, particularly the 32-bit-per-pixel RGB variants.
> > >>>>>
> > >>>>> VKMS will be used more and more in CI for userspace projects, and
> > >>>>> performance actually matters there.
> > >>>>>
> > >>>>> I'm worrying that this performance degradation here is significant. I
> > >>>>> believe it is possible to keep blending with lines, if you add new line
> > >>>>> getters for reading from rotated, sub-sampled etc. images. That way you
> > >>>>> don't have to regress the most common formats' performance.
> > >>>>
> > >>>> While I understand performance is important and should be taken into
> > >>>> account seriously, I cannot understand how broken testing could be
> > >>>> considered better. Fast but inaccurate will always be significantly
> > >>>> less attractive to my eyes.
> > >>>
> > >>> AFAIK, neither the cover letter nor the commit log claimed it was fixing
> > >>> something broken, just that it was "better" (according to what
> > >>> criteria?).
> > >>
> > >> Better is probably too vague and I agree the "fixing" part is not
> > >> clearly explained in the commit log. The cover-letter however states:
> > >>
> > >>> Patch 2/2: This patch is more complex. My main target was to solve issues
> > >>> I found in [1], but as it was very complex to do it "in place", I choose
> > >>> to rework the composition function.
> > >> ...
> > >>> [1]: https://lore.kernel.org/dri-devel/[email protected]/
> > >>
> > >> If you follow this link you will find all the feedback and especially
> > >> the "broken" parts. Just to be clear, writing bugs is totally expected
> > >> and review/testing is supposed to help on this regard. I am not blaming
> > >> the author in any way, just focusing on getting this code in a more
> > >> readable shape and hopefully reinforce the testing procedure.
> > >>
> > >>> If something is truly broken, it must be stated what exactly is so we
> > >>> can all come up with a solution that will satisfy everyone.
> > >>
> > >> Maybe going through the series pointed above will give more context
> > >> but AFAIU: the YUV composition is not totally right (and the tests used
> > >> to validate it need to be more complex as well in order to fail).
> > >>
> > >> Here is a proposal.
> > >>
> > >> Today's RGB implementation is only optimized in the line-by-line case
> > >> when there is no rotation. The logic is bit convoluted and may possibly
> > >> be slightly clarified with a per-format read_line() implementation,
> > >> at a very light performance cost. Such an improvement would definitely
> > >> benefit to the clarity of the code, especially when transformations
> > >> (especially the rotations) come into play because they would be clearly
> > >> handled differently instead of being "hidden" in the optimized logic.
> > >> Performances would not change much as this path is not optimized today
> > >> anyway (the pixel-oriented logic is already used in the rotation case).
> > >>
> > >> Arthur's YUV implementation is indeed well optimized but the added
> > >> complexity probably lead to small mistakes in the logic. The
> > >> per-format read_line() implementation mentioned above could be
> > >> extended to the YUV format as well, which would leverage Arthur's
> > >> proposal by re-using his optimized version. Louis will help on this
> > >> regard. However, for more complex cases such as when there is a
> > >> rotation, it will be easier (and not sub-optimized compared to the RGB
> > >> case) to also fallback to a pixel-oriented processing.
> > >>
> > >> Would this approach make sense?
> > >
> > > Hi,
> > >
> > > I think it would, if I understand what you mean. Ever since I proposed
> > > a line-by-line algorithm to improve the performance, I was thinking of
> > > per-format read_line() functions that would be selected outside of any
> > > loops. Extending that to support YUV is only natural. I can imagine
> > > rotation complicates things, and I won't oppose that resulting in a
> > > much heavier read_line() implementation used in those cases. They might
> > > perhaps call the original read_line() implementations pixel-by-pixel or
> > > plane-by-plane (i.e. YUV planes) per pixel. Chroma-siting complicates
> > > things even further. That way one could compose any
> > > rotation-format-siting combination by chaining function pointers.
> > >
> > > I haven't looked at VKMS in a long time, and I am disappointed to find
> > > that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel.
> > > The reading vfunc should be called with many pixels at a time when the
> > > source FB layout allows it. The whole point of the line-based functions
> > > was that they repeat the innermost loop in every function body to make
> > > the per-pixel overhead as small as possible. The VKMS implementations
> > > benchmarked before and after the original line-based algorithm showed
> > > that calling a function pointer per-pixel is relatively very expensive.
> > > Or maybe it was a switch-case.
> >
> > Hi,
> >
> > I think I'm the culprit for that, as stated on [1]. My intention with
> > the suggestion was to remove some code repetition and too facilitate the
> > rotation support implementation. Going back, I think I was to high on
> > DRY at the time and didn't worry about optimization, which was a
> > mistake.
>
> Hi Arthur,
>
> no worries. We can also blame reviewers for not minding benchmarks. ;-)
>
> > But, I agree with Miquel that the rotation logic is easier to implement
> > in a pixel-based way. So going pixel-by-pixel only when rotation occurs
> > would be great.
>
> Yes, and I think that can very well be done in the line-based framework
> still that existed in the old days before any rotation support was
> added. Essentially a plug-in line-getter function that then calls a
> format-specific line-getter pixel-by-pixel while applying the rotation.
> It would be simple, it would leave unrotated performance unharmed (use
> format-specific line-getter directly with lines), but it might be
> somewhat less performant for rotated KMS planes. I suspect that might
> be a good compromise.
>
> Format-specific line-getters could also be parameterized by
> pixel-to-pixel offset in bytes. Then they could directly traverse FB
> rows forward and backward, and even FB columns. It may or may not have
> a penalty compared to the original line-getters, so it would have to
> be benchmarked.
Oh, actually, since the byte offset depends on format, it might be
better to parametrize by direction and compute the offset in the
format-specific line-getter before the loop.
Thanks,
pq
> Line-getters working on planar YUV FBs might delegate Y, U, V, or UV/VU
> reading to R8 and/or RG88 line or pixel reader functions. They might
> also return block-height lines instead of one line at a time. However,
> I wouldn't commit to any approach without benchmarking alternatives.
> The performance comparison must justify the code complexity, like it
> was seen with the initial line-based implementation.
>
> A good benchmark is key, IMO.
>
>
> Thanks,
> pq
>
> >
> > Best Regards,
> > ~Arthur Grillo
> >
> > [1]: https://lore.kernel.org/dri-devel/[email protected]/
> >
> > >
> > > Sorry, I didn't realize the optimization had already been lost.
> > >
> > > Btw. I'd suggest renaming vkms_compose_row() to vkms_fetch_row() since
> > > it's not composing anything and the name mislead me.
> > >
> > > I think if you inspect the compositing code as of revision
> > > 8356b97906503a02125c8d03c9b88a61ea46a05a you'll get a better feeling of
> > > what it was supposed to be.
> > >
> > >
> > > Thanks,
> > > pq
>
On 02/02/24 16:45, Pekka Paalanen wrote:
> On Fri, 2 Feb 2024 17:07:34 +0100
> Miquel Raynal <[email protected]> wrote:
>
>> Hi Pekka,
>
> Hi Miquel,
>
> I'm happy to see no hard feelings below. I know I may be blunt at
> times.
>
> Another thing coming to my mind is that I suppose there is no
> agreed standard benchmark for VKMS performance, is there?
>
> I recall people running selected IGT tests in a loop in the past,
> and I worry that the IGT start-up overhead and small plane
> dimensions might skew the results.
>
> Would it be possible to have a standardised benchmark specifically
> for performance rather than correctness, in IGT or where-ever it
> would make sense? Then it would be simple to tell contributors to
> run this and report the numbers before and after.
>
> I would propose this kind of KMS layout:
>
> - CRTC size 3841 x 2161
> - primary plane, XRGB8888, 3639 x 2161 @ 101,0
> - overlay A, XBGR2101010, 3033 x 1777 @ 201,199
> - overlay B, ARGB8888, 1507 x 1400 @ 1800,250
>
> The sizes and positions are deliberately odd to try to avoid happy
> alignment accidents. The planes are big, which should let the pixel
> operations easily dominate performance measurement. There are
> different pixel formats, both opaque and semi-transparent. There is
> lots of plane overlap. The planes also do not cover the whole CRTC
> leaving the background visible a bit.
>
> There should be two FBs per each plane, flipped alternatingly each
> frame. Writeback should be active. Run this a number of frames, say,
> 100, and measure the kernel CPU time taken. It's supposed to take at
> least several seconds in total.
>
> I think something like this should be the base benchmark. One can
> add more to it, like rotated planes, YUV planes, etc. or switch
> settings on the existing planes. Maybe even FB_DAMAGE_CLIPS. Maybe
> one more overlay that is very tall and thin.
>
> Just an idea, what do you all think?
Hi Pekka,
I just finished writing this proposal using IGT.
I got pretty interesting results:
The mentioned commit 8356b97906503a02125c8d03c9b88a61ea46a05a took
around 13 seconds. While drm-misc/drm-misc-next took 36 seconds.
I'm currently bisecting to be certain that the change to the
pixel-by-pixel is the culprit, but I don't see why it wouldn't be.
I just need to do some final touches on the benchmark code and it
will be ready for revision.
Best Regards,
~Arthur Grillo
>
>
> Thanks,
> pq
>
>> [email protected] wrote on Fri, 2 Feb 2024 17:49:13 +0200:
>>
>>> On Fri, 2 Feb 2024 13:13:22 +0100
>>> Miquel Raynal <[email protected]> wrote:
>>>
>>>> Hello Maxime,
>>>>
>>>> + Arthur
>>>>
>>>> [email protected] wrote on Fri, 2 Feb 2024 10:53:37 +0100:
>>>>
>>>>> Hi Miquel,
>>>>>
>>>>> On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote:
>>>>>> [email protected] wrote on Fri, 2 Feb 2024 10:55:22 +0200:
>>>>>>
>>>>>>> On Thu, 01 Feb 2024 18:31:32 +0100
>>>>>>> Louis Chauvet <[email protected]> wrote:
>>>>>>>
>>>>>>>> Change the composition algorithm to iterate over pixels instead of lines.
>>>>>>>> It allows a simpler management of rotation and pixel access for complex formats.
>>>>>>>>
>>>>>>>> This new algorithm allows read_pixel function to have access to x/y
>>>>>>>> coordinates and make it possible to read the correct thing in a block
>>>>>>>> when block_w and block_h are not 1.
>>>>>>>> The iteration pixel-by-pixel in the same method also allows a simpler
>>>>>>>> management of rotation with drm_rect_* helpers. This way it's not needed
>>>>>>>> anymore to have misterious switch-case distributed in multiple places.
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> there was a very good reason to write this code using lines:
>>>>>>> performance. Before lines, it was indeed operating on individual pixels.
>>>>>>>
>>>>>>> Please, include performance measurements before and after this series
>>>>>>> to quantify the impact on the previously already supported pixel
>>>>>>> formats, particularly the 32-bit-per-pixel RGB variants.
>>>>>>>
>>>>>>> VKMS will be used more and more in CI for userspace projects, and
>>>>>>> performance actually matters there.
>>>>>>>
>>>>>>> I'm worrying that this performance degradation here is significant. I
>>>>>>> believe it is possible to keep blending with lines, if you add new line
>>>>>>> getters for reading from rotated, sub-sampled etc. images. That way you
>>>>>>> don't have to regress the most common formats' performance.
>>>>>>
>>>>>> While I understand performance is important and should be taken into
>>>>>> account seriously, I cannot understand how broken testing could be
>>>>>> considered better. Fast but inaccurate will always be significantly
>>>>>> less attractive to my eyes.
>>>>>
>>>>> AFAIK, neither the cover letter nor the commit log claimed it was fixing
>>>>> something broken, just that it was "better" (according to what
>>>>> criteria?).
>>>>
>>>> Better is probably too vague and I agree the "fixing" part is not
>>>> clearly explained in the commit log. The cover-letter however states:
>>>>
>>>>> Patch 2/2: This patch is more complex. My main target was to solve issues
>>>>> I found in [1], but as it was very complex to do it "in place", I choose
>>>>> to rework the composition function.
>>>> ...
>>>>> [1]: https://lore.kernel.org/dri-devel/[email protected]/
>>>>
>>>> If you follow this link you will find all the feedback and especially
>>>> the "broken" parts. Just to be clear, writing bugs is totally expected
>>>> and review/testing is supposed to help on this regard. I am not blaming
>>>> the author in any way, just focusing on getting this code in a more
>>>> readable shape and hopefully reinforce the testing procedure.
>>>>
>>>>> If something is truly broken, it must be stated what exactly is so we
>>>>> can all come up with a solution that will satisfy everyone.
>>>>
>>>> Maybe going through the series pointed above will give more context
>>>> but AFAIU: the YUV composition is not totally right (and the tests used
>>>> to validate it need to be more complex as well in order to fail).
>>>>
>>>> Here is a proposal.
>>>>
>>>> Today's RGB implementation is only optimized in the line-by-line case
>>>> when there is no rotation. The logic is bit convoluted and may possibly
>>>> be slightly clarified with a per-format read_line() implementation,
>>>> at a very light performance cost. Such an improvement would definitely
>>>> benefit to the clarity of the code, especially when transformations
>>>> (especially the rotations) come into play because they would be clearly
>>>> handled differently instead of being "hidden" in the optimized logic.
>>>> Performances would not change much as this path is not optimized today
>>>> anyway (the pixel-oriented logic is already used in the rotation case).
>>>>
>>>> Arthur's YUV implementation is indeed well optimized but the added
>>>> complexity probably lead to small mistakes in the logic. The
>>>> per-format read_line() implementation mentioned above could be
>>>> extended to the YUV format as well, which would leverage Arthur's
>>>> proposal by re-using his optimized version. Louis will help on this
>>>> regard. However, for more complex cases such as when there is a
>>>> rotation, it will be easier (and not sub-optimized compared to the RGB
>>>> case) to also fallback to a pixel-oriented processing.
>>>>
>>>> Would this approach make sense?
>>>
>>> Hi,
>>>
>>> I think it would, if I understand what you mean. Ever since I proposed
>>> a line-by-line algorithm to improve the performance, I was thinking of
>>> per-format read_line() functions that would be selected outside of any
>>> loops. Extending that to support YUV is only natural. I can imagine
>>> rotation complicates things, and I won't oppose that resulting in a
>>> much heavier read_line() implementation used in those cases. They might
>>> perhaps call the original read_line() implementations pixel-by-pixel or
>>> plane-by-plane (i.e. YUV planes) per pixel. Chroma-siting complicates
>>> things even further. That way one could compose any
>>> rotation-format-siting combination by chaining function pointers.
>>
>> I'll let Louis also validate but on my side I feel like I totally
>> agree with your feedback.
>>
>>> I haven't looked at VKMS in a long time, and I am disappointed to find
>>> that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel.
>>> The reading vfunc should be called with many pixels at a time when the
>>> source FB layout allows it. The whole point of the line-based functions
>>> was that they repeat the innermost loop in every function body to make
>>> the per-pixel overhead as small as possible. The VKMS implementations
>>> benchmarked before and after the original line-based algorithm showed
>>> that calling a function pointer per-pixel is relatively very expensive.
>>> Or maybe it was a switch-case.
>>
>> Indeed, since your initial feedback Louis made a couple of comparisons
>> and the time penalty is between +5% and +60% depending on the case,
>> AFAIR.
>>
>>> Sorry, I didn't realize the optimization had already been lost.
>>
>> No problem, actually I also lost myself in my first answer as I
>> initially thought the (mainline) RGB logic was also broken in edge
>> cases, which it was not, only the YUV logic suffered from some
>> limitations.
>>
>>> Btw. I'd suggest renaming vkms_compose_row() to vkms_fetch_row() since
>>> it's not composing anything and the name mislead me.
>>
>> Makes sense.
>>
>>> I think if you inspect the compositing code as of revision
>>> 8356b97906503a02125c8d03c9b88a61ea46a05a you'll get a better feeling of
>>> what it was supposed to be.
>>
>> Excellent, thanks a lot!
>>
>> Miquèl
>
>
On Tue, 6 Feb 2024 14:57:48 -0300
Arthur Grillo <[email protected]> wrote:
> On 02/02/24 16:45, Pekka Paalanen wrote:
..
> > Would it be possible to have a standardised benchmark specifically
> > for performance rather than correctness, in IGT or where-ever it
> > would make sense? Then it would be simple to tell contributors to
> > run this and report the numbers before and after.
> >
> > I would propose this kind of KMS layout:
> >
> > - CRTC size 3841 x 2161
> > - primary plane, XRGB8888, 3639 x 2161 @ 101,0
> > - overlay A, XBGR2101010, 3033 x 1777 @ 201,199
> > - overlay B, ARGB8888, 1507 x 1400 @ 1800,250
> >
> > The sizes and positions are deliberately odd to try to avoid happy
> > alignment accidents. The planes are big, which should let the pixel
> > operations easily dominate performance measurement. There are
> > different pixel formats, both opaque and semi-transparent. There is
> > lots of plane overlap. The planes also do not cover the whole CRTC
> > leaving the background visible a bit.
> >
> > There should be two FBs per each plane, flipped alternatingly each
> > frame. Writeback should be active. Run this a number of frames, say,
> > 100, and measure the kernel CPU time taken. It's supposed to take at
> > least several seconds in total.
> >
> > I think something like this should be the base benchmark. One can
> > add more to it, like rotated planes, YUV planes, etc. or switch
> > settings on the existing planes. Maybe even FB_DAMAGE_CLIPS. Maybe
> > one more overlay that is very tall and thin.
> >
> > Just an idea, what do you all think?
>
> Hi Pekka,
>
> I just finished writing this proposal using IGT.
>
> I got pretty interesting results:
>
> The mentioned commit 8356b97906503a02125c8d03c9b88a61ea46a05a took
> around 13 seconds. While drm-misc/drm-misc-next took 36 seconds.
>
> I'm currently bisecting to be certain that the change to the
> pixel-by-pixel is the culprit, but I don't see why it wouldn't be.
>
> I just need to do some final touches on the benchmark code and it
> will be ready for revision.
Awesome, thank you very much for doing that!
pq
Hello Pekka, Arthur, Maxime,
> > > >>>>>> Change the composition algorithm to iterate over pixels instead of lines.
> > > >>>>>> It allows a simpler management of rotation and pixel access for complex formats.
> > > >>>>>>
> > > >>>>>> This new algorithm allows read_pixel function to have access to x/y
> > > >>>>>> coordinates and make it possible to read the correct thing in a block
> > > >>>>>> when block_w and block_h are not 1.
> > > >>>>>> The iteration pixel-by-pixel in the same method also allows a simpler
> > > >>>>>> management of rotation with drm_rect_* helpers. This way it's not needed
> > > >>>>>> anymore to have misterious switch-case distributed in multiple places.
> > > >>>>>
> > > >>>>> Hi,
> > > >>>>>
> > > >>>>> there was a very good reason to write this code using lines:
> > > >>>>> performance. Before lines, it was indeed operating on individual pixels.
> > > >>>>>
> > > >>>>> Please, include performance measurements before and after this series
> > > >>>>> to quantify the impact on the previously already supported pixel
> > > >>>>> formats, particularly the 32-bit-per-pixel RGB variants.
> > > >>>>>
> > > >>>>> VKMS will be used more and more in CI for userspace projects, and
> > > >>>>> performance actually matters there.
> > > >>>>>
> > > >>>>> I'm worrying that this performance degradation here is significant. I
> > > >>>>> believe it is possible to keep blending with lines, if you add new line
> > > >>>>> getters for reading from rotated, sub-sampled etc. images. That way you
> > > >>>>> don't have to regress the most common formats' performance.
I tested, and yes, it's significant for most of the tests. None of them
timed out on my machine, but I agree that I have to improve this. Do you
know which tests are the more "heavy"?
> > > >>>> While I understand performance is important and should be taken into
> > > >>>> account seriously, I cannot understand how broken testing could be
> > > >>>> considered better. Fast but inaccurate will always be significantly
> > > >>>> less attractive to my eyes.
> > > >>>
> > > >>> AFAIK, neither the cover letter nor the commit log claimed it was fixing
> > > >>> something broken, just that it was "better" (according to what
> > > >>> criteria?).
Sorry Maxime for this little missunderstanding, I will improve the commit
message and cover letter for the v2.
> > > >> Today's RGB implementation is only optimized in the line-by-line case
> > > >> when there is no rotation. The logic is bit convoluted and may possibly
> > > >> be slightly clarified with a per-format read_line() implementation,
> > > >> at a very light performance cost. Such an improvement would definitely
> > > >> benefit to the clarity of the code, especially when transformations
> > > >> (especially the rotations) come into play because they would be clearly
> > > >> handled differently instead of being "hidden" in the optimized logic.
> > > >> Performances would not change much as this path is not optimized today
> > > >> anyway (the pixel-oriented logic is already used in the rotation case).
[...]
> > > > I think it would, if I understand what you mean. Ever since I proposed
> > > > a line-by-line algorithm to improve the performance, I was thinking of
> > > > per-format read_line() functions that would be selected outside of any
> > > > loops.
[...]
> > > > I haven't looked at VKMS in a long time, and I am disappointed to find
> > > > that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel.
> > > > The reading vfunc should be called with many pixels at a time when the
> > > > source FB layout allows it. The whole point of the line-based functions
> > > > was that they repeat the innermost loop in every function body to make
> > > > the per-pixel overhead as small as possible. The VKMS implementations
> > > > benchmarked before and after the original line-based algorithm showed
> > > > that calling a function pointer per-pixel is relatively very expensive.
> > > > Or maybe it was a switch-case.
[...]
> > > But, I agree with Miquel that the rotation logic is easier to implement
> > > in a pixel-based way. So going pixel-by-pixel only when rotation occurs
> > > would be great.
> >
> > Yes, and I think that can very well be done in the line-based framework
> > still that existed in the old days before any rotation support was
> > added. Essentially a plug-in line-getter function that then calls a
> > format-specific line-getter pixel-by-pixel while applying the rotation.
> > It would be simple, it would leave unrotated performance unharmed (use
> > format-specific line-getter directly with lines), but it might be
> > somewhat less performant for rotated KMS planes. I suspect that might
> > be a good compromise.
> >
> > Format-specific line-getters could also be parameterized by
> > pixel-to-pixel offset in bytes. Then they could directly traverse FB
> > rows forward and backward, and even FB columns. It may or may not have
> > a penalty compared to the original line-getters, so it would have to
> > be benchmarked.
>
> Oh, actually, since the byte offset depends on format, it might be
> better to parametrize by direction and compute the offset in the
> format-specific line-getter before the loop.
>
I'm currently working on this implementation. The algorithm would look
like:
void blend(...) {
for(int y = 0; y < height; y++) {
for(int plane = 0; plane < nb_planes; plane++) {
if(planes[plane].read_line && planes[plane].rotation == DRM_ROTATION_0) {
[...] /* Small common logic to manage REFLECT_X/Y and translations */
planes[plane].read_line(....);
} else {
[...] /* v1 of my patch, pixel by pixel read */
}
}
}
}
where read_line is:
void read_line(frame_info *src, int y, int x_start, int x_stop, pixel_argb16 *dts[])
- y is the line to read (so the caller need to compute the correct offset)
- x_start/x_stop are the start and stop column, but they may be not
ordered properly (i.e DRM_REFLECT_X will make x_start greater than
x_stop)
- src/dst are source and destination buffers
This way:
- It's simple to read for the general case (usage of drm_rect_* instead of
manually rewriting the logic)
- Each pixel format can be quickly implemented with "pixel-by-pixel"
methods
- The performances should be good if no rotation is implied for some
formats
I also created some helpers for conversions between formats to avoid code
duplication between pixel and line algorithms (and also between argb and
xrgb variants).
The only flaw with this, is that most of the read_line functions will
look like:
void read_line(...) {
int increment = x_start < x_stop ? 1: -1;
while(x_start != x_stop) {
out += 1;
[...] /* color conversion */
x_start += increment;
}
}
But as Pekka explained, it's probably the most efficient way to do it.
Is there a way to save the output of vkms to a folder (something like
"one image per frame")? It's a bit complex to debug graphics without
seeing them.
I have something which (I think) should work, but some tests are failing
and I don't find why in my code (I don't find the reason why the they are
failing and the hexdump I added to debug seems normal).
I think my issue is a simple mistake in the "translation managment" or
maybe just a wrong offset, but I don't see my error in the code. I think a
quick look on the final image would help me a lot.
[...]
Have a nice day,
Louis Chauvet
--
Louis Chauvet, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
Hello Pekka, Arthur,
[...]
> > > Would it be possible to have a standardised benchmark specifically
> > > for performance rather than correctness, in IGT or where-ever it
> > > would make sense? Then it would be simple to tell contributors to
> > > run this and report the numbers before and after.
> > >
> > > I would propose this kind of KMS layout:
> > >
> > > - CRTC size 3841 x 2161
> > > - primary plane, XRGB8888, 3639 x 2161 @ 101,0
> > > - overlay A, XBGR2101010, 3033 x 1777 @ 201,199
> > > - overlay B, ARGB8888, 1507 x 1400 @ 1800,250
> > >
> > > The sizes and positions are deliberately odd to try to avoid happy
> > > alignment accidents. The planes are big, which should let the pixel
> > > operations easily dominate performance measurement. There are
> > > different pixel formats, both opaque and semi-transparent. There is
> > > lots of plane overlap. The planes also do not cover the whole CRTC
> > > leaving the background visible a bit.
> > >
> > > There should be two FBs per each plane, flipped alternatingly each
> > > frame. Writeback should be active. Run this a number of frames, say,
> > > 100, and measure the kernel CPU time taken. It's supposed to take at
> > > least several seconds in total.
> > >
> > > I think something like this should be the base benchmark. One can
> > > add more to it, like rotated planes, YUV planes, etc. or switch
> > > settings on the existing planes. Maybe even FB_DAMAGE_CLIPS. Maybe
> > > one more overlay that is very tall and thin.
> > >
> > > Just an idea, what do you all think?
> >
> > Hi Pekka,
> >
> > I just finished writing this proposal using IGT.
> >
> > I got pretty interesting results:
> >
> > The mentioned commit 8356b97906503a02125c8d03c9b88a61ea46a05a took
> > around 13 seconds. While drm-misc/drm-misc-next took 36 seconds.
> >
> > I'm currently bisecting to be certain that the change to the
> > pixel-by-pixel is the culprit, but I don't see why it wouldn't be.
> >
> > I just need to do some final touches on the benchmark code and it
> > will be ready for revision.
>
> Awesome, thank you very much for doing that!
> pq
I also think it's a good benchmarks for classic configurations. The odd
size is a very nice idea to verify the corner cases of line-by-line
algorithms.
When this is ready, please share the test, so I can check if my patch is
as performant as before.
Thank you for this work.
Have a nice day,
Louis Chauvet
--
Louis Chauvet, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
On 07/02/24 13:03, Louis Chauvet wrote:
> Hello Pekka, Arthur,
>
> [...]
>
>>>> Would it be possible to have a standardised benchmark specifically
>>>> for performance rather than correctness, in IGT or where-ever it
>>>> would make sense? Then it would be simple to tell contributors to
>>>> run this and report the numbers before and after.
>>>>
>>>> I would propose this kind of KMS layout:
>>>>
>>>> - CRTC size 3841 x 2161
>>>> - primary plane, XRGB8888, 3639 x 2161 @ 101,0
>>>> - overlay A, XBGR2101010, 3033 x 1777 @ 201,199
>>>> - overlay B, ARGB8888, 1507 x 1400 @ 1800,250
>>>>
>>>> The sizes and positions are deliberately odd to try to avoid happy
>>>> alignment accidents. The planes are big, which should let the pixel
>>>> operations easily dominate performance measurement. There are
>>>> different pixel formats, both opaque and semi-transparent. There is
>>>> lots of plane overlap. The planes also do not cover the whole CRTC
>>>> leaving the background visible a bit.
>>>>
>>>> There should be two FBs per each plane, flipped alternatingly each
>>>> frame. Writeback should be active. Run this a number of frames, say,
>>>> 100, and measure the kernel CPU time taken. It's supposed to take at
>>>> least several seconds in total.
>>>>
>>>> I think something like this should be the base benchmark. One can
>>>> add more to it, like rotated planes, YUV planes, etc. or switch
>>>> settings on the existing planes. Maybe even FB_DAMAGE_CLIPS. Maybe
>>>> one more overlay that is very tall and thin.
>>>>
>>>> Just an idea, what do you all think?
>>>
>>> Hi Pekka,
>>>
>>> I just finished writing this proposal using IGT.
>>>
>>> I got pretty interesting results:
>>>
>>> The mentioned commit 8356b97906503a02125c8d03c9b88a61ea46a05a took
>>> around 13 seconds. While drm-misc/drm-misc-next took 36 seconds.
>>>
>>> I'm currently bisecting to be certain that the change to the
>>> pixel-by-pixel is the culprit, but I don't see why it wouldn't be.
>>>
>>> I just need to do some final touches on the benchmark code and it
>>> will be ready for revision.
>>
>> Awesome, thank you very much for doing that!
>> pq
>
> I also think it's a good benchmarks for classic configurations. The odd
> size is a very nice idea to verify the corner cases of line-by-line
> algorithms.
>
> When this is ready, please share the test, so I can check if my patch is
> as performant as before.
>
> Thank you for this work.
>
> Have a nice day,
> Louis Chauvet
>
Just sent the benchmark for revision:
https://lore.kernel.org/r/[email protected]
On Wed, 7 Feb 2024 16:49:56 +0100
Louis Chauvet <[email protected]> wrote:
> Hello Pekka, Arthur, Maxime,
Hi all
> > > > >>>>>> Change the composition algorithm to iterate over pixels instead of lines.
> > > > >>>>>> It allows a simpler management of rotation and pixel access for complex formats.
> > > > >>>>>>
> > > > >>>>>> This new algorithm allows read_pixel function to have access to x/y
> > > > >>>>>> coordinates and make it possible to read the correct thing in a block
> > > > >>>>>> when block_w and block_h are not 1.
> > > > >>>>>> The iteration pixel-by-pixel in the same method also allows a simpler
> > > > >>>>>> management of rotation with drm_rect_* helpers. This way it's not needed
> > > > >>>>>> anymore to have misterious switch-case distributed in multiple places.
> > > > >>>>>
> > > > >>>>> Hi,
> > > > >>>>>
> > > > >>>>> there was a very good reason to write this code using lines:
> > > > >>>>> performance. Before lines, it was indeed operating on individual pixels.
> > > > >>>>>
> > > > >>>>> Please, include performance measurements before and after this series
> > > > >>>>> to quantify the impact on the previously already supported pixel
> > > > >>>>> formats, particularly the 32-bit-per-pixel RGB variants.
> > > > >>>>>
> > > > >>>>> VKMS will be used more and more in CI for userspace projects, and
> > > > >>>>> performance actually matters there.
> > > > >>>>>
> > > > >>>>> I'm worrying that this performance degradation here is significant. I
> > > > >>>>> believe it is possible to keep blending with lines, if you add new line
> > > > >>>>> getters for reading from rotated, sub-sampled etc. images. That way you
> > > > >>>>> don't have to regress the most common formats' performance.
>
> I tested, and yes, it's significant for most of the tests. None of them
> timed out on my machine, but I agree that I have to improve this. Do you
> know which tests are the more "heavy"?
I don't, but considering that various userspace projects (e.g. Wayland
compositors) want to use VKMS more and more in their own CI, looking
only at IGT is not enough. Every second saved per run is a tiny bit of
data center energy saved, or developers waiting less for results.
I do have some expectations that for each KMS property, Wayland
compositors tend to use the "normal" property value more than any other
value. So if you test different pixel formats, you probably set
rotation to normal, since it's completely orthogonal in userspace. And
then you would test different rotations with just one pixel format.
At least I would personally leave it to IGT to test all the possible
combinations of pixel formats + rotations + odd sizes + odd positions.
Wayland compositor CI wants to test the compositor internals, not VKMS
internals.
> > > > >>>> While I understand performance is important and should be taken into
> > > > >>>> account seriously, I cannot understand how broken testing could be
> > > > >>>> considered better. Fast but inaccurate will always be significantly
> > > > >>>> less attractive to my eyes.
> > > > >>>
> > > > >>> AFAIK, neither the cover letter nor the commit log claimed it was fixing
> > > > >>> something broken, just that it was "better" (according to what
> > > > >>> criteria?).
>
> Sorry Maxime for this little missunderstanding, I will improve the commit
> message and cover letter for the v2.
>
> > > > >> Today's RGB implementation is only optimized in the line-by-line case
> > > > >> when there is no rotation. The logic is bit convoluted and may possibly
> > > > >> be slightly clarified with a per-format read_line() implementation,
> > > > >> at a very light performance cost. Such an improvement would definitely
> > > > >> benefit to the clarity of the code, especially when transformations
> > > > >> (especially the rotations) come into play because they would be clearly
> > > > >> handled differently instead of being "hidden" in the optimized logic.
> > > > >> Performances would not change much as this path is not optimized today
> > > > >> anyway (the pixel-oriented logic is already used in the rotation case).
>
> [...]
>
> > > > > I think it would, if I understand what you mean. Ever since I proposed
> > > > > a line-by-line algorithm to improve the performance, I was thinking of
> > > > > per-format read_line() functions that would be selected outside of any
> > > > > loops.
>
> [...]
>
> > > > > I haven't looked at VKMS in a long time, and I am disappointed to find
> > > > > that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel.
> > > > > The reading vfunc should be called with many pixels at a time when the
> > > > > source FB layout allows it. The whole point of the line-based functions
> > > > > was that they repeat the innermost loop in every function body to make
> > > > > the per-pixel overhead as small as possible. The VKMS implementations
> > > > > benchmarked before and after the original line-based algorithm showed
> > > > > that calling a function pointer per-pixel is relatively very expensive.
> > > > > Or maybe it was a switch-case.
>
> [...]
>
> > > > But, I agree with Miquel that the rotation logic is easier to implement
> > > > in a pixel-based way. So going pixel-by-pixel only when rotation occurs
> > > > would be great.
> > >
> > > Yes, and I think that can very well be done in the line-based framework
> > > still that existed in the old days before any rotation support was
> > > added. Essentially a plug-in line-getter function that then calls a
> > > format-specific line-getter pixel-by-pixel while applying the rotation.
> > > It would be simple, it would leave unrotated performance unharmed (use
> > > format-specific line-getter directly with lines), but it might be
> > > somewhat less performant for rotated KMS planes. I suspect that might
> > > be a good compromise.
> > >
> > > Format-specific line-getters could also be parameterized by
> > > pixel-to-pixel offset in bytes. Then they could directly traverse FB
> > > rows forward and backward, and even FB columns. It may or may not have
> > > a penalty compared to the original line-getters, so it would have to
> > > be benchmarked.
> >
> > Oh, actually, since the byte offset depends on format, it might be
> > better to parametrize by direction and compute the offset in the
> > format-specific line-getter before the loop.
> >
>
> I'm currently working on this implementation. The algorithm would look
> like:
>
> void blend(...) {
> for(int y = 0; y < height; y++) {
> for(int plane = 0; plane < nb_planes; plane++) {
> if(planes[plane].read_line && planes[plane].rotation == DRM_ROTATION_0) {
I would try to drop the rotation check here completely. Instead, when
choosing the function pointer to call here, outside of *all* loops, you
would check the rotation property. If rotation is a no-op, pick the
read_line function directly. If rotation/reflection is needed, pick a
rotation function that will then call read_line function pixel-by-pixel.
So planes[plane] would have two vfuncs, one with a plain read_line that
assumes normal orientation and can return a line of arbitrary length
from arbitrary x,y position, and another vfunc that this loop here will
call which is either some rotation handling function or just the same
function as the first vfunc.
The two function pointers might well need different signatures, meaning
you need a simple wrapper for the rotation=normal case too.
I believe that could result in cleaner code.
> [...] /* Small common logic to manage REFLECT_X/Y and translations */
> planes[plane].read_line(....);
> } else {
> [...] /* v1 of my patch, pixel by pixel read */
> }
> }
> }
> }
>
> where read_line is:
> void read_line(frame_info *src, int y, int x_start, int x_stop, pixel_argb16 *dts[])
> - y is the line to read (so the caller need to compute the correct offset)
> - x_start/x_stop are the start and stop column, but they may be not
> ordered properly (i.e DRM_REFLECT_X will make x_start greater than
> x_stop)
> - src/dst are source and destination buffers
This sounds ok. An alternative would be something like
enum direction {
RIGHT,
LEFT,
UP,
DOWN,
};
void read_line(frame_info *src, int start_x, int start_y, enum direction dir,
int count_pixels, pixel_argb16 *dst);
Based on dir, before the inner loop this function would compute the
byte offset between the pixels to be read. If the format is multiplanar
YUV, it can compute the offset per plane. And the starting pointers per
pixel plane, of course, and one end pointer for the loop stop condition
maybe from dst.
This might make all the other directions than RIGHT much faster than
calling read_line one pixel at a time to achieve the same.
Would need to benchmark if this is significantly slower than your
suggestion for dir=RIGHT, though. If it's roughly the same, then it
would probably be worth it.
> This way:
> - It's simple to read for the general case (usage of drm_rect_* instead of
> manually rewriting the logic)
> - Each pixel format can be quickly implemented with "pixel-by-pixel"
> methods
> - The performances should be good if no rotation is implied for some
> formats
>
> I also created some helpers for conversions between formats to avoid code
> duplication between pixel and line algorithms (and also between argb and
> xrgb variants).
>
> The only flaw with this, is that most of the read_line functions will
> look like:
>
> void read_line(...) {
> int increment = x_start < x_stop ? 1: -1;
> while(x_start != x_stop) {
> out += 1;
> [...] /* color conversion */
> x_start += increment;
> }
> }
>
> But as Pekka explained, it's probably the most efficient way to do it.
Yes, I expect them to look roughly like that. It's necessary for moving
as much of the setup computations and real function calls out of the
inner-most loop as possible. The middle (over KMS planes) and outer
(over y) loops are less sensitive to wasted cycles on redundant
computations.
> Is there a way to save the output of vkms to a folder (something like
> "one image per frame")? It's a bit complex to debug graphics without
> seeing them.
>
> I have something which (I think) should work, but some tests are failing
> and I don't find why in my code (I don't find the reason why the they are
> failing and the hexdump I added to debug seems normal).
>
> I think my issue is a simple mistake in the "translation managment" or
> maybe just a wrong offset, but I don't see my error in the code. I think a
> quick look on the final image would help me a lot.
I don't know anything about the kernel unit testing frameworks, maybe
they could help?
But if you drive the test through UAPI from userspace, use a writeback
connector to fetch the resulting image. I'm sure IGT has code doing
that somewhere, although many IGT tests rely on CRC instead because CRC
is more widely available in hardware.
Arthur's new benchmark seems to be using writeback, you just need to
make it save to file:
https://lore.kernel.org/all/[email protected]/
Thanks,
pq
On 08/02/24 06:39, Pekka Paalanen wrote:
> On Wed, 7 Feb 2024 16:49:56 +0100
> Louis Chauvet <[email protected]> wrote:
>
>> Hello Pekka, Arthur, Maxime,
>
> Hi all
>
>>>>>>>>>>> Change the composition algorithm to iterate over pixels instead of lines.
>>>>>>>>>>> It allows a simpler management of rotation and pixel access for complex formats.
>>>>>>>>>>>
>>>>>>>>>>> This new algorithm allows read_pixel function to have access to x/y
>>>>>>>>>>> coordinates and make it possible to read the correct thing in a block
>>>>>>>>>>> when block_w and block_h are not 1.
>>>>>>>>>>> The iteration pixel-by-pixel in the same method also allows a simpler
>>>>>>>>>>> management of rotation with drm_rect_* helpers. This way it's not needed
>>>>>>>>>>> anymore to have misterious switch-case distributed in multiple places.
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> there was a very good reason to write this code using lines:
>>>>>>>>>> performance. Before lines, it was indeed operating on individual pixels.
>>>>>>>>>>
>>>>>>>>>> Please, include performance measurements before and after this series
>>>>>>>>>> to quantify the impact on the previously already supported pixel
>>>>>>>>>> formats, particularly the 32-bit-per-pixel RGB variants.
>>>>>>>>>>
>>>>>>>>>> VKMS will be used more and more in CI for userspace projects, and
>>>>>>>>>> performance actually matters there.
>>>>>>>>>>
>>>>>>>>>> I'm worrying that this performance degradation here is significant. I
>>>>>>>>>> believe it is possible to keep blending with lines, if you add new line
>>>>>>>>>> getters for reading from rotated, sub-sampled etc. images. That way you
>>>>>>>>>> don't have to regress the most common formats' performance.
>>
>> I tested, and yes, it's significant for most of the tests. None of them
>> timed out on my machine, but I agree that I have to improve this. Do you
>> know which tests are the more "heavy"?
>
> I don't, but considering that various userspace projects (e.g. Wayland
> compositors) want to use VKMS more and more in their own CI, looking
> only at IGT is not enough. Every second saved per run is a tiny bit of
> data center energy saved, or developers waiting less for results.
>
> I do have some expectations that for each KMS property, Wayland
> compositors tend to use the "normal" property value more than any other
> value. So if you test different pixel formats, you probably set
> rotation to normal, since it's completely orthogonal in userspace. And
> then you would test different rotations with just one pixel format.
>
> At least I would personally leave it to IGT to test all the possible
> combinations of pixel formats + rotations + odd sizes + odd positions.
> Wayland compositor CI wants to test the compositor internals, not VKMS
> internals.
>
>>>>>>>>> While I understand performance is important and should be taken into
>>>>>>>>> account seriously, I cannot understand how broken testing could be
>>>>>>>>> considered better. Fast but inaccurate will always be significantly
>>>>>>>>> less attractive to my eyes.
>>>>>>>>
>>>>>>>> AFAIK, neither the cover letter nor the commit log claimed it was fixing
>>>>>>>> something broken, just that it was "better" (according to what
>>>>>>>> criteria?).
>>
>> Sorry Maxime for this little missunderstanding, I will improve the commit
>> message and cover letter for the v2.
>>
>>>>>>> Today's RGB implementation is only optimized in the line-by-line case
>>>>>>> when there is no rotation. The logic is bit convoluted and may possibly
>>>>>>> be slightly clarified with a per-format read_line() implementation,
>>>>>>> at a very light performance cost. Such an improvement would definitely
>>>>>>> benefit to the clarity of the code, especially when transformations
>>>>>>> (especially the rotations) come into play because they would be clearly
>>>>>>> handled differently instead of being "hidden" in the optimized logic.
>>>>>>> Performances would not change much as this path is not optimized today
>>>>>>> anyway (the pixel-oriented logic is already used in the rotation case).
>>
>> [...]
>>
>>>>>> I think it would, if I understand what you mean. Ever since I proposed
>>>>>> a line-by-line algorithm to improve the performance, I was thinking of
>>>>>> per-format read_line() functions that would be selected outside of any
>>>>>> loops.
>>
>> [...]
>>
>>>>>> I haven't looked at VKMS in a long time, and I am disappointed to find
>>>>>> that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel.
>>>>>> The reading vfunc should be called with many pixels at a time when the
>>>>>> source FB layout allows it. The whole point of the line-based functions
>>>>>> was that they repeat the innermost loop in every function body to make
>>>>>> the per-pixel overhead as small as possible. The VKMS implementations
>>>>>> benchmarked before and after the original line-based algorithm showed
>>>>>> that calling a function pointer per-pixel is relatively very expensive.
>>>>>> Or maybe it was a switch-case.
>>
>> [...]
>>
>>>>> But, I agree with Miquel that the rotation logic is easier to implement
>>>>> in a pixel-based way. So going pixel-by-pixel only when rotation occurs
>>>>> would be great.
>>>>
>>>> Yes, and I think that can very well be done in the line-based framework
>>>> still that existed in the old days before any rotation support was
>>>> added. Essentially a plug-in line-getter function that then calls a
>>>> format-specific line-getter pixel-by-pixel while applying the rotation.
>>>> It would be simple, it would leave unrotated performance unharmed (use
>>>> format-specific line-getter directly with lines), but it might be
>>>> somewhat less performant for rotated KMS planes. I suspect that might
>>>> be a good compromise.
>>>>
>>>> Format-specific line-getters could also be parameterized by
>>>> pixel-to-pixel offset in bytes. Then they could directly traverse FB
>>>> rows forward and backward, and even FB columns. It may or may not have
>>>> a penalty compared to the original line-getters, so it would have to
>>>> be benchmarked.
>>>
>>> Oh, actually, since the byte offset depends on format, it might be
>>> better to parametrize by direction and compute the offset in the
>>> format-specific line-getter before the loop.
>>>
>>
>> I'm currently working on this implementation. The algorithm would look
>> like:
>>
>> void blend(...) {
>> for(int y = 0; y < height; y++) {
>> for(int plane = 0; plane < nb_planes; plane++) {
>> if(planes[plane].read_line && planes[plane].rotation == DRM_ROTATION_0) {
>
> I would try to drop the rotation check here completely. Instead, when
> choosing the function pointer to call here, outside of *all* loops, you
> would check the rotation property. If rotation is a no-op, pick the
> read_line function directly. If rotation/reflection is needed, pick a
> rotation function that will then call read_line function pixel-by-pixel.
>
> So planes[plane] would have two vfuncs, one with a plain read_line that
> assumes normal orientation and can return a line of arbitrary length
> from arbitrary x,y position, and another vfunc that this loop here will
> call which is either some rotation handling function or just the same
> function as the first vfunc.
>
> The two function pointers might well need different signatures, meaning
> you need a simple wrapper for the rotation=normal case too.
>
> I believe that could result in cleaner code.
>
>> [...] /* Small common logic to manage REFLECT_X/Y and translations */
>> planes[plane].read_line(....);
>> } else {
>> [...] /* v1 of my patch, pixel by pixel read */
>> }
>> }
>> }
>> }
>>
>> where read_line is:
>> void read_line(frame_info *src, int y, int x_start, int x_stop, pixel_argb16 *dts[])
>> - y is the line to read (so the caller need to compute the correct offset)
>> - x_start/x_stop are the start and stop column, but they may be not
>> ordered properly (i.e DRM_REFLECT_X will make x_start greater than
>> x_stop)
>> - src/dst are source and destination buffers
>
> This sounds ok. An alternative would be something like
>
> enum direction {
> RIGHT,
> LEFT,
> UP,
> DOWN,
> };
>
> void read_line(frame_info *src, int start_x, int start_y, enum direction dir,
> int count_pixels, pixel_argb16 *dst);
>
> Based on dir, before the inner loop this function would compute the
> byte offset between the pixels to be read. If the format is multiplanar
> YUV, it can compute the offset per plane. And the starting pointers per
> pixel plane, of course, and one end pointer for the loop stop condition
> maybe from dst.
>
> This might make all the other directions than RIGHT much faster than
> calling read_line one pixel at a time to achieve the same.
>
> Would need to benchmark if this is significantly slower than your
> suggestion for dir=RIGHT, though. If it's roughly the same, then it
> would probably be worth it.
>
>
>> This way:
>> - It's simple to read for the general case (usage of drm_rect_* instead of
>> manually rewriting the logic)
>> - Each pixel format can be quickly implemented with "pixel-by-pixel"
>> methods
>> - The performances should be good if no rotation is implied for some
>> formats
>>
>> I also created some helpers for conversions between formats to avoid code
>> duplication between pixel and line algorithms (and also between argb and
>> xrgb variants).
>>
>> The only flaw with this, is that most of the read_line functions will
>> look like:
>>
>> void read_line(...) {
>> int increment = x_start < x_stop ? 1: -1;
>> while(x_start != x_stop) {
>> out += 1;
>> [...] /* color conversion */
>> x_start += increment;
>> }
>> }
>>
>> But as Pekka explained, it's probably the most efficient way to do it.
>
> Yes, I expect them to look roughly like that. It's necessary for moving
> as much of the setup computations and real function calls out of the
> inner-most loop as possible. The middle (over KMS planes) and outer
> (over y) loops are less sensitive to wasted cycles on redundant
> computations.
>
>> Is there a way to save the output of vkms to a folder (something like
>> "one image per frame")? It's a bit complex to debug graphics without
>> seeing them.
>>
>> I have something which (I think) should work, but some tests are failing
>> and I don't find why in my code (I don't find the reason why the they are
>> failing and the hexdump I added to debug seems normal).
>>
>> I think my issue is a simple mistake in the "translation managment" or
>> maybe just a wrong offset, but I don't see my error in the code. I think a
>> quick look on the final image would help me a lot.
>
> I don't know anything about the kernel unit testing frameworks, maybe
> they could help?
>
> But if you drive the test through UAPI from userspace, use a writeback
> connector to fetch the resulting image. I'm sure IGT has code doing
> that somewhere, although many IGT tests rely on CRC instead because CRC
> is more widely available in hardware.
>
> Arthur's new benchmark seems to be using writeback, you just need to
> make it save to file:
> https://lore.kernel.org/all/[email protected]/
Hi,
I just found that the IGT's test kms_writeback has a flag '--dump' that
does that, maybe you can use it or copy the logic to the benchmark.
Best Regards,
~Arthur Grillo
>
>
> Thanks,
> pq