Subject: [PATCH 0/6] Refactor the vkms to accept new formats

Summary
=======
This series of patches refactor some vkms components in order to introduce
new formats to the planes and writeback connector.

Now in the blend function, the plane's pixels are converted to ARGB16161616
and then blended together.

The CRC is calculated based on the ARGB1616161616 buffer. And if required,
this buffer is copied/converted to the writeback buffer format.

And to handle the pixel conversion, new functions were added to convert
from a specific format to ARGB16161616 (the reciprocal is also true).

Tests
=====
This patch series was tested using the following igt tests:
-t ".*kms_plane.*"
-t ".*kms_writeback.*"
-t ".*kms_cursor_crc*"

New tests passing
-------------------
- pipe-A-cursor-size-change
- pipe-A-cursor-alpha-transparent

Tests and Performance Regressions
-------------------------------------
This pack of tests is failing:
- pipe-A-cursor-*-rapid-movement

This is expected since there are more operations per pixel than before.
And consequently, the compositing is way slower than before.

My micro-profiling shows these ranges to the completion of each
compositing in the .*kms_cursor_crc.* tests:

| Frametime |
|:-------:|:--------:|
| before | after |
| 8~20 ms | 32~56 ms |

Hence, further optimizations will be required.

Writeback test
---------------
During the development of this patch series, I discovered that the
writeback-check-output test wasn't filling the plane correctly.

So, currently, this patch series is failing in this test. But I sent a
patch to igt to fix it[1].

XRGB to ARGB behavior
=================
During the development, I decided to always fill the alpha channel of
the output pixel whenever the conversion from a format without an alpha
channel to ARGB16161616 is necessary. Therefore, I ignore the value
received from the XRGB and overwrite the value with 0xFFFF.

My question is, is this behavior acceptable?

[1] https://lists.freedesktop.org/archives/igt-dev/2021-October/036125.html

Igor Matheus Andrade Torrente (6):
drm: vkms: Replace the deprecated drm_mode_config_init
drm: vkms: Alloc the compose frame using vzalloc
drm: vkms: Replace hardcoded value of `vkms_composer.map` to
DRM_FORMAT_MAX_PLANES
drm: vkms: Add fb information to `vkms_writeback_job`
drm: vkms: Prepare `vkms_wb_encoder_atomic_check` to accept multiple
formats
drm: vkms: Refactor the plane composer to accept new formats

drivers/gpu/drm/vkms/vkms_composer.c | 275 ++++++++++++++------------
drivers/gpu/drm/vkms/vkms_drv.c | 5 +-
drivers/gpu/drm/vkms/vkms_drv.h | 12 +-
drivers/gpu/drm/vkms/vkms_formats.h | 125 ++++++++++++
drivers/gpu/drm/vkms/vkms_writeback.c | 27 ++-
5 files changed, 304 insertions(+), 140 deletions(-)
create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h

--
2.30.2


Subject: [PATCH 1/6] drm: vkms: Replace the deprecated drm_mode_config_init

The `drm_mode_config_init` was deprecated since c3b790e commit, and it's
being replaced by the `drmm_mode_config_init`.

Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
---
drivers/gpu/drm/vkms/vkms_drv.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/vkms/vkms_drv.c b/drivers/gpu/drm/vkms/vkms_drv.c
index 0ffe5f0e33f7..828868920494 100644
--- a/drivers/gpu/drm/vkms/vkms_drv.c
+++ b/drivers/gpu/drm/vkms/vkms_drv.c
@@ -140,8 +140,11 @@ static const struct drm_mode_config_helper_funcs vkms_mode_config_helpers = {
static int vkms_modeset_init(struct vkms_device *vkmsdev)
{
struct drm_device *dev = &vkmsdev->drm;
+ int ret = drmm_mode_config_init(dev);
+
+ if (ret < 0)
+ return ret;

- drm_mode_config_init(dev);
dev->mode_config.funcs = &vkms_mode_funcs;
dev->mode_config.min_width = XRES_MIN;
dev->mode_config.min_height = YRES_MIN;
--
2.30.2

Subject: [PATCH 2/6] drm: vkms: Alloc the compose frame using vzalloc

Currently, the memory to the composition frame is being allocated using
the kzmalloc. This comes with the limitation of maximum size of one
page size(which in the x86_64 is 4Kb and 4MB for default and hugepage
respectively).

Somes test of igt (e.g. kms_plane@pixel-format) uses more than 4MB when
testing some pixel formats like ARGB16161616.

This problem is addessed by allocating the memory using kvzalloc that
circunvents this limitation.

Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
---
drivers/gpu/drm/vkms/vkms_composer.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
index 9e8204be9a14..82f79e508f81 100644
--- a/drivers/gpu/drm/vkms/vkms_composer.c
+++ b/drivers/gpu/drm/vkms/vkms_composer.c
@@ -180,7 +180,7 @@ static int compose_active_planes(void **vaddr_out,
int i;

if (!*vaddr_out) {
- *vaddr_out = kzalloc(gem_obj->size, GFP_KERNEL);
+ *vaddr_out = kvzalloc(gem_obj->size, GFP_KERNEL);
if (!*vaddr_out) {
DRM_ERROR("Cannot allocate memory for output frame.");
return -ENOMEM;
@@ -263,7 +263,7 @@ void vkms_composer_worker(struct work_struct *work)
crtc_state);
if (ret) {
if (ret == -EINVAL && !wb_pending)
- kfree(vaddr_out);
+ kvfree(vaddr_out);
return;
}

@@ -275,7 +275,7 @@ void vkms_composer_worker(struct work_struct *work)
crtc_state->wb_pending = false;
spin_unlock_irq(&out->composer_lock);
} else {
- kfree(vaddr_out);
+ kvfree(vaddr_out);
}

/*
--
2.30.2

Subject: [PATCH 3/6] drm: vkms: Replace hardcoded value of `vkms_composer.map` to DRM_FORMAT_MAX_PLANES

The `map` vector at `vkms_composer` uses a hardcoded value to define its
size.

If someday the maximum number of planes increases, this hardcoded value
can be a problem.

This value is being replaced with the DRM_FORMAT_MAX_PLANES macro.

Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
---
drivers/gpu/drm/vkms/vkms_drv.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/vkms/vkms_drv.h b/drivers/gpu/drm/vkms/vkms_drv.h
index d48c23d40ce5..64e62993b06f 100644
--- a/drivers/gpu/drm/vkms/vkms_drv.h
+++ b/drivers/gpu/drm/vkms/vkms_drv.h
@@ -28,7 +28,7 @@ struct vkms_writeback_job {
struct vkms_composer {
struct drm_framebuffer fb;
struct drm_rect src, dst;
- struct dma_buf_map map[4];
+ struct dma_buf_map map[DRM_FORMAT_MAX_PLANES];
unsigned int offset;
unsigned int pitch;
unsigned int cpp;
--
2.30.2

Subject: [PATCH 5/6] drm: vkms: Prepare `vkms_wb_encoder_atomic_check` to accept multiple formats

Currently, the vkms atomic check only goes through the first position of
the `vkms_wb_formats` vector.

This change prepares the atomic_check to check the entire vector.

Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
---
drivers/gpu/drm/vkms/vkms_writeback.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c b/drivers/gpu/drm/vkms/vkms_writeback.c
index 5a3e12f105dc..56978f499203 100644
--- a/drivers/gpu/drm/vkms/vkms_writeback.c
+++ b/drivers/gpu/drm/vkms/vkms_writeback.c
@@ -30,6 +30,8 @@ static int vkms_wb_encoder_atomic_check(struct drm_encoder *encoder,
{
struct drm_framebuffer *fb;
const struct drm_display_mode *mode = &crtc_state->mode;
+ bool format_supported = false;
+ int i;

if (!conn_state->writeback_job || !conn_state->writeback_job->fb)
return 0;
@@ -41,7 +43,14 @@ static int vkms_wb_encoder_atomic_check(struct drm_encoder *encoder,
return -EINVAL;
}

- if (fb->format->format != vkms_wb_formats[0]) {
+ for (i = 0; i < ARRAY_SIZE(vkms_wb_formats); i++) {
+ if (fb->format->format == vkms_wb_formats[i]) {
+ format_supported = true;
+ break;
+ }
+ }
+
+ if (!format_supported) {
DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
&fb->format->format);
return -EINVAL;
--
2.30.2

Subject: [PATCH 6/6] drm: vkms: Refactor the plane composer to accept new formats

Currently the blend function only accepts XRGB_8888 and ARGB_8888
as a color input.

This patch refactors all the functions related to the plane composition
to overcome this limitation.

Now the blend function receives a format handler to each plane and a
blend function pointer. It will take two ARGB_1616161616 pixels, one
for each handler, and will use the blend function to calculate and
store the final color in the output buffer.

These format handlers will receive the `vkms_composer` and a pair of
coordinates. And they should return the respective pixel in the
ARGB_16161616 format.

The blend function will receive two ARGB_16161616 pixels, x, y, and
the vkms_composer of the output buffer. The method should perform the
blend operation and store output to the format aforementioned
ARGB_16161616.

Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
---
drivers/gpu/drm/vkms/vkms_composer.c | 275 ++++++++++++++-------------
drivers/gpu/drm/vkms/vkms_formats.h | 125 ++++++++++++
2 files changed, 271 insertions(+), 129 deletions(-)
create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h

diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
index 82f79e508f81..1e7c10c02a52 100644
--- a/drivers/gpu/drm/vkms/vkms_composer.c
+++ b/drivers/gpu/drm/vkms/vkms_composer.c
@@ -9,18 +9,28 @@
#include <drm/drm_vblank.h>

#include "vkms_drv.h"
-
-static u32 get_pixel_from_buffer(int x, int y, const u8 *buffer,
- const struct vkms_composer *composer)
-{
- u32 pixel;
- int src_offset = composer->offset + (y * composer->pitch)
- + (x * composer->cpp);
-
- pixel = *(u32 *)&buffer[src_offset];
-
- return pixel;
-}
+#include "vkms_formats.h"
+
+#define get_output_vkms_composer(buffer_pointer, composer) \
+ ((struct vkms_composer) { \
+ .fb = (struct drm_framebuffer) { \
+ .format = &(struct drm_format_info) { \
+ .format = DRM_FORMAT_ARGB16161616, \
+ }, \
+ }, \
+ .map[0].vaddr = (buffer_pointer), \
+ .src = (composer)->src, \
+ .dst = (composer)->dst, \
+ .cpp = sizeof(u64), \
+ .pitch = drm_rect_width(&(composer)->dst) * sizeof(u64) \
+ })
+
+struct vkms_pixel_composition_functions {
+ u64 (*get_src_pixel)(struct vkms_composer *composer, int x, int y);
+ u64 (*get_dst_pixel)(struct vkms_composer *composer, int x, int y);
+ void (*pixel_blend)(u64 argb_src1, u64 argb_src2, int x, int y,
+ struct vkms_composer *dst_composer);
+};

/**
* compute_crc - Compute CRC value on output frame
@@ -31,42 +41,33 @@ static u32 get_pixel_from_buffer(int x, int y, const u8 *buffer,
* returns CRC value computed using crc32 on the visible portion of
* the final framebuffer at vaddr_out
*/
-static uint32_t compute_crc(const u8 *vaddr,
+static uint32_t compute_crc(const __le64 *vaddr,
const struct vkms_composer *composer)
{
- int x, y;
- u32 crc = 0, pixel = 0;
- int x_src = composer->src.x1 >> 16;
- int y_src = composer->src.y1 >> 16;
- int h_src = drm_rect_height(&composer->src) >> 16;
- int w_src = drm_rect_width(&composer->src) >> 16;
-
- for (y = y_src; y < y_src + h_src; ++y) {
- for (x = x_src; x < x_src + w_src; ++x) {
- pixel = get_pixel_from_buffer(x, y, vaddr, composer);
- crc = crc32_le(crc, (void *)&pixel, sizeof(u32));
- }
- }
+ int h = drm_rect_height(&composer->dst);
+ int w = drm_rect_width(&composer->dst);

- return crc;
+ return crc32_le(0, (void *)vaddr, w * h * sizeof(u64));
}

-static u8 blend_channel(u8 src, u8 dst, u8 alpha)
+static __le16 blend_channel(u16 src, u16 dst, u16 alpha)
{
- u32 pre_blend;
- u8 new_color;
+ u64 pre_blend;
+ u16 new_color;

- pre_blend = (src * 255 + dst * (255 - alpha));
+ pre_blend = (src * 0xffff + dst * (0xffff - alpha));

- /* Faster div by 255 */
- new_color = ((pre_blend + ((pre_blend + 257) >> 8)) >> 8);
+ new_color = DIV_ROUND_UP(pre_blend, 0xffff);

- return new_color;
+ return cpu_to_le16(new_color);
}

/**
* alpha_blend - alpha blending equation
- * @argb_src: src pixel on premultiplied alpha mode
+ * @argb_src1: pixel of the source plane on premultiplied alpha mode
+ * @argb_src2: pixel of the destiny planes on premultiplied alpha mode
+ * @x: The x coodinate(width) of the pixel
+ * @y: The y coodinate(heigth) of the pixel
* @argb_dst: dst pixel completely opaque
*
* blend pixels using premultiplied blend formula. The current DRM assumption
@@ -74,50 +75,52 @@ static u8 blend_channel(u8 src, u8 dst, u8 alpha)
* channel values. See more drm_plane_create_blend_mode_property(). Also, this
* formula assumes a completely opaque background.
*/
-static void alpha_blend(const u8 *argb_src, u8 *argb_dst)
+static void alpha_blend(u64 argb_src1, u64 argb_src2, int y, int x,
+ struct vkms_composer *dst_composer)
{
- u8 alpha;
+ __le16 *output_pixel = packed_pixels_addr(dst_composer, y, x);

- alpha = argb_src[3];
- argb_dst[0] = blend_channel(argb_src[0], argb_dst[0], alpha);
- argb_dst[1] = blend_channel(argb_src[1], argb_dst[1], alpha);
- argb_dst[2] = blend_channel(argb_src[2], argb_dst[2], alpha);
-}
+ u16 src1_a = (argb_src1 & (0xffffllu << 48)) >> 48;
+ u16 src1_r = (argb_src1 & (0xffffllu << 32)) >> 32;
+ u16 src1_g = (argb_src1 & (0xffffllu << 16)) >> 16;
+ u16 src1_b = argb_src1 & 0xffffllu;

-/**
- * x_blend - blending equation that ignores the pixel alpha
- *
- * overwrites RGB color value from src pixel to dst pixel.
- */
-static void x_blend(const u8 *xrgb_src, u8 *xrgb_dst)
-{
- memcpy(xrgb_dst, xrgb_src, sizeof(u8) * 3);
+ u16 src2_r = (argb_src2 & (0xffffllu << 32)) >> 32;
+ u16 src2_g = (argb_src2 & (0xffffllu << 16)) >> 16;
+ u16 src2_b = argb_src2 & 0xffffllu;
+
+ output_pixel[0] = blend_channel(src1_b, src2_b, src1_a);
+ output_pixel[1] = blend_channel(src1_g, src2_g, src1_a);
+ output_pixel[2] = blend_channel(src1_r, src2_r, src1_a);
+ output_pixel[3] = 0xffff;
}

/**
- * blend - blend value at vaddr_src with value at vaddr_dst
- * @vaddr_dst: destination address
- * @vaddr_src: source address
- * @dst_composer: destination framebuffer's metadata
* @src_composer: source framebuffer's metadata
- * @pixel_blend: blending equation based on plane format
+ * @dst_composer: destiny framebuffer's metadata
+ * @funcs: A struct containing all the composition functions(get_src_pixel,
+ * get_dst_pixel, and pixel_blend)
*
- * Blend the vaddr_src value with the vaddr_dst value using a pixel blend
- * equation according to the supported plane formats DRM_FORMAT_(A/XRGB8888)
- * and clearing alpha channel to an completely opaque background. This function
- * uses buffer's metadata to locate the new composite values at vaddr_dst.
+ * Using the pixel_blend function passed as parameter, this function blends
+ * all pixels from src planes into a output buffer.
+ * Information of the output buffer is in the dst_composer parameter
+ * and the source plane in the src_composer.
+ * The get_src_pixel will use the src_composer to get the respective pixel,
+ * convert, and return it as ARGB_16161616.
+ * The same is true for the dst_composer and get_dst_pixel respectively.
+ * And finally, the blend function will receive the dst_composer, src,
+ * and dst pixels. Blend, and store thre result in the output using the
+ * dst_composer buffer information.
*
* TODO: completely clear the primary plane (a = 0xff) before starting to blend
* pixel color values
*/
-static void blend(void *vaddr_dst, void *vaddr_src,
+static void blend(struct vkms_composer *src_composer,
struct vkms_composer *dst_composer,
- struct vkms_composer *src_composer,
- void (*pixel_blend)(const u8 *, u8 *))
+ struct vkms_pixel_composition_functions *funcs)
{
int i, j, j_dst, i_dst;
- int offset_src, offset_dst;
- u8 *pixel_dst, *pixel_src;
+ u64 pixel_dst, pixel_src;

int x_src = src_composer->src.x1 >> 16;
int y_src = src_composer->src.y1 >> 16;
@@ -130,80 +133,101 @@ static void blend(void *vaddr_dst, void *vaddr_src,
int y_limit = y_src + h_dst;
int x_limit = x_src + w_dst;

- for (i = y_src, i_dst = y_dst; i < y_limit; ++i) {
- for (j = x_src, j_dst = x_dst; j < x_limit; ++j) {
- offset_dst = dst_composer->offset
- + (i_dst * dst_composer->pitch)
- + (j_dst++ * dst_composer->cpp);
- offset_src = src_composer->offset
- + (i * src_composer->pitch)
- + (j * src_composer->cpp);
-
- pixel_src = (u8 *)(vaddr_src + offset_src);
- pixel_dst = (u8 *)(vaddr_dst + offset_dst);
- pixel_blend(pixel_src, pixel_dst);
- /* clearing alpha channel (0xff)*/
- pixel_dst[3] = 0xff;
+ for (i = y_src, i_dst = y_dst; i < y_limit; ++i, i_dst++) {
+ for (j = x_src, j_dst = x_dst; j < x_limit; ++j, j_dst++) {
+ pixel_src = funcs->get_src_pixel(src_composer, j, i);
+ pixel_dst = funcs->get_dst_pixel(dst_composer, j_dst, i_dst);
+
+ funcs->pixel_blend(pixel_src, pixel_dst, j_dst, i_dst,
+ dst_composer);
}
- i_dst++;
}
}

-static void compose_plane(struct vkms_composer *primary_composer,
- struct vkms_composer *plane_composer,
- void *vaddr_out)
+static u64 ((*get_pixel_fmt_transform_function(u32 format))
+ (struct vkms_composer *, int, int))
{
- struct drm_framebuffer *fb = &plane_composer->fb;
- void *vaddr;
- void (*pixel_blend)(const u8 *p_src, u8 *p_dst);
+ if (format == DRM_FORMAT_ARGB8888)
+ return &ARGB8888_to_ARGB16161616;
+ else if (format == DRM_FORMAT_ARGB16161616)
+ return &get_ARGB16161616;
+ else
+ return &XRGB8888_to_ARGB16161616;
+}

- if (WARN_ON(dma_buf_map_is_null(&primary_composer->map[0])))
- return;
+static void ((*get_pixel_blend_function(u32 format))
+ (u64, u64, int, int, struct vkms_composer *))
+{
+ if (format == DRM_FORMAT_ARGB8888)
+ return &convert_to_ARGB8888;
+ else if (format == DRM_FORMAT_ARGB16161616)
+ return &convert_to_ARGB16161616;
+ else
+ return &convert_to_XRGB8888;
+}

- vaddr = plane_composer->map[0].vaddr;
+static void compose_plane(struct vkms_composer *src_composer,
+ struct vkms_composer *dst_composer,
+ struct vkms_pixel_composition_functions *funcs)
+{
+ u32 src_format = src_composer->fb.format->format;
+ u32 dst_format = dst_composer->fb.format->format;

- if (fb->format->format == DRM_FORMAT_ARGB8888)
- pixel_blend = &alpha_blend;
- else
- pixel_blend = &x_blend;
+ funcs->get_src_pixel = get_pixel_fmt_transform_function(src_format);
+ funcs->get_dst_pixel = get_pixel_fmt_transform_function(dst_format);

- blend(vaddr_out, vaddr, primary_composer, plane_composer, pixel_blend);
+ blend(src_composer, dst_composer, funcs);
}

-static int compose_active_planes(void **vaddr_out,
- struct vkms_composer *primary_composer,
- struct vkms_crtc_state *crtc_state)
+static __le64 *compose_active_planes(struct vkms_composer *primary_composer,
+ struct vkms_crtc_state *crtc_state)
{
- struct drm_framebuffer *fb = &primary_composer->fb;
- struct drm_gem_object *gem_obj = drm_gem_fb_get_obj(fb, 0);
- const void *vaddr;
+ struct vkms_plane_state **active_planes = crtc_state->active_planes;
+ int h = drm_rect_height(&primary_composer->dst);
+ int w = drm_rect_width(&primary_composer->dst);
+ struct vkms_pixel_composition_functions funcs;
+ struct vkms_composer dst_composer;
+ __le64 *vaddr_out;
int i;

- if (!*vaddr_out) {
- *vaddr_out = kvzalloc(gem_obj->size, GFP_KERNEL);
- if (!*vaddr_out) {
- DRM_ERROR("Cannot allocate memory for output frame.");
- return -ENOMEM;
- }
- }
-
if (WARN_ON(dma_buf_map_is_null(&primary_composer->map[0])))
- return -EINVAL;
+ return NULL;

- vaddr = primary_composer->map[0].vaddr;
+ vaddr_out = kvzalloc(w * h * sizeof(__le64), GFP_KERNEL);
+ if (!vaddr_out) {
+ DRM_ERROR("Cannot allocate memory for output frame.");
+ return NULL;
+ }

- memcpy(*vaddr_out, vaddr, gem_obj->size);
+ dst_composer = get_output_vkms_composer(vaddr_out, primary_composer);
+ funcs.pixel_blend = get_pixel_blend_function(DRM_FORMAT_ARGB16161616);
+ compose_plane(active_planes[0]->composer, &dst_composer, &funcs);

/* If there are other planes besides primary, we consider the active
* planes should be in z-order and compose them associatively:
* ((primary <- overlay) <- cursor)
*/
+ funcs.pixel_blend = alpha_blend;
for (i = 1; i < crtc_state->num_active_planes; i++)
- compose_plane(primary_composer,
- crtc_state->active_planes[i]->composer,
- *vaddr_out);
+ compose_plane(active_planes[i]->composer, &dst_composer, &funcs);

- return 0;
+ return vaddr_out;
+}
+
+static void write_wb_buffer(struct vkms_writeback_job *active_wb,
+ struct vkms_composer *primary_composer,
+ __le64 *vaddr_out)
+{
+ u32 dst_fb_format = active_wb->composer.fb.format->format;
+ struct vkms_pixel_composition_functions funcs;
+ struct vkms_composer src_composer;
+
+ src_composer = get_output_vkms_composer(vaddr_out, primary_composer);
+ funcs.pixel_blend = get_pixel_blend_function(dst_fb_format);
+ active_wb->composer.src = primary_composer->src;
+ active_wb->composer.dst = primary_composer->dst;
+
+ compose_plane(&src_composer, &active_wb->composer, &funcs);
}

/**
@@ -221,14 +245,14 @@ void vkms_composer_worker(struct work_struct *work)
struct vkms_crtc_state,
composer_work);
struct drm_crtc *crtc = crtc_state->base.crtc;
+ struct vkms_writeback_job *active_wb = crtc_state->active_writeback;
struct vkms_output *out = drm_crtc_to_vkms_output(crtc);
struct vkms_composer *primary_composer = NULL;
struct vkms_plane_state *act_plane = NULL;
bool crc_pending, wb_pending;
- void *vaddr_out = NULL;
+ __le64 *vaddr_out = NULL;
u32 crc32 = 0;
u64 frame_start, frame_end;
- int ret;

spin_lock_irq(&out->composer_lock);
frame_start = crtc_state->frame_start;
@@ -256,28 +280,21 @@ void vkms_composer_worker(struct work_struct *work)
if (!primary_composer)
return;

- if (wb_pending)
- vaddr_out = crtc_state->active_writeback->data[0].vaddr;
-
- ret = compose_active_planes(&vaddr_out, primary_composer,
- crtc_state);
- if (ret) {
- if (ret == -EINVAL && !wb_pending)
- kvfree(vaddr_out);
+ vaddr_out = compose_active_planes(primary_composer, crtc_state);
+ if (!vaddr_out)
return;
- }
-
- crc32 = compute_crc(vaddr_out, primary_composer);

if (wb_pending) {
+ write_wb_buffer(active_wb, primary_composer, vaddr_out);
drm_writeback_signal_completion(&out->wb_connector, 0);
spin_lock_irq(&out->composer_lock);
crtc_state->wb_pending = false;
spin_unlock_irq(&out->composer_lock);
- } else {
- kvfree(vaddr_out);
}

+ crc32 = compute_crc(vaddr_out, primary_composer);
+ kvfree(vaddr_out);
+
/*
* The worker can fall behind the vblank hrtimer, make sure we catch up.
*/
diff --git a/drivers/gpu/drm/vkms/vkms_formats.h b/drivers/gpu/drm/vkms/vkms_formats.h
new file mode 100644
index 000000000000..60e21adbf68d
--- /dev/null
+++ b/drivers/gpu/drm/vkms/vkms_formats.h
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+#ifndef _VKMS_FORMATS_H_
+#define _VKMS_FORMATS_H_
+
+#include <drm/drm_rect.h>
+
+#define pixel_offset(composer, x, y) \
+ ((composer)->offset + ((y) * (composer)->pitch) + ((x) * (composer)->cpp))
+
+/*
+ * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates
+ *
+ * @composer: Buffer metadata
+ * @x: The x(width) coordinate of the 2D buffer
+ * @y: The y(Heigth) coordinate of the 2D buffer
+ *
+ * Takes the information stored in the composer, a pair of coordinates, and
+ * returns the address of the first color channel.
+ * This function assumes the channels are packed together, i.e. a color channel
+ * comes immediately after another. And therefore, this function doesn't work
+ * for YUV with chroma subsampling (e.g. YUV420 and NV21).
+ */
+void *packed_pixels_addr(struct vkms_composer *composer, int x, int y)
+{
+ int offset = pixel_offset(composer, x, y);
+
+ return (u8 *)composer->map[0].vaddr + offset;
+}
+
+u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
+{
+ u8 *pixel_addr = packed_pixels_addr(composer, x, y);
+
+ /*
+ * Organizes the channels in their respective positions and converts
+ * the 8 bits channel to 16.
+ * The 257 is the "conversion ratio". This number is obtained by the
+ * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
+ * the best color value in a color space with more possibilities.
+ * And a similar idea applies to others RGB color conversions.
+ */
+ return ((u64)pixel_addr[3] * 257) << 48 |
+ ((u64)pixel_addr[2] * 257) << 32 |
+ ((u64)pixel_addr[1] * 257) << 16 |
+ ((u64)pixel_addr[0] * 257);
+}
+
+u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
+{
+ u8 *pixel_addr = packed_pixels_addr(composer, x, y);
+
+ /*
+ * The same as the ARGB8888 but with the alpha channel as the
+ * maximum value as possible.
+ */
+ return 0xffffllu << 48 |
+ ((u64)pixel_addr[2] * 257) << 32 |
+ ((u64)pixel_addr[1] * 257) << 16 |
+ ((u64)pixel_addr[0] * 257);
+}
+
+u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
+{
+ __le64 *pixel_addr = packed_pixels_addr(composer, x, y);
+
+ /*
+ * Because the format byte order is in little-endian and this code
+ * needs to run on big-endian machines too, we need modify
+ * the byte order from little-endian to the CPU native byte order.
+ */
+ return le64_to_cpu(*pixel_addr);
+}
+
+/*
+ * The following functions are used as blend operations. But unlike the
+ * `alpha_blend`, these functions take an ARGB16161616 pixel from the
+ * source, convert it to a specific format, and store it in the destination.
+ *
+ * They are used in the `compose_active_planes` and `write_wb_buffer` to
+ * copy and convert one pixel from/to the output buffer to/from
+ * another buffer (e.g. writeback buffer, primary plane buffer).
+ */
+
+void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
+ struct vkms_composer *dst_composer)
+{
+ u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
+
+ /*
+ * This sequence below is important because the format's byte order is
+ * in little-endian. In the case of the ARGB8888 the memory is
+ * organized this way:
+ *
+ * | Addr | = blue channel
+ * | Addr + 1 | = green channel
+ * | Addr + 2 | = Red channel
+ * | Addr + 3 | = Alpha channel
+ */
+ pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
+ pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
+ pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
+ pixel_addr[3] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 48)) >> 48, 257);
+}
+
+void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
+ struct vkms_composer *dst_composer)
+{
+ u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
+
+ pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
+ pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
+ pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
+ pixel_addr[3] = 0xff;
+}
+
+void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
+ struct vkms_composer *dst_composer)
+{
+ __le64 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
+
+ *pixel_addr = cpu_to_le64(argb_src1);
+}
+
+#endif /* _VKMS_FORMATS_H_ */
--
2.30.2

Subject: [PATCH 4/6] drm: vkms: Add fb information to `vkms_writeback_job`

This commit is the groundwork to introduce new formats to the planes and
writeback buffer. As part of it, a new buffer metadata field is added to
`vkms_writeback_job`, this metadata is represented by the `vkms_composer`
struct.

This will allow us, in the future, to have different compositing and wb
format types.

Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
---
drivers/gpu/drm/vkms/vkms_drv.h | 10 +++++-----
drivers/gpu/drm/vkms/vkms_writeback.c | 16 +++++++++++++---
2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/vkms/vkms_drv.h b/drivers/gpu/drm/vkms/vkms_drv.h
index 64e62993b06f..d62f8ebd454b 100644
--- a/drivers/gpu/drm/vkms/vkms_drv.h
+++ b/drivers/gpu/drm/vkms/vkms_drv.h
@@ -20,11 +20,6 @@
#define XRES_MAX 8192
#define YRES_MAX 8192

-struct vkms_writeback_job {
- struct dma_buf_map map[DRM_FORMAT_MAX_PLANES];
- struct dma_buf_map data[DRM_FORMAT_MAX_PLANES];
-};
-
struct vkms_composer {
struct drm_framebuffer fb;
struct drm_rect src, dst;
@@ -34,6 +29,11 @@ struct vkms_composer {
unsigned int cpp;
};

+struct vkms_writeback_job {
+ struct dma_buf_map data[DRM_FORMAT_MAX_PLANES];
+ struct vkms_composer composer;
+};
+
/**
* vkms_plane_state - Driver specific plane state
* @base: base plane state
diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c b/drivers/gpu/drm/vkms/vkms_writeback.c
index 8694227f555f..5a3e12f105dc 100644
--- a/drivers/gpu/drm/vkms/vkms_writeback.c
+++ b/drivers/gpu/drm/vkms/vkms_writeback.c
@@ -75,7 +75,7 @@ static int vkms_wb_prepare_job(struct drm_writeback_connector *wb_connector,
if (!vkmsjob)
return -ENOMEM;

- ret = drm_gem_fb_vmap(job->fb, vkmsjob->map, vkmsjob->data);
+ ret = drm_gem_fb_vmap(job->fb, vkmsjob->composer.map, vkmsjob->data);
if (ret) {
DRM_ERROR("vmap failed: %d\n", ret);
goto err_kfree;
@@ -99,7 +99,7 @@ static void vkms_wb_cleanup_job(struct drm_writeback_connector *connector,
if (!job->fb)
return;

- drm_gem_fb_vunmap(job->fb, vkmsjob->map);
+ drm_gem_fb_vunmap(job->fb, vkmsjob->composer.map);

vkmsdev = drm_device_to_vkms_device(job->fb->dev);
vkms_set_composer(&vkmsdev->output, false);
@@ -116,14 +116,24 @@ static void vkms_wb_atomic_commit(struct drm_connector *conn,
struct drm_writeback_connector *wb_conn = &output->wb_connector;
struct drm_connector_state *conn_state = wb_conn->base.state;
struct vkms_crtc_state *crtc_state = output->composer_state;
+ struct drm_framebuffer *fb = connector_state->writeback_job->fb;
+ struct vkms_writeback_job *active_wb;
+ struct vkms_composer *wb_composer;

if (!conn_state)
return;

vkms_set_composer(&vkmsdev->output, true);

+ active_wb = conn_state->writeback_job->priv;
+ wb_composer = &active_wb->composer;
+
spin_lock_irq(&output->composer_lock);
- crtc_state->active_writeback = conn_state->writeback_job->priv;
+ crtc_state->active_writeback = active_wb;
+ memcpy(&wb_composer->fb, fb, sizeof(struct drm_framebuffer));
+ wb_composer->offset = fb->offsets[0];
+ wb_composer->pitch = fb->pitches[0];
+ wb_composer->cpp = fb->format->cpp[0];
crtc_state->wb_pending = true;
spin_unlock_irq(&output->composer_lock);
drm_writeback_queue_job(wb_conn, connector_state);
--
2.30.2

2021-10-05 22:23:52

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 6/6] drm: vkms: Refactor the plane composer to accept new formats

Hi Igor,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.15-rc3 next-20210922]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Igor-Matheus-Andrade-Torrente/Refactor-the-vkms-to-accept-new-formats/20211006-042037
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 02d5e016800d082058b3d3b7c3ede136cdc6ddcb
config: riscv-buildonly-randconfig-r005-20211004 (attached as .config)
compiler: riscv32-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/9cd34ac9858091dc06086b2024e8f5f111657d48
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Igor-Matheus-Andrade-Torrente/Refactor-the-vkms-to-accept-new-formats/20211006-042037
git checkout 9cd34ac9858091dc06086b2024e8f5f111657d48
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=riscv

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from drivers/gpu/drm/vkms/vkms_composer.c:12:
>> drivers/gpu/drm/vkms/vkms_formats.h:24:7: error: no previous prototype for 'packed_pixels_addr' [-Werror=missing-prototypes]
24 | void *packed_pixels_addr(struct vkms_composer *composer, int x, int y)
| ^~~~~~~~~~~~~~~~~~
>> drivers/gpu/drm/vkms/vkms_formats.h:31:5: error: no previous prototype for 'ARGB8888_to_ARGB16161616' [-Werror=missing-prototypes]
31 | u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
| ^~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/gpu/drm/vkms/vkms_formats.h:49:5: error: no previous prototype for 'XRGB8888_to_ARGB16161616' [-Werror=missing-prototypes]
49 | u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
| ^~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/gpu/drm/vkms/vkms_formats.h:63:5: error: no previous prototype for 'get_ARGB16161616' [-Werror=missing-prototypes]
63 | u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
| ^~~~~~~~~~~~~~~~
>> drivers/gpu/drm/vkms/vkms_formats.h:85:6: error: no previous prototype for 'convert_to_ARGB8888' [-Werror=missing-prototypes]
85 | void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
| ^~~~~~~~~~~~~~~~~~~
>> drivers/gpu/drm/vkms/vkms_formats.h:106:6: error: no previous prototype for 'convert_to_XRGB8888' [-Werror=missing-prototypes]
106 | void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
| ^~~~~~~~~~~~~~~~~~~
>> drivers/gpu/drm/vkms/vkms_formats.h:117:6: error: no previous prototype for 'convert_to_ARGB16161616' [-Werror=missing-prototypes]
117 | void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
| ^~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors


vim +/packed_pixels_addr +24 drivers/gpu/drm/vkms/vkms_formats.h

7
8 #define pixel_offset(composer, x, y) \
9 ((composer)->offset + ((y) * (composer)->pitch) + ((x) * (composer)->cpp))
10
11 /*
12 * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates
13 *
14 * @composer: Buffer metadata
15 * @x: The x(width) coordinate of the 2D buffer
16 * @y: The y(Heigth) coordinate of the 2D buffer
17 *
18 * Takes the information stored in the composer, a pair of coordinates, and
19 * returns the address of the first color channel.
20 * This function assumes the channels are packed together, i.e. a color channel
21 * comes immediately after another. And therefore, this function doesn't work
22 * for YUV with chroma subsampling (e.g. YUV420 and NV21).
23 */
> 24 void *packed_pixels_addr(struct vkms_composer *composer, int x, int y)
25 {
26 int offset = pixel_offset(composer, x, y);
27
28 return (u8 *)composer->map[0].vaddr + offset;
29 }
30
> 31 u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
32 {
33 u8 *pixel_addr = packed_pixels_addr(composer, x, y);
34
35 /*
36 * Organizes the channels in their respective positions and converts
37 * the 8 bits channel to 16.
38 * The 257 is the "conversion ratio". This number is obtained by the
39 * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
40 * the best color value in a color space with more possibilities.
41 * And a similar idea applies to others RGB color conversions.
42 */
43 return ((u64)pixel_addr[3] * 257) << 48 |
44 ((u64)pixel_addr[2] * 257) << 32 |
45 ((u64)pixel_addr[1] * 257) << 16 |
46 ((u64)pixel_addr[0] * 257);
47 }
48
> 49 u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
50 {
51 u8 *pixel_addr = packed_pixels_addr(composer, x, y);
52
53 /*
54 * The same as the ARGB8888 but with the alpha channel as the
55 * maximum value as possible.
56 */
57 return 0xffffllu << 48 |
58 ((u64)pixel_addr[2] * 257) << 32 |
59 ((u64)pixel_addr[1] * 257) << 16 |
60 ((u64)pixel_addr[0] * 257);
61 }
62
> 63 u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
64 {
65 __le64 *pixel_addr = packed_pixels_addr(composer, x, y);
66
67 /*
68 * Because the format byte order is in little-endian and this code
69 * needs to run on big-endian machines too, we need modify
70 * the byte order from little-endian to the CPU native byte order.
71 */
72 return le64_to_cpu(*pixel_addr);
73 }
74
75 /*
76 * The following functions are used as blend operations. But unlike the
77 * `alpha_blend`, these functions take an ARGB16161616 pixel from the
78 * source, convert it to a specific format, and store it in the destination.
79 *
80 * They are used in the `compose_active_planes` and `write_wb_buffer` to
81 * copy and convert one pixel from/to the output buffer to/from
82 * another buffer (e.g. writeback buffer, primary plane buffer).
83 */
84
> 85 void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
86 struct vkms_composer *dst_composer)
87 {
88 u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
89
90 /*
91 * This sequence below is important because the format's byte order is
92 * in little-endian. In the case of the ARGB8888 the memory is
93 * organized this way:
94 *
95 * | Addr | = blue channel
96 * | Addr + 1 | = green channel
97 * | Addr + 2 | = Red channel
98 * | Addr + 3 | = Alpha channel
99 */
100 pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
101 pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
102 pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
103 pixel_addr[3] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 48)) >> 48, 257);
104 }
105
> 106 void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
107 struct vkms_composer *dst_composer)
108 {
109 u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
110
111 pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
112 pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
113 pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
114 pixel_addr[3] = 0xff;
115 }
116
> 117 void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
118 struct vkms_composer *dst_composer)
119 {
120 __le64 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
121
122 *pixel_addr = cpu_to_le64(argb_src1);
123 }
124

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (8.47 kB)
.config.gz (36.21 kB)
Download all attachments

2021-10-05 23:22:58

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 6/6] drm: vkms: Refactor the plane composer to accept new formats

Hi Igor,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v5.15-rc3 next-20210922]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Igor-Matheus-Andrade-Torrente/Refactor-the-vkms-to-accept-new-formats/20211006-042037
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 02d5e016800d082058b3d3b7c3ede136cdc6ddcb
config: x86_64-randconfig-a003-20211004 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project c0039de2953d15815448b4b3c3bafb45607781e0)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/9cd34ac9858091dc06086b2024e8f5f111657d48
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Igor-Matheus-Andrade-Torrente/Refactor-the-vkms-to-accept-new-formats/20211006-042037
git checkout 9cd34ac9858091dc06086b2024e8f5f111657d48
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=x86_64

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

In file included from drivers/gpu/drm/vkms/vkms_composer.c:12:
>> drivers/gpu/drm/vkms/vkms_formats.h:24:7: warning: no previous prototype for function 'packed_pixels_addr' [-Wmissing-prototypes]
void *packed_pixels_addr(struct vkms_composer *composer, int x, int y)
^
drivers/gpu/drm/vkms/vkms_formats.h:24:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void *packed_pixels_addr(struct vkms_composer *composer, int x, int y)
^
static
>> drivers/gpu/drm/vkms/vkms_formats.h:31:5: warning: no previous prototype for function 'ARGB8888_to_ARGB16161616' [-Wmissing-prototypes]
u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
^
drivers/gpu/drm/vkms/vkms_formats.h:31:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
^
static
>> drivers/gpu/drm/vkms/vkms_formats.h:49:5: warning: no previous prototype for function 'XRGB8888_to_ARGB16161616' [-Wmissing-prototypes]
u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
^
drivers/gpu/drm/vkms/vkms_formats.h:49:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
^
static
>> drivers/gpu/drm/vkms/vkms_formats.h:63:5: warning: no previous prototype for function 'get_ARGB16161616' [-Wmissing-prototypes]
u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
^
drivers/gpu/drm/vkms/vkms_formats.h:63:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
^
static
>> drivers/gpu/drm/vkms/vkms_formats.h:85:6: warning: no previous prototype for function 'convert_to_ARGB8888' [-Wmissing-prototypes]
void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
^
drivers/gpu/drm/vkms/vkms_formats.h:85:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
^
static
>> drivers/gpu/drm/vkms/vkms_formats.h:106:6: warning: no previous prototype for function 'convert_to_XRGB8888' [-Wmissing-prototypes]
void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
^
drivers/gpu/drm/vkms/vkms_formats.h:106:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
^
static
>> drivers/gpu/drm/vkms/vkms_formats.h:117:6: warning: no previous prototype for function 'convert_to_ARGB16161616' [-Wmissing-prototypes]
void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
^
drivers/gpu/drm/vkms/vkms_formats.h:117:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
^
static
7 warnings generated.


vim +/packed_pixels_addr +24 drivers/gpu/drm/vkms/vkms_formats.h

7
8 #define pixel_offset(composer, x, y) \
9 ((composer)->offset + ((y) * (composer)->pitch) + ((x) * (composer)->cpp))
10
11 /*
12 * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates
13 *
14 * @composer: Buffer metadata
15 * @x: The x(width) coordinate of the 2D buffer
16 * @y: The y(Heigth) coordinate of the 2D buffer
17 *
18 * Takes the information stored in the composer, a pair of coordinates, and
19 * returns the address of the first color channel.
20 * This function assumes the channels are packed together, i.e. a color channel
21 * comes immediately after another. And therefore, this function doesn't work
22 * for YUV with chroma subsampling (e.g. YUV420 and NV21).
23 */
> 24 void *packed_pixels_addr(struct vkms_composer *composer, int x, int y)
25 {
26 int offset = pixel_offset(composer, x, y);
27
28 return (u8 *)composer->map[0].vaddr + offset;
29 }
30
> 31 u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
32 {
33 u8 *pixel_addr = packed_pixels_addr(composer, x, y);
34
35 /*
36 * Organizes the channels in their respective positions and converts
37 * the 8 bits channel to 16.
38 * The 257 is the "conversion ratio". This number is obtained by the
39 * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
40 * the best color value in a color space with more possibilities.
41 * And a similar idea applies to others RGB color conversions.
42 */
43 return ((u64)pixel_addr[3] * 257) << 48 |
44 ((u64)pixel_addr[2] * 257) << 32 |
45 ((u64)pixel_addr[1] * 257) << 16 |
46 ((u64)pixel_addr[0] * 257);
47 }
48
> 49 u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
50 {
51 u8 *pixel_addr = packed_pixels_addr(composer, x, y);
52
53 /*
54 * The same as the ARGB8888 but with the alpha channel as the
55 * maximum value as possible.
56 */
57 return 0xffffllu << 48 |
58 ((u64)pixel_addr[2] * 257) << 32 |
59 ((u64)pixel_addr[1] * 257) << 16 |
60 ((u64)pixel_addr[0] * 257);
61 }
62
> 63 u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
64 {
65 __le64 *pixel_addr = packed_pixels_addr(composer, x, y);
66
67 /*
68 * Because the format byte order is in little-endian and this code
69 * needs to run on big-endian machines too, we need modify
70 * the byte order from little-endian to the CPU native byte order.
71 */
72 return le64_to_cpu(*pixel_addr);
73 }
74
75 /*
76 * The following functions are used as blend operations. But unlike the
77 * `alpha_blend`, these functions take an ARGB16161616 pixel from the
78 * source, convert it to a specific format, and store it in the destination.
79 *
80 * They are used in the `compose_active_planes` and `write_wb_buffer` to
81 * copy and convert one pixel from/to the output buffer to/from
82 * another buffer (e.g. writeback buffer, primary plane buffer).
83 */
84
> 85 void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
86 struct vkms_composer *dst_composer)
87 {
88 u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
89
90 /*
91 * This sequence below is important because the format's byte order is
92 * in little-endian. In the case of the ARGB8888 the memory is
93 * organized this way:
94 *
95 * | Addr | = blue channel
96 * | Addr + 1 | = green channel
97 * | Addr + 2 | = Red channel
98 * | Addr + 3 | = Alpha channel
99 */
100 pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
101 pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
102 pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
103 pixel_addr[3] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 48)) >> 48, 257);
104 }
105
> 106 void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
107 struct vkms_composer *dst_composer)
108 {
109 u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
110
111 pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
112 pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
113 pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
114 pixel_addr[3] = 0xff;
115 }
116
> 117 void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
118 struct vkms_composer *dst_composer)
119 {
120 __le64 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
121
122 *pixel_addr = cpu_to_le64(argb_src1);
123 }
124

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (9.94 kB)
.config.gz (35.86 kB)
Download all attachments

2021-10-05 23:40:44

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 6/6] drm: vkms: Refactor the plane composer to accept new formats

Hi Igor,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.15-rc3 next-20210922]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Igor-Matheus-Andrade-Torrente/Refactor-the-vkms-to-accept-new-formats/20211006-042037
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 02d5e016800d082058b3d3b7c3ede136cdc6ddcb
config: i386-buildonly-randconfig-r004-20211004 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project c0039de2953d15815448b4b3c3bafb45607781e0)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/9cd34ac9858091dc06086b2024e8f5f111657d48
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Igor-Matheus-Andrade-Torrente/Refactor-the-vkms-to-accept-new-formats/20211006-042037
git checkout 9cd34ac9858091dc06086b2024e8f5f111657d48
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=i386

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from drivers/gpu/drm/vkms/vkms_composer.c:12:
>> drivers/gpu/drm/vkms/vkms_formats.h:24:7: error: no previous prototype for function 'packed_pixels_addr' [-Werror,-Wmissing-prototypes]
void *packed_pixels_addr(struct vkms_composer *composer, int x, int y)
^
drivers/gpu/drm/vkms/vkms_formats.h:24:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void *packed_pixels_addr(struct vkms_composer *composer, int x, int y)
^
static
>> drivers/gpu/drm/vkms/vkms_formats.h:31:5: error: no previous prototype for function 'ARGB8888_to_ARGB16161616' [-Werror,-Wmissing-prototypes]
u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
^
drivers/gpu/drm/vkms/vkms_formats.h:31:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
^
static
>> drivers/gpu/drm/vkms/vkms_formats.h:49:5: error: no previous prototype for function 'XRGB8888_to_ARGB16161616' [-Werror,-Wmissing-prototypes]
u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
^
drivers/gpu/drm/vkms/vkms_formats.h:49:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
^
static
>> drivers/gpu/drm/vkms/vkms_formats.h:63:5: error: no previous prototype for function 'get_ARGB16161616' [-Werror,-Wmissing-prototypes]
u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
^
drivers/gpu/drm/vkms/vkms_formats.h:63:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
^
static
>> drivers/gpu/drm/vkms/vkms_formats.h:85:6: error: no previous prototype for function 'convert_to_ARGB8888' [-Werror,-Wmissing-prototypes]
void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
^
drivers/gpu/drm/vkms/vkms_formats.h:85:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
^
static
>> drivers/gpu/drm/vkms/vkms_formats.h:106:6: error: no previous prototype for function 'convert_to_XRGB8888' [-Werror,-Wmissing-prototypes]
void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
^
drivers/gpu/drm/vkms/vkms_formats.h:106:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
^
static
>> drivers/gpu/drm/vkms/vkms_formats.h:117:6: error: no previous prototype for function 'convert_to_ARGB16161616' [-Werror,-Wmissing-prototypes]
void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
^
drivers/gpu/drm/vkms/vkms_formats.h:117:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
^
static
7 errors generated.


vim +/packed_pixels_addr +24 drivers/gpu/drm/vkms/vkms_formats.h

7
8 #define pixel_offset(composer, x, y) \
9 ((composer)->offset + ((y) * (composer)->pitch) + ((x) * (composer)->cpp))
10
11 /*
12 * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates
13 *
14 * @composer: Buffer metadata
15 * @x: The x(width) coordinate of the 2D buffer
16 * @y: The y(Heigth) coordinate of the 2D buffer
17 *
18 * Takes the information stored in the composer, a pair of coordinates, and
19 * returns the address of the first color channel.
20 * This function assumes the channels are packed together, i.e. a color channel
21 * comes immediately after another. And therefore, this function doesn't work
22 * for YUV with chroma subsampling (e.g. YUV420 and NV21).
23 */
> 24 void *packed_pixels_addr(struct vkms_composer *composer, int x, int y)
25 {
26 int offset = pixel_offset(composer, x, y);
27
28 return (u8 *)composer->map[0].vaddr + offset;
29 }
30
> 31 u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
32 {
33 u8 *pixel_addr = packed_pixels_addr(composer, x, y);
34
35 /*
36 * Organizes the channels in their respective positions and converts
37 * the 8 bits channel to 16.
38 * The 257 is the "conversion ratio". This number is obtained by the
39 * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
40 * the best color value in a color space with more possibilities.
41 * And a similar idea applies to others RGB color conversions.
42 */
43 return ((u64)pixel_addr[3] * 257) << 48 |
44 ((u64)pixel_addr[2] * 257) << 32 |
45 ((u64)pixel_addr[1] * 257) << 16 |
46 ((u64)pixel_addr[0] * 257);
47 }
48
> 49 u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
50 {
51 u8 *pixel_addr = packed_pixels_addr(composer, x, y);
52
53 /*
54 * The same as the ARGB8888 but with the alpha channel as the
55 * maximum value as possible.
56 */
57 return 0xffffllu << 48 |
58 ((u64)pixel_addr[2] * 257) << 32 |
59 ((u64)pixel_addr[1] * 257) << 16 |
60 ((u64)pixel_addr[0] * 257);
61 }
62
> 63 u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
64 {
65 __le64 *pixel_addr = packed_pixels_addr(composer, x, y);
66
67 /*
68 * Because the format byte order is in little-endian and this code
69 * needs to run on big-endian machines too, we need modify
70 * the byte order from little-endian to the CPU native byte order.
71 */
72 return le64_to_cpu(*pixel_addr);
73 }
74
75 /*
76 * The following functions are used as blend operations. But unlike the
77 * `alpha_blend`, these functions take an ARGB16161616 pixel from the
78 * source, convert it to a specific format, and store it in the destination.
79 *
80 * They are used in the `compose_active_planes` and `write_wb_buffer` to
81 * copy and convert one pixel from/to the output buffer to/from
82 * another buffer (e.g. writeback buffer, primary plane buffer).
83 */
84
> 85 void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
86 struct vkms_composer *dst_composer)
87 {
88 u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
89
90 /*
91 * This sequence below is important because the format's byte order is
92 * in little-endian. In the case of the ARGB8888 the memory is
93 * organized this way:
94 *
95 * | Addr | = blue channel
96 * | Addr + 1 | = green channel
97 * | Addr + 2 | = Red channel
98 * | Addr + 3 | = Alpha channel
99 */
100 pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
101 pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
102 pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
103 pixel_addr[3] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 48)) >> 48, 257);
104 }
105
> 106 void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
107 struct vkms_composer *dst_composer)
108 {
109 u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
110
111 pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
112 pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
113 pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
114 pixel_addr[3] = 0xff;
115 }
116
> 117 void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
118 struct vkms_composer *dst_composer)
119 {
120 __le64 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
121
122 *pixel_addr = cpu_to_le64(argb_src1);
123 }
124

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (9.97 kB)
.config.gz (39.15 kB)
Download all attachments

2021-10-18 07:56:15

by Pekka Paalanen

[permalink] [raw]
Subject: Re: [PATCH 0/6] Refactor the vkms to accept new formats

On Tue, 5 Oct 2021 17:16:31 -0300
Igor Matheus Andrade Torrente <[email protected]> wrote:

> XRGB to ARGB behavior
> =================
> During the development, I decided to always fill the alpha channel of
> the output pixel whenever the conversion from a format without an alpha
> channel to ARGB16161616 is necessary. Therefore, I ignore the value
> received from the XRGB and overwrite the value with 0xFFFF.
>
> My question is, is this behavior acceptable?

Hi,

that is the expected behaviour. X channel values must never affect
anything on screen, hence they must never affect other channels'
values. You are free to completely ignore X channel values, and if your
output buffer has X channel, then you are free to write (or not write,
unless for security reasons) whatever into it.


Thanks,
pq

>
> [1] https://lists.freedesktop.org/archives/igt-dev/2021-October/036125.html
>
> Igor Matheus Andrade Torrente (6):
> drm: vkms: Replace the deprecated drm_mode_config_init
> drm: vkms: Alloc the compose frame using vzalloc
> drm: vkms: Replace hardcoded value of `vkms_composer.map` to
> DRM_FORMAT_MAX_PLANES
> drm: vkms: Add fb information to `vkms_writeback_job`
> drm: vkms: Prepare `vkms_wb_encoder_atomic_check` to accept multiple
> formats
> drm: vkms: Refactor the plane composer to accept new formats
>
> drivers/gpu/drm/vkms/vkms_composer.c | 275 ++++++++++++++------------
> drivers/gpu/drm/vkms/vkms_drv.c | 5 +-
> drivers/gpu/drm/vkms/vkms_drv.h | 12 +-
> drivers/gpu/drm/vkms/vkms_formats.h | 125 ++++++++++++
> drivers/gpu/drm/vkms/vkms_writeback.c | 27 ++-
> 5 files changed, 304 insertions(+), 140 deletions(-)
> create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
>


Attachments:
(No filename) (849.00 B)
OpenPGP digital signature

2021-10-18 10:05:07

by Thomas Zimmermann

[permalink] [raw]
Subject: Re: [PATCH 1/6] drm: vkms: Replace the deprecated drm_mode_config_init

Hi

Am 05.10.21 um 22:16 schrieb Igor Matheus Andrade Torrente:
> The `drm_mode_config_init` was deprecated since c3b790e commit, and it's
> being replaced by the `drmm_mode_config_init`.
>
> Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
> ---
> drivers/gpu/drm/vkms/vkms_drv.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/vkms/vkms_drv.c b/drivers/gpu/drm/vkms/vkms_drv.c
> index 0ffe5f0e33f7..828868920494 100644
> --- a/drivers/gpu/drm/vkms/vkms_drv.c
> +++ b/drivers/gpu/drm/vkms/vkms_drv.c
> @@ -140,8 +140,11 @@ static const struct drm_mode_config_helper_funcs vkms_mode_config_helpers = {
> static int vkms_modeset_init(struct vkms_device *vkmsdev)
> {
> struct drm_device *dev = &vkmsdev->drm;
> + int ret = drmm_mode_config_init(dev);
> +
> + if (ret < 0)
> + return ret;

The style looks awkward IMHO. Rather use

int ret

ret = drmm_mode_config_init()
if (ret)
return ret;

>
> - drm_mode_config_init(dev);
> dev->mode_config.funcs = &vkms_mode_funcs;
> dev->mode_config.min_width = XRES_MIN;
> dev->mode_config.min_height = YRES_MIN;
>

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer


Attachments:
OpenPGP_signature (855.00 B)
OpenPGP digital signature

2021-10-18 10:09:11

by Thomas Zimmermann

[permalink] [raw]
Subject: Re: [PATCH 3/6] drm: vkms: Replace hardcoded value of `vkms_composer.map` to DRM_FORMAT_MAX_PLANES

Hi

Am 05.10.21 um 22:16 schrieb Igor Matheus Andrade Torrente:
> The `map` vector at `vkms_composer` uses a hardcoded value to define its
> size.
>
> If someday the maximum number of planes increases, this hardcoded value
> can be a problem.
>
> This value is being replaced with the DRM_FORMAT_MAX_PLANES macro.
>
> Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
> ---
> drivers/gpu/drm/vkms/vkms_drv.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/vkms/vkms_drv.h b/drivers/gpu/drm/vkms/vkms_drv.h
> index d48c23d40ce5..64e62993b06f 100644
> --- a/drivers/gpu/drm/vkms/vkms_drv.h
> +++ b/drivers/gpu/drm/vkms/vkms_drv.h
> @@ -28,7 +28,7 @@ struct vkms_writeback_job {
> struct vkms_composer {
> struct drm_framebuffer fb;
> struct drm_rect src, dst;
> - struct dma_buf_map map[4];
> + struct dma_buf_map map[DRM_FORMAT_MAX_PLANES];

I suspect that I simply missed this instance while introducing
DRM_FORMAT_MAX_PLANES. Thanks for fixing.

Acked-by: Thomas Zimmermann <[email protected]>

> unsigned int offset;
> unsigned int pitch;
> unsigned int cpp;
>

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer


Attachments:
OpenPGP_signature (855.00 B)
OpenPGP digital signature

2021-10-18 10:11:49

by Thomas Zimmermann

[permalink] [raw]
Subject: Re: [PATCH 4/6] drm: vkms: Add fb information to `vkms_writeback_job`

Hi

Am 05.10.21 um 22:16 schrieb Igor Matheus Andrade Torrente:
> This commit is the groundwork to introduce new formats to the planes and
> writeback buffer. As part of it, a new buffer metadata field is added to
> `vkms_writeback_job`, this metadata is represented by the `vkms_composer`
> struct.
>
> This will allow us, in the future, to have different compositing and wb
> format types.
>
> Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
> ---
> drivers/gpu/drm/vkms/vkms_drv.h | 10 +++++-----
> drivers/gpu/drm/vkms/vkms_writeback.c | 16 +++++++++++++---
> 2 files changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/vkms/vkms_drv.h b/drivers/gpu/drm/vkms/vkms_drv.h
> index 64e62993b06f..d62f8ebd454b 100644
> --- a/drivers/gpu/drm/vkms/vkms_drv.h
> +++ b/drivers/gpu/drm/vkms/vkms_drv.h
> @@ -20,11 +20,6 @@
> #define XRES_MAX 8192
> #define YRES_MAX 8192
>
> -struct vkms_writeback_job {
> - struct dma_buf_map map[DRM_FORMAT_MAX_PLANES];
> - struct dma_buf_map data[DRM_FORMAT_MAX_PLANES];
> -};
> -
> struct vkms_composer {
> struct drm_framebuffer fb;
> struct drm_rect src, dst;
> @@ -34,6 +29,11 @@ struct vkms_composer {
> unsigned int cpp;
> };
>
> +struct vkms_writeback_job {
> + struct dma_buf_map data[DRM_FORMAT_MAX_PLANES];
> + struct vkms_composer composer;
> +};
> +
> /**
> * vkms_plane_state - Driver specific plane state
> * @base: base plane state
> diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c b/drivers/gpu/drm/vkms/vkms_writeback.c
> index 8694227f555f..5a3e12f105dc 100644
> --- a/drivers/gpu/drm/vkms/vkms_writeback.c
> +++ b/drivers/gpu/drm/vkms/vkms_writeback.c
> @@ -75,7 +75,7 @@ static int vkms_wb_prepare_job(struct drm_writeback_connector *wb_connector,
> if (!vkmsjob)
> return -ENOMEM;
>
> - ret = drm_gem_fb_vmap(job->fb, vkmsjob->map, vkmsjob->data);
> + ret = drm_gem_fb_vmap(job->fb, vkmsjob->composer.map, vkmsjob->data);
> if (ret) {
> DRM_ERROR("vmap failed: %d\n", ret);
> goto err_kfree;
> @@ -99,7 +99,7 @@ static void vkms_wb_cleanup_job(struct drm_writeback_connector *connector,
> if (!job->fb)
> return;
>
> - drm_gem_fb_vunmap(job->fb, vkmsjob->map);
> + drm_gem_fb_vunmap(job->fb, vkmsjob->composer.map);
>
> vkmsdev = drm_device_to_vkms_device(job->fb->dev);
> vkms_set_composer(&vkmsdev->output, false);
> @@ -116,14 +116,24 @@ static void vkms_wb_atomic_commit(struct drm_connector *conn,
> struct drm_writeback_connector *wb_conn = &output->wb_connector;
> struct drm_connector_state *conn_state = wb_conn->base.state;
> struct vkms_crtc_state *crtc_state = output->composer_state;
> + struct drm_framebuffer *fb = connector_state->writeback_job->fb;
> + struct vkms_writeback_job *active_wb;
> + struct vkms_composer *wb_composer;
>
> if (!conn_state)
> return;
>
> vkms_set_composer(&vkmsdev->output, true);
>
> + active_wb = conn_state->writeback_job->priv;
> + wb_composer = &active_wb->composer;
> +
> spin_lock_irq(&output->composer_lock);
> - crtc_state->active_writeback = conn_state->writeback_job->priv;
> + crtc_state->active_writeback = active_wb;
> + memcpy(&wb_composer->fb, fb, sizeof(struct drm_framebuffer));
> + wb_composer->offset = fb->offsets[0];
> + wb_composer->pitch = fb->pitches[0];
> + wb_composer->cpp = fb->format->cpp[0];

I don't know the exact cleanup requirements of these structures, but
copying fb into composer looks like it breaks reference counting of the
framebuffer's gem objects at least.

This atomic_commit helper would better acquire the correct references
and hand them over to the compositor thread. Why not have a ref-counted
composer or get a reference on the framebuffer instead?

Best regards
Thomas


> crtc_state->wb_pending = true;
> spin_unlock_irq(&output->composer_lock);
> drm_writeback_queue_job(wb_conn, connector_state);
>

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer


Attachments:
OpenPGP_signature (855.00 B)
OpenPGP digital signature

2021-10-18 10:17:30

by Thomas Zimmermann

[permalink] [raw]
Subject: Re: [PATCH 5/6] drm: vkms: Prepare `vkms_wb_encoder_atomic_check` to accept multiple formats

Hi

Am 05.10.21 um 22:16 schrieb Igor Matheus Andrade Torrente:
> Currently, the vkms atomic check only goes through the first position of
> the `vkms_wb_formats` vector.
>
> This change prepares the atomic_check to check the entire vector.
>
> Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
> ---
> drivers/gpu/drm/vkms/vkms_writeback.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c b/drivers/gpu/drm/vkms/vkms_writeback.c
> index 5a3e12f105dc..56978f499203 100644
> --- a/drivers/gpu/drm/vkms/vkms_writeback.c
> +++ b/drivers/gpu/drm/vkms/vkms_writeback.c
> @@ -30,6 +30,8 @@ static int vkms_wb_encoder_atomic_check(struct drm_encoder *encoder,
> {
> struct drm_framebuffer *fb;
> const struct drm_display_mode *mode = &crtc_state->mode;
> + bool format_supported = false;
> + int i;
>
> if (!conn_state->writeback_job || !conn_state->writeback_job->fb)
> return 0;
> @@ -41,7 +43,14 @@ static int vkms_wb_encoder_atomic_check(struct drm_encoder *encoder,
> return -EINVAL;
> }
>
> - if (fb->format->format != vkms_wb_formats[0]) {
> + for (i = 0; i < ARRAY_SIZE(vkms_wb_formats); i++) {
> + if (fb->format->format == vkms_wb_formats[i]) {
> + format_supported = true;
> + break;
> + }
> + }

At a minimum, this loop should be in a helper function. But more
generally, I'm surprised that this isn't already covered by the DRM's
atomic helpers.

Best regards
Thomas

> +
> + if (!format_supported) {
> DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
> &fb->format->format);
> return -EINVAL;
>

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer


Attachments:
OpenPGP_signature (855.00 B)
OpenPGP digital signature

2021-10-18 10:31:09

by Pekka Paalanen

[permalink] [raw]
Subject: Re: [PATCH 6/6] drm: vkms: Refactor the plane composer to accept new formats

On Tue, 5 Oct 2021 17:16:37 -0300
Igor Matheus Andrade Torrente <[email protected]> wrote:

> Currently the blend function only accepts XRGB_8888 and ARGB_8888
> as a color input.
>
> This patch refactors all the functions related to the plane composition
> to overcome this limitation.
>
> Now the blend function receives a format handler to each plane and a
> blend function pointer. It will take two ARGB_1616161616 pixels, one
> for each handler, and will use the blend function to calculate and
> store the final color in the output buffer.
>
> These format handlers will receive the `vkms_composer` and a pair of
> coordinates. And they should return the respective pixel in the
> ARGB_16161616 format.
>
> The blend function will receive two ARGB_16161616 pixels, x, y, and
> the vkms_composer of the output buffer. The method should perform the
> blend operation and store output to the format aforementioned
> ARGB_16161616.

Hi,

here are some drive-by comments which you are free to take or leave.

To find more efficient ways to do this, you might want research Pixman
library. It's MIT licensed.

IIRC, the elementary operations there operate on pixel lines (pointer +
length), not individual pixels (image, x, y). Input conversion function
reads and converts a whole line a time. Blending function blends two
lines and writes the output into back one of the lines. Output
conversion function takes a line and converts it into destination
buffer. That way the elementary operations do not need to take stride
into account, and blending does not need to deal with varying memory
alignment FWIW. The inner loops involve much less function calls and
state, probably.

Pixman may not do exactly this, but it does something very similar.
Pixman also has multiple different levels of optimisations, which may
not be necessary for VKMS.

It's a trade-off between speed and temporary memory consumed. You need
temporary buffers for two lines, but not more (not a whole image in
full intermediate precision). Further optimisation could determine
whether to process whole image rows as lines, or split a row into
multiple lines to stay within CPU cache size.

Since it seems you are blending multiple planes into a single
destination which is assumed to be opaque, you might also be able to do
the multiple blends without convert-writing and read-converting to/from
the destination between every plane. I think that might be possible to
architect on top of the per-line operations still.

> Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
> ---
> drivers/gpu/drm/vkms/vkms_composer.c | 275 ++++++++++++++-------------
> drivers/gpu/drm/vkms/vkms_formats.h | 125 ++++++++++++
> 2 files changed, 271 insertions(+), 129 deletions(-)
> create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h

...

> +
> +u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
> +{
> + u8 *pixel_addr = packed_pixels_addr(composer, x, y);
> +
> + /*
> + * Organizes the channels in their respective positions and converts
> + * the 8 bits channel to 16.
> + * The 257 is the "conversion ratio". This number is obtained by the
> + * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
> + * the best color value in a color space with more possibilities.

Pixel format, not color space.

> + * And a similar idea applies to others RGB color conversions.
> + */
> + return ((u64)pixel_addr[3] * 257) << 48 |
> + ((u64)pixel_addr[2] * 257) << 32 |
> + ((u64)pixel_addr[1] * 257) << 16 |
> + ((u64)pixel_addr[0] * 257);

I wonder if the compiler is smart enough to choose between "mul 257"
and "(v << 8) | v" operations... but that's probably totally drowning
under the overhead of per (x,y) looping.

> +}
> +
> +u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
> +{
> + u8 *pixel_addr = packed_pixels_addr(composer, x, y);
> +
> + /*
> + * The same as the ARGB8888 but with the alpha channel as the
> + * maximum value as possible.
> + */
> + return 0xffffllu << 48 |
> + ((u64)pixel_addr[2] * 257) << 32 |
> + ((u64)pixel_addr[1] * 257) << 16 |
> + ((u64)pixel_addr[0] * 257);
> +}
> +
> +u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
> +{
> + __le64 *pixel_addr = packed_pixels_addr(composer, x, y);
> +
> + /*
> + * Because the format byte order is in little-endian and this code
> + * needs to run on big-endian machines too, we need modify
> + * the byte order from little-endian to the CPU native byte order.
> + */
> + return le64_to_cpu(*pixel_addr);
> +}
> +
> +/*
> + * The following functions are used as blend operations. But unlike the
> + * `alpha_blend`, these functions take an ARGB16161616 pixel from the
> + * source, convert it to a specific format, and store it in the destination.
> + *
> + * They are used in the `compose_active_planes` and `write_wb_buffer` to
> + * copy and convert one pixel from/to the output buffer to/from
> + * another buffer (e.g. writeback buffer, primary plane buffer).
> + */
> +
> +void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
> + struct vkms_composer *dst_composer)
> +{
> + u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
> +
> + /*
> + * This sequence below is important because the format's byte order is
> + * in little-endian. In the case of the ARGB8888 the memory is
> + * organized this way:
> + *
> + * | Addr | = blue channel
> + * | Addr + 1 | = green channel
> + * | Addr + 2 | = Red channel
> + * | Addr + 3 | = Alpha channel
> + */
> + pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
> + pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
> + pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
> + pixel_addr[3] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 48)) >> 48, 257);

This could be potentially expensive if the compiler ends up using an
actual div instruction.

Btw. this would be shorter to write as

(argb_src1 >> 16) & 0xffff

etc.

Thanks,
pq

> +}
> +
> +void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
> + struct vkms_composer *dst_composer)
> +{
> + u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
> +
> + pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
> + pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
> + pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
> + pixel_addr[3] = 0xff;
> +}
> +
> +void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
> + struct vkms_composer *dst_composer)
> +{
> + __le64 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
> +
> + *pixel_addr = cpu_to_le64(argb_src1);
> +}
> +
> +#endif /* _VKMS_FORMATS_H_ */


Attachments:
(No filename) (849.00 B)
OpenPGP digital signature
Subject: Re: [PATCH 5/6] drm: vkms: Prepare `vkms_wb_encoder_atomic_check` to accept multiple formats

Hello,

On 10/18/21 7:14 AM, Thomas Zimmermann wrote:
> Hi
>
> Am 05.10.21 um 22:16 schrieb Igor Matheus Andrade Torrente:
>> Currently, the vkms atomic check only goes through the first position of
>> the `vkms_wb_formats` vector.
>>
>> This change prepares the atomic_check to check the entire vector.
>>
>> Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
>> ---
>>   drivers/gpu/drm/vkms/vkms_writeback.c | 11 ++++++++++-
>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c
>> b/drivers/gpu/drm/vkms/vkms_writeback.c
>> index 5a3e12f105dc..56978f499203 100644
>> --- a/drivers/gpu/drm/vkms/vkms_writeback.c
>> +++ b/drivers/gpu/drm/vkms/vkms_writeback.c
>> @@ -30,6 +30,8 @@ static int vkms_wb_encoder_atomic_check(struct
>> drm_encoder *encoder,
>>   {
>>       struct drm_framebuffer *fb;
>>       const struct drm_display_mode *mode = &crtc_state->mode;
>> +    bool format_supported = false;
>> +    int i;
>>       if (!conn_state->writeback_job || !conn_state->writeback_job->fb)
>>           return 0;
>> @@ -41,7 +43,14 @@ static int vkms_wb_encoder_atomic_check(struct
>> drm_encoder *encoder,
>>           return -EINVAL;
>>       }
>> -    if (fb->format->format != vkms_wb_formats[0]) {
>> +    for (i = 0; i < ARRAY_SIZE(vkms_wb_formats); i++) {
>> +        if (fb->format->format == vkms_wb_formats[i]) {
>> +            format_supported = true;
>> +            break;
>> +        }
>> +    }
>
> At a minimum, this loop should be in a helper function. But more
> generally, I'm surprised that this isn't already covered by the DRM's
> atomic helpers.

Ok, I can wrap it in a new function.

AFAIK the DRM doesn't cover it. But I may be wrong...

>
> Best regards
> Thomas
>
>> +
>> +    if (!format_supported) {
>>           DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
>>                     &fb->format->format);
>>           return -EINVAL;
>>
>

Subject: Re: [PATCH 0/6] Refactor the vkms to accept new formats

Hi pq,

On 10/18/21 4:53 AM, Pekka Paalanen wrote:
> On Tue, 5 Oct 2021 17:16:31 -0300
> Igor Matheus Andrade Torrente <[email protected]> wrote:
>
>> XRGB to ARGB behavior
>> =================
>> During the development, I decided to always fill the alpha channel of
>> the output pixel whenever the conversion from a format without an alpha
>> channel to ARGB16161616 is necessary. Therefore, I ignore the value
>> received from the XRGB and overwrite the value with 0xFFFF.
>>
>> My question is, is this behavior acceptable?
>
> Hi,
>
> that is the expected behaviour. X channel values must never affect
> anything on screen, hence they must never affect other channels'
> values. You are free to completely ignore X channel values, and if your
> output buffer has X channel, then you are free to write (or not write,
> unless for security reasons) whatever into it.
>

Great!! And thanks!!!

>
> Thanks,
> pq
>
>>
>> [1] https://lists.freedesktop.org/archives/igt-dev/2021-October/036125.html
>>
>> Igor Matheus Andrade Torrente (6):
>> drm: vkms: Replace the deprecated drm_mode_config_init
>> drm: vkms: Alloc the compose frame using vzalloc
>> drm: vkms: Replace hardcoded value of `vkms_composer.map` to
>> DRM_FORMAT_MAX_PLANES
>> drm: vkms: Add fb information to `vkms_writeback_job`
>> drm: vkms: Prepare `vkms_wb_encoder_atomic_check` to accept multiple
>> formats
>> drm: vkms: Refactor the plane composer to accept new formats
>>
>> drivers/gpu/drm/vkms/vkms_composer.c | 275 ++++++++++++++------------
>> drivers/gpu/drm/vkms/vkms_drv.c | 5 +-
>> drivers/gpu/drm/vkms/vkms_drv.h | 12 +-
>> drivers/gpu/drm/vkms/vkms_formats.h | 125 ++++++++++++
>> drivers/gpu/drm/vkms/vkms_writeback.c | 27 ++-
>> 5 files changed, 304 insertions(+), 140 deletions(-)
>> create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
>>
>

2021-10-18 18:09:19

by Thomas Zimmermann

[permalink] [raw]
Subject: Re: [PATCH 5/6] drm: vkms: Prepare `vkms_wb_encoder_atomic_check` to accept multiple formats

Hi

Am 18.10.21 um 19:41 schrieb Igor Matheus Andrade Torrente:
> Hello,
>
> On 10/18/21 7:14 AM, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 05.10.21 um 22:16 schrieb Igor Matheus Andrade Torrente:
>>> Currently, the vkms atomic check only goes through the first position of
>>> the `vkms_wb_formats` vector.
>>>
>>> This change prepares the atomic_check to check the entire vector.
>>>
>>> Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
>>> ---
>>>   drivers/gpu/drm/vkms/vkms_writeback.c | 11 ++++++++++-
>>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c
>>> b/drivers/gpu/drm/vkms/vkms_writeback.c
>>> index 5a3e12f105dc..56978f499203 100644
>>> --- a/drivers/gpu/drm/vkms/vkms_writeback.c
>>> +++ b/drivers/gpu/drm/vkms/vkms_writeback.c
>>> @@ -30,6 +30,8 @@ static int vkms_wb_encoder_atomic_check(struct
>>> drm_encoder *encoder,
>>>   {
>>>       struct drm_framebuffer *fb;
>>>       const struct drm_display_mode *mode = &crtc_state->mode;
>>> +    bool format_supported = false;
>>> +    int i;
>>>       if (!conn_state->writeback_job || !conn_state->writeback_job->fb)
>>>           return 0;
>>> @@ -41,7 +43,14 @@ static int vkms_wb_encoder_atomic_check(struct
>>> drm_encoder *encoder,
>>>           return -EINVAL;
>>>       }
>>> -    if (fb->format->format != vkms_wb_formats[0]) {
>>> +    for (i = 0; i < ARRAY_SIZE(vkms_wb_formats); i++) {
>>> +        if (fb->format->format == vkms_wb_formats[i]) {
>>> +            format_supported = true;
>>> +            break;
>>> +        }
>>> +    }
>>
>> At a minimum, this loop should be in a helper function. But more
>> generally, I'm surprised that this isn't already covered by the DRM's
>> atomic helpers.
>
> Ok, I can wrap it in a new function.
>
> AFAIK the DRM doesn't cover it. But I may be wrong...

I couldn't find anything either.

Other drivers do similar format and frambuffer checks. So I guess a
helper could be implemented. All plane's are supposed to call
drm_atomic_helper_check_plane_state() in their atomic_check() code. You
could add a similar helper, say
drm_atomic_helper_check_writeback_encoder_state(), that tests for the
format and maybe other things as well.

Best regards
Thomas

>
>>
>> Best regards
>> Thomas
>>
>>> +
>>> +    if (!format_supported) {
>>>           DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
>>>                     &fb->format->format);
>>>           return -EINVAL;
>>>
>>

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer


Attachments:
OpenPGP_signature (855.00 B)
OpenPGP digital signature
Subject: Re: [PATCH 1/6] drm: vkms: Replace the deprecated drm_mode_config_init

Hi Thomas,

On 10/18/21 7:02 AM, Thomas Zimmermann wrote:
> Hi
>
> Am 05.10.21 um 22:16 schrieb Igor Matheus Andrade Torrente:
>> The `drm_mode_config_init` was deprecated since c3b790e commit, and it's
>> being replaced by the `drmm_mode_config_init`.
>>
>> Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
>> ---
>>   drivers/gpu/drm/vkms/vkms_drv.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/vkms/vkms_drv.c
>> b/drivers/gpu/drm/vkms/vkms_drv.c
>> index 0ffe5f0e33f7..828868920494 100644
>> --- a/drivers/gpu/drm/vkms/vkms_drv.c
>> +++ b/drivers/gpu/drm/vkms/vkms_drv.c
>> @@ -140,8 +140,11 @@ static const struct drm_mode_config_helper_funcs
>> vkms_mode_config_helpers = {
>>   static int vkms_modeset_init(struct vkms_device *vkmsdev)
>>   {
>>       struct drm_device *dev = &vkmsdev->drm;
>> +    int ret = drmm_mode_config_init(dev);
>> +
>> +    if (ret < 0)
>> +        return ret;
>
> The style looks awkward IMHO. Rather use
I don't think it's awkward. But I don't mind change it anyway.

>
>  int ret
>
>  ret = drmm_mode_config_init()
>  if (ret)
>     return ret;
>
>> -    drm_mode_config_init(dev);
>>       dev->mode_config.funcs = &vkms_mode_funcs;
>>       dev->mode_config.min_width = XRES_MIN;
>>       dev->mode_config.min_height = YRES_MIN;
>>
>

Subject: Re: [PATCH 6/6] drm: vkms: Refactor the plane composer to accept new formats

Hi Pekka,

On 10/18/21 5:30 AM, Pekka Paalanen wrote:
> On Tue, 5 Oct 2021 17:16:37 -0300
> Igor Matheus Andrade Torrente <[email protected]> wrote:
>
>> Currently the blend function only accepts XRGB_8888 and ARGB_8888
>> as a color input.
>>
>> This patch refactors all the functions related to the plane composition
>> to overcome this limitation.
>>
>> Now the blend function receives a format handler to each plane and a
>> blend function pointer. It will take two ARGB_1616161616 pixels, one
>> for each handler, and will use the blend function to calculate and
>> store the final color in the output buffer.
>>
>> These format handlers will receive the `vkms_composer` and a pair of
>> coordinates. And they should return the respective pixel in the
>> ARGB_16161616 format.
>>
>> The blend function will receive two ARGB_16161616 pixels, x, y, and
>> the vkms_composer of the output buffer. The method should perform the
>> blend operation and store output to the format aforementioned
>> ARGB_16161616.
>
> Hi,
>
> here are some drive-by comments which you are free to take or leave.
>
> To find more efficient ways to do this, you might want research Pixman
> library. It's MIT licensed.
>
> IIRC, the elementary operations there operate on pixel lines (pointer +
> length), not individual pixels (image, x, y). Input conversion function
> reads and converts a whole line a time. Blending function blends two
> lines and writes the output into back one of the lines. Output
> conversion function takes a line and converts it into destination
> buffer. That way the elementary operations do not need to take stride
> into account, and blending does not need to deal with varying memory
> alignment FWIW. The inner loops involve much less function calls and
> state, probably.

I was doing some rudimentary profiling, and I noticed that the code
spends most of the time with the indirect system call overhead and not
the actual computation. This can definitively help with it.

>
> Pixman may not do exactly this, but it does something very similar.
> Pixman also has multiple different levels of optimisations, which may
> not be necessary for VKMS.
>
> It's a trade-off between speed and temporary memory consumed. You need
> temporary buffers for two lines, but not more (not a whole image in
> full intermediate precision). Further optimisation could determine
> whether to process whole image rows as lines, or split a row into
> multiple lines to stay within CPU cache size.
>

Sorry, I didn't understand the idea of the last sentence.

> Since it seems you are blending multiple planes into a single
> destination which is assumed to be opaque, you might also be able to do
> the multiple blends without convert-writing and read-converting to/from
> the destination between every plane. I think that might be possible to
> architect on top of the per-line operations still.

I tried it. But I don't know how to do this without looking like a mess.
Does the pixman perform some similar?

>
>> Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
>> ---
>> drivers/gpu/drm/vkms/vkms_composer.c | 275 ++++++++++++++-------------
>> drivers/gpu/drm/vkms/vkms_formats.h | 125 ++++++++++++
>> 2 files changed, 271 insertions(+), 129 deletions(-)
>> create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
>
> ...
>
>> +
>> +u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
>> +{
>> + u8 *pixel_addr = packed_pixels_addr(composer, x, y);
>> +
>> + /*
>> + * Organizes the channels in their respective positions and converts
>> + * the 8 bits channel to 16.
>> + * The 257 is the "conversion ratio". This number is obtained by the
>> + * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
>> + * the best color value in a color space with more possibilities.
>
> Pixel format, not color space. >
>> + * And a similar idea applies to others RGB color conversions.
>> + */
>> + return ((u64)pixel_addr[3] * 257) << 48 |
>> + ((u64)pixel_addr[2] * 257) << 32 |
>> + ((u64)pixel_addr[1] * 257) << 16 |
>> + ((u64)pixel_addr[0] * 257);
>
> I wonder if the compiler is smart enough to choose between "mul 257"
> and "(v << 8) | v" operations... but that's probably totally drowning
> under the overhead of per (x,y) looping.

I disassembled the code to check it. And looks like the compiler is
replacing the multiplication with shifts and additions.

ARGB8888_to_ARGB16161616:
0xffffffff816ad660 <+0>: imul 0x12c(%rdi),%edx
0xffffffff816ad667 <+7>: imul 0x130(%rdi),%esi
0xffffffff816ad66e <+14>: add %edx,%esi
0xffffffff816ad670 <+16>: add 0x128(%rdi),%esi
0xffffffff816ad676 <+22>: movslq %esi,%rax
0xffffffff816ad679 <+25>: add 0xe8(%rdi),%rax
0xffffffff816ad680 <+32>: movzbl 0x3(%rax),%ecx
0xffffffff816ad684 <+36>: movzbl 0x2(%rax),%esi
0xffffffff816ad688 <+40>: mov %rcx,%rdx
0xffffffff816ad68b <+43>: shl $0x8,%rdx
0xffffffff816ad68f <+47>: add %rcx,%rdx
0xffffffff816ad692 <+50>: mov %rsi,%rcx
0xffffffff816ad695 <+53>: shl $0x8,%rcx
0xffffffff816ad699 <+57>: shl $0x30,%rdx
0xffffffff816ad69d <+61>: add %rsi,%rcx
0xffffffff816ad6a0 <+64>: movzbl (%rax),%esi
0xffffffff816ad6a3 <+67>: shl $0x20,%rcx
0xffffffff816ad6a7 <+71>: or %rcx,%rdx
0xffffffff816ad6aa <+74>: mov %rsi,%rcx
0xffffffff816ad6ad <+77>: shl $0x8,%rcx
0xffffffff816ad6b1 <+81>: add %rsi,%rcx
0xffffffff816ad6b4 <+84>: or %rcx,%rdx
0xffffffff816ad6b7 <+87>: movzbl 0x1(%rax),%ecx
0xffffffff816ad6bb <+91>: mov %rcx,%rax
0xffffffff816ad6be <+94>: shl $0x8,%rax
0xffffffff816ad6c2 <+98>: add %rcx,%rax
0xffffffff816ad6c5 <+101>: shl $0x10,%rax
0xffffffff816ad6c9 <+105>: or %rdx,%rax
0xffffffff816ad6cc <+108>: ret

>
>> +}
>> +
>> +u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
>> +{
>> + u8 *pixel_addr = packed_pixels_addr(composer, x, y);
>> +
>> + /*
>> + * The same as the ARGB8888 but with the alpha channel as the
>> + * maximum value as possible.
>> + */
>> + return 0xffffllu << 48 |
>> + ((u64)pixel_addr[2] * 257) << 32 |
>> + ((u64)pixel_addr[1] * 257) << 16 |
>> + ((u64)pixel_addr[0] * 257);
>> +}
>> +
>> +u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
>> +{
>> + __le64 *pixel_addr = packed_pixels_addr(composer, x, y);
>> +
>> + /*
>> + * Because the format byte order is in little-endian and this code
>> + * needs to run on big-endian machines too, we need modify
>> + * the byte order from little-endian to the CPU native byte order.
>> + */
>> + return le64_to_cpu(*pixel_addr);
>> +}
>> +
>> +/*
>> + * The following functions are used as blend operations. But unlike the
>> + * `alpha_blend`, these functions take an ARGB16161616 pixel from the
>> + * source, convert it to a specific format, and store it in the destination.
>> + *
>> + * They are used in the `compose_active_planes` and `write_wb_buffer` to
>> + * copy and convert one pixel from/to the output buffer to/from
>> + * another buffer (e.g. writeback buffer, primary plane buffer).
>> + */
>> +
>> +void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
>> + struct vkms_composer *dst_composer)
>> +{
>> + u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>> +
>> + /*
>> + * This sequence below is important because the format's byte order is
>> + * in little-endian. In the case of the ARGB8888 the memory is
>> + * organized this way:
>> + *
>> + * | Addr | = blue channel
>> + * | Addr + 1 | = green channel
>> + * | Addr + 2 | = Red channel
>> + * | Addr + 3 | = Alpha channel
>> + */
>> + pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
>> + pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
>> + pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
>> + pixel_addr[3] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 48)) >> 48, 257);
>
> This could be potentially expensive if the compiler ends up using an
> actual div instruction.
>
Yes, I'm using the DIV_ROUND_UP because I couldn't figure out how the
"Faster div by 255" works to adapt to 16 bits.

> Btw. this would be shorter to write as
>
> (argb_src1 >> 16) & 0xffff
>
> etc.
I will use it in the V2. Thanks.

>
> Thanks,
> pq
>
>> +}
>> +
>> +void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
>> + struct vkms_composer *dst_composer)
>> +{
>> + u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>> +
>> + pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
>> + pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
>> + pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
>> + pixel_addr[3] = 0xff;
>> +}
>> +
>> +void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
>> + struct vkms_composer *dst_composer)
>> +{
>> + __le64 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>> +
>> + *pixel_addr = cpu_to_le64(argb_src1);
>> +}
>> +
>> +#endif /* _VKMS_FORMATS_H_ */
>

Subject: Re: [PATCH 5/6] drm: vkms: Prepare `vkms_wb_encoder_atomic_check` to accept multiple formats

Hi Thomas,

On 10/18/21 3:06 PM, Thomas Zimmermann wrote:
> Hi
>
> Am 18.10.21 um 19:41 schrieb Igor Matheus Andrade Torrente:
>> Hello,
>>
>> On 10/18/21 7:14 AM, Thomas Zimmermann wrote:
>>> Hi
>>>
>>> Am 05.10.21 um 22:16 schrieb Igor Matheus Andrade Torrente:
>>>> Currently, the vkms atomic check only goes through the first
>>>> position of
>>>> the `vkms_wb_formats` vector.
>>>>
>>>> This change prepares the atomic_check to check the entire vector.
>>>>
>>>> Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
>>>> ---
>>>>   drivers/gpu/drm/vkms/vkms_writeback.c | 11 ++++++++++-
>>>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c
>>>> b/drivers/gpu/drm/vkms/vkms_writeback.c
>>>> index 5a3e12f105dc..56978f499203 100644
>>>> --- a/drivers/gpu/drm/vkms/vkms_writeback.c
>>>> +++ b/drivers/gpu/drm/vkms/vkms_writeback.c
>>>> @@ -30,6 +30,8 @@ static int vkms_wb_encoder_atomic_check(struct
>>>> drm_encoder *encoder,
>>>>   {
>>>>       struct drm_framebuffer *fb;
>>>>       const struct drm_display_mode *mode = &crtc_state->mode;
>>>> +    bool format_supported = false;
>>>> +    int i;
>>>>       if (!conn_state->writeback_job || !conn_state->writeback_job->fb)
>>>>           return 0;
>>>> @@ -41,7 +43,14 @@ static int vkms_wb_encoder_atomic_check(struct
>>>> drm_encoder *encoder,
>>>>           return -EINVAL;
>>>>       }
>>>> -    if (fb->format->format != vkms_wb_formats[0]) {
>>>> +    for (i = 0; i < ARRAY_SIZE(vkms_wb_formats); i++) {
>>>> +        if (fb->format->format == vkms_wb_formats[i]) {
>>>> +            format_supported = true;
>>>> +            break;
>>>> +        }
>>>> +    }
>>>
>>> At a minimum, this loop should be in a helper function. But more
>>> generally, I'm surprised that this isn't already covered by the DRM's
>>> atomic helpers.
>>
>> Ok, I can wrap it in a new function.
>>
>> AFAIK the DRM doesn't cover it. But I may be wrong...
>
> I couldn't find anything either.
>
> Other drivers do similar format and frambuffer checks. So I guess a
> helper could be implemented. All plane's are supposed to call
> drm_atomic_helper_check_plane_state() in their atomic_check() code. You
> could add a similar helper, say
> drm_atomic_helper_check_writeback_encoder_state(), that tests for the
> format and maybe other things as well.

Do you think this should be done before or after this patch series?

>
> Best regards
> Thomas
>
>>
>>>
>>> Best regards
>>> Thomas
>>>
>>>> +
>>>> +    if (!format_supported) {
>>>>           DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
>>>>                     &fb->format->format);
>>>>           return -EINVAL;
>>>>
>>>
>

Thanks,
Igor Torrente

2021-10-19 07:22:21

by Thomas Zimmermann

[permalink] [raw]
Subject: Re: [PATCH 5/6] drm: vkms: Prepare `vkms_wb_encoder_atomic_check` to accept multiple formats

Hi

Am 18.10.21 um 21:32 schrieb Igor Matheus Andrade Torrente:
> Hi Thomas,
>
> On 10/18/21 3:06 PM, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 18.10.21 um 19:41 schrieb Igor Matheus Andrade Torrente:
>>> Hello,
>>>
>>> On 10/18/21 7:14 AM, Thomas Zimmermann wrote:
>>>> Hi
>>>>
>>>> Am 05.10.21 um 22:16 schrieb Igor Matheus Andrade Torrente:
>>>>> Currently, the vkms atomic check only goes through the first
>>>>> position of
>>>>> the `vkms_wb_formats` vector.
>>>>>
>>>>> This change prepares the atomic_check to check the entire vector.
>>>>>
>>>>> Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
>>>>> ---
>>>>>   drivers/gpu/drm/vkms/vkms_writeback.c | 11 ++++++++++-
>>>>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c
>>>>> b/drivers/gpu/drm/vkms/vkms_writeback.c
>>>>> index 5a3e12f105dc..56978f499203 100644
>>>>> --- a/drivers/gpu/drm/vkms/vkms_writeback.c
>>>>> +++ b/drivers/gpu/drm/vkms/vkms_writeback.c
>>>>> @@ -30,6 +30,8 @@ static int vkms_wb_encoder_atomic_check(struct
>>>>> drm_encoder *encoder,
>>>>>   {
>>>>>       struct drm_framebuffer *fb;
>>>>>       const struct drm_display_mode *mode = &crtc_state->mode;
>>>>> +    bool format_supported = false;
>>>>> +    int i;
>>>>>       if (!conn_state->writeback_job ||
>>>>> !conn_state->writeback_job->fb)
>>>>>           return 0;
>>>>> @@ -41,7 +43,14 @@ static int vkms_wb_encoder_atomic_check(struct
>>>>> drm_encoder *encoder,
>>>>>           return -EINVAL;
>>>>>       }
>>>>> -    if (fb->format->format != vkms_wb_formats[0]) {
>>>>> +    for (i = 0; i < ARRAY_SIZE(vkms_wb_formats); i++) {
>>>>> +        if (fb->format->format == vkms_wb_formats[i]) {
>>>>> +            format_supported = true;
>>>>> +            break;
>>>>> +        }
>>>>> +    }
>>>>
>>>> At a minimum, this loop should be in a helper function. But more
>>>> generally, I'm surprised that this isn't already covered by the
>>>> DRM's atomic helpers.
>>>
>>> Ok, I can wrap it in a new function.
>>>
>>> AFAIK the DRM doesn't cover it. But I may be wrong...
>>
>> I couldn't find anything either.
>>
>> Other drivers do similar format and frambuffer checks. So I guess a
>> helper could be implemented. All plane's are supposed to call
>> drm_atomic_helper_check_plane_state() in their atomic_check() code.
>> You could add a similar helper, say
>> drm_atomic_helper_check_writeback_encoder_state(), that tests for the
>> format and maybe other things as well.
>
> Do you think this should be done before or after this patch series?

Just add it as part of this series and use it for vkms. Other drivers
can adopt it later on. The rcar-du code [1] looks similar to the one in
vkms. Maybe put the common tests in to the new helper. You can extract
the list of supported formats from the property blob, I think.

Best regards
Thomas

[1]
https://elixir.bootlin.com/linux/v5.14.13/source/drivers/gpu/drm/rcar-du/rcar_du_writeback.c#L140

>
>>
>> Best regards
>> Thomas
>>
>>>
>>>>
>>>> Best regards
>>>> Thomas
>>>>
>>>>> +
>>>>> +    if (!format_supported) {
>>>>>           DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
>>>>>                     &fb->format->format);
>>>>>           return -EINVAL;
>>>>>
>>>>
>>
>
> Thanks,
> Igor Torrente

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer


Attachments:
OpenPGP_signature (855.00 B)
OpenPGP digital signature

2021-10-19 08:08:36

by Pekka Paalanen

[permalink] [raw]
Subject: Re: [PATCH 6/6] drm: vkms: Refactor the plane composer to accept new formats

On Mon, 18 Oct 2021 16:26:06 -0300
Igor Matheus Andrade Torrente <[email protected]> wrote:

> Hi Pekka,
>
> On 10/18/21 5:30 AM, Pekka Paalanen wrote:
> > On Tue, 5 Oct 2021 17:16:37 -0300
> > Igor Matheus Andrade Torrente <[email protected]> wrote:
> >
> >> Currently the blend function only accepts XRGB_8888 and ARGB_8888
> >> as a color input.
> >>
> >> This patch refactors all the functions related to the plane composition
> >> to overcome this limitation.
> >>
> >> Now the blend function receives a format handler to each plane and a
> >> blend function pointer. It will take two ARGB_1616161616 pixels, one
> >> for each handler, and will use the blend function to calculate and
> >> store the final color in the output buffer.
> >>
> >> These format handlers will receive the `vkms_composer` and a pair of
> >> coordinates. And they should return the respective pixel in the
> >> ARGB_16161616 format.
> >>
> >> The blend function will receive two ARGB_16161616 pixels, x, y, and
> >> the vkms_composer of the output buffer. The method should perform the
> >> blend operation and store output to the format aforementioned
> >> ARGB_16161616.
> >
> > Hi,
> >
> > here are some drive-by comments which you are free to take or leave.
> >
> > To find more efficient ways to do this, you might want research Pixman
> > library. It's MIT licensed.
> >
> > IIRC, the elementary operations there operate on pixel lines (pointer +
> > length), not individual pixels (image, x, y). Input conversion function
> > reads and converts a whole line a time. Blending function blends two
> > lines and writes the output into back one of the lines. Output
> > conversion function takes a line and converts it into destination
> > buffer. That way the elementary operations do not need to take stride
> > into account, and blending does not need to deal with varying memory
> > alignment FWIW. The inner loops involve much less function calls and
> > state, probably.
>
> I was doing some rudimentary profiling, and I noticed that the code
> spends most of the time with the indirect system call overhead and not
> the actual computation. This can definitively help with it.

Hi,

I suppose you mean function (pointer) call, not system call?

Really good that you have already profiled things. All optimisations
should be guided by profiling, otherwise they are just guesses (and I
got lucky this time I guess).

> > Pixman may not do exactly this, but it does something very similar.
> > Pixman also has multiple different levels of optimisations, which may
> > not be necessary for VKMS.
> >
> > It's a trade-off between speed and temporary memory consumed. You need
> > temporary buffers for two lines, but not more (not a whole image in
> > full intermediate precision). Further optimisation could determine
> > whether to process whole image rows as lines, or split a row into
> > multiple lines to stay within CPU cache size.
> >
>
> Sorry, I didn't understand the idea of the last sentence.

If an image is very wide, a single row could still be relatively large
in size (bytes). If it is large enough that the working set falls out
of a faster CPU cache into a slower CPU cache (or worse yet, into RAM
accesses), performance may suffer and become memory bandwidth limited
rather than ALU rate limited. Theoretically that can be worked around
by limiting the maximum size of a line, and splitting an image row into
multiple lines.

However, this is an optimisation one probably should not do until there
is performance profiling data showing that it actually is useful. The
optimal line size limit depends on the CPU model as well. So it's a bit
far out, but something you could keep in mind just in case.

> > Since it seems you are blending multiple planes into a single
> > destination which is assumed to be opaque, you might also be able to do
> > the multiple blends without convert-writing and read-converting to/from
> > the destination between every plane. I think that might be possible to
> > architect on top of the per-line operations still.
>
> I tried it. But I don't know how to do this without looking like a mess.

Dedicate one of the temporary line buffers for the destination, and
instead of writing it out and loading it back for each input plane,
leave it in place over all planes and write it out just once at the end.

I do expect more state tracking needed. You need to iterate over the
list of planes for each output row, extract only the used part of an
input plane's buffer into the other temporary line buffer, and offset
the destination line buffer and length to match when passing those into
a blending function.

It's not an obvious win but a trade-off, so profiling is again needed.

Btw. the use of temporary line buffers should also help with
implementing scaling. You could do the filtering during reading of the
input buffer. If the filter is not nearest, meaning you need to access
more than one input pixel per pixel-for-blending, there are a few ways
to go about that. You could do the filtering during the input buffer
reading, or you could load two input buffer rows into temporary line
buffers and do filtering as a separate step into yet another line
buffer. As the composition advances from top to bottom, only one of the
input buffer rows will change at a time (during up-scaling) so you can
avoid re-loading a row by swapping the line buffers.

This is getting ahead of things, so don't worry about scaling or
filtering yet. The first thing is to see if you can make the line
buffer approach give you a significant speed-up.

> Does the pixman perform some similar?

No, Pixman composition operation has only three images: source,
mask, and destination. All those it can handle simultaneously, but
since there is no "multi-blending" API, it doesn't need to take care of
more.

IIRC, Pixman also has a form of optimised operations that do blending
and converting to destination in the same pass. The advantage of that
is that blending can work with less precision when it knows what
precision the output will be converted to and it saves some bandwidth
by not needing to write-after-blending and read-for-conversion
intermediate precision values. The disadvantage is that the number of
needed specialised blending functions explodes by the number of
possible destination formats. Pixman indeed makes those specialised
functions optional, falling back to more generic C code. I would hope
that VKMS does not need to go this far in optimisations though.

> >
> >> Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
> >> ---
> >> drivers/gpu/drm/vkms/vkms_composer.c | 275 ++++++++++++++-------------
> >> drivers/gpu/drm/vkms/vkms_formats.h | 125 ++++++++++++
> >> 2 files changed, 271 insertions(+), 129 deletions(-)
> >> create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
> >
> > ...
> >
> >> +
> >> +u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
> >> +{
> >> + u8 *pixel_addr = packed_pixels_addr(composer, x, y);
> >> +
> >> + /*
> >> + * Organizes the channels in their respective positions and converts
> >> + * the 8 bits channel to 16.
> >> + * The 257 is the "conversion ratio". This number is obtained by the
> >> + * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
> >> + * the best color value in a color space with more possibilities.
> >
> > Pixel format, not color space. >
> >> + * And a similar idea applies to others RGB color conversions.
> >> + */
> >> + return ((u64)pixel_addr[3] * 257) << 48 |
> >> + ((u64)pixel_addr[2] * 257) << 32 |
> >> + ((u64)pixel_addr[1] * 257) << 16 |
> >> + ((u64)pixel_addr[0] * 257);
> >
> > I wonder if the compiler is smart enough to choose between "mul 257"
> > and "(v << 8) | v" operations... but that's probably totally drowning
> > under the overhead of per (x,y) looping.
>
> I disassembled the code to check it. And looks like the compiler is
> replacing the multiplication with shifts and additions.
>
> ARGB8888_to_ARGB16161616:
> 0xffffffff816ad660 <+0>: imul 0x12c(%rdi),%edx
> 0xffffffff816ad667 <+7>: imul 0x130(%rdi),%esi
> 0xffffffff816ad66e <+14>: add %edx,%esi
> 0xffffffff816ad670 <+16>: add 0x128(%rdi),%esi
> 0xffffffff816ad676 <+22>: movslq %esi,%rax
> 0xffffffff816ad679 <+25>: add 0xe8(%rdi),%rax
> 0xffffffff816ad680 <+32>: movzbl 0x3(%rax),%ecx
> 0xffffffff816ad684 <+36>: movzbl 0x2(%rax),%esi
> 0xffffffff816ad688 <+40>: mov %rcx,%rdx
> 0xffffffff816ad68b <+43>: shl $0x8,%rdx
> 0xffffffff816ad68f <+47>: add %rcx,%rdx
> 0xffffffff816ad692 <+50>: mov %rsi,%rcx
> 0xffffffff816ad695 <+53>: shl $0x8,%rcx
> 0xffffffff816ad699 <+57>: shl $0x30,%rdx
> 0xffffffff816ad69d <+61>: add %rsi,%rcx
> 0xffffffff816ad6a0 <+64>: movzbl (%rax),%esi
> 0xffffffff816ad6a3 <+67>: shl $0x20,%rcx
> 0xffffffff816ad6a7 <+71>: or %rcx,%rdx
> 0xffffffff816ad6aa <+74>: mov %rsi,%rcx
> 0xffffffff816ad6ad <+77>: shl $0x8,%rcx
> 0xffffffff816ad6b1 <+81>: add %rsi,%rcx
> 0xffffffff816ad6b4 <+84>: or %rcx,%rdx
> 0xffffffff816ad6b7 <+87>: movzbl 0x1(%rax),%ecx
> 0xffffffff816ad6bb <+91>: mov %rcx,%rax
> 0xffffffff816ad6be <+94>: shl $0x8,%rax
> 0xffffffff816ad6c2 <+98>: add %rcx,%rax
> 0xffffffff816ad6c5 <+101>: shl $0x10,%rax
> 0xffffffff816ad6c9 <+105>: or %rdx,%rax
> 0xffffffff816ad6cc <+108>: ret

Nice!

> >
> >> +}
> >> +
> >> +u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
> >> +{
> >> + u8 *pixel_addr = packed_pixels_addr(composer, x, y);
> >> +
> >> + /*
> >> + * The same as the ARGB8888 but with the alpha channel as the
> >> + * maximum value as possible.
> >> + */
> >> + return 0xffffllu << 48 |
> >> + ((u64)pixel_addr[2] * 257) << 32 |
> >> + ((u64)pixel_addr[1] * 257) << 16 |
> >> + ((u64)pixel_addr[0] * 257);
> >> +}
> >> +
> >> +u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
> >> +{
> >> + __le64 *pixel_addr = packed_pixels_addr(composer, x, y);
> >> +
> >> + /*
> >> + * Because the format byte order is in little-endian and this code
> >> + * needs to run on big-endian machines too, we need modify
> >> + * the byte order from little-endian to the CPU native byte order.
> >> + */
> >> + return le64_to_cpu(*pixel_addr);
> >> +}
> >> +
> >> +/*
> >> + * The following functions are used as blend operations. But unlike the
> >> + * `alpha_blend`, these functions take an ARGB16161616 pixel from the
> >> + * source, convert it to a specific format, and store it in the destination.
> >> + *
> >> + * They are used in the `compose_active_planes` and `write_wb_buffer` to
> >> + * copy and convert one pixel from/to the output buffer to/from
> >> + * another buffer (e.g. writeback buffer, primary plane buffer).
> >> + */
> >> +
> >> +void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
> >> + struct vkms_composer *dst_composer)
> >> +{
> >> + u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
> >> +
> >> + /*
> >> + * This sequence below is important because the format's byte order is
> >> + * in little-endian. In the case of the ARGB8888 the memory is
> >> + * organized this way:
> >> + *
> >> + * | Addr | = blue channel
> >> + * | Addr + 1 | = green channel
> >> + * | Addr + 2 | = Red channel
> >> + * | Addr + 3 | = Alpha channel
> >> + */
> >> + pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
> >> + pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
> >> + pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
> >> + pixel_addr[3] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 48)) >> 48, 257);
> >
> > This could be potentially expensive if the compiler ends up using an
> > actual div instruction.
> >
> Yes, I'm using the DIV_ROUND_UP because I couldn't figure out how the
> "Faster div by 255" works to adapt to 16 bits.

But does the compiler actually do a div instruction? If not, then no
worries. If it does, then maybe something to look into, *if* this shows
up in profiling.


Thanks,
pq

> > Btw. this would be shorter to write as
> >
> > (argb_src1 >> 16) & 0xffff
> >
> > etc.
> I will use it in the V2. Thanks.
>
> >
> > Thanks,
> > pq
> >
> >> +}
> >> +
> >> +void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
> >> + struct vkms_composer *dst_composer)
> >> +{
> >> + u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
> >> +
> >> + pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
> >> + pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
> >> + pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
> >> + pixel_addr[3] = 0xff;
> >> +}
> >> +
> >> +void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
> >> + struct vkms_composer *dst_composer)
> >> +{
> >> + __le64 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
> >> +
> >> + *pixel_addr = cpu_to_le64(argb_src1);
> >> +}
> >> +
> >> +#endif /* _VKMS_FORMATS_H_ */
> >


Attachments:
(No filename) (849.00 B)
OpenPGP digital signature
Subject: Re: [PATCH 5/6] drm: vkms: Prepare `vkms_wb_encoder_atomic_check` to accept multiple formats

Hi Thomas,

On 10/19/21 4:17 AM, Thomas Zimmermann wrote:
> Hi
>
> Am 18.10.21 um 21:32 schrieb Igor Matheus Andrade Torrente:
>> Hi Thomas,
>>
>> On 10/18/21 3:06 PM, Thomas Zimmermann wrote:
>>> Hi
>>>
>>> Am 18.10.21 um 19:41 schrieb Igor Matheus Andrade Torrente:
>>>> Hello,
>>>>
>>>> On 10/18/21 7:14 AM, Thomas Zimmermann wrote:
>>>>> Hi
>>>>>
>>>>> Am 05.10.21 um 22:16 schrieb Igor Matheus Andrade Torrente:
>>>>>> Currently, the vkms atomic check only goes through the first
>>>>>> position of
>>>>>> the `vkms_wb_formats` vector.
>>>>>>
>>>>>> This change prepares the atomic_check to check the entire vector.
>>>>>>
>>>>>> Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
>>>>>> ---
>>>>>>   drivers/gpu/drm/vkms/vkms_writeback.c | 11 ++++++++++-
>>>>>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c
>>>>>> b/drivers/gpu/drm/vkms/vkms_writeback.c
>>>>>> index 5a3e12f105dc..56978f499203 100644
>>>>>> --- a/drivers/gpu/drm/vkms/vkms_writeback.c
>>>>>> +++ b/drivers/gpu/drm/vkms/vkms_writeback.c
>>>>>> @@ -30,6 +30,8 @@ static int vkms_wb_encoder_atomic_check(struct
>>>>>> drm_encoder *encoder,
>>>>>>   {
>>>>>>       struct drm_framebuffer *fb;
>>>>>>       const struct drm_display_mode *mode = &crtc_state->mode;
>>>>>> +    bool format_supported = false;
>>>>>> +    int i;
>>>>>>       if (!conn_state->writeback_job ||
>>>>>> !conn_state->writeback_job->fb)
>>>>>>           return 0;
>>>>>> @@ -41,7 +43,14 @@ static int vkms_wb_encoder_atomic_check(struct
>>>>>> drm_encoder *encoder,
>>>>>>           return -EINVAL;
>>>>>>       }
>>>>>> -    if (fb->format->format != vkms_wb_formats[0]) {
>>>>>> +    for (i = 0; i < ARRAY_SIZE(vkms_wb_formats); i++) {
>>>>>> +        if (fb->format->format == vkms_wb_formats[i]) {
>>>>>> +            format_supported = true;
>>>>>> +            break;
>>>>>> +        }
>>>>>> +    }
>>>>>
>>>>> At a minimum, this loop should be in a helper function. But more
>>>>> generally, I'm surprised that this isn't already covered by the
>>>>> DRM's atomic helpers.
>>>>
>>>> Ok, I can wrap it in a new function.
>>>>
>>>> AFAIK the DRM doesn't cover it. But I may be wrong...
>>>
>>> I couldn't find anything either.
>>>
>>> Other drivers do similar format and frambuffer checks. So I guess a
>>> helper could be implemented. All plane's are supposed to call
>>> drm_atomic_helper_check_plane_state() in their atomic_check() code.
>>> You could add a similar helper, say
>>> drm_atomic_helper_check_writeback_encoder_state(), that tests for the
>>> format and maybe other things as well.
>>
>> Do you think this should be done before or after this patch series?
>
> Just add it as part of this series and use it for vkms. Other drivers
> can adopt it later on. The rcar-du code [1] looks similar to the one in
> vkms. Maybe put the common tests in to the new helper. You can extract
> the list of supported formats from the property blob, I think.
>

OK, Thanks!

> Best regards
> Thomas
>
> [1]
> https://elixir.bootlin.com/linux/v5.14.13/source/drivers/gpu/drm/rcar-du/rcar_du_writeback.c#L140
>
>>
>>>
>>> Best regards
>>> Thomas
>>>
>>>>
>>>>>
>>>>> Best regards
>>>>> Thomas
>>>>>
>>>>>> +
>>>>>> +    if (!format_supported) {
>>>>>>           DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
>>>>>>                     &fb->format->format);
>>>>>>           return -EINVAL;
>>>>>>
>>>>>
>>>
>>
>> Thanks,
>> Igor Torrente
>

Subject: Re: [PATCH 6/6] drm: vkms: Refactor the plane composer to accept new formats

Hi Pekka,

On 10/19/21 5:05 AM, Pekka Paalanen wrote:
> On Mon, 18 Oct 2021 16:26:06 -0300
> Igor Matheus Andrade Torrente <[email protected]> wrote:
>
>> Hi Pekka,
>>
>> On 10/18/21 5:30 AM, Pekka Paalanen wrote:
>>> On Tue, 5 Oct 2021 17:16:37 -0300
>>> Igor Matheus Andrade Torrente <[email protected]> wrote:
>>>
>>>> Currently the blend function only accepts XRGB_8888 and ARGB_8888
>>>> as a color input.
>>>>
>>>> This patch refactors all the functions related to the plane composition
>>>> to overcome this limitation.
>>>>
>>>> Now the blend function receives a format handler to each plane and a
>>>> blend function pointer. It will take two ARGB_1616161616 pixels, one
>>>> for each handler, and will use the blend function to calculate and
>>>> store the final color in the output buffer.
>>>>
>>>> These format handlers will receive the `vkms_composer` and a pair of
>>>> coordinates. And they should return the respective pixel in the
>>>> ARGB_16161616 format.
>>>>
>>>> The blend function will receive two ARGB_16161616 pixels, x, y, and
>>>> the vkms_composer of the output buffer. The method should perform the
>>>> blend operation and store output to the format aforementioned
>>>> ARGB_16161616.
>>>
>>> Hi,
>>>
>>> here are some drive-by comments which you are free to take or leave.
>>>
>>> To find more efficient ways to do this, you might want research
>>> library. It's MIT licensed.
>>>
>>> IIRC, the elementary operations there operate on pixel lines (pointer +
>>> length), not individual pixels (image, x, y). Input conversion function
>>> reads and converts a whole line a time. Blending function blends two
>>> lines and writes the output into back one of the lines. Output
>>> conversion function takes a line and converts it into destination
>>> buffer. That way the elementary operations do not need to take stride
>>> into account, and blending does not need to deal with varying memory
>>> alignment FWIW. The inner loops involve much less function calls and
>>> state, probably.
>>
>> I was doing some rudimentary profiling, and I noticed that the code
>> spends most of the time with the indirect system call overhead and not
>> the actual computation. This can definitively help with it.
>
> Hi,
>
> I suppose you mean function (pointer) call, not system call?

Yes, I misspelled it.

>
> Really good that you have already profiled things. All optimisations
> should be guided by profiling, otherwise they are just guesses (and I
> got lucky this time I guess).
>
>>> Pixman may not do exactly this, but it does something very similar.
>>> Pixman also has multiple different levels of optimisations, which may
>>> not be necessary for VKMS.
>>>
>>> It's a trade-off between speed and temporary memory consumed. You need
>>> temporary buffers for two lines, but not more (not a whole image in
>>> full intermediate precision). Further optimisation could determine
>>> whether to process whole image rows as lines, or split a row into
>>> multiple lines to stay within CPU cache size.
>>>
>>
>> Sorry, I didn't understand the idea of the last sentence.
>
> If an image is very wide, a single row could still be relatively large
> in size (bytes). If it is large enough that the working set falls out
> of a faster CPU cache into a slower CPU cache (or worse yet, into RAM
> accesses), performance may suffer and become memory bandwidth limited
> rather than ALU rate limited. Theoretically that can be worked around
> by limiting the maximum size of a line, and splitting an image row into
> multiple lines.

Got it now, thanks.

>
> However, this is an optimisation one probably should not do until there
> is performance profiling data showing that it actually is useful. The
> optimal line size limit depends on the CPU model as well. So it's a bit
> far out, but something you could keep in mind just in case.

OK. For now I will not deal with this complexity, and if necessary I
will revisit it latter.

>
>>> Since it seems you are blending multiple planes into a single
>>> destination which is assumed to be opaque, you might also be able to do
>>> the multiple blends without convert-writing and read-converting to/from
>>> the destination between every plane. I think that might be possible to
>>> architect on top of the per-line operations still.
>>
>> I tried it. But I don't know how to do this without looking like a mess.

I don't think it changes anything, but I forgot to mention that I tried
it with the blend per pixel and not a per line.

>
> Dedicate one of the temporary line buffers for the destination, and
> instead of writing it out and loading it back for each input plane,
> leave it in place over all planes and write it out just once at the end.
>
> I do expect more state tracking needed. You need to iterate over the
> list of planes for each output row, extract only the used part of an
> input plane's buffer into the other temporary line buffer, and offset
> the destination line buffer and length to match when passing those into
> a blending function.+

It's exactly this state tracking that was a mess when I was trying to
implement something similar. But this is one more thing that probably
I will have to revisit later.

>
> It's not an obvious win but a trade-off, so profiling is again needed.
>
> Btw. the use of temporary line buffers should also help with
> implementing scaling. You could do the filtering during reading of the
> input buffer. If the filter is not nearest, meaning you need to access
> more than one input pixel per pixel-for-blending, there are a few ways
> to go about that. You could do the filtering during the input buffer
> reading, or you could load two input buffer rows into temporary line
> buffers and do filtering as a separate step into yet another line
> buffer. As the composition advances from top to bottom, only one of the
> input buffer rows will change at a time (during up-scaling) so you can
> avoid re-loading a row by swapping the line buffers.
>
> This is getting ahead of things, so don't worry about scaling or
> filtering yet. The first thing is to see if you can make the line
> buffer approach give you a significant speed-up.
>
>> Does the pixman perform some similar?
>
> No, Pixman composition operation has only three images: source,
> mask, and destination. All those it can handle simultaneously, but
> since there is no "multi-blending" API, it doesn't need to take care of
> more.
>
> IIRC, Pixman also has a form of optimised operations that do blending
> and converting to destination in the same pass. The advantage of that
> is that blending can work with less precision when it knows what
> precision the output will be converted to and it saves some bandwidth
> by not needing to write-after-blending and read-for-conversion
> intermediate precision values. The disadvantage is that the number of
> needed specialised blending functions explodes by the number of
> possible destination formats. Pixman indeed makes those specialised
> functions optional, falling back to more generic C code. I would hope
> that VKMS does not need to go this far in optimisations though.

This should be plenty fast indeed. Maybe worth for formats that are
extremely common.

>
>>>
>>>> Signed-off-by: Igor Matheus Andrade Torrente <[email protected]>
>>>> ---
>>>> drivers/gpu/drm/vkms/vkms_composer.c | 275 ++++++++++++++-------------
>>>> drivers/gpu/drm/vkms/vkms_formats.h | 125 ++++++++++++
>>>> 2 files changed, 271 insertions(+), 129 deletions(-)
>>>> create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
>>>
>>> ...
>>>
>>>> +
>>>> +u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
>>>> +{
>>>> + u8 *pixel_addr = packed_pixels_addr(composer, x, y);
>>>> +
>>>> + /*
>>>> + * Organizes the channels in their respective positions and converts
>>>> + * the 8 bits channel to 16.
>>>> + * The 257 is the "conversion ratio". This number is obtained by the
>>>> + * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
>>>> + * the best color value in a color space with more possibilities.
>>>
>>> Pixel format, not color space. >
>>>> + * And a similar idea applies to others RGB color conversions.
>>>> + */
>>>> + return ((u64)pixel_addr[3] * 257) << 48 |
>>>> + ((u64)pixel_addr[2] * 257) << 32 |
>>>> + ((u64)pixel_addr[1] * 257) << 16 |
>>>> + ((u64)pixel_addr[0] * 257);
>>>
>>> I wonder if the compiler is smart enough to choose between "mul 257"
>>> and "(v << 8) | v" operations... but that's probably totally drowning
>>> under the overhead of per (x,y) looping.
>>
>> I disassembled the code to check it. And looks like the compiler is
>> replacing the multiplication with shifts and additions.
>>
>> ARGB8888_to_ARGB16161616:
>> 0xffffffff816ad660 <+0>: imul 0x12c(%rdi),%edx
>> 0xffffffff816ad667 <+7>: imul 0x130(%rdi),%esi
>> 0xffffffff816ad66e <+14>: add %edx,%esi
>> 0xffffffff816ad670 <+16>: add 0x128(%rdi),%esi
>> 0xffffffff816ad676 <+22>: movslq %esi,%rax
>> 0xffffffff816ad679 <+25>: add 0xe8(%rdi),%rax
>> 0xffffffff816ad680 <+32>: movzbl 0x3(%rax),%ecx
>> 0xffffffff816ad684 <+36>: movzbl 0x2(%rax),%esi
>> 0xffffffff816ad688 <+40>: mov %rcx,%rdx
>> 0xffffffff816ad68b <+43>: shl $0x8,%rdx
>> 0xffffffff816ad68f <+47>: add %rcx,%rdx
>> 0xffffffff816ad692 <+50>: mov %rsi,%rcx
>> 0xffffffff816ad695 <+53>: shl $0x8,%rcx
>> 0xffffffff816ad699 <+57>: shl $0x30,%rdx
>> 0xffffffff816ad69d <+61>: add %rsi,%rcx
>> 0xffffffff816ad6a0 <+64>: movzbl (%rax),%esi
>> 0xffffffff816ad6a3 <+67>: shl $0x20,%rcx
>> 0xffffffff816ad6a7 <+71>: or %rcx,%rdx
>> 0xffffffff816ad6aa <+74>: mov %rsi,%rcx
>> 0xffffffff816ad6ad <+77>: shl $0x8,%rcx
>> 0xffffffff816ad6b1 <+81>: add %rsi,%rcx
>> 0xffffffff816ad6b4 <+84>: or %rcx,%rdx
>> 0xffffffff816ad6b7 <+87>: movzbl 0x1(%rax),%ecx
>> 0xffffffff816ad6bb <+91>: mov %rcx,%rax
>> 0xffffffff816ad6be <+94>: shl $0x8,%rax
>> 0xffffffff816ad6c2 <+98>: add %rcx,%rax
>> 0xffffffff816ad6c5 <+101>: shl $0x10,%rax
>> 0xffffffff816ad6c9 <+105>: or %rdx,%rax
>> 0xffffffff816ad6cc <+108>: ret
>
> Nice!
>
>>>
>>>> +}
>>>> +
>>>> +u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
>>>> +{
>>>> + u8 *pixel_addr = packed_pixels_addr(composer, x, y);
>>>> +
>>>> + /*
>>>> + * The same as the ARGB8888 but with the alpha channel as the
>>>> + * maximum value as possible.
>>>> + */
>>>> + return 0xffffllu << 48 |
>>>> + ((u64)pixel_addr[2] * 257) << 32 |
>>>> + ((u64)pixel_addr[1] * 257) << 16 |
>>>> + ((u64)pixel_addr[0] * 257);
>>>> +}
>>>> +
>>>> +u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
>>>> +{
>>>> + __le64 *pixel_addr = packed_pixels_addr(composer, x, y);
>>>> +
>>>> + /*
>>>> + * Because the format byte order is in little-endian and this code
>>>> + * needs to run on big-endian machines too, we need modify
>>>> + * the byte order from little-endian to the CPU native byte order.
>>>> + */
>>>> + return le64_to_cpu(*pixel_addr);
>>>> +}
>>>> +
>>>> +/*
>>>> + * The following functions are used as blend operations. But unlike the
>>>> + * `alpha_blend`, these functions take an ARGB16161616 pixel from the
>>>> + * source, convert it to a specific format, and store it in the destination.
>>>> + *
>>>> + * They are used in the `compose_active_planes` and `write_wb_buffer` to
>>>> + * copy and convert one pixel from/to the output buffer to/from
>>>> + * another buffer (e.g. writeback buffer, primary plane buffer).
>>>> + */
>>>> +
>>>> +void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
>>>> + struct vkms_composer *dst_composer)
>>>> +{
>>>> + u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>>>> +
>>>> + /*This should be plenty fast indeed. Maybe worth for formats that are
extremely common.
>>>> + * This sequence below is important because the format's byte order is
>>>> + * in little-endian. In the case of the ARGB8888 the memory is
>>>> + * organized this way:
>>>> + *
>>>> + * | Addr | = blue channel
>>>> + * | Addr + 1 | = green channel
>>>> + * | Addr + 2 | = Red channel
>>>> + * | Addr + 3 | = Alpha channel
>>>> + */
>>>> + pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
>>>> + pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
>>>> + pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
>>>> + pixel_addr[3] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 48)) >> 48, 257);
>>>
>>> This could be potentially expensive if the compiler ends up using an
>>> actual div instruction.
>>>
>> Yes, I'm using the DIV_ROUND_UP because I couldn't figure out how the
>> "Faster div by 255" works to adapt to 16 bits.
>
> But does the compiler actually do a div instruction? If not, then no
> worries. If it does, then maybe something to look into, *if* this shows
> up in profiling.

GCC isn't issuing any div instruction here.

convert_to_ARGB8888:
0xffffffff816ad770 <+0>: imul 0x12c(%rcx),%edx
0xffffffff816ad777 <+7>: mov %rcx,%rax
0xffffffff816ad77a <+10>: imul 0x130(%rcx),%esi
0xffffffff816ad781 <+17>: add %edx,%esi
0xffffffff816ad783 <+19>: movzwl %di,%edx
0xffffffff816ad786 <+22>: add 0x128(%rcx),%esi
0xffffffff816ad78c <+28>: add $0x100,%rdx
0xffffffff816ad793 <+35>: movslq %esi,%rcx
0xffffffff816ad796 <+38>: movabs $0xff00ff00ff00ff01,%rsi
0xffffffff816ad7a0 <+48>: add 0xe8(%rax),%rcx
0xffffffff816ad7a7 <+55>: mov %rdx,%rax
0xffffffff816ad7aa <+58>: mul %rsi
0xffffffff816ad7ad <+61>: shr $0x8,%rdx
0xffffffff816ad7b1 <+65>: mov %dl,(%rcx)
0xffffffff816ad7b3 <+67>: mov %rdi,%rdx
0xffffffff816ad7b6 <+70>: shr $0x10,%rdx
0xffffffff816ad7ba <+74>: movzwl %dx,%edx
0xffffffff816ad7bd <+77>: add $0x100,%rdx
0xffffffff816ad7c4 <+84>: mov %rdx,%rax
0xffffffff816ad7c7 <+87>: mul %rsi
0xffffffff816ad7ca <+90>: shr $0x8,%rdx
0xffffffff816ad7ce <+94>: mov %dl,0x1(%rcx)
0xffffffff816ad7d1 <+97>: mov %rdi,%rdx
0xffffffff816ad7d4 <+100>: shr $0x30,%rdi
0xffffffff816ad7d8 <+104>: shr $0x20,%rdx
0xffffffff816ad7dc <+108>: movzwl %dx,%edx
0xffffffff816ad7df <+111>: add $0x100,%rdx
0xffffffff816ad7e6 <+118>: mov %rdx,%rax
0xffffffff816ad7e9 <+121>: mul %rsi
0xffffffff816ad7ec <+124>: shr $0x8,%rdx
0xffffffff816ad7f0 <+128>: mov %dl,0x2(%rcx)
0xffffffff816ad7f3 <+131>: lea 0x100(%rdi),%rdx
0xffffffff816ad7fa <+138>: mov %rdx,%rax
0xffffffff816ad7fd <+141>: mul %rsi
0xffffffff816ad800 <+144>: shr $0x8,%rdx
0xffffffff816ad804 <+148>: mov %dl,0x3(%rcx)
0xffffffff816ad807 <+151>: ret

>
>
> Thanks,
> pq
>
>>> Btw. this would be shorter to write as
>>>
>>> (argb_src1 >> 16) & 0xffff
>>>
>>> etc.
>> I will use it in the V2. Thanks.
>>
>>>
>>> Thanks,
>>> pq
>>>
>>>> +}
>>>> +
>>>> +void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
>>>> + struct vkms_composer *dst_composer)
>>>> +{
>>>> + u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>>>> +
>>>> + pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
>>>> + pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
>>>> + pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
>>>> + pixel_addr[3] = 0xff;
>>>> +}
>>>> +
>>>> +void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
>>>> + struct vkms_composer *dst_composer)
>>>> +{
>>>> + __le64 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>>>> +
>>>> + *pixel_addr = cpu_to_le64(argb_src1);
>>>> +}
>>>> +
>>>> +#endif /* _VKMS_FORMATS_H_ */
>>>
>

2021-10-20 08:28:08

by Pekka Paalanen

[permalink] [raw]
Subject: Re: [PATCH 6/6] drm: vkms: Refactor the plane composer to accept new formats

On Tue, 19 Oct 2021 18:10:57 -0300
Igor Matheus Andrade Torrente <[email protected]> wrote:

> Hi Pekka,
>
> On 10/19/21 5:05 AM, Pekka Paalanen wrote:
> > On Mon, 18 Oct 2021 16:26:06 -0300
> > Igor Matheus Andrade Torrente <[email protected]> wrote:
> >
> >> Hi Pekka,
> >>
> >> On 10/18/21 5:30 AM, Pekka Paalanen wrote:
> >>> On Tue, 5 Oct 2021 17:16:37 -0300
> >>> Igor Matheus Andrade Torrente <[email protected]> wrote:
> >>>
> >>>> Currently the blend function only accepts XRGB_8888 and ARGB_8888
> >>>> as a color input.
> >>>>
> >>>> This patch refactors all the functions related to the plane composition
> >>>> to overcome this limitation.
> >>>>
> >>>> Now the blend function receives a format handler to each plane and a
> >>>> blend function pointer. It will take two ARGB_1616161616 pixels, one
> >>>> for each handler, and will use the blend function to calculate and
> >>>> store the final color in the output buffer.
> >>>>
> >>>> These format handlers will receive the `vkms_composer` and a pair of
> >>>> coordinates. And they should return the respective pixel in the
> >>>> ARGB_16161616 format.
> >>>>
> >>>> The blend function will receive two ARGB_16161616 pixels, x, y, and
> >>>> the vkms_composer of the output buffer. The method should perform the
> >>>> blend operation and store output to the format aforementioned
> >>>> ARGB_16161616.
> >>>
> >>> Hi,
> >>>
> >>> here are some drive-by comments which you are free to take or leave.
> >>>
> >>> To find more efficient ways to do this, you might want research
> >>> library. It's MIT licensed.
> >>>
> >>> IIRC, the elementary operations there operate on pixel lines (pointer +
> >>> length), not individual pixels (image, x, y). Input conversion function
> >>> reads and converts a whole line a time. Blending function blends two
> >>> lines and writes the output into back one of the lines. Output
> >>> conversion function takes a line and converts it into destination
> >>> buffer. That way the elementary operations do not need to take stride
> >>> into account, and blending does not need to deal with varying memory
> >>> alignment FWIW. The inner loops involve much less function calls and
> >>> state, probably.
> >>
> >> I was doing some rudimentary profiling, and I noticed that the code
> >> spends most of the time with the indirect system call overhead and not
> >> the actual computation. This can definitively help with it.
> >
> > Hi,
> >
> > I suppose you mean function (pointer) call, not system call?
>
> Yes, I misspelled it.
>
> >
> > Really good that you have already profiled things. All optimisations
> > should be guided by profiling, otherwise they are just guesses (and I
> > got lucky this time I guess).
> >
> >>> Pixman may not do exactly this, but it does something very similar.
> >>> Pixman also has multiple different levels of optimisations, which may
> >>> not be necessary for VKMS.
> >>>
> >>> It's a trade-off between speed and temporary memory consumed. You need
> >>> temporary buffers for two lines, but not more (not a whole image in
> >>> full intermediate precision). Further optimisation could determine
> >>> whether to process whole image rows as lines, or split a row into
> >>> multiple lines to stay within CPU cache size.
> >>>
> >>
> >> Sorry, I didn't understand the idea of the last sentence.
> >
> > If an image is very wide, a single row could still be relatively large
> > in size (bytes). If it is large enough that the working set falls out
> > of a faster CPU cache into a slower CPU cache (or worse yet, into RAM
> > accesses), performance may suffer and become memory bandwidth limited
> > rather than ALU rate limited. Theoretically that can be worked around
> > by limiting the maximum size of a line, and splitting an image row into
> > multiple lines.
>
> Got it now, thanks.
>
> >
> > However, this is an optimisation one probably should not do until there
> > is performance profiling data showing that it actually is useful. The
> > optimal line size limit depends on the CPU model as well. So it's a bit
> > far out, but something you could keep in mind just in case.
>
> OK. For now I will not deal with this complexity, and if necessary I
> will revisit it latter.
>
> >
> >>> Since it seems you are blending multiple planes into a single
> >>> destination which is assumed to be opaque, you might also be able to do
> >>> the multiple blends without convert-writing and read-converting to/from
> >>> the destination between every plane. I think that might be possible to
> >>> architect on top of the per-line operations still.
> >>
> >> I tried it. But I don't know how to do this without looking like a mess.
>
> I don't think it changes anything, but I forgot to mention that I tried
> it with the blend per pixel and not a per line.

Hi,

I believe that can get messy per-pixel, yeah. You kind of want to keep
the per-pixel (inner loop) operations as fast as possible, and that
naturally tends to lead to contorted code as you try to optimise it
(prematurely perhaps).

I would expect per-line code to be cleaner, because there is less state
to be handled in the inner loop, and the outer loops spin rarely enough
that you can afford to write more clear code.

> > Dedicate one of the temporary line buffers for the destination, and
> > instead of writing it out and loading it back for each input plane,
> > leave it in place over all planes and write it out just once at the end.
> >
> > I do expect more state tracking needed. You need to iterate over the
> > list of planes for each output row, extract only the used part of an
> > input plane's buffer into the other temporary line buffer, and offset
> > the destination line buffer and length to match when passing those into
> > a blending function.+
>
> It's exactly this state tracking that was a mess when I was trying to
> implement something similar. But this is one more thing that probably
> I will have to revisit later.

Mind the difference between state tracking in the inner loop vs. the
outer loops. Per-line, inner loop becomes simpler while the outer loops
become slightly more complicated, but it should be a net win, because
the outer loops spin fewer times.

When nesting gets too deep, code becomes a mess, or the number of local
variables in a function grows too far, it usually helps to split things
into more functions. The compiler will inline them for you
automatically when that is beneficial.

Function pointer calls cannot usually be inlined, hence they should be
kept out of the innermost loop.

> > It's not an obvious win but a trade-off, so profiling is again needed.
> >
> > Btw. the use of temporary line buffers should also help with
> > implementing scaling. You could do the filtering during reading of the
> > input buffer. If the filter is not nearest, meaning you need to access
> > more than one input pixel per pixel-for-blending, there are a few ways
> > to go about that. You could do the filtering during the input buffer
> > reading, or you could load two input buffer rows into temporary line
> > buffers and do filtering as a separate step into yet another line
> > buffer. As the composition advances from top to bottom, only one of the
> > input buffer rows will change at a time (during up-scaling) so you can
> > avoid re-loading a row by swapping the line buffers.
> >
> > This is getting ahead of things, so don't worry about scaling or
> > filtering yet. The first thing is to see if you can make the line
> > buffer approach give you a significant speed-up.
> >
> >> Does the pixman perform some similar?
> >
> > No, Pixman composition operation has only three images: source,
> > mask, and destination. All those it can handle simultaneously, but
> > since there is no "multi-blending" API, it doesn't need to take care of
> > more.
> >
> > IIRC, Pixman also has a form of optimised operations that do blending
> > and converting to destination in the same pass. The advantage of that
> > is that blending can work with less precision when it knows what
> > precision the output will be converted to and it saves some bandwidth
> > by not needing to write-after-blending and read-for-conversion
> > intermediate precision values. The disadvantage is that the number of
> > needed specialised blending functions explodes by the number of
> > possible destination formats. Pixman indeed makes those specialised
> > functions optional, falling back to more generic C code. I would hope
> > that VKMS does not need to go this far in optimisations though.
>
> This should be plenty fast indeed. Maybe worth for formats that are
> extremely common.

My feeling says that it would be better to not aim that far at first,
and see how the per-line approach works with separate pluggable
read-conversion, blending, and write-conversion functions.

Besides, combined blend-write operations would probably conflict with
the idea of blending all planes first and then writing destination once.

There are KMS properties for different blending functions, so that's a
reason to make the blending function pluggable as well. Not just for a
potential optimisation to skip the X channel on input completely.

If you want to support background color property, that might be handled
with a special read-conversion function which just fills the temporary
line buffer with the background color.

It is always possible to get faster by writing more specialised
functions that can take advantage of more invariants/assumptions, but
you get a combinatorial explosion of specialisations.


Thanks,
pq


Attachments:
(No filename) (849.00 B)
OpenPGP digital signature