2019-05-30 21:16:53

by Jernej Skrabec

[permalink] [raw]
Subject: [PATCH 0/7] media: cedrus: Improvements/cleanup

Here is first batch of random Cedrus improvements/cleanups. Only patch 2
has a change which raises a question about H264 controls.

Changes were tested on H3 SoC using modified ffmpeg and Kodi.

Please take a look.

Best regards,
Jernej

Jernej Skrabec (7):
media: cedrus: Disable engine after each slice decoding
media: cedrus: Fix H264 default reference index count
media: cedrus: Fix decoding for some H264 videos
media: cedrus: Remove dst_bufs from context
media: cedrus: Don't set chroma size for scale & rotation
media: cedrus: Add infra for extra buffers connected to capture
buffers
media: cedrus: Improve H264 memory efficiency

drivers/staging/media/sunxi/cedrus/cedrus.h | 12 +-
.../staging/media/sunxi/cedrus/cedrus_h264.c | 115 ++++++++----------
.../staging/media/sunxi/cedrus/cedrus_hw.c | 4 +-
.../staging/media/sunxi/cedrus/cedrus_video.c | 25 ++--
4 files changed, 68 insertions(+), 88 deletions(-)

--
2.21.0


2019-05-30 21:17:01

by Jernej Skrabec

[permalink] [raw]
Subject: [PATCH 1/7] media: cedrus: Disable engine after each slice decoding

libvdpau-sunxi always disables engine after each decoded slice.
Do same in Cedrus driver.

Presumably this also lowers power consumption which is always nice.

Signed-off-by: Jernej Skrabec <[email protected]>
---
drivers/staging/media/sunxi/cedrus/cedrus_hw.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_hw.c b/drivers/staging/media/sunxi/cedrus/cedrus_hw.c
index c34aec7c6e40..9c5819def186 100644
--- a/drivers/staging/media/sunxi/cedrus/cedrus_hw.c
+++ b/drivers/staging/media/sunxi/cedrus/cedrus_hw.c
@@ -123,6 +123,7 @@ static irqreturn_t cedrus_irq(int irq, void *data)

dev->dec_ops[ctx->current_codec]->irq_disable(ctx);
dev->dec_ops[ctx->current_codec]->irq_clear(ctx);
+ cedrus_engine_disable(dev);

src_buf = v4l2_m2m_src_buf_remove(ctx->fh.m2m_ctx);
dst_buf = v4l2_m2m_dst_buf_remove(ctx->fh.m2m_ctx);
--
2.21.0

2019-05-30 21:17:13

by Jernej Skrabec

[permalink] [raw]
Subject: [PATCH 4/7] media: cedrus: Remove dst_bufs from context

This array is just duplicated capture buffer queue. Remove it and adjust
code to look into capture buffer queue instead.

Signed-off-by: Jernej Skrabec <[email protected]>
---
drivers/staging/media/sunxi/cedrus/cedrus.h | 4 +---
.../staging/media/sunxi/cedrus/cedrus_h264.c | 4 ++--
.../staging/media/sunxi/cedrus/cedrus_video.c | 22 -------------------
3 files changed, 3 insertions(+), 27 deletions(-)

diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h b/drivers/staging/media/sunxi/cedrus/cedrus.h
index 3f476d0fd981..d8e6777e5e27 100644
--- a/drivers/staging/media/sunxi/cedrus/cedrus.h
+++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
@@ -100,8 +100,6 @@ struct cedrus_ctx {
struct v4l2_ctrl_handler hdl;
struct v4l2_ctrl **ctrls;

- struct vb2_buffer *dst_bufs[VIDEO_MAX_FRAME];
-
union {
struct {
void *mv_col_buf;
@@ -187,7 +185,7 @@ static inline dma_addr_t cedrus_dst_buf_addr(struct cedrus_ctx *ctx,
if (index < 0)
return 0;

- buf = ctx->dst_bufs[index];
+ buf = ctx->fh.m2m_ctx->cap_q_ctx.q.bufs[index];
return buf ? cedrus_buf_addr(buf, &ctx->dst_fmt, plane) : 0;
}

diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
index d0ee3f90ff46..b2290f98d81a 100644
--- a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
+++ b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
@@ -119,7 +119,7 @@ static void cedrus_write_frame_list(struct cedrus_ctx *ctx,
if (buf_idx < 0)
continue;

- cedrus_buf = vb2_to_cedrus_buffer(ctx->dst_bufs[buf_idx]);
+ cedrus_buf = vb2_to_cedrus_buffer(cap_q->bufs[buf_idx]);
position = cedrus_buf->codec.h264.position;
used_dpbs |= BIT(position);

@@ -194,7 +194,7 @@ static void _cedrus_write_ref_list(struct cedrus_ctx *ctx,
if (buf_idx < 0)
continue;

- ref_buf = to_vb2_v4l2_buffer(ctx->dst_bufs[buf_idx]);
+ ref_buf = to_vb2_v4l2_buffer(cap_q->bufs[buf_idx]);
cedrus_buf = vb2_v4l2_to_cedrus_buffer(ref_buf);
position = cedrus_buf->codec.h264.position;

diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_video.c b/drivers/staging/media/sunxi/cedrus/cedrus_video.c
index e2b530b1a956..681dfe3367a6 100644
--- a/drivers/staging/media/sunxi/cedrus/cedrus_video.c
+++ b/drivers/staging/media/sunxi/cedrus/cedrus_video.c
@@ -411,26 +411,6 @@ static void cedrus_queue_cleanup(struct vb2_queue *vq, u32 state)
}
}

-static int cedrus_buf_init(struct vb2_buffer *vb)
-{
- struct vb2_queue *vq = vb->vb2_queue;
- struct cedrus_ctx *ctx = vb2_get_drv_priv(vq);
-
- if (!V4L2_TYPE_IS_OUTPUT(vq->type))
- ctx->dst_bufs[vb->index] = vb;
-
- return 0;
-}
-
-static void cedrus_buf_cleanup(struct vb2_buffer *vb)
-{
- struct vb2_queue *vq = vb->vb2_queue;
- struct cedrus_ctx *ctx = vb2_get_drv_priv(vq);
-
- if (!V4L2_TYPE_IS_OUTPUT(vq->type))
- ctx->dst_bufs[vb->index] = NULL;
-}
-
static int cedrus_buf_out_validate(struct vb2_buffer *vb)
{
struct vb2_v4l2_buffer *vbuf = to_vb2_v4l2_buffer(vb);
@@ -517,8 +497,6 @@ static void cedrus_buf_request_complete(struct vb2_buffer *vb)
static struct vb2_ops cedrus_qops = {
.queue_setup = cedrus_queue_setup,
.buf_prepare = cedrus_buf_prepare,
- .buf_init = cedrus_buf_init,
- .buf_cleanup = cedrus_buf_cleanup,
.buf_queue = cedrus_buf_queue,
.buf_out_validate = cedrus_buf_out_validate,
.buf_request_complete = cedrus_buf_request_complete,
--
2.21.0

2019-05-30 21:17:20

by Jernej Skrabec

[permalink] [raw]
Subject: [PATCH 5/7] media: cedrus: Don't set chroma size for scale & rotation

Scale and rotation are currently not implemented, so it makes no sense to
set chroma size for it.

Signed-off-by: Jernej Skrabec <[email protected]>
---
drivers/staging/media/sunxi/cedrus/cedrus_hw.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_hw.c b/drivers/staging/media/sunxi/cedrus/cedrus_hw.c
index 9c5819def186..9de20ae47916 100644
--- a/drivers/staging/media/sunxi/cedrus/cedrus_hw.c
+++ b/drivers/staging/media/sunxi/cedrus/cedrus_hw.c
@@ -79,9 +79,6 @@ void cedrus_dst_format_set(struct cedrus_dev *dev,
reg = VE_PRIMARY_OUT_FMT_NV12;
cedrus_write(dev, VE_PRIMARY_OUT_FMT, reg);

- reg = VE_CHROMA_BUF_LEN_SDRT(chroma_size / 2);
- cedrus_write(dev, VE_CHROMA_BUF_LEN, reg);
-
reg = chroma_size / 2;
cedrus_write(dev, VE_PRIMARY_CHROMA_BUF_LEN, reg);

--
2.21.0

2019-05-30 21:17:22

by Jernej Skrabec

[permalink] [raw]
Subject: [PATCH 7/7] media: cedrus: Improve H264 memory efficiency

H264 decoder driver preallocated pretty big worst case mv col buffer
pool. It turns out that pool is most of the time much bigger than it
needs to be.

Solution implemented here is to allocate memory only if capture buffer
is actually used and only as much as it is really necessary.

This is also preparation for 4K video decoding support, which will be
implemented later.

Signed-off-by: Jernej Skrabec <[email protected]>
---
drivers/staging/media/sunxi/cedrus/cedrus.h | 4 -
.../staging/media/sunxi/cedrus/cedrus_h264.c | 81 +++++++------------
2 files changed, 28 insertions(+), 57 deletions(-)

diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h b/drivers/staging/media/sunxi/cedrus/cedrus.h
index 16c1bdfd243a..fcbbbef65494 100644
--- a/drivers/staging/media/sunxi/cedrus/cedrus.h
+++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
@@ -106,10 +106,6 @@ struct cedrus_ctx {

union {
struct {
- void *mv_col_buf;
- dma_addr_t mv_col_buf_dma;
- ssize_t mv_col_buf_field_size;
- ssize_t mv_col_buf_size;
void *pic_info_buf;
dma_addr_t pic_info_buf_dma;
void *neighbor_info_buf;
diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
index b2290f98d81a..758fd0049e8f 100644
--- a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
+++ b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
@@ -54,17 +54,14 @@ static void cedrus_h264_write_sram(struct cedrus_dev *dev,
cedrus_write(dev, VE_AVC_SRAM_PORT_DATA, *buffer++);
}

-static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_ctx *ctx,
- unsigned int position,
+static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_buffer *buf,
unsigned int field)
{
- dma_addr_t addr = ctx->codec.h264.mv_col_buf_dma;
-
- /* Adjust for the position */
- addr += position * ctx->codec.h264.mv_col_buf_field_size * 2;
+ dma_addr_t addr = buf->extra_buf_dma;

/* Adjust for the field */
- addr += field * ctx->codec.h264.mv_col_buf_field_size;
+ if (field)
+ addr += buf->extra_buf_size / 2;

return addr;
}
@@ -76,7 +73,6 @@ static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,
struct cedrus_h264_sram_ref_pic *pic)
{
struct vb2_buffer *vbuf = &buf->m2m_buf.vb.vb2_buf;
- unsigned int position = buf->codec.h264.position;

pic->top_field_order_cnt = cpu_to_le32(top_field_order_cnt);
pic->bottom_field_order_cnt = cpu_to_le32(bottom_field_order_cnt);
@@ -84,10 +80,8 @@ static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,

pic->luma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt, 0));
pic->chroma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt, 1));
- pic->mv_col_top_ptr =
- cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position, 0));
- pic->mv_col_bot_ptr =
- cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position, 1));
+ pic->mv_col_top_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf, 0));
+ pic->mv_col_bot_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf, 1));
}

static void cedrus_write_frame_list(struct cedrus_ctx *ctx,
@@ -142,6 +136,28 @@ static void cedrus_write_frame_list(struct cedrus_ctx *ctx,
output_buf = vb2_to_cedrus_buffer(&run->dst->vb2_buf);
output_buf->codec.h264.position = position;

+ if (!output_buf->extra_buf_size) {
+ const struct v4l2_ctrl_h264_sps *sps = run->h264.sps;
+ unsigned int field_size;
+
+ field_size = DIV_ROUND_UP(ctx->src_fmt.width, 16) *
+ DIV_ROUND_UP(ctx->src_fmt.height, 16) * 16;
+ if (!(sps->flags & V4L2_H264_SPS_FLAG_DIRECT_8X8_INFERENCE))
+ field_size = field_size * 2;
+ if (!(sps->flags & V4L2_H264_SPS_FLAG_FRAME_MBS_ONLY))
+ field_size = field_size * 2;
+
+ output_buf->extra_buf_size = field_size * 2;
+ output_buf->extra_buf =
+ dma_alloc_coherent(dev->dev,
+ output_buf->extra_buf_size,
+ &output_buf->extra_buf_dma,
+ GFP_KERNEL);
+
+ if (!output_buf->extra_buf)
+ output_buf->extra_buf_size = 0;
+ }
+
if (slice->flags & V4L2_H264_SLICE_FLAG_FIELD_PIC)
output_buf->codec.h264.pic_type = CEDRUS_H264_PIC_TYPE_FIELD;
else if (sps->flags & V4L2_H264_SPS_FLAG_MB_ADAPTIVE_FRAME_FIELD)
@@ -476,8 +492,6 @@ static void cedrus_h264_setup(struct cedrus_ctx *ctx,
static int cedrus_h264_start(struct cedrus_ctx *ctx)
{
struct cedrus_dev *dev = ctx->dev;
- unsigned int field_size;
- unsigned int mv_col_size;
int ret;

/*
@@ -509,44 +523,8 @@ static int cedrus_h264_start(struct cedrus_ctx *ctx)
goto err_pic_buf;
}

- field_size = DIV_ROUND_UP(ctx->src_fmt.width, 16) *
- DIV_ROUND_UP(ctx->src_fmt.height, 16) * 16;
-
- /*
- * FIXME: This is actually conditional to
- * V4L2_H264_SPS_FLAG_DIRECT_8X8_INFERENCE not being set, we
- * might have to rework this if memory efficiency ever is
- * something we need to work on.
- */
- field_size = field_size * 2;
-
- /*
- * FIXME: This is actually conditional to
- * V4L2_H264_SPS_FLAG_FRAME_MBS_ONLY not being set, we might
- * have to rework this if memory efficiency ever is something
- * we need to work on.
- */
- field_size = field_size * 2;
- ctx->codec.h264.mv_col_buf_field_size = field_size;
-
- mv_col_size = field_size * 2 * CEDRUS_H264_FRAME_NUM;
- ctx->codec.h264.mv_col_buf_size = mv_col_size;
- ctx->codec.h264.mv_col_buf = dma_alloc_coherent(dev->dev,
- ctx->codec.h264.mv_col_buf_size,
- &ctx->codec.h264.mv_col_buf_dma,
- GFP_KERNEL);
- if (!ctx->codec.h264.mv_col_buf) {
- ret = -ENOMEM;
- goto err_neighbor_buf;
- }
-
return 0;

-err_neighbor_buf:
- dma_free_coherent(dev->dev, CEDRUS_NEIGHBOR_INFO_BUF_SIZE,
- ctx->codec.h264.neighbor_info_buf,
- ctx->codec.h264.neighbor_info_buf_dma);
-
err_pic_buf:
dma_free_coherent(dev->dev, CEDRUS_PIC_INFO_BUF_SIZE,
ctx->codec.h264.pic_info_buf,
@@ -558,9 +536,6 @@ static void cedrus_h264_stop(struct cedrus_ctx *ctx)
{
struct cedrus_dev *dev = ctx->dev;

- dma_free_coherent(dev->dev, ctx->codec.h264.mv_col_buf_size,
- ctx->codec.h264.mv_col_buf,
- ctx->codec.h264.mv_col_buf_dma);
dma_free_coherent(dev->dev, CEDRUS_NEIGHBOR_INFO_BUF_SIZE,
ctx->codec.h264.neighbor_info_buf,
ctx->codec.h264.neighbor_info_buf_dma);
--
2.21.0

2019-05-30 21:18:12

by Jernej Skrabec

[permalink] [raw]
Subject: [PATCH 3/7] media: cedrus: Fix decoding for some H264 videos

It seems that for some H264 videos at least one bitstream parsing
trigger must be called in order to be decoded correctly. There is no
explanation why this helps, but it was observed that two sample videos
with this fix are now decoded correctly and there is no regression with
others.

Signed-off-by: Jernej Skrabec <[email protected]>
---
I have two samples which are fixed by this:
http://jernej.libreelec.tv/videos/h264/test.mkv
http://jernej.libreelec.tv/videos/h264/Dredd%20%E2%80%93%20DTS%20Sound%20Check%20DTS-HD%20MA%207.1.m2ts

Although second one also needs support for multi-slice frames, which is not yet implemented here.

.../staging/media/sunxi/cedrus/cedrus_h264.c | 22 ++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
index cc8d17f211a1..d0ee3f90ff46 100644
--- a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
+++ b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
@@ -6,6 +6,7 @@
* Copyright (c) 2018 Bootlin
*/

+#include <linux/delay.h>
#include <linux/types.h>

#include <media/videobuf2-dma-contig.h>
@@ -289,6 +290,20 @@ static void cedrus_write_pred_weight_table(struct cedrus_ctx *ctx,
}
}

+static void cedrus_skip_bits(struct cedrus_dev *dev, int num)
+{
+ for (; num > 32; num -= 32) {
+ cedrus_write(dev, VE_H264_TRIGGER_TYPE, 0x3 | (32 << 8));
+ while (cedrus_read(dev, VE_H264_STATUS) & (1 << 8))
+ udelay(1);
+ }
+ if (num > 0) {
+ cedrus_write(dev, VE_H264_TRIGGER_TYPE, 0x3 | (num << 8));
+ while (cedrus_read(dev, VE_H264_STATUS) & (1 << 8))
+ udelay(1);
+ }
+}
+
static void cedrus_set_params(struct cedrus_ctx *ctx,
struct cedrus_run *run)
{
@@ -299,12 +314,11 @@ static void cedrus_set_params(struct cedrus_ctx *ctx,
struct vb2_buffer *src_buf = &run->src->vb2_buf;
struct cedrus_dev *dev = ctx->dev;
dma_addr_t src_buf_addr;
- u32 offset = slice->header_bit_size;
- u32 len = (slice->size * 8) - offset;
+ u32 len = slice->size * 8;
u32 reg;

cedrus_write(dev, VE_H264_VLD_LEN, len);
- cedrus_write(dev, VE_H264_VLD_OFFSET, offset);
+ cedrus_write(dev, VE_H264_VLD_OFFSET, 0);

src_buf_addr = vb2_dma_contig_plane_dma_addr(src_buf, 0);
cedrus_write(dev, VE_H264_VLD_END,
@@ -323,6 +337,8 @@ static void cedrus_set_params(struct cedrus_ctx *ctx,
cedrus_write(dev, VE_H264_TRIGGER_TYPE,
VE_H264_TRIGGER_TYPE_INIT_SWDEC);

+ cedrus_skip_bits(dev, slice->header_bit_size);
+
if (((pps->flags & V4L2_H264_PPS_FLAG_WEIGHTED_PRED) &&
(slice->slice_type == V4L2_H264_SLICE_TYPE_P ||
slice->slice_type == V4L2_H264_SLICE_TYPE_SP)) ||
--
2.21.0

2019-05-30 21:18:58

by Jernej Skrabec

[permalink] [raw]
Subject: [PATCH 6/7] media: cedrus: Add infra for extra buffers connected to capture buffers

H264 and HEVC engines need additional buffers for each capture buffer.
H264 engine has this currently solved by allocating fixed size pool,
which is not ideal. Most of the time pool size is much bigger than it
needs to be.

Ideally, extra buffer should be allocated at buffer initialization, but
that's not efficient. It's size in H264 depends on flags set in SPS, but
that information is not available in buffer init callback.

Signed-off-by: Jernej Skrabec <[email protected]>
---
drivers/staging/media/sunxi/cedrus/cedrus.h | 4 ++++
.../staging/media/sunxi/cedrus/cedrus_video.c | 19 +++++++++++++++++++
2 files changed, 23 insertions(+)

diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h b/drivers/staging/media/sunxi/cedrus/cedrus.h
index d8e6777e5e27..16c1bdfd243a 100644
--- a/drivers/staging/media/sunxi/cedrus/cedrus.h
+++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
@@ -81,6 +81,10 @@ struct cedrus_run {
struct cedrus_buffer {
struct v4l2_m2m_buffer m2m_buf;

+ void *extra_buf;
+ dma_addr_t extra_buf_dma;
+ ssize_t extra_buf_size;
+
union {
struct {
unsigned int position;
diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_video.c b/drivers/staging/media/sunxi/cedrus/cedrus_video.c
index 681dfe3367a6..d756e0e69634 100644
--- a/drivers/staging/media/sunxi/cedrus/cedrus_video.c
+++ b/drivers/staging/media/sunxi/cedrus/cedrus_video.c
@@ -411,6 +411,24 @@ static void cedrus_queue_cleanup(struct vb2_queue *vq, u32 state)
}
}

+static void cedrus_buf_cleanup(struct vb2_buffer *vb)
+{
+ struct vb2_queue *vq = vb->vb2_queue;
+
+ if (!V4L2_TYPE_IS_OUTPUT(vq->type)) {
+ struct cedrus_ctx *ctx = vb2_get_drv_priv(vq);
+ struct cedrus_buffer *cedrus_buf;
+
+ cedrus_buf = vb2_to_cedrus_buffer(vq->bufs[vb->index]);
+
+ if (cedrus_buf->extra_buf_size)
+ dma_free_coherent(ctx->dev->dev,
+ cedrus_buf->extra_buf_size,
+ cedrus_buf->extra_buf,
+ cedrus_buf->extra_buf_dma);
+ }
+}
+
static int cedrus_buf_out_validate(struct vb2_buffer *vb)
{
struct vb2_v4l2_buffer *vbuf = to_vb2_v4l2_buffer(vb);
@@ -497,6 +515,7 @@ static void cedrus_buf_request_complete(struct vb2_buffer *vb)
static struct vb2_ops cedrus_qops = {
.queue_setup = cedrus_queue_setup,
.buf_prepare = cedrus_buf_prepare,
+ .buf_cleanup = cedrus_buf_cleanup,
.buf_queue = cedrus_buf_queue,
.buf_out_validate = cedrus_buf_out_validate,
.buf_request_complete = cedrus_buf_request_complete,
--
2.21.0

2019-05-30 21:19:10

by Jernej Skrabec

[permalink] [raw]
Subject: [PATCH 2/7] media: cedrus: Fix H264 default reference index count

Reference index count in VE_H264_PPS reg should come from PPS control.
However, this is not really important because reference index count is
in our case always overridden by that from slice header.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Signed-off-by: Jernej Skrabec <[email protected]>
---
We have to decide if we drop pps->num_ref_idx_l0_default_active_minus1
and pps->num_ref_idx_l1_default_active_minus1 fields or add
num_ref_idx_l0_active_override_flag and num_ref_idx_l0_active_override_flag
to slice control.

Current control doesn't have those two flags, so in Cedrus override flag is
always set and we rely on userspace to set slice->num_ref_idx_l0_active_minus1
and slice->num_ref_idx_l1_active_minus1 to correct values. This means that
values stored in PPS are not needed and always ignored by VPU.

If I understand correctly, algorithm is very simple:

ref_count = PPS->ref_count
if (override_flag)
ref_count = slice->ref_count

It seems that VAAPI provides only final value. In my opinion we should do the
same - get rid of PPS default ref index count fields.

drivers/staging/media/sunxi/cedrus/cedrus_h264.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
index a30bb283f69f..cc8d17f211a1 100644
--- a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
+++ b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
@@ -340,12 +340,8 @@ static void cedrus_set_params(struct cedrus_ctx *ctx,

// picture parameters
reg = 0;
- /*
- * FIXME: the kernel headers are allowing the default value to
- * be passed, but the libva doesn't give us that.
- */
- reg |= (slice->num_ref_idx_l0_active_minus1 & 0x1f) << 10;
- reg |= (slice->num_ref_idx_l1_active_minus1 & 0x1f) << 5;
+ reg |= (pps->num_ref_idx_l0_default_active_minus1 & 0x1f) << 10;
+ reg |= (pps->num_ref_idx_l1_default_active_minus1 & 0x1f) << 5;
reg |= (pps->weighted_bipred_idc & 0x3) << 2;
if (pps->flags & V4L2_H264_PPS_FLAG_ENTROPY_CODING_MODE)
reg |= VE_H264_PPS_ENTROPY_CODING_MODE;
--
2.21.0

2019-06-03 12:20:56

by Maxime Ripard

[permalink] [raw]
Subject: Re: [PATCH 1/7] media: cedrus: Disable engine after each slice decoding

Hi,

On Thu, May 30, 2019 at 11:15:10PM +0200, Jernej Skrabec wrote:
> libvdpau-sunxi always disables engine after each decoded slice.
> Do same in Cedrus driver.
>
> Presumably this also lowers power consumption which is always nice.
>
> Signed-off-by: Jernej Skrabec <[email protected]>

Is it fixing anything though?

I indeed saw that cedar did disable it everytime, but I couldn't find
a reason why.

Also, the power management improvement would need to be measured, it
can even create the opposite situation where the device will draw more
current from being woken up than if it had just remained disabled.

Maxime

--
Maxime Ripard, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


Attachments:
(No filename) (737.00 B)
signature.asc (235.00 B)
Download all attachments

2019-06-03 12:25:49

by Maxime Ripard

[permalink] [raw]
Subject: Re: [PATCH 4/7] media: cedrus: Remove dst_bufs from context

On Thu, May 30, 2019 at 11:15:13PM +0200, Jernej Skrabec wrote:
> This array is just duplicated capture buffer queue. Remove it and adjust
> code to look into capture buffer queue instead.
>
> Signed-off-by: Jernej Skrabec <[email protected]>

Acked-by: Maxime Ripard <[email protected]>

Maxime

--
Maxime Ripard, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


Attachments:
(No filename) (409.00 B)
signature.asc (235.00 B)
Download all attachments

2019-06-03 12:27:08

by Maxime Ripard

[permalink] [raw]
Subject: Re: [PATCH 6/7] media: cedrus: Add infra for extra buffers connected to capture buffers

Hi,

On Thu, May 30, 2019 at 11:15:15PM +0200, Jernej Skrabec wrote:
> H264 and HEVC engines need additional buffers for each capture buffer.
> H264 engine has this currently solved by allocating fixed size pool,
> which is not ideal. Most of the time pool size is much bigger than it
> needs to be.
>
> Ideally, extra buffer should be allocated at buffer initialization, but
> that's not efficient. It's size in H264 depends on flags set in SPS, but
> that information is not available in buffer init callback.
>
> Signed-off-by: Jernej Skrabec <[email protected]>
> ---
> drivers/staging/media/sunxi/cedrus/cedrus.h | 4 ++++
> .../staging/media/sunxi/cedrus/cedrus_video.c | 19 +++++++++++++++++++
> 2 files changed, 23 insertions(+)
>
> diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h b/drivers/staging/media/sunxi/cedrus/cedrus.h
> index d8e6777e5e27..16c1bdfd243a 100644
> --- a/drivers/staging/media/sunxi/cedrus/cedrus.h
> +++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
> @@ -81,6 +81,10 @@ struct cedrus_run {
> struct cedrus_buffer {
> struct v4l2_m2m_buffer m2m_buf;
>
> + void *extra_buf;
> + dma_addr_t extra_buf_dma;
> + ssize_t extra_buf_size;
> +
> union {
> struct {
> unsigned int position;
> diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_video.c b/drivers/staging/media/sunxi/cedrus/cedrus_video.c
> index 681dfe3367a6..d756e0e69634 100644
> --- a/drivers/staging/media/sunxi/cedrus/cedrus_video.c
> +++ b/drivers/staging/media/sunxi/cedrus/cedrus_video.c
> @@ -411,6 +411,24 @@ static void cedrus_queue_cleanup(struct vb2_queue *vq, u32 state)
> }
> }
>
> +static void cedrus_buf_cleanup(struct vb2_buffer *vb)
> +{
> + struct vb2_queue *vq = vb->vb2_queue;
> +
> + if (!V4L2_TYPE_IS_OUTPUT(vq->type)) {
> + struct cedrus_ctx *ctx = vb2_get_drv_priv(vq);
> + struct cedrus_buffer *cedrus_buf;
> +
> + cedrus_buf = vb2_to_cedrus_buffer(vq->bufs[vb->index]);
> +
> + if (cedrus_buf->extra_buf_size)
> + dma_free_coherent(ctx->dev->dev,
> + cedrus_buf->extra_buf_size,
> + cedrus_buf->extra_buf,
> + cedrus_buf->extra_buf_dma);
> + }
> +}
> +

I'm really not a fan of allocating something somewhere, and freeing it
somewhere else. Making sure you don't leak something is hard enough to
not have some non-trivial allocation scheme.

Maxime

--
Maxime Ripard, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


Attachments:
(No filename) (2.44 kB)
signature.asc (235.00 B)
Download all attachments

2019-06-03 15:36:17

by Jernej Skrabec

[permalink] [raw]
Subject: Re: [PATCH 2/7] media: cedrus: Fix H264 default reference index count

Dne ponedeljek, 03. junij 2019 ob 13:46:20 CEST je Maxime Ripard napisal(a):
> On Thu, May 30, 2019 at 11:15:11PM +0200, Jernej Skrabec wrote:
> > Reference index count in VE_H264_PPS reg should come from PPS control.
> > However, this is not really important because reference index count is
> > in our case always overridden by that from slice header.
> >
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> >
> > Signed-off-by: Jernej Skrabec <[email protected]>
>
> Acked-by: Maxime Ripard <[email protected]>
>
> > ---
> > We have to decide if we drop pps->num_ref_idx_l0_default_active_minus1
> > and pps->num_ref_idx_l1_default_active_minus1 fields or add
> > num_ref_idx_l0_active_override_flag and
> > num_ref_idx_l0_active_override_flag
> > to slice control.

Actually only one flag is in bitstream valid for both l0 and l1 ref list.

> >
> > Current control doesn't have those two flags, so in Cedrus override flag
> > is
> > always set and we rely on userspace to set
> > slice->num_ref_idx_l0_active_minus1 and
> > slice->num_ref_idx_l1_active_minus1 to correct values. This means that
> > values stored in PPS are not needed and always ignored by VPU.
> >
> > If I understand correctly, algorithm is very simple:
> >
> > ref_count = PPS->ref_count
> > if (override_flag)
> >
> > ref_count = slice->ref_count
> >
> > It seems that VAAPI provides only final value. In my opinion we should do
> > the same - get rid of PPS default ref index count fields.
>
> The rationale was to be as conservative as possible and just expose
> everything that is in the bitstream in those controls to accomodate
> for as many weird hardware as possible.

Ok, so then we should add that override flag, which would align with h264 specs
and you can still do same trick in VAAPI library which it's currently used in
Cedrus driver - always set override flag and fill out only slice reflist count.
At the end it shouldn't matter for proper decoding in any driver.

Best regards,
Jernej



2019-06-03 15:39:12

by Jernej Skrabec

[permalink] [raw]
Subject: Re: [PATCH 3/7] media: cedrus: Fix decoding for some H264 videos

Dne ponedeljek, 03. junij 2019 ob 13:55:36 CEST je Maxime Ripard napisal(a):
> Hi,
>
> On Thu, May 30, 2019 at 11:15:12PM +0200, Jernej Skrabec wrote:
> > It seems that for some H264 videos at least one bitstream parsing
> > trigger must be called in order to be decoded correctly. There is no
> > explanation why this helps, but it was observed that two sample videos
> > with this fix are now decoded correctly and there is no regression with
> > others.
> >
> > Signed-off-by: Jernej Skrabec <[email protected]>
> > ---
> > I have two samples which are fixed by this:
> > http://jernej.libreelec.tv/videos/h264/test.mkv
> > http://jernej.libreelec.tv/videos/h264/Dredd%20%E2%80%93%20DTS%20Sound%20C
> > heck%20DTS-HD%20MA%207.1.m2ts
> >
> > Although second one also needs support for multi-slice frames, which is
> > not yet implemented here.>
> > .../staging/media/sunxi/cedrus/cedrus_h264.c | 22 ++++++++++++++++---
> > 1 file changed, 19 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> > b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c index
> > cc8d17f211a1..d0ee3f90ff46 100644
> > --- a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> > +++ b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> > @@ -6,6 +6,7 @@
> >
> > * Copyright (c) 2018 Bootlin
> > */
> >
> > +#include <linux/delay.h>
> >
> > #include <linux/types.h>
> >
> > #include <media/videobuf2-dma-contig.h>
> >
> > @@ -289,6 +290,20 @@ static void cedrus_write_pred_weight_table(struct
> > cedrus_ctx *ctx,>
> > }
> >
> > }
>
> We should have a comment here explaining why that is needed

ok.

>
> > +static void cedrus_skip_bits(struct cedrus_dev *dev, int num)
> > +{
> > + for (; num > 32; num -= 32) {
> > + cedrus_write(dev, VE_H264_TRIGGER_TYPE, 0x3 | (32 <<
8));
>
> Using defines here would be great

Yes, sorry about that.

>
> > + while (cedrus_read(dev, VE_H264_STATUS) & (1 << 8))
> > + udelay(1);
> > + }
>
> A new line here would be great
>
> > + if (num > 0) {
> > + cedrus_write(dev, VE_H264_TRIGGER_TYPE, 0x3 | (num <<
8));
> > + while (cedrus_read(dev, VE_H264_STATUS) & (1 << 8))
> > + udelay(1);
> > + }
>
> Can't we make that a bit simpler by not duplicating the loop?
>
> Something like:
>
> int current = 0;
>
> while (current < num) {
> int tmp = min(num - current, 32);
>
> cedrus_write(dev, VE_H264_TRIGGER_TYPE, 0x3 | (current << 8))
> while (...)
> ...
>
> current += tmp;
> }

Yes, that looks better, I'll change it in next version.

Best regards,
Jernej



2019-06-03 15:51:21

by Jernej Skrabec

[permalink] [raw]
Subject: Re: [PATCH 6/7] media: cedrus: Add infra for extra buffers connected to capture buffers

Dne ponedeljek, 03. junij 2019 ob 14:18:59 CEST je Maxime Ripard napisal(a):
> Hi,
>
> On Thu, May 30, 2019 at 11:15:15PM +0200, Jernej Skrabec wrote:
> > H264 and HEVC engines need additional buffers for each capture buffer.
> > H264 engine has this currently solved by allocating fixed size pool,
> > which is not ideal. Most of the time pool size is much bigger than it
> > needs to be.
> >
> > Ideally, extra buffer should be allocated at buffer initialization, but
> > that's not efficient. It's size in H264 depends on flags set in SPS, but
> > that information is not available in buffer init callback.
> >
> > Signed-off-by: Jernej Skrabec <[email protected]>
> > ---
> >
> > drivers/staging/media/sunxi/cedrus/cedrus.h | 4 ++++
> > .../staging/media/sunxi/cedrus/cedrus_video.c | 19 +++++++++++++++++++
> > 2 files changed, 23 insertions(+)
> >
> > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > b/drivers/staging/media/sunxi/cedrus/cedrus.h index
> > d8e6777e5e27..16c1bdfd243a 100644
> > --- a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > +++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
> > @@ -81,6 +81,10 @@ struct cedrus_run {
> >
> > struct cedrus_buffer {
> >
> > struct v4l2_m2m_buffer m2m_buf;
> >
> > + void *extra_buf;
> > + dma_addr_t extra_buf_dma;
> > + ssize_t extra_buf_size;
> > +
> >
> > union {
> >
> > struct {
> >
> > unsigned int position;
> >
> > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_video.c
> > b/drivers/staging/media/sunxi/cedrus/cedrus_video.c index
> > 681dfe3367a6..d756e0e69634 100644
> > --- a/drivers/staging/media/sunxi/cedrus/cedrus_video.c
> > +++ b/drivers/staging/media/sunxi/cedrus/cedrus_video.c
> > @@ -411,6 +411,24 @@ static void cedrus_queue_cleanup(struct vb2_queue
> > *vq, u32 state)>
> > }
> >
> > }
> >
> > +static void cedrus_buf_cleanup(struct vb2_buffer *vb)
> > +{
> > + struct vb2_queue *vq = vb->vb2_queue;
> > +
> > + if (!V4L2_TYPE_IS_OUTPUT(vq->type)) {
> > + struct cedrus_ctx *ctx = vb2_get_drv_priv(vq);
> > + struct cedrus_buffer *cedrus_buf;
> > +
> > + cedrus_buf = vb2_to_cedrus_buffer(vq->bufs[vb->index]);
> > +
> > + if (cedrus_buf->extra_buf_size)
> > + dma_free_coherent(ctx->dev->dev,
> > + cedrus_buf-
>extra_buf_size,
> > + cedrus_buf-
>extra_buf,
> > + cedrus_buf-
>extra_buf_dma);
> > + }
> > +}
> > +
>
> I'm really not a fan of allocating something somewhere, and freeing it
> somewhere else. Making sure you don't leak something is hard enough to
> not have some non-trivial allocation scheme.

Ok, what about introducing two new optional methods in engine callbacks,
buffer_init and buffer_destroy, which would be called from cedrus_buf_init() and
cedrus_buf_cleanup(), respectively. That way all (de)allocation logic stays
within the same engine.

Best regards,
Jernej



2019-06-03 16:57:34

by Maxime Ripard

[permalink] [raw]
Subject: Re: [PATCH 3/7] media: cedrus: Fix decoding for some H264 videos

Hi,

On Thu, May 30, 2019 at 11:15:12PM +0200, Jernej Skrabec wrote:
> It seems that for some H264 videos at least one bitstream parsing
> trigger must be called in order to be decoded correctly. There is no
> explanation why this helps, but it was observed that two sample videos
> with this fix are now decoded correctly and there is no regression with
> others.
>
> Signed-off-by: Jernej Skrabec <[email protected]>
> ---
> I have two samples which are fixed by this:
> http://jernej.libreelec.tv/videos/h264/test.mkv
> http://jernej.libreelec.tv/videos/h264/Dredd%20%E2%80%93%20DTS%20Sound%20Check%20DTS-HD%20MA%207.1.m2ts
>
> Although second one also needs support for multi-slice frames, which is not yet implemented here.
>
> .../staging/media/sunxi/cedrus/cedrus_h264.c | 22 ++++++++++++++++---
> 1 file changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> index cc8d17f211a1..d0ee3f90ff46 100644
> --- a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> +++ b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> @@ -6,6 +6,7 @@
> * Copyright (c) 2018 Bootlin
> */
>
> +#include <linux/delay.h>
> #include <linux/types.h>
>
> #include <media/videobuf2-dma-contig.h>
> @@ -289,6 +290,20 @@ static void cedrus_write_pred_weight_table(struct cedrus_ctx *ctx,
> }
> }

We should have a comment here explaining why that is needed

> +static void cedrus_skip_bits(struct cedrus_dev *dev, int num)
> +{
> + for (; num > 32; num -= 32) {
> + cedrus_write(dev, VE_H264_TRIGGER_TYPE, 0x3 | (32 << 8));

Using defines here would be great

> + while (cedrus_read(dev, VE_H264_STATUS) & (1 << 8))
> + udelay(1);
> + }

A new line here would be great

> + if (num > 0) {
> + cedrus_write(dev, VE_H264_TRIGGER_TYPE, 0x3 | (num << 8));
> + while (cedrus_read(dev, VE_H264_STATUS) & (1 << 8))
> + udelay(1);
> + }

Can't we make that a bit simpler by not duplicating the loop?

Something like:

int current = 0;

while (current < num) {
int tmp = min(num - current, 32);

cedrus_write(dev, VE_H264_TRIGGER_TYPE, 0x3 | (current << 8))
while (...)
...

current += tmp;
}

Maxime

--
Maxime Ripard, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


Attachments:
(No filename) (2.33 kB)
signature.asc (235.00 B)
Download all attachments

2019-06-03 16:58:04

by Maxime Ripard

[permalink] [raw]
Subject: Re: [PATCH 7/7] media: cedrus: Improve H264 memory efficiency

On Thu, May 30, 2019 at 11:15:16PM +0200, Jernej Skrabec wrote:
> H264 decoder driver preallocated pretty big worst case mv col buffer
> pool. It turns out that pool is most of the time much bigger than it
> needs to be.
>
> Solution implemented here is to allocate memory only if capture buffer
> is actually used and only as much as it is really necessary.
>
> This is also preparation for 4K video decoding support, which will be
> implemented later.

What is it doing exactly to prepare for 4k?

> Signed-off-by: Jernej Skrabec <[email protected]>
> ---
> drivers/staging/media/sunxi/cedrus/cedrus.h | 4 -
> .../staging/media/sunxi/cedrus/cedrus_h264.c | 81 +++++++------------
> 2 files changed, 28 insertions(+), 57 deletions(-)
>
> diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h b/drivers/staging/media/sunxi/cedrus/cedrus.h
> index 16c1bdfd243a..fcbbbef65494 100644
> --- a/drivers/staging/media/sunxi/cedrus/cedrus.h
> +++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
> @@ -106,10 +106,6 @@ struct cedrus_ctx {
>
> union {
> struct {
> - void *mv_col_buf;
> - dma_addr_t mv_col_buf_dma;
> - ssize_t mv_col_buf_field_size;
> - ssize_t mv_col_buf_size;
> void *pic_info_buf;
> dma_addr_t pic_info_buf_dma;
> void *neighbor_info_buf;
> diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> index b2290f98d81a..758fd0049e8f 100644
> --- a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> +++ b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> @@ -54,17 +54,14 @@ static void cedrus_h264_write_sram(struct cedrus_dev *dev,
> cedrus_write(dev, VE_AVC_SRAM_PORT_DATA, *buffer++);
> }
>
> -static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_ctx *ctx,
> - unsigned int position,
> +static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_buffer *buf,
> unsigned int field)
> {
> - dma_addr_t addr = ctx->codec.h264.mv_col_buf_dma;
> -
> - /* Adjust for the position */
> - addr += position * ctx->codec.h264.mv_col_buf_field_size * 2;
> + dma_addr_t addr = buf->extra_buf_dma;
>
> /* Adjust for the field */
> - addr += field * ctx->codec.h264.mv_col_buf_field_size;
> + if (field)
> + addr += buf->extra_buf_size / 2;
>
> return addr;
> }
> @@ -76,7 +73,6 @@ static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,
> struct cedrus_h264_sram_ref_pic *pic)
> {
> struct vb2_buffer *vbuf = &buf->m2m_buf.vb.vb2_buf;
> - unsigned int position = buf->codec.h264.position;
>
> pic->top_field_order_cnt = cpu_to_le32(top_field_order_cnt);
> pic->bottom_field_order_cnt = cpu_to_le32(bottom_field_order_cnt);
> @@ -84,10 +80,8 @@ static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,
>
> pic->luma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt, 0));
> pic->chroma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt, 1));
> - pic->mv_col_top_ptr =
> - cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position, 0));
> - pic->mv_col_bot_ptr =
> - cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position, 1));
> + pic->mv_col_top_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf, 0));
> + pic->mv_col_bot_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf, 1));
> }
>
> static void cedrus_write_frame_list(struct cedrus_ctx *ctx,
> @@ -142,6 +136,28 @@ static void cedrus_write_frame_list(struct cedrus_ctx *ctx,
> output_buf = vb2_to_cedrus_buffer(&run->dst->vb2_buf);
> output_buf->codec.h264.position = position;
>
> + if (!output_buf->extra_buf_size) {
> + const struct v4l2_ctrl_h264_sps *sps = run->h264.sps;
> + unsigned int field_size;
> +
> + field_size = DIV_ROUND_UP(ctx->src_fmt.width, 16) *
> + DIV_ROUND_UP(ctx->src_fmt.height, 16) * 16;
> + if (!(sps->flags & V4L2_H264_SPS_FLAG_DIRECT_8X8_INFERENCE))
> + field_size = field_size * 2;
> + if (!(sps->flags & V4L2_H264_SPS_FLAG_FRAME_MBS_ONLY))
> + field_size = field_size * 2;
> +
> + output_buf->extra_buf_size = field_size * 2;
> + output_buf->extra_buf =
> + dma_alloc_coherent(dev->dev,
> + output_buf->extra_buf_size,
> + &output_buf->extra_buf_dma,
> + GFP_KERNEL);
> +
> + if (!output_buf->extra_buf)
> + output_buf->extra_buf_size = 0;
> + }
> +

That also means that instead of allocating that buffer exactly once,
you now allocate it for each output buffer?

I guess that it will cleaned up by your previous patch at
buffer_cleanup time, so after it's no longer a reference frame?

What is the average memory usage before, and after that change during
a playback, and what is the runtime cost of doing it multiple times
instead of once?

Maxime

--
Maxime Ripard, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


Attachments:
(No filename) (4.74 kB)
signature.asc (235.00 B)
Download all attachments

2019-06-03 17:31:27

by Maxime Ripard

[permalink] [raw]
Subject: Re: [PATCH 2/7] media: cedrus: Fix H264 default reference index count

On Thu, May 30, 2019 at 11:15:11PM +0200, Jernej Skrabec wrote:
> Reference index count in VE_H264_PPS reg should come from PPS control.
> However, this is not really important because reference index count is
> in our case always overridden by that from slice header.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
>
> Signed-off-by: Jernej Skrabec <[email protected]>

Acked-by: Maxime Ripard <[email protected]>

> ---
> We have to decide if we drop pps->num_ref_idx_l0_default_active_minus1
> and pps->num_ref_idx_l1_default_active_minus1 fields or add
> num_ref_idx_l0_active_override_flag and num_ref_idx_l0_active_override_flag
> to slice control.
>
> Current control doesn't have those two flags, so in Cedrus override flag is
> always set and we rely on userspace to set slice->num_ref_idx_l0_active_minus1
> and slice->num_ref_idx_l1_active_minus1 to correct values. This means that
> values stored in PPS are not needed and always ignored by VPU.
>
> If I understand correctly, algorithm is very simple:
>
> ref_count = PPS->ref_count
> if (override_flag)
> ref_count = slice->ref_count
>
> It seems that VAAPI provides only final value. In my opinion we should do the
> same - get rid of PPS default ref index count fields.

The rationale was to be as conservative as possible and just expose
everything that is in the bitstream in those controls to accomodate
for as many weird hardware as possible.

Maxime

--
Maxime Ripard, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


Attachments:
(No filename) (1.57 kB)
signature.asc (235.00 B)
Download all attachments

2019-06-03 17:49:45

by Jernej Skrabec

[permalink] [raw]
Subject: Re: [PATCH 7/7] media: cedrus: Improve H264 memory efficiency

Dne ponedeljek, 03. junij 2019 ob 14:23:28 CEST je Maxime Ripard napisal(a):
> On Thu, May 30, 2019 at 11:15:16PM +0200, Jernej Skrabec wrote:
> > H264 decoder driver preallocated pretty big worst case mv col buffer
> > pool. It turns out that pool is most of the time much bigger than it
> > needs to be.
> >
> > Solution implemented here is to allocate memory only if capture buffer
> > is actually used and only as much as it is really necessary.
> >
> > This is also preparation for 4K video decoding support, which will be
> > implemented later.
>
> What is it doing exactly to prepare for 4k?

Well, with that change 4K videos can be actually watched with 256 MiB CMA
pool, but I can drop this statement in next version.

I concentrated on 256 MiB CMA pool, because it's maximum memory size supported
by older VPU versions, but I think they don't support 4K decoding. I don't
have them, so I can't test that hypothesis.

>
> > Signed-off-by: Jernej Skrabec <[email protected]>
> > ---
> >
> > drivers/staging/media/sunxi/cedrus/cedrus.h | 4 -
> > .../staging/media/sunxi/cedrus/cedrus_h264.c | 81 +++++++------------
> > 2 files changed, 28 insertions(+), 57 deletions(-)
> >
> > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > b/drivers/staging/media/sunxi/cedrus/cedrus.h index
> > 16c1bdfd243a..fcbbbef65494 100644
> > --- a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > +++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
> > @@ -106,10 +106,6 @@ struct cedrus_ctx {
> >
> > union {
> >
> > struct {
> >
> > - void *mv_col_buf;
> > - dma_addr_t mv_col_buf_dma;
> > - ssize_t mv_col_buf_field_size;
> > - ssize_t mv_col_buf_size;
> >
> > void *pic_info_buf;
> > dma_addr_t pic_info_buf_dma;
> > void *neighbor_info_buf;
> >
> > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> > b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c index
> > b2290f98d81a..758fd0049e8f 100644
> > --- a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> > +++ b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> > @@ -54,17 +54,14 @@ static void cedrus_h264_write_sram(struct cedrus_dev
> > *dev,>
> > cedrus_write(dev, VE_AVC_SRAM_PORT_DATA, *buffer++);
> >
> > }
> >
> > -static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_ctx *ctx,
> > - unsigned int
position,
> > +static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_buffer *buf,
> >
> > unsigned int
field)
> >
> > {
> >
> > - dma_addr_t addr = ctx->codec.h264.mv_col_buf_dma;
> > -
> > - /* Adjust for the position */
> > - addr += position * ctx->codec.h264.mv_col_buf_field_size * 2;
> > + dma_addr_t addr = buf->extra_buf_dma;
> >
> > /* Adjust for the field */
> >
> > - addr += field * ctx->codec.h264.mv_col_buf_field_size;
> > + if (field)
> > + addr += buf->extra_buf_size / 2;
> >
> > return addr;
> >
> > }
> >
> > @@ -76,7 +73,6 @@ static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,
> >
> > struct cedrus_h264_sram_ref_pic
*pic)
> >
> > {
> >
> > struct vb2_buffer *vbuf = &buf->m2m_buf.vb.vb2_buf;
> >
> > - unsigned int position = buf->codec.h264.position;
> >
> > pic->top_field_order_cnt = cpu_to_le32(top_field_order_cnt);
> > pic->bottom_field_order_cnt = cpu_to_le32(bottom_field_order_cnt);
> >
> > @@ -84,10 +80,8 @@ static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,
> >
> > pic->luma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt,
0));
> > pic->chroma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt,
1));
> >
> > - pic->mv_col_top_ptr =
> > - cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position,
0));
> > - pic->mv_col_bot_ptr =
> > - cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position,
1));
> > + pic->mv_col_top_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf,
0));
> > + pic->mv_col_bot_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf,
1));
> >
> > }
> >
> > static void cedrus_write_frame_list(struct cedrus_ctx *ctx,
> >
> > @@ -142,6 +136,28 @@ static void cedrus_write_frame_list(struct cedrus_ctx
> > *ctx,>
> > output_buf = vb2_to_cedrus_buffer(&run->dst->vb2_buf);
> > output_buf->codec.h264.position = position;
> >
> > + if (!output_buf->extra_buf_size) {
> > + const struct v4l2_ctrl_h264_sps *sps = run->h264.sps;
> > + unsigned int field_size;
> > +
> > + field_size = DIV_ROUND_UP(ctx->src_fmt.width, 16) *
> > + DIV_ROUND_UP(ctx->src_fmt.height, 16) * 16;
> > + if (!(sps->flags &
V4L2_H264_SPS_FLAG_DIRECT_8X8_INFERENCE))
> > + field_size = field_size * 2;
> > + if (!(sps->flags & V4L2_H264_SPS_FLAG_FRAME_MBS_ONLY))
> > + field_size = field_size * 2;
> > +
> > + output_buf->extra_buf_size = field_size * 2;
> > + output_buf->extra_buf =
> > + dma_alloc_coherent(dev->dev,
> > + output_buf-
>extra_buf_size,
> > + &output_buf-
>extra_buf_dma,
> > + GFP_KERNEL);
> > +
> > + if (!output_buf->extra_buf)
> > + output_buf->extra_buf_size = 0;
> > + }
> > +
>
> That also means that instead of allocating that buffer exactly once,
> you now allocate it for each output buffer?

It's not completely the same. I'm allocating multiple times, yes, but much
smaller chunks and only if needed.

Still, this slight overhead happens only when buffer is used for the first time.
When buffer is reused, this MV buffer is also reused.

>
> I guess that it will cleaned up by your previous patch at
> buffer_cleanup time, so after it's no longer a reference frame?

Yes, it will be deallocated in buffer_cleanup, but only after capture buffers
are freed, which usually happens when device file descriptor is closed.

Buffers which holds reference frames are later reused, together with it's MV
buffer, so there's no overhead.

>
> What is the average memory usage before, and after that change during
> a playback, and what is the runtime cost of doing it multiple times
> instead of once?

As I already said, overhead is present only when buffer is used for the first
time, which is not ideal, but allows to calculate minimal buffer size needed
and even doesn't allocate anything if capture buffer is not used at all.

I didn't collect any exact numbers, but with this change I can play H264 and
HEVC (with similar modification) 4K video samples with 256 MiB CMA pool.
Without this change, it's not really possible. You can argue "but what if 4K
video use 16 reference frames", then yes, only solution is to increase CMA
pool, but why reserve extra memory which will never be used?

I've been using this optimization for past ~6 month with no issues noticed. If
you feel better, I can change this to be a bit conservative and allocate MV
buffer when buffer_init is called. This will consume a bit more memory as SPS is
not available at that time (worst case buffer size estimation), but it still
won't allocate MV buffers for unallocated capture frames.

Best regards,
Jernej


2019-06-05 21:10:54

by Paul Kocialkowski

[permalink] [raw]
Subject: Re: [PATCH 5/7] media: cedrus: Don't set chroma size for scale & rotation

Hi,

Le jeudi 30 mai 2019 à 23:15 +0200, Jernej Skrabec a écrit :
> Scale and rotation are currently not implemented, so it makes no sense to
> set chroma size for it.
>
> Signed-off-by: Jernej Skrabec <[email protected]>

Acked-by: Paul Kocialkowski <[email protected]>
Cheers,

Paul

> ---
> drivers/staging/media/sunxi/cedrus/cedrus_hw.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_hw.c b/drivers/staging/media/sunxi/cedrus/cedrus_hw.c
> index 9c5819def186..9de20ae47916 100644
> --- a/drivers/staging/media/sunxi/cedrus/cedrus_hw.c
> +++ b/drivers/staging/media/sunxi/cedrus/cedrus_hw.c
> @@ -79,9 +79,6 @@ void cedrus_dst_format_set(struct cedrus_dev *dev,
> reg = VE_PRIMARY_OUT_FMT_NV12;
> cedrus_write(dev, VE_PRIMARY_OUT_FMT, reg);
>
> - reg = VE_CHROMA_BUF_LEN_SDRT(chroma_size / 2);
> - cedrus_write(dev, VE_CHROMA_BUF_LEN, reg);
> -
> reg = chroma_size / 2;
> cedrus_write(dev, VE_PRIMARY_CHROMA_BUF_LEN, reg);
>

2019-06-05 21:11:18

by Paul Kocialkowski

[permalink] [raw]
Subject: Re: [PATCH 4/7] media: cedrus: Remove dst_bufs from context

Hi,

Le jeudi 30 mai 2019 à 23:15 +0200, Jernej Skrabec a écrit :
> This array is just duplicated capture buffer queue. Remove it and adjust
> code to look into capture buffer queue instead.
>
> Signed-off-by: Jernej Skrabec <[email protected]>

Acked-by: Paul Kocialkowski <[email protected]>

Cheers and thanks,

Paul

> ---
> drivers/staging/media/sunxi/cedrus/cedrus.h | 4 +---
> .../staging/media/sunxi/cedrus/cedrus_h264.c | 4 ++--
> .../staging/media/sunxi/cedrus/cedrus_video.c | 22 -------------------
> 3 files changed, 3 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h b/drivers/staging/media/sunxi/cedrus/cedrus.h
> index 3f476d0fd981..d8e6777e5e27 100644
> --- a/drivers/staging/media/sunxi/cedrus/cedrus.h
> +++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
> @@ -100,8 +100,6 @@ struct cedrus_ctx {
> struct v4l2_ctrl_handler hdl;
> struct v4l2_ctrl **ctrls;
>
> - struct vb2_buffer *dst_bufs[VIDEO_MAX_FRAME];
> -
> union {
> struct {
> void *mv_col_buf;
> @@ -187,7 +185,7 @@ static inline dma_addr_t cedrus_dst_buf_addr(struct cedrus_ctx *ctx,
> if (index < 0)
> return 0;
>
> - buf = ctx->dst_bufs[index];
> + buf = ctx->fh.m2m_ctx->cap_q_ctx.q.bufs[index];
> return buf ? cedrus_buf_addr(buf, &ctx->dst_fmt, plane) : 0;
> }
>
> diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> index d0ee3f90ff46..b2290f98d81a 100644
> --- a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> +++ b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> @@ -119,7 +119,7 @@ static void cedrus_write_frame_list(struct cedrus_ctx *ctx,
> if (buf_idx < 0)
> continue;
>
> - cedrus_buf = vb2_to_cedrus_buffer(ctx->dst_bufs[buf_idx]);
> + cedrus_buf = vb2_to_cedrus_buffer(cap_q->bufs[buf_idx]);
> position = cedrus_buf->codec.h264.position;
> used_dpbs |= BIT(position);
>
> @@ -194,7 +194,7 @@ static void _cedrus_write_ref_list(struct cedrus_ctx *ctx,
> if (buf_idx < 0)
> continue;
>
> - ref_buf = to_vb2_v4l2_buffer(ctx->dst_bufs[buf_idx]);
> + ref_buf = to_vb2_v4l2_buffer(cap_q->bufs[buf_idx]);
> cedrus_buf = vb2_v4l2_to_cedrus_buffer(ref_buf);
> position = cedrus_buf->codec.h264.position;
>
> diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_video.c b/drivers/staging/media/sunxi/cedrus/cedrus_video.c
> index e2b530b1a956..681dfe3367a6 100644
> --- a/drivers/staging/media/sunxi/cedrus/cedrus_video.c
> +++ b/drivers/staging/media/sunxi/cedrus/cedrus_video.c
> @@ -411,26 +411,6 @@ static void cedrus_queue_cleanup(struct vb2_queue *vq, u32 state)
> }
> }
>
> -static int cedrus_buf_init(struct vb2_buffer *vb)
> -{
> - struct vb2_queue *vq = vb->vb2_queue;
> - struct cedrus_ctx *ctx = vb2_get_drv_priv(vq);
> -
> - if (!V4L2_TYPE_IS_OUTPUT(vq->type))
> - ctx->dst_bufs[vb->index] = vb;
> -
> - return 0;
> -}
> -
> -static void cedrus_buf_cleanup(struct vb2_buffer *vb)
> -{
> - struct vb2_queue *vq = vb->vb2_queue;
> - struct cedrus_ctx *ctx = vb2_get_drv_priv(vq);
> -
> - if (!V4L2_TYPE_IS_OUTPUT(vq->type))
> - ctx->dst_bufs[vb->index] = NULL;
> -}
> -
> static int cedrus_buf_out_validate(struct vb2_buffer *vb)
> {
> struct vb2_v4l2_buffer *vbuf = to_vb2_v4l2_buffer(vb);
> @@ -517,8 +497,6 @@ static void cedrus_buf_request_complete(struct vb2_buffer *vb)
> static struct vb2_ops cedrus_qops = {
> .queue_setup = cedrus_queue_setup,
> .buf_prepare = cedrus_buf_prepare,
> - .buf_init = cedrus_buf_init,
> - .buf_cleanup = cedrus_buf_cleanup,
> .buf_queue = cedrus_buf_queue,
> .buf_out_validate = cedrus_buf_out_validate,
> .buf_request_complete = cedrus_buf_request_complete,

2019-06-05 21:12:00

by Paul Kocialkowski

[permalink] [raw]
Subject: Re: [PATCH 6/7] media: cedrus: Add infra for extra buffers connected to capture buffers

Hi,

Le lundi 03 juin 2019 à 17:48 +0200, Jernej Škrabec a écrit :
> Dne ponedeljek, 03. junij 2019 ob 14:18:59 CEST je Maxime Ripard napisal(a):
> > Hi,
> >
> > On Thu, May 30, 2019 at 11:15:15PM +0200, Jernej Skrabec wrote:
> > > H264 and HEVC engines need additional buffers for each capture buffer.
> > > H264 engine has this currently solved by allocating fixed size pool,
> > > which is not ideal. Most of the time pool size is much bigger than it
> > > needs to be.
> > >
> > > Ideally, extra buffer should be allocated at buffer initialization, but
> > > that's not efficient. It's size in H264 depends on flags set in SPS, but
> > > that information is not available in buffer init callback.
> > >
> > > Signed-off-by: Jernej Skrabec <[email protected]>
> > > ---
> > >
> > > drivers/staging/media/sunxi/cedrus/cedrus.h | 4 ++++
> > > .../staging/media/sunxi/cedrus/cedrus_video.c | 19 +++++++++++++++++++
> > > 2 files changed, 23 insertions(+)
> > >
> > > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > > b/drivers/staging/media/sunxi/cedrus/cedrus.h index
> > > d8e6777e5e27..16c1bdfd243a 100644
> > > --- a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > > +++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
> > > @@ -81,6 +81,10 @@ struct cedrus_run {
> > >
> > > struct cedrus_buffer {
> > >
> > > struct v4l2_m2m_buffer m2m_buf;
> > >
> > > + void *extra_buf;
> > > + dma_addr_t extra_buf_dma;
> > > + ssize_t extra_buf_size;
> > > +
> > >
> > > union {
> > >
> > > struct {
> > >
> > > unsigned int position;
> > >
> > > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_video.c
> > > b/drivers/staging/media/sunxi/cedrus/cedrus_video.c index
> > > 681dfe3367a6..d756e0e69634 100644
> > > --- a/drivers/staging/media/sunxi/cedrus/cedrus_video.c
> > > +++ b/drivers/staging/media/sunxi/cedrus/cedrus_video.c
> > > @@ -411,6 +411,24 @@ static void cedrus_queue_cleanup(struct vb2_queue
> > > *vq, u32 state)>
> > > }
> > >
> > > }
> > >
> > > +static void cedrus_buf_cleanup(struct vb2_buffer *vb)
> > > +{
> > > + struct vb2_queue *vq = vb->vb2_queue;
> > > +
> > > + if (!V4L2_TYPE_IS_OUTPUT(vq->type)) {
> > > + struct cedrus_ctx *ctx = vb2_get_drv_priv(vq);
> > > + struct cedrus_buffer *cedrus_buf;
> > > +
> > > + cedrus_buf = vb2_to_cedrus_buffer(vq->bufs[vb->index]);
> > > +
> > > + if (cedrus_buf->extra_buf_size)
> > > + dma_free_coherent(ctx->dev->dev,
> > > + cedrus_buf-
> > extra_buf_size,
> > > + cedrus_buf-
> > extra_buf,
> > > + cedrus_buf-
> > extra_buf_dma);
> > > + }
> > > +}
> > > +
> >
> > I'm really not a fan of allocating something somewhere, and freeing it
> > somewhere else. Making sure you don't leak something is hard enough to
> > not have some non-trivial allocation scheme.
>
> Ok, what about introducing two new optional methods in engine callbacks,
> buffer_init and buffer_destroy, which would be called from cedrus_buf_init() and
> cedrus_buf_cleanup(), respectively. That way all (de)allocation logic stays
> within the same engine.

I'm thinking that we should have v4l2-framework-level per-codec helpers
to provide ops for these kinds of things, since they tend be quite
common across decoders.

Cheers,

Paul

2019-06-05 21:54:56

by Jernej Skrabec

[permalink] [raw]
Subject: Re: [PATCH 6/7] media: cedrus: Add infra for extra buffers connected to capture buffers

Dne sreda, 05. junij 2019 ob 23:10:17 CEST je Paul Kocialkowski napisal(a):
> Hi,
>
> Le lundi 03 juin 2019 à 17:48 +0200, Jernej Škrabec a écrit :
> > Dne ponedeljek, 03. junij 2019 ob 14:18:59 CEST je Maxime Ripard
napisal(a):
> > > Hi,
> > >
> > > On Thu, May 30, 2019 at 11:15:15PM +0200, Jernej Skrabec wrote:
> > > > H264 and HEVC engines need additional buffers for each capture buffer.
> > > > H264 engine has this currently solved by allocating fixed size pool,
> > > > which is not ideal. Most of the time pool size is much bigger than it
> > > > needs to be.
> > > >
> > > > Ideally, extra buffer should be allocated at buffer initialization,
> > > > but
> > > > that's not efficient. It's size in H264 depends on flags set in SPS,
> > > > but
> > > > that information is not available in buffer init callback.
> > > >
> > > > Signed-off-by: Jernej Skrabec <[email protected]>
> > > > ---
> > > >
> > > > drivers/staging/media/sunxi/cedrus/cedrus.h | 4 ++++
> > > > .../staging/media/sunxi/cedrus/cedrus_video.c | 19
> > > > +++++++++++++++++++
> > > > 2 files changed, 23 insertions(+)
> > > >
> > > > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > > > b/drivers/staging/media/sunxi/cedrus/cedrus.h index
> > > > d8e6777e5e27..16c1bdfd243a 100644
> > > > --- a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > > > +++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
> > > > @@ -81,6 +81,10 @@ struct cedrus_run {
> > > >
> > > > struct cedrus_buffer {
> > > >
> > > > struct v4l2_m2m_buffer m2m_buf;
> > > >
> > > > + void *extra_buf;
> > > > + dma_addr_t extra_buf_dma;
> > > > + ssize_t extra_buf_size;
> > > > +
> > > >
> > > > union {
> > > >
> > > > struct {
> > > >
> > > > unsigned int position;
> > > >
> > > > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_video.c
> > > > b/drivers/staging/media/sunxi/cedrus/cedrus_video.c index
> > > > 681dfe3367a6..d756e0e69634 100644
> > > > --- a/drivers/staging/media/sunxi/cedrus/cedrus_video.c
> > > > +++ b/drivers/staging/media/sunxi/cedrus/cedrus_video.c
> > > > @@ -411,6 +411,24 @@ static void cedrus_queue_cleanup(struct vb2_queue
> > > > *vq, u32 state)>
> > > >
> > > > }
> > > >
> > > > }
> > > >
> > > > +static void cedrus_buf_cleanup(struct vb2_buffer *vb)
> > > > +{
> > > > + struct vb2_queue *vq = vb->vb2_queue;
> > > > +
> > > > + if (!V4L2_TYPE_IS_OUTPUT(vq->type)) {
> > > > + struct cedrus_ctx *ctx = vb2_get_drv_priv(vq);
> > > > + struct cedrus_buffer *cedrus_buf;
> > > > +
> > > > + cedrus_buf = vb2_to_cedrus_buffer(vq->bufs[vb->index]);
> > > > +
> > > > + if (cedrus_buf->extra_buf_size)
> > > > + dma_free_coherent(ctx->dev->dev,
> > > > + cedrus_buf-
> > >
> > > extra_buf_size,
> > >
> > > > + cedrus_buf-
> > >
> > > extra_buf,
> > >
> > > > + cedrus_buf-
> > >
> > > extra_buf_dma);
> > >
> > > > + }
> > > > +}
> > > > +
> > >
> > > I'm really not a fan of allocating something somewhere, and freeing it
> > > somewhere else. Making sure you don't leak something is hard enough to
> > > not have some non-trivial allocation scheme.
> >
> > Ok, what about introducing two new optional methods in engine callbacks,
> > buffer_init and buffer_destroy, which would be called from
> > cedrus_buf_init() and cedrus_buf_cleanup(), respectively. That way all
> > (de)allocation logic stays within the same engine.
>
> I'm thinking that we should have v4l2-framework-level per-codec helpers
> to provide ops for these kinds of things, since they tend be quite
> common across decoders.

Isn't .buf_init and .buf_cleanup callbacks provided by struct vb2_ops meant
for exactly that?

Related, but different topic. I managed to fix 10-bit HEVC support on H6, but
when working in 8-bit mode, capture buffers have to be big enough to hold
normal NV12 decoded image plus extra buffer for 2 bits of each pixel. VPU
accepts only offset from destination buffer for this extra buffer instead of full
address. How we will handle that? Override sizeimage when allocating? But
there we don't have information if it's 10-bit video or not. As you can see,
I'm not a fan of overallocating.

I suspect we will have even bigger issues when decoding 10-bit HEVC video in
P010 format, which is the only 10-bit YUV format useable by DRM driver (not
implemented yet). From what I know till now, VPU needs aforementioned 8-bit+2-
bit buffers (for decoding) and another one in which it rearranges samples in
P010 format. But that has to be confirmed first.

Best regards,
Jernej


2019-06-05 21:56:15

by Paul Kocialkowski

[permalink] [raw]
Subject: Re: [PATCH 7/7] media: cedrus: Improve H264 memory efficiency

Hi,

Le lundi 03 juin 2019 à 18:37 +0200, Jernej Škrabec a écrit :
> Dne ponedeljek, 03. junij 2019 ob 14:23:28 CEST je Maxime Ripard napisal(a):
> > On Thu, May 30, 2019 at 11:15:16PM +0200, Jernej Skrabec wrote:
> > > H264 decoder driver preallocated pretty big worst case mv col buffer
> > > pool. It turns out that pool is most of the time much bigger than it
> > > needs to be.
> > >
> > > Solution implemented here is to allocate memory only if capture buffer
> > > is actually used and only as much as it is really necessary.
> > >
> > > This is also preparation for 4K video decoding support, which will be
> > > implemented later.
> >
> > What is it doing exactly to prepare for 4k?
>
> Well, with that change 4K videos can be actually watched with 256 MiB CMA
> pool, but I can drop this statement in next version.
>
> I concentrated on 256 MiB CMA pool, because it's maximum memory size supported
> by older VPU versions, but I think they don't support 4K decoding. I don't
> have them, so I can't test that hypothesis.

I think it's a fair goal to try and optimize the CMA pool usage, maybe
it should be presented as that and I guess it's fine to connect that to
4K decoding if you like :)

Either way, I think we should have per-codec framework callbacks for
these kinds of things.

Cheers,

Paul

> > > Signed-off-by: Jernej Skrabec <[email protected]>
> > > ---
> > >
> > > drivers/staging/media/sunxi/cedrus/cedrus.h | 4 -
> > > .../staging/media/sunxi/cedrus/cedrus_h264.c | 81 +++++++------------
> > > 2 files changed, 28 insertions(+), 57 deletions(-)
> > >
> > > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > > b/drivers/staging/media/sunxi/cedrus/cedrus.h index
> > > 16c1bdfd243a..fcbbbef65494 100644
> > > --- a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > > +++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
> > > @@ -106,10 +106,6 @@ struct cedrus_ctx {
> > >
> > > union {
> > >
> > > struct {
> > >
> > > - void *mv_col_buf;
> > > - dma_addr_t mv_col_buf_dma;
> > > - ssize_t mv_col_buf_field_size;
> > > - ssize_t mv_col_buf_size;
> > >
> > > void *pic_info_buf;
> > > dma_addr_t pic_info_buf_dma;
> > > void *neighbor_info_buf;
> > >
> > > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> > > b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c index
> > > b2290f98d81a..758fd0049e8f 100644
> > > --- a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> > > +++ b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> > > @@ -54,17 +54,14 @@ static void cedrus_h264_write_sram(struct cedrus_dev
> > > *dev,>
> > > cedrus_write(dev, VE_AVC_SRAM_PORT_DATA, *buffer++);
> > >
> > > }
> > >
> > > -static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_ctx *ctx,
> > > - unsigned int
> position,
> > > +static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_buffer *buf,
> > >
> > > unsigned int
> field)
> > >
> > > {
> > >
> > > - dma_addr_t addr = ctx->codec.h264.mv_col_buf_dma;
> > > -
> > > - /* Adjust for the position */
> > > - addr += position * ctx->codec.h264.mv_col_buf_field_size * 2;
> > > + dma_addr_t addr = buf->extra_buf_dma;
> > >
> > > /* Adjust for the field */
> > >
> > > - addr += field * ctx->codec.h264.mv_col_buf_field_size;
> > > + if (field)
> > > + addr += buf->extra_buf_size / 2;
> > >
> > > return addr;
> > >
> > > }
> > >
> > > @@ -76,7 +73,6 @@ static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,
> > >
> > > struct cedrus_h264_sram_ref_pic
> *pic)
> > >
> > > {
> > >
> > > struct vb2_buffer *vbuf = &buf->m2m_buf.vb.vb2_buf;
> > >
> > > - unsigned int position = buf->codec.h264.position;
> > >
> > > pic->top_field_order_cnt = cpu_to_le32(top_field_order_cnt);
> > > pic->bottom_field_order_cnt = cpu_to_le32(bottom_field_order_cnt);
> > >
> > > @@ -84,10 +80,8 @@ static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,
> > >
> > > pic->luma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt,
> 0));
> > > pic->chroma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt,
> 1));
> > > - pic->mv_col_top_ptr =
> > > - cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position,
> 0));
> > > - pic->mv_col_bot_ptr =
> > > - cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position,
> 1));
> > > + pic->mv_col_top_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf,
> 0));
> > > + pic->mv_col_bot_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf,
> 1));
> > > }
> > >
> > > static void cedrus_write_frame_list(struct cedrus_ctx *ctx,
> > >
> > > @@ -142,6 +136,28 @@ static void cedrus_write_frame_list(struct cedrus_ctx
> > > *ctx,>
> > > output_buf = vb2_to_cedrus_buffer(&run->dst->vb2_buf);
> > > output_buf->codec.h264.position = position;
> > >
> > > + if (!output_buf->extra_buf_size) {
> > > + const struct v4l2_ctrl_h264_sps *sps = run->h264.sps;
> > > + unsigned int field_size;
> > > +
> > > + field_size = DIV_ROUND_UP(ctx->src_fmt.width, 16) *
> > > + DIV_ROUND_UP(ctx->src_fmt.height, 16) * 16;
> > > + if (!(sps->flags &
> V4L2_H264_SPS_FLAG_DIRECT_8X8_INFERENCE))
> > > + field_size = field_size * 2;
> > > + if (!(sps->flags & V4L2_H264_SPS_FLAG_FRAME_MBS_ONLY))
> > > + field_size = field_size * 2;
> > > +
> > > + output_buf->extra_buf_size = field_size * 2;
> > > + output_buf->extra_buf =
> > > + dma_alloc_coherent(dev->dev,
> > > + output_buf-
> > extra_buf_size,
> > > + &output_buf-
> > extra_buf_dma,
> > > + GFP_KERNEL);
> > > +
> > > + if (!output_buf->extra_buf)
> > > + output_buf->extra_buf_size = 0;
> > > + }
> > > +
> >
> > That also means that instead of allocating that buffer exactly once,
> > you now allocate it for each output buffer?
>
> It's not completely the same. I'm allocating multiple times, yes, but much
> smaller chunks and only if needed.
>
> Still, this slight overhead happens only when buffer is used for the first time.
> When buffer is reused, this MV buffer is also reused.
>
> > I guess that it will cleaned up by your previous patch at
> > buffer_cleanup time, so after it's no longer a reference frame?
>
> Yes, it will be deallocated in buffer_cleanup, but only after capture buffers
> are freed, which usually happens when device file descriptor is closed.
>
> Buffers which holds reference frames are later reused, together with it's MV
> buffer, so there's no overhead.
>
> > What is the average memory usage before, and after that change during
> > a playback, and what is the runtime cost of doing it multiple times
> > instead of once?
>
> As I already said, overhead is present only when buffer is used for the first
> time, which is not ideal, but allows to calculate minimal buffer size needed
> and even doesn't allocate anything if capture buffer is not used at all.
>
> I didn't collect any exact numbers, but with this change I can play H264 and
> HEVC (with similar modification) 4K video samples with 256 MiB CMA pool.
> Without this change, it's not really possible. You can argue "but what if 4K
> video use 16 reference frames", then yes, only solution is to increase CMA
> pool, but why reserve extra memory which will never be used?
>
> I've been using this optimization for past ~6 month with no issues noticed. If
> you feel better, I can change this to be a bit conservative and allocate MV
> buffer when buffer_init is called. This will consume a bit more memory as SPS is
> not available at that time (worst case buffer size estimation), but it still
> won't allocate MV buffers for unallocated capture frames.
>
> Best regards,
> Jernej
>
>

2019-06-06 11:51:44

by Maxime Ripard

[permalink] [raw]
Subject: Re: [PATCH 6/7] media: cedrus: Add infra for extra buffers connected to capture buffers

On Mon, Jun 03, 2019 at 05:48:25PM +0200, Jernej Škrabec wrote:
> Dne ponedeljek, 03. junij 2019 ob 14:18:59 CEST je Maxime Ripard napisal(a):
> > > +static void cedrus_buf_cleanup(struct vb2_buffer *vb)
> > > +{
> > > + struct vb2_queue *vq = vb->vb2_queue;
> > > +
> > > + if (!V4L2_TYPE_IS_OUTPUT(vq->type)) {
> > > + struct cedrus_ctx *ctx = vb2_get_drv_priv(vq);
> > > + struct cedrus_buffer *cedrus_buf;
> > > +
> > > + cedrus_buf = vb2_to_cedrus_buffer(vq->bufs[vb->index]);
> > > +
> > > + if (cedrus_buf->extra_buf_size)
> > > + dma_free_coherent(ctx->dev->dev,
> > > + cedrus_buf-
> >extra_buf_size,
> > > + cedrus_buf-
> >extra_buf,
> > > + cedrus_buf-
> >extra_buf_dma);
> > > + }
> > > +}
> > > +
> >
> > I'm really not a fan of allocating something somewhere, and freeing it
> > somewhere else. Making sure you don't leak something is hard enough to
> > not have some non-trivial allocation scheme.
>
> Ok, what about introducing two new optional methods in engine callbacks,
> buffer_init and buffer_destroy, which would be called from cedrus_buf_init() and
> cedrus_buf_cleanup(), respectively. That way all (de)allocation logic stays
> within the same engine.

Yep, that would work for me.

Thanks!
Maxime

--
Maxime Ripard, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


Attachments:
(No filename) (1.34 kB)
signature.asc (235.00 B)
Download all attachments

2019-06-06 13:03:29

by Dan Carpenter

[permalink] [raw]
Subject: Re: [PATCH 3/7] media: cedrus: Fix decoding for some H264 videos

On Mon, Jun 03, 2019 at 05:37:17PM +0200, Jernej Škrabec wrote:
> Dne ponedeljek, 03. junij 2019 ob 13:55:36 CEST je Maxime Ripard napisal(a):
> > int current = 0;
> >
> > while (current < num) {
> > int tmp = min(num - current, 32);
> >
> > cedrus_write(dev, VE_H264_TRIGGER_TYPE, 0x3 | (current << 8))
^^^^^^^
This should be "tmp << 8" instead of "current << 8" though.


> > while (...)
> > ...
> >
> > current += tmp;
> > }
>

regards,
dan carpenter

2019-08-12 12:15:21

by Hans Verkuil

[permalink] [raw]
Subject: Re: [PATCH 0/7] media: cedrus: Improvements/cleanup

On 5/30/19 11:15 PM, Jernej Skrabec wrote:
> Here is first batch of random Cedrus improvements/cleanups. Only patch 2
> has a change which raises a question about H264 controls.
>
> Changes were tested on H3 SoC using modified ffmpeg and Kodi.
>
> Please take a look.

This has been sitting in patchwork for quite some time. I've updated the
status of the various patches and most needed extra work.

It seems that patches 4/7 and 5/7 are OK. Maxime, can you please confirm
that these two are still valid? They apply cleanly on the latest master
at least, but since they are a bit old I prefer to have confirmation that
it's OK to merge them.

Regards,

Hans

>
> Best regards,
> Jernej
>
> Jernej Skrabec (7):
> media: cedrus: Disable engine after each slice decoding
> media: cedrus: Fix H264 default reference index count
> media: cedrus: Fix decoding for some H264 videos
> media: cedrus: Remove dst_bufs from context
> media: cedrus: Don't set chroma size for scale & rotation
> media: cedrus: Add infra for extra buffers connected to capture
> buffers
> media: cedrus: Improve H264 memory efficiency
>
> drivers/staging/media/sunxi/cedrus/cedrus.h | 12 +-
> .../staging/media/sunxi/cedrus/cedrus_h264.c | 115 ++++++++----------
> .../staging/media/sunxi/cedrus/cedrus_hw.c | 4 +-
> .../staging/media/sunxi/cedrus/cedrus_video.c | 25 ++--
> 4 files changed, 68 insertions(+), 88 deletions(-)
>

2019-08-12 12:22:05

by Jernej Skrabec

[permalink] [raw]
Subject: Re: [PATCH 0/7] media: cedrus: Improvements/cleanup

Dne ponedeljek, 12. avgust 2019 ob 14:12:21 CEST je Hans Verkuil napisal(a):
> On 5/30/19 11:15 PM, Jernej Skrabec wrote:
> > Here is first batch of random Cedrus improvements/cleanups. Only patch 2
> > has a change which raises a question about H264 controls.
> >
> > Changes were tested on H3 SoC using modified ffmpeg and Kodi.
> >
> > Please take a look.
>
> This has been sitting in patchwork for quite some time. I've updated the
> status of the various patches and most needed extra work.
>
> It seems that patches 4/7 and 5/7 are OK. Maxime, can you please confirm
> that these two are still valid? They apply cleanly on the latest master
> at least, but since they are a bit old I prefer to have confirmation that
> it's OK to merge them.

I'm not sure about patch 4, IIRC Boris Brezillon also wants to improve that
area in separate series, but patch 5 should be safe to merge.

Anyway, I didn't post new version because I'm waiting on close-to-be-merged
H264 and HEVC patch series to be actually merged.

Best regards,
Jernej

>
> Regards,
>
> Hans
>
> > Best regards,
> > Jernej
> >
> > Jernej Skrabec (7):
> > media: cedrus: Disable engine after each slice decoding
> > media: cedrus: Fix H264 default reference index count
> > media: cedrus: Fix decoding for some H264 videos
> > media: cedrus: Remove dst_bufs from context
> > media: cedrus: Don't set chroma size for scale & rotation
> > media: cedrus: Add infra for extra buffers connected to capture
> >
> > buffers
> >
> > media: cedrus: Improve H264 memory efficiency
> >
> > drivers/staging/media/sunxi/cedrus/cedrus.h | 12 +-
> > .../staging/media/sunxi/cedrus/cedrus_h264.c | 115 ++++++++----------
> > .../staging/media/sunxi/cedrus/cedrus_hw.c | 4 +-
> > .../staging/media/sunxi/cedrus/cedrus_video.c | 25 ++--
> > 4 files changed, 68 insertions(+), 88 deletions(-)




2019-08-12 13:29:17

by Ezequiel Garcia

[permalink] [raw]
Subject: Re: [PATCH 1/7] media: cedrus: Disable engine after each slice decoding

Hi Jernej,

On Mon, 2019-06-03 at 13:38 +0200, Maxime Ripard wrote:
> Hi,
>
> On Thu, May 30, 2019 at 11:15:10PM +0200, Jernej Skrabec wrote:
> > libvdpau-sunxi always disables engine after each decoded slice.
> > Do same in Cedrus driver.
> >
> > Presumably this also lowers power consumption which is always nice.
> >
> > Signed-off-by: Jernej Skrabec <[email protected]>
>
> Is it fixing anything though?
>
> I indeed saw that cedar did disable it everytime, but I couldn't find
> a reason why.
>
> Also, the power management improvement would need to be measured, it
> can even create the opposite situation where the device will draw more
> current from being woken up than if it had just remained disabled.
>

While reviewing this, I'm noticing that cedrus_engine_disable can
be marked for static storage (with or without this patch).

Regards,
Eze

2019-08-12 13:45:19

by Ezequiel Garcia

[permalink] [raw]
Subject: Re: [PATCH 4/7] media: cedrus: Remove dst_bufs from context

On Thu, 2019-05-30 at 23:15 +0200, Jernej Skrabec wrote:
> This array is just duplicated capture buffer queue. Remove it and adjust
> code to look into capture buffer queue instead.
>
> Signed-off-by: Jernej Skrabec <[email protected]>
> ---
> drivers/staging/media/sunxi/cedrus/cedrus.h | 4 +---
> .../staging/media/sunxi/cedrus/cedrus_h264.c | 4 ++--
> .../staging/media/sunxi/cedrus/cedrus_video.c | 22 -------------------
> 3 files changed, 3 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h b/drivers/staging/media/sunxi/cedrus/cedrus.h
> index 3f476d0fd981..d8e6777e5e27 100644
> --- a/drivers/staging/media/sunxi/cedrus/cedrus.h
> +++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
> @@ -100,8 +100,6 @@ struct cedrus_ctx {
> struct v4l2_ctrl_handler hdl;
> struct v4l2_ctrl **ctrls;
>
> - struct vb2_buffer *dst_bufs[VIDEO_MAX_FRAME];
> -
> union {
> struct {
> void *mv_col_buf;
> @@ -187,7 +185,7 @@ static inline dma_addr_t cedrus_dst_buf_addr(struct cedrus_ctx *ctx,
> if (index < 0)
> return 0;
>
> - buf = ctx->dst_bufs[index];
> + buf = ctx->fh.m2m_ctx->cap_q_ctx.q.bufs[index];

I think you can use v4l2_m2m_get_dst_vq() to access the queue,
and vb2_get_buffer() to access buffers in a vb2 queue.

> return buf ? cedrus_buf_addr(buf, &ctx->dst_fmt, plane) : 0;
> }
>
> diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> index d0ee3f90ff46..b2290f98d81a 100644
> --- a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> +++ b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> @@ -119,7 +119,7 @@ static void cedrus_write_frame_list(struct cedrus_ctx *ctx,
> if (buf_idx < 0)
> continue;
>
> - cedrus_buf = vb2_to_cedrus_buffer(ctx->dst_bufs[buf_idx]);
> + cedrus_buf = vb2_to_cedrus_buffer(cap_q->bufs[buf_idx]);

Ditto about vb2_get_buffer.

> position = cedrus_buf->codec.h264.position;
> used_dpbs |= BIT(position);
>
> @@ -194,7 +194,7 @@ static void _cedrus_write_ref_list(struct cedrus_ctx *ctx,
> if (buf_idx < 0)
> continue;
>
> - ref_buf = to_vb2_v4l2_buffer(ctx->dst_bufs[buf_idx]);
> + ref_buf = to_vb2_v4l2_buffer(cap_q->bufs[buf_idx]);

Ditto about vb2_get_buffer.

With those changes:

Reviewed-by: Ezequiel Garcia <[email protected]>

Thanks,
Ezequiel

2019-08-12 13:56:23

by Maxime Ripard

[permalink] [raw]
Subject: Re: [PATCH 0/7] media: cedrus: Improvements/cleanup

Hi!

On Mon, Aug 12, 2019 at 02:12:21PM +0200, Hans Verkuil wrote:
> On 5/30/19 11:15 PM, Jernej Skrabec wrote:
> > Here is first batch of random Cedrus improvements/cleanups. Only patch 2
> > has a change which raises a question about H264 controls.
> >
> > Changes were tested on H3 SoC using modified ffmpeg and Kodi.
> >
> > Please take a look.
>
> This has been sitting in patchwork for quite some time. I've updated the
> status of the various patches and most needed extra work.
>
> It seems that patches 4/7 and 5/7 are OK. Maxime, can you please confirm
> that these two are still valid? They apply cleanly on the latest master
> at least, but since they are a bit old I prefer to have confirmation that
> it's OK to merge them.

Yes, you can definitely merge those.

Maxime

--
Maxime Ripard, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


Attachments:
(No filename) (896.00 B)
signature.asc (235.00 B)
Download all attachments