2024-04-09 21:25:17

by Michael Grzeschik

[permalink] [raw]
Subject: [PATCH 0/3] usb: gadget: uvc: allocate requests based on frame interval length and buffersize

This patch series is improving the size calculation and allocation
of the uvc requests. Using the currenlty setup frame duration of the
stream it is possible to calculate the number of requests based on the
interval length.

Signed-off-by: Michael Grzeschik <[email protected]>
---
Michael Grzeschik (3):
usb: gadget: function: uvc: set req_size once when the vb2 queue is calculated
usb: gadget: uvc: add g_parm and s_parm for frame interval
usb: gadget: uvc: set req_size and n_requests based on the frame interval

drivers/usb/gadget/function/uvc.h | 1 +
drivers/usb/gadget/function/uvc_queue.c | 30 ++++++++++++++-----
drivers/usb/gadget/function/uvc_v4l2.c | 52 +++++++++++++++++++++++++++++++++
drivers/usb/gadget/function/uvc_video.c | 17 ++---------
4 files changed, 79 insertions(+), 21 deletions(-)
---
base-commit: 3295f1b866bfbcabd625511968e8a5c541f9ab32
change-id: 20240403-uvc_request_length_by_interval-a7efd587d963

Best regards,
--
Michael Grzeschik <[email protected]>



2024-04-09 21:25:19

by Michael Grzeschik

[permalink] [raw]
Subject: [PATCH 2/3] usb: gadget: uvc: add g_parm and s_parm for frame interval

The uvc gadget driver is lacking the information which frame interval
was set by the host. We add this information by implementing the g_parm
and s_parm callbacks.

Signed-off-by: Michael Grzeschik <[email protected]>
---
drivers/usb/gadget/function/uvc.h | 1 +
drivers/usb/gadget/function/uvc_v4l2.c | 52 ++++++++++++++++++++++++++++++++++
2 files changed, 53 insertions(+)

diff --git a/drivers/usb/gadget/function/uvc.h b/drivers/usb/gadget/function/uvc.h
index cb35687b11e7e..d153bd9e35e31 100644
--- a/drivers/usb/gadget/function/uvc.h
+++ b/drivers/usb/gadget/function/uvc.h
@@ -97,6 +97,7 @@ struct uvc_video {
unsigned int width;
unsigned int height;
unsigned int imagesize;
+ unsigned int interval;
struct mutex mutex; /* protects frame parameters */

unsigned int uvc_num_requests;
diff --git a/drivers/usb/gadget/function/uvc_v4l2.c b/drivers/usb/gadget/function/uvc_v4l2.c
index a024aecb76dc3..5b579ec1f5040 100644
--- a/drivers/usb/gadget/function/uvc_v4l2.c
+++ b/drivers/usb/gadget/function/uvc_v4l2.c
@@ -307,6 +307,56 @@ uvc_v4l2_set_format(struct file *file, void *fh, struct v4l2_format *fmt)
return ret;
}

+static int uvc_v4l2_g_parm(struct file *file, void *fh,
+ struct v4l2_streamparm *parm)
+{
+ struct video_device *vdev = video_devdata(file);
+ struct uvc_device *uvc = video_get_drvdata(vdev);
+ struct uvc_video *video = &uvc->video;
+ struct v4l2_fract timeperframe;
+
+ if (parm->type == V4L2_BUF_TYPE_VIDEO_CAPTURE)
+ return -EINVAL;
+
+ /* Return the actual frame period. */
+ timeperframe.numerator = video->interval;
+ timeperframe.denominator = 10000000;
+ v4l2_simplify_fraction(&timeperframe.numerator,
+ &timeperframe.denominator, 8, 333);
+
+ uvcg_dbg(&uvc->func, "Getting frame interval of %u/%u (%u)\n",
+ timeperframe.numerator, timeperframe.denominator,
+ video->interval);
+
+ parm->parm.output.timeperframe = timeperframe;
+ parm->parm.output.capability = V4L2_CAP_TIMEPERFRAME;
+
+ return 0;
+}
+
+static int uvc_v4l2_s_parm(struct file *file, void *fh,
+ struct v4l2_streamparm *parm)
+{
+ struct video_device *vdev = video_devdata(file);
+ struct uvc_device *uvc = video_get_drvdata(vdev);
+ struct uvc_video *video = &uvc->video;
+ struct v4l2_fract timeperframe;
+
+ if (parm->type == V4L2_BUF_TYPE_VIDEO_CAPTURE)
+ return -EINVAL;
+
+ timeperframe = parm->parm.output.timeperframe;
+
+ video->interval = v4l2_fraction_to_interval(timeperframe.numerator,
+ timeperframe.denominator);
+
+ uvcg_dbg(&uvc->func, "Setting frame interval to %u/%u (%u)\n",
+ timeperframe.numerator, timeperframe.denominator,
+ video->interval);
+
+ return 0;
+}
+
static int
uvc_v4l2_enum_frameintervals(struct file *file, void *fh,
struct v4l2_frmivalenum *fival)
@@ -577,6 +627,8 @@ const struct v4l2_ioctl_ops uvc_v4l2_ioctl_ops = {
.vidioc_dqbuf = uvc_v4l2_dqbuf,
.vidioc_streamon = uvc_v4l2_streamon,
.vidioc_streamoff = uvc_v4l2_streamoff,
+ .vidioc_s_parm = uvc_v4l2_s_parm,
+ .vidioc_g_parm = uvc_v4l2_g_parm,
.vidioc_subscribe_event = uvc_v4l2_subscribe_event,
.vidioc_unsubscribe_event = uvc_v4l2_unsubscribe_event,
.vidioc_default = uvc_v4l2_ioctl_default,

--
2.39.2


2024-04-09 21:25:42

by Michael Grzeschik

[permalink] [raw]
Subject: [PATCH 3/3] usb: gadget: uvc: set req_size and n_requests based on the frame interval

With the information of the interval frame length it is now possible to
calculate the number of usb requests by the frame duration. Based on the
request size and the imagesize we calculate the actual size per request.
This has calculation has the benefit that the frame data is equally
distributed over all allocated requests.

We keep the current req_size calculation as a fallback, if the interval
callbacks did not set the interval property.

Signed-off-by: Michael Grzeschik <[email protected]>
---
drivers/usb/gadget/function/uvc_queue.c | 28 +++++++++++++++++++++-------
drivers/usb/gadget/function/uvc_video.c | 2 +-
2 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/usb/gadget/function/uvc_queue.c b/drivers/usb/gadget/function/uvc_queue.c
index ce51643fc4639..cd8a9793aa72e 100644
--- a/drivers/usb/gadget/function/uvc_queue.c
+++ b/drivers/usb/gadget/function/uvc_queue.c
@@ -44,7 +44,7 @@ static int uvc_queue_setup(struct vb2_queue *vq,
{
struct uvc_video_queue *queue = vb2_get_drv_priv(vq);
struct uvc_video *video = container_of(queue, struct uvc_video, queue);
- unsigned int req_size;
+ unsigned int req_size, max_req_size;
unsigned int nreq;

if (*nbuffers > UVC_MAX_VIDEO_BUFFERS)
@@ -54,15 +54,29 @@ static int uvc_queue_setup(struct vb2_queue *vq,

sizes[0] = video->imagesize;

- req_size = video->ep->maxpacket
+ nreq = DIV_ROUND_UP(video->interval, video->ep->desc->bInterval * 1250);
+
+ req_size = DIV_ROUND_UP(sizes[0], nreq);
+
+ max_req_size = video->ep->maxpacket
* max_t(unsigned int, video->ep->maxburst, 1)
* (video->ep->mult);

- /* We divide by two, to increase the chance to run
- * into fewer requests for smaller framesizes.
- */
- nreq = DIV_ROUND_UP(DIV_ROUND_UP(sizes[0], 2), req_size);
- nreq = clamp(nreq, 4U, 64U);
+ if (!req_size) {
+ req_size = max_req_size;
+
+ /* We divide by two, to increase the chance to run
+ * into fewer requests for smaller framesizes.
+ */
+ nreq = DIV_ROUND_UP(DIV_ROUND_UP(sizes[0], 2), req_size);
+ nreq = clamp(nreq, 4U, 64U);
+ } else if (req_size > max_req_size) {
+ /* The prepared interval length and expected buffer size
+ * is not possible to stream with the currently configured
+ * isoc bandwidth
+ */
+ return -EINVAL;
+ }

video->req_size = req_size;
video->uvc_num_requests = nreq;
diff --git a/drivers/usb/gadget/function/uvc_video.c b/drivers/usb/gadget/function/uvc_video.c
index 95bb64e16f3da..d197c46e93fb4 100644
--- a/drivers/usb/gadget/function/uvc_video.c
+++ b/drivers/usb/gadget/function/uvc_video.c
@@ -304,7 +304,7 @@ static int uvcg_video_usb_req_queue(struct uvc_video *video,
*/
if (list_empty(&video->req_free) || ureq->last_buf ||
!(video->req_int_count %
- DIV_ROUND_UP(video->uvc_num_requests, 4))) {
+ clamp(DIV_ROUND_UP(video->uvc_num_requests, 4), 4U, 16U))) {
video->req_int_count = 0;
req->no_interrupt = 0;
} else {

--
2.39.2


2024-04-09 21:25:44

by Michael Grzeschik

[permalink] [raw]
Subject: [PATCH 1/3] usb: gadget: function: uvc: set req_size once when the vb2 queue is calculated

The uvc gadget driver is calculating the req_size on every
video_enable/alloc_request and is based on the fixed configfs parameters
maxpacket, maxburst and mult.

As those parameters can not be changed once the gadget is started and
the same calculation is done already early on the
vb2_streamon/queue_setup path its save to remove one extra calculation
and reuse the calculation from uvc_queue_setup for the allocation step.

Signed-off-by: Michael Grzeschik <[email protected]>

---
Link to v1: https://lore.kernel.org/r/[email protected]
---
drivers/usb/gadget/function/uvc_queue.c | 2 ++
drivers/usb/gadget/function/uvc_video.c | 15 ++-------------
2 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/drivers/usb/gadget/function/uvc_queue.c b/drivers/usb/gadget/function/uvc_queue.c
index 0aa3d7e1f3cc3..ce51643fc4639 100644
--- a/drivers/usb/gadget/function/uvc_queue.c
+++ b/drivers/usb/gadget/function/uvc_queue.c
@@ -63,6 +63,8 @@ static int uvc_queue_setup(struct vb2_queue *vq,
*/
nreq = DIV_ROUND_UP(DIV_ROUND_UP(sizes[0], 2), req_size);
nreq = clamp(nreq, 4U, 64U);
+
+ video->req_size = req_size;
video->uvc_num_requests = nreq;

return 0;
diff --git a/drivers/usb/gadget/function/uvc_video.c b/drivers/usb/gadget/function/uvc_video.c
index d41f5f31dadd5..95bb64e16f3da 100644
--- a/drivers/usb/gadget/function/uvc_video.c
+++ b/drivers/usb/gadget/function/uvc_video.c
@@ -496,7 +496,6 @@ uvc_video_free_requests(struct uvc_video *video)
INIT_LIST_HEAD(&video->ureqs);
INIT_LIST_HEAD(&video->req_free);
INIT_LIST_HEAD(&video->req_ready);
- video->req_size = 0;
return 0;
}

@@ -504,16 +503,9 @@ static int
uvc_video_alloc_requests(struct uvc_video *video)
{
struct uvc_request *ureq;
- unsigned int req_size;
unsigned int i;
int ret = -ENOMEM;

- BUG_ON(video->req_size);
-
- req_size = video->ep->maxpacket
- * max_t(unsigned int, video->ep->maxburst, 1)
- * (video->ep->mult);
-
for (i = 0; i < video->uvc_num_requests; i++) {
ureq = kzalloc(sizeof(struct uvc_request), GFP_KERNEL);
if (ureq == NULL)
@@ -523,7 +515,7 @@ uvc_video_alloc_requests(struct uvc_video *video)

list_add_tail(&ureq->list, &video->ureqs);

- ureq->req_buffer = kmalloc(req_size, GFP_KERNEL);
+ ureq->req_buffer = kmalloc(video->req_size, GFP_KERNEL);
if (ureq->req_buffer == NULL)
goto error;

@@ -541,12 +533,10 @@ uvc_video_alloc_requests(struct uvc_video *video)
list_add_tail(&ureq->req->list, &video->req_free);
/* req_size/PAGE_SIZE + 1 for overruns and + 1 for header */
sg_alloc_table(&ureq->sgt,
- DIV_ROUND_UP(req_size - UVCG_REQUEST_HEADER_LEN,
+ DIV_ROUND_UP(video->req_size - UVCG_REQUEST_HEADER_LEN,
PAGE_SIZE) + 2, GFP_KERNEL);
}

- video->req_size = req_size;
-
return 0;

error:
@@ -699,7 +689,6 @@ uvcg_video_disable(struct uvc_video *video)
INIT_LIST_HEAD(&video->ureqs);
INIT_LIST_HEAD(&video->req_free);
INIT_LIST_HEAD(&video->req_ready);
- video->req_size = 0;
spin_unlock_irqrestore(&video->req_lock, flags);

/*

--
2.39.2


2024-04-21 23:26:08

by Michael Grzeschik

[permalink] [raw]
Subject: Re: [PATCH 0/3] usb: gadget: uvc: allocate requests based on frame interval length and buffersize

On Tue, Apr 09, 2024 at 11:24:56PM +0200, Michael Grzeschik wrote:
>This patch series is improving the size calculation and allocation
>of the uvc requests. Using the currenlty setup frame duration of the
>stream it is possible to calculate the number of requests based on the
>interval length.

The basic concept here is right. But unfortunatly we found out that
together with Patch [1] and the current zero length request pump
mechanism [2] and [3] this is not working as expected.

The conclusion that we can not queue more than one frame at once into
the hw led to [1]. The current implementation of zero length reqeusts
which will be queued while we are waiting for the frame to finish
transferring will enlarge the frame duration. Since every zero-length
request is still taking up at least one frame interval of 125 us.

This longer frameduration of each enqueued will therefor decrease the
absolut framerate.

Therefor to properly make those patches work, we will have to get rid of
the zero length pump mechanism again and make sure that the whole
business logic of what to be queued and when will only be done in the
pump worker. It is possible to let the dwc3 udc run dry, as we are
actively waiting for the frame to finish, the last request in the
prepared and started list will stop the current dwc3 stream and therfor
no underruns will occur with the next ep_queue.

Also keeping the prepared list and doing the preparation in any case
of the pump worker is still a good point we need to keep.

With all these pending patches the whole uvc saga of underruns and
flickering videostreams should come to an end™.

I already started with this but would be happy to see Avichal and others
to review the patches when they are ready in my eyes.

mgr

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/[email protected]/

>Signed-off-by: Michael Grzeschik <[email protected]>
>---
>Michael Grzeschik (3):
> usb: gadget: function: uvc: set req_size once when the vb2 queue is calculated
> usb: gadget: uvc: add g_parm and s_parm for frame interval
> usb: gadget: uvc: set req_size and n_requests based on the frame interval
>
> drivers/usb/gadget/function/uvc.h | 1 +
> drivers/usb/gadget/function/uvc_queue.c | 30 ++++++++++++++-----
> drivers/usb/gadget/function/uvc_v4l2.c | 52 +++++++++++++++++++++++++++++++++
> drivers/usb/gadget/function/uvc_video.c | 17 ++---------
> 4 files changed, 79 insertions(+), 21 deletions(-)
>---
>base-commit: 3295f1b866bfbcabd625511968e8a5c541f9ab32
>change-id: 20240403-uvc_request_length_by_interval-a7efd587d963
>
>Best regards,
>--
>Michael Grzeschik <[email protected]>
>
>

--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |


Attachments:
(No filename) (3.15 kB)
signature.asc (849.00 B)
Download all attachments

2024-04-23 00:37:38

by Avichal Rakesh

[permalink] [raw]
Subject: Re: [PATCH 0/3] usb: gadget: uvc: allocate requests based on frame interval length and buffersize



On 4/21/24 16:25, Michael Grzeschik wrote:
> On Tue, Apr 09, 2024 at 11:24:56PM +0200, Michael Grzeschik wrote:
>> This patch series is improving the size calculation and allocation
>> of the uvc requests. Using the currenlty setup frame duration of the
>> stream it is possible to calculate the number of requests based on the
>> interval length.
>
> The basic concept here is right. But unfortunatly we found out that
> together with Patch [1] and the current zero length request pump
> mechanism [2] and [3] this is not working as expected.
>
> The conclusion that we can not queue more than one frame at once into
> the hw led to [1]. The current implementation of zero length reqeusts
> which will be queued while we are waiting for the frame to finish
> transferring will enlarge the frame duration. Since every zero-length
> request is still taking up at least one frame interval of 125 us.

I haven't taken a super close look at your patches, so please feel free
to correct me if I am misunderstanding something.

It looks like the goal of the patches is to determine a better number
and size of usb_requests from the given framerate such that we send exactly
nreqs requests per frame where nreqs is determined to be the exact number
of requests that can be sent in one frame interval?

As the logic stands, we need some 0-length requests to be circulating to
ensure that we don't miss ISOC deadlines. The current logic unconditionally
sends half of all allocated requests to be circulated.

With those two things in mind, this means than video_pump can at encode
at most half a frame in one go, and then has to wait for complete
callbacks to come in. In such cases, the theoretical worst case for
encode time is
125us * (number of requests needed per frame / 2) + scheduling delays
as after the first half of the frame has been encoded, the video_pump
thread will have to wait 125us for each of the zero length requests to
be returned.

The underlying assumption behind the "queue 0-length requests" approach
was that video_pump encodes the frames in as few requests as possible
and that there are spare requests to maintain a pressure on the
ISOC queue without hindering the video_pump thread, and unfortunately
it seems like patch 3/3 is breaking both of them?

Assuming my understanding of your patches is correct, my question
is: Why do we want to spread the frame uniformly over the requests
instead of encoding it in as few requests as possible. Spreading
the frame over more requests artificially increases the encode time
required by video_pump, and AFAICT there is no real benefit to it?

> Therefor to properly make those patches work, we will have to get rid of
> the zero length pump mechanism again and make sure that the whole
> business logic of what to be queued and when will only be done in the
> pump worker. It is possible to let the dwc3 udc run dry, as we are
> actively waiting for the frame to finish, the last request in the
> prepared and started list will stop the current dwc3 stream and therf> no underruns will occur with the next ep_queue.

One thing to note here: The reason we moved to queuing 0-length requests
from complete callback was because even with realtime priority, video_pump
thread doesn't always meet the ISOC queueing cadence. I think stopping and
starting the stream was briefly discussed in our initial discussion in
https://lore.kernel.org/all/[email protected]/
and Thinh mentioned that dwc3 controller does it if it detects an underrun,
but I am not sure if starting and stopping an ISOC stream is good practice.
Someone better versed in USB protocol can probably confirm, but it seems
somewhat hacky to stop the ISOC stream at the end of the frame and restart
with the next frame.

> With all these pending patches the whole uvc saga of underruns and
> flickering videostreams should come to an end™.

This would indeed be nice!

>
> I already started with this but would be happy to see Avichal and others
> to review the patches when they are ready in my eyes.

Of course!

- Avi.

2024-04-23 14:25:26

by Michael Grzeschik

[permalink] [raw]
Subject: Re: [PATCH 0/3] usb: gadget: uvc: allocate requests based on frame interval length and buffersize

Ccing:

Michael Riesch <[email protected]>
Thinh Nguyen <[email protected]>

On Mon, Apr 22, 2024 at 05:21:09PM -0700, Avichal Rakesh wrote:
>On 4/21/24 16:25, Michael Grzeschik wrote:
>> On Tue, Apr 09, 2024 at 11:24:56PM +0200, Michael Grzeschik wrote:
>>> This patch series is improving the size calculation and allocation
>>> of the uvc requests. Using the currenlty setup frame duration of the
>>> stream it is possible to calculate the number of requests based on the
>>> interval length.
>>
>> The basic concept here is right. But unfortunatly we found out that
>> together with Patch [1] and the current zero length request pump
>> mechanism [2] and [3] this is not working as expected.
>>
>> The conclusion that we can not queue more than one frame at once into
>> the hw led to [1]. The current implementation of zero length reqeusts
>> which will be queued while we are waiting for the frame to finish
>> transferring will enlarge the frame duration. Since every zero-length
>> request is still taking up at least one frame interval of 125 us.
>
>I haven't taken a super close look at your patches, so please feel free
>to correct me if I am misunderstanding something.
>
>It looks like the goal of the patches is to determine a better number
>and size of usb_requests from the given framerate such that we send exactly
>nreqs requests per frame where nreqs is determined to be the exact number
>of requests that can be sent in one frame interval?

It does not need to be the exact time, actually it may not be exact.
Scattering the data over all requests would not leave any headroom for
any latencies or overhead.

>As the logic stands, we need some 0-length requests to be circulating to
>ensure that we don't miss ISOC deadlines. The current logic unconditionally
>sends half of all allocated requests to be circulated.
>
>With those two things in mind, this means than video_pump can at encode
>at most half a frame in one go, and then has to wait for complete
>callbacks to come in. In such cases, the theoretical worst case for
>encode time is
>125us * (number of requests needed per frame / 2) + scheduling delays
>as after the first half of the frame has been encoded, the video_pump
>thread will have to wait 125us for each of the zero length requests to
>be returned.
>
>The underlying assumption behind the "queue 0-length requests" approach
>was that video_pump encodes the frames in as few requests as possible
>and that there are spare requests to maintain a pressure on the
>ISOC queue without hindering the video_pump thread, and unfortunately
>it seems like patch 3/3 is breaking both of them?

Right.

>Assuming my understanding of your patches is correct, my question
>is: Why do we want to spread the frame uniformly over the requests
>instead of encoding it in as few requests as possible. Spreading
>the frame over more requests artificially increases the encode time
>required by video_pump, and AFAICT there is no real benefit to it?

Thinh gave me the advise that it is better to use the isoc stream
constantly filled. Rather then streaming big amounts of data in the
beginning of an frameinterval and having then a lot of spare time
where the bandwidth is completely unsused.

In our reallife scenario streaming big requests had the impact, that
the dwc3 core could not keep up with reading the amount of data
from the memory bus, as the bus is already under heavy load. When the
HW was then not able to transfer the requested and actually available
amount of data in the interval, the hw did give us the usual missed
interrupt answer.

Using smaller requests solved the problem here, as it really was
unnecessary to stress the memory and usb bus in the beginning as
we had enough headroom in the temporal domain.

Which then led to the conclusion that the number of needed requests
per image frame interval is calculatable since we know the usb
interval length.

@Thinh: Correct me if I am saying something wrong here.

>> Therefor to properly make those patches work, we will have to get rid of
>> the zero length pump mechanism again and make sure that the whole
>> business logic of what to be queued and when will only be done in the
>> pump worker. It is possible to let the dwc3 udc run dry, as we are
>> actively waiting for the frame to finish, the last request in the
>> prepared and started list will stop the current dwc3 stream and for
>> no underruns will occur with the next ep_queue.
>
>One thing to note here: The reason we moved to queuing 0-length requests
>from complete callback was because even with realtime priority, video_pump
>thread doesn't always meet the ISOC queueing cadence. I think stopping and
>starting the stream was briefly discussed in our initial discussion in
>https://lore.kernel.org/all/[email protected]/
>and Thinh mentioned that dwc3 controller does it if it detects an underrun,
>but I am not sure if starting and stopping an ISOC stream is good practice.

The realtime latency aspect is not an issue anymore if we ensure that we
always keep only one frame in the hw ring buffer. When the pump worker
ensure that it will always run through one full frame the scheduler has
no chance to break our running dwc3 stream. Since the pump is running
under a while(1) this should be possible.

Also with the request amount precalculation we can always encode the
whole frame into all available requests and don't have to wait for
requests to be available again.

Together with the latest knowladge about the underlying hw we even need to only
keep one frame in the HW ring buffer. Since we have some interrupt latency,
keeping more frames in the ring buffer, would mean that we are not able to tag
the currently streamed frame properly as errornous if the dwc3 hw ring buffer
is already telling the host some data about the next frame. And as we already
need to wait for the end of the frame to finish, based on the assumption that
only one frame is enqueued in the ring buffer the hw will stop the stream and
the next requst will start a new stream. So there will no missed underruns be
happening.

So the main fact here is, that telling the host some status about a
frame in the past is impossible! Therefor the first request of the next
hw stream need to be the one that is telling the Host if the previous frame
is ment to be drawn or not.

>Someone better versed in USB protocol can probably confirm, but it seems
>somewhat hacky to stop the ISOC stream at the end of the frame and restart
>with the next frame.

All I know is that the HW mechanism that is reading from the trb ring buffer is
started or stopped I don't know if really the ISOC stream is stopped and
restarted here or what that means on the real wire. And if so, I am unsure if
that is really a problem or not. Thinh?

Regards,
Michael

--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |


Attachments:
(No filename) (7.11 kB)
signature.asc (849.00 B)
Download all attachments

2024-04-23 23:23:38

by Avichal Rakesh

[permalink] [raw]
Subject: Re: [PATCH 0/3] usb: gadget: uvc: allocate requests based on frame interval length and buffersize



On 4/23/24 07:25, Michael Grzeschik wrote:
> Ccing:
>
> Michael Riesch <[email protected]>
> Thinh Nguyen <[email protected]>
>
> On Mon, Apr 22, 2024 at 05:21:09PM -0700, Avichal Rakesh wrote:
>> On 4/21/24 16:25, Michael Grzeschik wrote:
>>> On Tue, Apr 09, 2024 at 11:24:56PM +0200, Michael Grzeschik wrote:
>>>> This patch series is improving the size calculation and allocation
>>>> of the uvc requests. Using the currenlty setup frame duration of the
>>>> stream it is possible to calculate the number of requests based on the
>>>> interval length.
>>>
>>> The basic concept here is right. But unfortunatly we found out that
>>> together with Patch [1] and the current zero length request pump
>>> mechanism [2] and [3] this is not working as expected.
>>>
>>> The conclusion that we can not queue more than one frame at once into
>>> the hw led to [1]. The current implementation of zero length reqeusts
>>> which will be queued while we are waiting for the frame to finish
>>> transferring will enlarge the frame duration. Since every zero-length
>>> request is still taking up at least one frame interval of 125 us.
>>
>> I haven't taken a super close look at your patches, so please feel free
>> to correct me if I am misunderstanding something.
>>
>> It looks like the goal of the patches is to determine a better number
>> and size of usb_requests from the given framerate such that we send exactly
>> nreqs requests per frame where nreqs is determined to be the exact number
>> of requests that can be sent in one frame interval?
>
> It does not need to be the exact time, actually it may not be exact.
> Scattering the data over all requests would not leave any headroom for
> any latencies or overhead.

IIUC, patch 3/3 sets the number of requests to frameinterval / 125 us,
which gives us the number of requests we can send in exactly one frame interval,
and then sets the size of the request as max framesize / nreq, which means the
frames will be evenly divided up into all available requests (with a little
fuzz factor here and there).

This effectively means that (assuming no other delays) one frame will take
~one frameinterval to be transmitted?

>
>> As the logic stands, we need some 0-length requests to be circulating to
>> ensure that we don't miss ISOC deadlines. The current logic unconditionally
>> sends half of all allocated requests to be circulated.
>>
>> With those two things in mind, this means than video_pump can at encode
>> at most half a frame in one go, and then has to wait for complete
>> callbacks to come in. In such cases, the theoretical worst case for
>> encode time is
>> 125us * (number of requests needed per frame / 2) + scheduling delays
>> as after the first half of the frame has been encoded, the video_pump
>> thread will have to wait 125us for each of the zero length requests to
>> be returned.
>>
>> The underlying assumption behind the "queue 0-length requests" approach
>> was that video_pump encodes the frames in as few requests as possible
>> and that there are spare requests to maintain a pressure on the
>> ISOC queue without hindering the video_pump thread, and unfortunately
>> it seems like patch 3/3 is breaking both of them?
>
> Right.
>
>> Assuming my understanding of your patches is correct, my question
>> is: Why do we want to spread the frame uniformly over the requests
>> instead of encoding it in as few requests as possible. Spreading
>> the frame over more requests artificially increases the encode time
>> required by video_pump, and AFAICT there is no real benefit to it?
>
> Thinh gave me the advise that it is better to use the isoc stream
> constantly filled. Rather then streaming big amounts of data in the
> beginning of an frameinterval and having then a lot of spare time
> where the bandwidth is completely unsused.
>
> In our reallife scenario streaming big requests had the impact, that
> the dwc3 core could not keep up with reading the amount of data
> from the memory bus, as the bus is already under heavy load. When the
> HW was then not able to transfer the requested and actually available
> amount of data in the interval, the hw did give us the usual missed
> interrupt answer.
>
> Using smaller requests solved the problem here, as it really was
> unnecessary to stress the memory and usb bus in the beginning as
> we had enough headroom in the temporal domain.

Ah, I see. This was not a consideration, and it makes sense if USB
bus is under contention from a few different streams. So the solution
seems to be to spread the frame of as many requests as we can transmit
in one frameinterval?

As an experiment, while we wait for others to respond, could you try
doubling (or 2.5x'ing to be extra safe) the number of requests allocated
by patch 3/3 without changing the request's buffer size?

It won't help with the error reporting but should help with ensuring
that frames are sent out in one frameinterval with little to no
0-length requests between them.

The idea is that video_pump will have enough requests available to fully
encode the frame in one burst, and another frame's worth of request will be
re-added to req_free list for video_pump to fill up in the time that the next
frame comes in.

>
> Which then led to the conclusion that the number of needed requests
> per image frame interval is calculatable since we know the usb
> interval length.
>
> @Thinh: Correct me if I am saying something wrong here.
>
>>> Therefor to properly make those patches work, we will have to get rid of
>>> the zero length pump mechanism again and make sure that the whole
>>> business logic of what to be queued and when will only be done in the
>>> pump worker. It is possible to let the dwc3 udc run dry, as we are
>>> actively waiting for the frame to finish, the last request in the
>>> prepared and started list will stop the current dwc3 stream and  for
>>> no underruns will occur with the next ep_queue.
>>
>> One thing to note here: The reason we moved to queuing 0-length requests
>> from complete callback was because even with realtime priority, video_pump
>> thread doesn't always meet the ISOC queueing cadence. I think stopping and
>> starting the stream was briefly discussed in our initial discussion in
>> https://lore.kernel.org/all/[email protected]/
>> and Thinh mentioned that dwc3 controller does it if it detects an underrun,
>> but I am not sure if starting and stopping an ISOC stream is good practice.
>
> The realtime latency aspect is not an issue anymore if we ensure that we
> always keep only one frame in the hw ring buffer. When the pump worker
> ensure that it will always run through one full frame the scheduler has
> no chance to break our running dwc3 stream. Since the pump is running
> under a while(1) this should be possible.

I'll wait for your patch to see, but are you saying that we should have the
pump worker busy spinning when there are no frames available? Cameras cannot
produce frames fast enough for video_pump to be constantly encoding frames.
IIRC, "encoding" a frame to usb_requests took less than a ms or two on my
device, and frame interval is 33ms for a 30fps stream, so the CPU would be
busy spinning for ~30ms which is an unreasonable time for a CPU to be
idling.

>
> Also with the request amount precalculation we can always encode the
> whole frame into all available requests and don't have to wait for
> requests to be available again.
>
> Together with the latest knowladge about the underlying hw we even need to only
> keep one frame in the HW ring buffer. Since we have some interrupt latency,
> keeping more frames in the ring buffer, would mean that we are not able to tag
> the currently streamed frame properly as errornous if the dwc3 hw ring buffer
> is already telling the host some data about the next frame. And as we already
> need to wait for the end of the frame to finish, based on the assumption that
> only one frame is enqueued in the ring buffer the hw will stop the stream and
> the next requst will start a new stream. So there will no missed underruns be
> happening.
>
> So the main fact here is, that telling the host some status about a
> frame in the past is impossible! Therefor the first request of the next
> hw stream need to be the one that is telling the Host if the previous frame
> is ment to be drawn or not.

This is a fair point, but the timing on this becomes a little difficult if
the frame is sent over the entire frameinterval. If we wait for the entire
frame to be transmitted, then we have 125us between the last request of a
frame being transmitted and the first request of the next frame being
queued. The userspace app producing the frames will have timing variations
larger than 125us, so we cannot rely on a frame being available exactly as
one frame is fully transmitted, or of us being notified of transmission
status by the time the next frame comes in.

>
>> Someone better versed in USB protocol can probably confirm, but it seems
>> somewhat hacky to stop the ISOC stream at the end of the frame and restart
>> with the next frame.
>
> All I know is that the HW mechanism that is reading from the trb ring buffer is
> started or stopped I don't know if really the ISOC stream is stopped and
> restarted here or what that means on the real wire. And if so, I am unsure if
> that is really a problem or not. Thinh?

Oh? That's great! If the controller can keep the ISOC stream from underruning
without the gadget feeding it 0-length requests, then we can simplify the
gadget side implementation quite a bit!


Regards,
Avi.

2024-04-24 02:29:10

by Thinh Nguyen

[permalink] [raw]
Subject: Re: [PATCH 0/3] usb: gadget: uvc: allocate requests based on frame interval length and buffersize

On Tue, Apr 23, 2024, Avichal Rakesh wrote:
>
>
> On 4/23/24 07:25, Michael Grzeschik wrote:
> > Ccing:
> >
> > Michael Riesch <[email protected]>
> > Thinh Nguyen <[email protected]>
> >
> > On Mon, Apr 22, 2024 at 05:21:09PM -0700, Avichal Rakesh wrote:
> >> On 4/21/24 16:25, Michael Grzeschik wrote:
> >>> On Tue, Apr 09, 2024 at 11:24:56PM +0200, Michael Grzeschik wrote:
> >>>> This patch series is improving the size calculation and allocation
> >>>> of the uvc requests. Using the currenlty setup frame duration of the
> >>>> stream it is possible to calculate the number of requests based on the
> >>>> interval length.
> >>>
> >>> The basic concept here is right. But unfortunatly we found out that
> >>> together with Patch [1] and the current zero length request pump
> >>> mechanism [2] and [3] this is not working as expected.
> >>>
> >>> The conclusion that we can not queue more than one frame at once into
> >>> the hw led to [1]. The current implementation of zero length reqeusts
> >>> which will be queued while we are waiting for the frame to finish
> >>> transferring will enlarge the frame duration. Since every zero-length
> >>> request is still taking up at least one frame interval of 125 us.
> >>
> >> I haven't taken a super close look at your patches, so please feel free
> >> to correct me if I am misunderstanding something.
> >>
> >> It looks like the goal of the patches is to determine a better number
> >> and size of usb_requests from the given framerate such that we send exactly
> >> nreqs requests per frame where nreqs is determined to be the exact number
> >> of requests that can be sent in one frame interval?
> >
> > It does not need to be the exact time, actually it may not be exact.
> > Scattering the data over all requests would not leave any headroom for
> > any latencies or overhead.
>
> IIUC, patch 3/3 sets the number of requests to frameinterval / 125 us,
> which gives us the number of requests we can send in exactly one frame interval,
> and then sets the size of the request as max framesize / nreq, which means the
> frames will be evenly divided up into all available requests (with a little
> fuzz factor here and there).
>
> This effectively means that (assuming no other delays) one frame will take
> ~one frameinterval to be transmitted?
>
> >
> >> As the logic stands, we need some 0-length requests to be circulating to
> >> ensure that we don't miss ISOC deadlines. The current logic unconditionally
> >> sends half of all allocated requests to be circulated.
> >>
> >> With those two things in mind, this means than video_pump can at encode
> >> at most half a frame in one go, and then has to wait for complete
> >> callbacks to come in. In such cases, the theoretical worst case for
> >> encode time is
> >> 125us * (number of requests needed per frame / 2) + scheduling delays
> >> as after the first half of the frame has been encoded, the video_pump
> >> thread will have to wait 125us for each of the zero length requests to
> >> be returned.
> >>
> >> The underlying assumption behind the "queue 0-length requests" approach
> >> was that video_pump encodes the frames in as few requests as possible
> >> and that there are spare requests to maintain a pressure on the
> >> ISOC queue without hindering the video_pump thread, and unfortunately
> >> it seems like patch 3/3 is breaking both of them?
> >
> > Right.
> >
> >> Assuming my understanding of your patches is correct, my question
> >> is: Why do we want to spread the frame uniformly over the requests
> >> instead of encoding it in as few requests as possible. Spreading
> >> the frame over more requests artificially increases the encode time
> >> required by video_pump, and AFAICT there is no real benefit to it?
> >
> > Thinh gave me the advise that it is better to use the isoc stream
> > constantly filled. Rather then streaming big amounts of data in the
> > beginning of an frameinterval and having then a lot of spare time
> > where the bandwidth is completely unsused.
> >
> > In our reallife scenario streaming big requests had the impact, that
> > the dwc3 core could not keep up with reading the amount of data
> > from the memory bus, as the bus is already under heavy load. When the
> > HW was then not able to transfer the requested and actually available
> > amount of data in the interval, the hw did give us the usual missed
> > interrupt answer.
> >
> > Using smaller requests solved the problem here, as it really was
> > unnecessary to stress the memory and usb bus in the beginning as
> > we had enough headroom in the temporal domain.
>
> Ah, I see. This was not a consideration, and it makes sense if USB
> bus is under contention from a few different streams. So the solution
> seems to be to spread the frame of as many requests as we can transmit
> in one frameinterval?
>
> As an experiment, while we wait for others to respond, could you try
> doubling (or 2.5x'ing to be extra safe) the number of requests allocated
> by patch 3/3 without changing the request's buffer size?
>
> It won't help with the error reporting but should help with ensuring
> that frames are sent out in one frameinterval with little to no
> 0-length requests between them.
>
> The idea is that video_pump will have enough requests available to fully
> encode the frame in one burst, and another frame's worth of request will be
> re-added to req_free list for video_pump to fill up in the time that the next
> frame comes in.
>
> >
> > Which then led to the conclusion that the number of needed requests
> > per image frame interval is calculatable since we know the usb
> > interval length.
> >
> > @Thinh: Correct me if I am saying something wrong here.

Right, if you max out the data rate per uframe, there's less opportunity
for the host to schedule everything for that interval (e.g. affected
from other endpoint/device traffics, link commands etc). It also
increases the latency of DMA. In many cases, many other vendor hosts
can't handle 48KB/uframe for SuperSpeed and 96KB/uframe for SuperSpeed
Plus. So, you'd need to test your platform find the optimal request size
so it can work for most hosts.

> >
> >>> Therefor to properly make those patches work, we will have to get rid of

Sorry if I may have missed the explaination, but why do we need to rid
of this?

> >>> the zero length pump mechanism again and make sure that the whole
> >>> business logic of what to be queued and when will only be done in the
> >>> pump worker. It is possible to let the dwc3 udc run dry, as we are
> >>> actively waiting for the frame to finish, the last request in the
> >>> prepared and started list will stop the current dwc3 stream and  for
> >>> no underruns will occur with the next ep_queue.
> >>
> >> One thing to note here: The reason we moved to queuing 0-length requests
> >> from complete callback was because even with realtime priority, video_pump
> >> thread doesn't always meet the ISOC queueing cadence. I think stopping and
> >> starting the stream was briefly discussed in our initial discussion in
> >> https://urldefense.com/v3/__https://lore.kernel.org/all/[email protected]/__;!!A4F2R9G_pg!ZmfvrPq4rs7MIhxNrrEqmgGrlYTJ12WgdzaqQhfEehKfjKqxPr2bC1RzUqaa9tvdBtAvXdyK2GpxYzvslpV6$
> >> and Thinh mentioned that dwc3 controller does it if it detects an underrun,
> >> but I am not sure if starting and stopping an ISOC stream is good practice.

There's a workaround specific for UVC in dwc3 to "guess" when underrun
happen. It's not foolproof. dwc3 should not need to do that.

Isoc data is periodic and continuous. We should not expect this
unconventional re-synchronization.

> >
> > The realtime latency aspect is not an issue anymore if we ensure that we
> > always keep only one frame in the hw ring buffer. When the pump worker
> > ensure that it will always run through one full frame the scheduler has
> > no chance to break our running dwc3 stream. Since the pump is running
> > under a while(1) this should be possible.
>
> I'll wait for your patch to see, but are you saying that we should have the
> pump worker busy spinning when there are no frames available? Cameras cannot
> produce frames fast enough for video_pump to be constantly encoding frames.
> IIRC, "encoding" a frame to usb_requests took less than a ms or two on my
> device, and frame interval is 33ms for a 30fps stream, so the CPU would be
> busy spinning for ~30ms which is an unreasonable time for a CPU to be
> idling.
>
> >
> > Also with the request amount precalculation we can always encode the
> > whole frame into all available requests and don't have to wait for
> > requests to be available again.
> >
> > Together with the latest knowladge about the underlying hw we even need to only
> > keep one frame in the HW ring buffer. Since we have some interrupt latency,
> > keeping more frames in the ring buffer, would mean that we are not able to tag
> > the currently streamed frame properly as errornous if the dwc3 hw ring buffer
> > is already telling the host some data about the next frame. And as we already
> > need to wait for the end of the frame to finish, based on the assumption that
> > only one frame is enqueued in the ring buffer the hw will stop the stream and
> > the next requst will start a new stream. So there will no missed underruns be
> > happening.
> >
> > So the main fact here is, that telling the host some status about a
> > frame in the past is impossible! Therefor the first request of the next
> > hw stream need to be the one that is telling the Host if the previous frame
> > is ment to be drawn or not.
>
> This is a fair point, but the timing on this becomes a little difficult if
> the frame is sent over the entire frameinterval. If we wait for the entire
> frame to be transmitted, then we have 125us between the last request of a
> frame being transmitted and the first request of the next frame being
> queued. The userspace app producing the frames will have timing variations
> larger than 125us, so we cannot rely on a frame being available exactly as
> one frame is fully transmitted, or of us being notified of transmission
> status by the time the next frame comes in.
>
> >
> >> Someone better versed in USB protocol can probably confirm, but it seems
> >> somewhat hacky to stop the ISOC stream at the end of the frame and restart
> >> with the next frame.
> >
> > All I know is that the HW mechanism that is reading from the trb ring buffer is
> > started or stopped I don't know if really the ISOC stream is stopped and
> > restarted here or what that means on the real wire. And if so, I am unsure if
> > that is really a problem or not. Thinh?

For isoc IN endpoint, if the host requests for data while there's no TRB
prepared, the controller would respond with 0-length data. When we stop
and start again, we reschedule the prepared isoc data to go out on a new
interval.

>
> Oh? That's great! If the controller can keep the ISOC stream from underruning
> without the gadget feeding it 0-length requests, then we can simplify the
> gadget side implementation quite a bit!
>

I'm not entirely clear on why we cannot use 0-length requests now. I
hope we can resolve this from UVC function driver as it seems to be a
proper place to handle this issue.

Thanks,
Thinh