by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCHSET v3 0/5] Add support for epoll min_wait

On 11/7/22 6:25 AM, Willem de Bruijn wrote:
> On Sat, Nov 5, 2022 at 2:46 PM Jens Axboe <[email protected]> wrote:
>>
>> On 11/5/22 12:05 PM, Willem de Bruijn wrote:
>>> On Sat, Nov 5, 2022 at 1:39 PM Jens Axboe <[email protected]> wrote:
>>>>
>>>>>> FWIW, when adding nsec resolution I initially opted for an init-based
>>>>>> approach, passing a new flag to epoll_create1. Feedback then was that
>>>>>> it was odd to have one syscall affect the behavior of another. The
>>>>>> final version just added a new epoll_pwait2 with timespec.
>>>>>
>>>>> I'm fine with just doing a pure syscall variant too, it was my original
>>>>> plan. Only changed it to allow for easier experimentation and adoption,
>>>>> and based on the fact that most use cases would likely use a fixed value
>>>>> per context anyway.
>>>>>
>>>>> I think it'd be a shame to drop the ctl, unless there's strong arguments
>>>>> against it. I'm quite happy to add a syscall variant too, that's not a
>>>>> big deal and would be a minor addition. Patch 6 should probably cut out
>>>>> the ctl addition and leave that for a patch 7, and then a patch 8 for
>>>>> adding a syscall.
>>>> I split the ctl patch out from the core change, and then took a look at
>>>> doing a syscall variant too. But there are a few complications there...
>>>> It would seem to make the most sense to build this on top of the newest
>>>> epoll wait syscall, epoll_pwait2(). But we're already at the max number
>>>> of arguments there...
>>>>
>>>> Arguably pwait2 should've been converted to use some kind of versioned
>>>> struct instead. I'm going to take a stab at pwait3 with that kind of
>>>> interface.
>>>
>>> Don't convert to a syscall approach based solely on my feedback. It
>>> would be good to hear from others.
>>
>> It's not just based on your feedback, if you read the original cover
>> letter, then that is the question that is posed in terms of API - ctl to
>> modify it, new syscall, or both? So figured I should at least try and
>> see what the syscall would look like.
>>
>>> At a high level, I'm somewhat uncomfortable merging two syscalls for
>>> behavior that already works, just to save half the syscall overhead.
>>> There is no shortage of calls that may make some sense for a workload
>>> to merge. Is the quoted 6-7% cpu cycle reduction due to saving one
>>> SYSENTER/SYSEXIT (as the high resolution timer wake-up will be the
>>> same), or am I missing something more fundamental?
>>
>> No, it's not really related to saving a single syscall, and you'd
>> potentially save more than just one as well. If we look at the two
>> extremes of applications, one will be low load and you're handling
>> probably just 1 event per loop. Not really interesting. At the other
>> end, you're fully loaded, and by the time you check for events, you have
>> 'maxevents' (or close to) available. That obviously reduces system
>> calls, but more importantly, it also allows the application to get some
>> batching effects from processing these events.
>>
>> In the medium range, there's enough processing to react pretty quickly
>> to events coming in, and you then end up doing just 1 event (or close to
>> that). To overcome that, we have some applications that detect this
>> medium range and do an artificial sleep before calling epoll_wait().
>> That was a nice effiency win for them. But we can do this a lot more
>> efficiently in the kernel. That was the idea behind this, and the
>> initial results from TAO (which does that sleep hack) proved it to be
>> more than worthwhile. Syscall reduction is one thing, improved batching
>> another, and just as importanly is sleep+wakeup reductions.
>
> Thanks for the context.
>
> So this is akin to interrupt moderation in network interfaces. Would
> it make sense to wait for timeout or nr of events, whichever comes
> first, similar to rx_usecs/rx_frames. Instead of an unconditional
> sleep at the start.

There's no unconditional sleep at the start with my patches, not sure
where you are getting that from. You already have 'nr of events', that's
the maxevents being passed in. If nr_available >= maxevents, then no
sleep will take place. We did debate doing a minevents kind of thing as
well, but the time based metric is more usable.

--
Jens Axboe

2022-11-07 21:58:18

On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
> > Hi Jens,
> > NICs and storage controllers have interrupt mitigation/coalescing
> > mechanisms that are similar.
>
> Yep
>
> > NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
> > (counter) value. When a completion occurs, the device waits until the
> > timeout or until the completion counter value is reached.
> >
> > If I've read the code correctly, min_wait is computed at the beginning
> > of epoll_wait(2). NVMe's Aggregation Time is computed from the first
> > completion.
> >
> > It makes me wonder which approach is more useful for applications. With
> > the Aggregation Time approach applications can control how much extra
> > latency is added. What do you think about that approach?
>
> We only tested the current approach, which is time noted from entry, not
> from when the first event arrives. I suspect the nvme approach is better
> suited to the hw side, the epoll timeout helps ensure that we batch
> within xx usec rather than xx usec + whatever the delay until the first
> one arrives. Which is why it's handled that way currently. That gives
> you a fixed batch latency.

min_wait is fine when the goal is just maximizing throughput without any
latency targets.

The min_wait approach makes it hard to set a useful upper bound on
latency because unlucky requests that complete early experience much
more latency than requests that complete later.

Stefan

Attachments:

(No filename) (1.51 kB)
signature.asc (499.00 B)
Download all attachments

2022-11-08 16:42:01

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCHSET v3 0/5] Add support for epoll min_wait

On 11/8/22 9:10 AM, Stefan Hajnoczi wrote:
> On Tue, Nov 08, 2022 at 07:09:30AM -0700, Jens Axboe wrote:
>> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
>>> On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
>>>> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
>>>>> Hi Jens,
>>>>> NICs and storage controllers have interrupt mitigation/coalescing
>>>>> mechanisms that are similar.
>>>>
>>>> Yep
>>>>
>>>>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
>>>>> (counter) value. When a completion occurs, the device waits until the
>>>>> timeout or until the completion counter value is reached.
>>>>>
>>>>> If I've read the code correctly, min_wait is computed at the beginning
>>>>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
>>>>> completion.
>>>>>
>>>>> It makes me wonder which approach is more useful for applications. With
>>>>> the Aggregation Time approach applications can control how much extra
>>>>> latency is added. What do you think about that approach?
>>>>
>>>> We only tested the current approach, which is time noted from entry, not
>>>> from when the first event arrives. I suspect the nvme approach is better
>>>> suited to the hw side, the epoll timeout helps ensure that we batch
>>>> within xx usec rather than xx usec + whatever the delay until the first
>>>> one arrives. Which is why it's handled that way currently. That gives
>>>> you a fixed batch latency.
>>>
>>> min_wait is fine when the goal is just maximizing throughput without any
>>> latency targets.
>>
>> That's not true at all, I think you're in different time scales than
>> this would be used for.
>>
>>> The min_wait approach makes it hard to set a useful upper bound on
>>> latency because unlucky requests that complete early experience much
>>> more latency than requests that complete later.
>>
>> As mentioned in the cover letter or the main patch, this is most useful
>> for the medium load kind of scenarios. For high load, the min_wait time
>> ends up not mattering because you will hit maxevents first anyway. For
>> the testing that we did, the target was 2-300 usec, and 200 usec was
>> used for the actual test. Depending on what the kind of traffic the
>> server is serving, that's usually not much of a concern. From your
>> reply, I'm guessing you're thinking of much higher min_wait numbers. I
>> don't think those would make sense. If your rate of arrival is low
>> enough that min_wait needs to be high to make a difference, then the
>> load is low enough anyway that it doesn't matter. Hence I'd argue that
>> it is indeed NOT hard to set a useful upper bound on latency, because
>> that is very much what min_wait is.
>>
>> I'm happy to argue merits of one approach over another, but keep in mind
>> that this particular approach was not pulled out of thin air AND it has
>> actually been tested and verified successfully on a production workload.
>> This isn't a hypothetical benchmark kind of setup.
>
> Fair enough. I just wanted to make sure the syscall interface that gets
> merged is as useful as possible.

That is indeed the main discussion as far as I'm concerned - syscall,
ctl, or both? At this point I'm inclined to just push forward with the
ctl addition. A new syscall can always be added, and if we do, then it'd
be nice to make one that will work going forward so we don't have to
keep adding epoll_wait variants...

--
Jens Axboe

2022-11-08 16:42:45

by Stefan Hajnoczi

[permalink] [raw]

Subject: Re: [PATCHSET v3 0/5] Add support for epoll min_wait

On Tue, Nov 08, 2022 at 07:09:30AM -0700, Jens Axboe wrote:
> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
> > On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
> >> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
> >>> Hi Jens,
> >>> NICs and storage controllers have interrupt mitigation/coalescing
> >>> mechanisms that are similar.
> >>
> >> Yep
> >>
> >>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
> >>> (counter) value. When a completion occurs, the device waits until the
> >>> timeout or until the completion counter value is reached.
> >>>
> >>> If I've read the code correctly, min_wait is computed at the beginning
> >>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
> >>> completion.
> >>>
> >>> It makes me wonder which approach is more useful for applications. With
> >>> the Aggregation Time approach applications can control how much extra
> >>> latency is added. What do you think about that approach?
> >>
> >> We only tested the current approach, which is time noted from entry, not
> >> from when the first event arrives. I suspect the nvme approach is better
> >> suited to the hw side, the epoll timeout helps ensure that we batch
> >> within xx usec rather than xx usec + whatever the delay until the first
> >> one arrives. Which is why it's handled that way currently. That gives
> >> you a fixed batch latency.
> >
> > min_wait is fine when the goal is just maximizing throughput without any
> > latency targets.
>
> That's not true at all, I think you're in different time scales than
> this would be used for.
>
> > The min_wait approach makes it hard to set a useful upper bound on
> > latency because unlucky requests that complete early experience much
> > more latency than requests that complete later.
>
> As mentioned in the cover letter or the main patch, this is most useful
> for the medium load kind of scenarios. For high load, the min_wait time
> ends up not mattering because you will hit maxevents first anyway. For
> the testing that we did, the target was 2-300 usec, and 200 usec was
> used for the actual test. Depending on what the kind of traffic the
> server is serving, that's usually not much of a concern. From your
> reply, I'm guessing you're thinking of much higher min_wait numbers. I
> don't think those would make sense. If your rate of arrival is low
> enough that min_wait needs to be high to make a difference, then the
> load is low enough anyway that it doesn't matter. Hence I'd argue that
> it is indeed NOT hard to set a useful upper bound on latency, because
> that is very much what min_wait is.
>
> I'm happy to argue merits of one approach over another, but keep in mind
> that this particular approach was not pulled out of thin air AND it has
> actually been tested and verified successfully on a production workload.
> This isn't a hypothetical benchmark kind of setup.

Fair enough. I just wanted to make sure the syscall interface that gets
merged is as useful as possible.

Thanks,
Stefan

Attachments:

(No filename) (3.01 kB)
signature.asc (499.00 B)
Download all attachments

2022-11-08 17:56:23

[permalink] [raw]

Subject: Re: [PATCH 6/6] eventpoll: add support for min-wait

On 12/1/22 11:39 AM, Soheil Hassas Yeganeh wrote:
> On Thu, Dec 1, 2022 at 1:00 PM Jens Axboe <[email protected]> wrote:
>>
>>>>> @@ -1845,6 +1891,18 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>>>>> ewq.timed_out = true;
>>>>> }
>>>>>
>>>>> + /*
>>>>> + * If min_wait is set for this epoll instance, note the min_wait
>>>>> + * time. Ensure the lowest bit is set in ewq.min_wait_ts, that's
>>>>> + * the state bit for whether or not min_wait is enabled.
>>>>> + */
>>>>> + if (ep->min_wait_ts) {
>>>>
>>>> Can we limit this block to "ewq.timed_out && ep->min_wait_ts"?
>>>> AFAICT, the code we run here is completely wasted if timeout is 0.
>>>
>>> Yep certainly, I can gate it on both of those conditions.
>> Looking at this for a respin, I think it should be gated on
>> !ewq.timed_out? timed_out == true is the path that it's wasted on
>> anyway.
>
> Ah, yes, that's a good point. The check should be !ewq.timed_out.

The just posted v4 has the check (and the right one :-))

--
Jens Axboe

2022-12-01 19:59:24

by Soheil Hassas Yeganeh

[permalink] [raw]

Subject: Re: [PATCH 6/6] eventpoll: add support for min-wait

On Thu, Dec 1, 2022 at 1:00 PM Jens Axboe <[email protected]> wrote:
>
> >>> @@ -1845,6 +1891,18 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> >>> ewq.timed_out = true;
> >>> }
> >>>
> >>> + /*
> >>> + * If min_wait is set for this epoll instance, note the min_wait
> >>> + * time. Ensure the lowest bit is set in ewq.min_wait_ts, that's
> >>> + * the state bit for whether or not min_wait is enabled.
> >>> + */
> >>> + if (ep->min_wait_ts) {
> >>
> >> Can we limit this block to "ewq.timed_out && ep->min_wait_ts"?
> >> AFAICT, the code we run here is completely wasted if timeout is 0.
> >
> > Yep certainly, I can gate it on both of those conditions.
> Looking at this for a respin, I think it should be gated on
> !ewq.timed_out? timed_out == true is the path that it's wasted on
> anyway.

Ah, yes, that's a good point. The check should be !ewq.timed_out.

Thanks,
Soheil

> --
> Jens Axboe
>