Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <Y2lw4Qc1uI+Ep+2C@fedora> <4281b354-d67d-2883-d966-a7816ed4f811@kernel.dk>
 <Y2phEZKYuSmPL5B5@fedora> <93fa2da5-c81a-d7f8-115c-511ed14dcdbb@kernel.dk>
In-Reply-To: <93fa2da5-c81a-d7f8-115c-511ed14dcdbb@kernel.dk>
From:   Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date:   Thu, 10 Nov 2022 11:13:38 +0100
Message-ID: <CA+FuTSe=09sAafHnLLMdc0EJrcP0+xcKCqD+rfMtdfQdSQYBDw@mail.gmail.com>
Subject: Re: [PATCHSET v3 0/5] Add support for epoll min_wait
To:     Jens Axboe <axboe@kernel.dk>
Cc:     Stefan Hajnoczi <stefanha@redhat.com>,
        linux-kernel@vger.kernel.org, netdev@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Tue, Nov 8, 2022 at 3:09 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
> > On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
> >> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
> >>> Hi Jens,
> >>> NICs and storage controllers have interrupt mitigation/coalescing
> >>> mechanisms that are similar.
> >>
> >> Yep
> >>
> >>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
> >>> (counter) value. When a completion occurs, the device waits until the
> >>> timeout or until the completion counter value is reached.
> >>>
> >>> If I've read the code correctly, min_wait is computed at the beginning
> >>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
> >>> completion.
> >>>
> >>> It makes me wonder which approach is more useful for applications. With
> >>> the Aggregation Time approach applications can control how much extra
> >>> latency is added. What do you think about that approach?
> >>
> >> We only tested the current approach, which is time noted from entry, not
> >> from when the first event arrives. I suspect the nvme approach is better
> >> suited to the hw side, the epoll timeout helps ensure that we batch
> >> within xx usec rather than xx usec + whatever the delay until the first
> >> one arrives. Which is why it's handled that way currently. That gives
> >> you a fixed batch latency.
> >
> > min_wait is fine when the goal is just maximizing throughput without any
> > latency targets.
>
> That's not true at all, I think you're in different time scales than
> this would be used for.
>
> > The min_wait approach makes it hard to set a useful upper bound on
> > latency because unlucky requests that complete early experience much
> > more latency than requests that complete later.
>
> As mentioned in the cover letter or the main patch, this is most useful
> for the medium load kind of scenarios. For high load, the min_wait time
> ends up not mattering because you will hit maxevents first anyway. For
> the testing that we did, the target was 2-300 usec, and 200 usec was
> used for the actual test. Depending on what the kind of traffic the
> server is serving, that's usually not much of a concern. From your
> reply, I'm guessing you're thinking of much higher min_wait numbers. I
> don't think those would make sense. If your rate of arrival is low
> enough that min_wait needs to be high to make a difference, then the
> load is low enough anyway that it doesn't matter. Hence I'd argue that
> it is indeed NOT hard to set a useful upper bound on latency, because
> that is very much what min_wait is.
>
> I'm happy to argue merits of one approach over another, but keep in mind
> that this particular approach was not pulled out of thin air AND it has
> actually been tested and verified successfully on a production workload.
> This isn't a hypothetical benchmark kind of setup.

Following up on the interrupt mitigation analogy. This also reminds
somewhat of SO_RCVLOWAT. That sets a lower bound on received data
before waking up a single thread.

Would it be more useful to define a minevents event count, rather than
a minwait timeout? That might give the same amount of preferred batch
size, without adding latency when unnecessary, or having to infer a
reasonable bound from expected event rate. Bounded still by the max
timeout.