Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
MIME-Version: 1.0
References: <153736009982.24033.13696245431713246950.stgit@localhost.localdomain>
 <CANn89iJ1UnwLFv5+AwXgeb1BUhYg8UJYTJLtiipavJee+2SWxQ@mail.gmail.com> <2fdf2bd7-1cc4-a1e1-15c2-e2badfcd4d59@virtuozzo.com>
In-Reply-To: <2fdf2bd7-1cc4-a1e1-15c2-e2badfcd4d59@virtuozzo.com>
From:   Eric Dumazet <edumazet@google.com>
Date:   Wed, 19 Sep 2018 08:49:48 -0700
Message-ID: <CANn89iK8X5cW3=YnNRrKo=BVCFJkJ0D22YY_eJFLyGCX+5SxsQ@mail.gmail.com>
Subject: Re: [RFC] net;sched: Try to find idle cpu for RPS to handle packets
To:     Kirill Tkhai <ktkhai@virtuozzo.com>
Cc:     Peter Zijlstra <peterz@infradead.org>,
        David Miller <davem@davemloft.net>,
        Daniel Borkmann <daniel@iogearbox.net>, tom@quantonium.net,
        netdev <netdev@vger.kernel.org>,
        LKML <linux-kernel@vger.kernel.org>
Content-Type: multipart/mixed; boundary="000000000000f0f2fe05763b5b29"
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

--000000000000f0f2fe05763b5b29
Content-Type: text/plain; charset="UTF-8"

On Wed, Sep 19, 2018 at 8:41 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 19.09.2018 17:55, Eric Dumazet wrote:
> > On Wed, Sep 19, 2018 at 5:29 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>
> >> Many workloads have polling mode of work. The application
> >> checks for incomming packets from time to time, but it also
> >> has a work to do, when there is no packets. This RFC
> >> tries to develop an idea to queue RPS packets on idle
> >> CPU in the the L3 domain of the consumer, so backlog
> >> processing of the packets and the application can execute
> >> in parallel.
> >>
> >> We require this in case of network cards does not
> >> have enough RX queues to cover all online CPUs (this seems
> >> to be the most cards), and  get_rps_cpu() actually chooses
> >> remote cpu, and SMP interrupt is sent. Here we may try
> >> our best, and to find idle CPU nearly the consumer's CPU.
> >> Note, that in case of consumer works in poll mode and it
> >> does not waits for incomming packets, its CPU will be not
> >> idle, while CPU of a sleeping consumer may be idle. So,
> >> not polling consumers will still be able to have skb
> >> handled on its CPU.
> >>
> >> In case of network card has many queues, the device
> >> interrupts will come on consumer's CPU, and this patch
> >> won't try to find idle cpu for them.
> >>
> >> I've tried simple netperf test for this:
> >> netserver -p 1234
> >> netperf -L 127.0.0.1 -p 1234 -l 100
> >>
> >> Before:
> >>  87380  16384  16384    100.00   60323.56
> >>  87380  16384  16384    100.00   60388.46
> >>  87380  16384  16384    100.00   60217.68
> >>  87380  16384  16384    100.00   57995.41
> >>  87380  16384  16384    100.00   60659.00
> >>
> >> After:
> >>  87380  16384  16384    100.00   64569.09
> >>  87380  16384  16384    100.00   64569.25
> >>  87380  16384  16384    100.00   64691.63
> >>  87380  16384  16384    100.00   64930.14
> >>  87380  16384  16384    100.00   62670.15
> >>
> >> The difference between best runs is +7%,
> >> the worst runs differ +8%.
> >>
> >> What do you think about following somehow in this way?
> >
> > Hi Kirill
> >
> > In my experience, scheduler has a poor view of softirq processing
> > happening on various cpus.
> > A cpu spending 90% of its cycles processing IRQ might be considered 'idle'
>
> Yes, in case of there is softirq on top of irq_exit(), the cpu is not
> considered as busy. But after MAX_SOFTIRQ_TIME (=2ms), ksoftirqd are
> waken up to execute the work in process context, and the processor is
> considered as !idle. 2ms is 2 timer ticks in case of HZ=1000. So, we
> don't restart softirq in case of it was executed for more then 2ms.
>

That's the theory, but reality is very different unfortunately.

If RFS/RPS is setup properly, we really do not hit MAX_SOFTIRQ_TIME condition
unless in some synthetic benchmarks maybe.

> The similar way, single net_rx_action() can't be executed longer
> than 2ms.
>
> Having 90% load in softirq (called on top of irq_exit()) should be
> very unlikely situation, when there are too many interrupts with small
> amount of work, which related softirq calls are doing for each of them.
> I think it had be a problem even in plain napi case, since it would
> worked not like expected.
>
> But anyway. You worry, that during handling of next portion of skbs,
> we find that previous portion of skbs already woken ksoftirqd, and
> we don't see this cpu as idle? Yeah, then we'll try to change cpu,
> and this is not what we want. We want to continue use the cpu, where
> previous portion was handler. Hm, not so fast I'll answer, but certainly,
> this may be handled somehow in more creative way.
>
> > So please run a real workload (it is _very_ uncommon anyone set up RPS
> > on lo interface !)
> >
> > Like 400 or more concurrent netperf -t TCP_RR on a 10Gbit NIC.
>
> Yeah, it's just a simulation of a single irq nic. I'll try on something
> more real hardware.

Also my concern is that you might have results that are tied to a particular
version of process scheduling, platform, workload...

One month later, a small change in process scheduler,
and very different results.

This is why I believe this new feature must be controllable, via a new
tunable (like RPS/RFS are controllable per rx queue)

>
> How do you execute such the tests? I don't see the appropriate parameter
> of netperf. Does this mean just to start 400 copies of netperf? How is
> to aggregate their results in this case?

Yeah, there are various 'super_netperf' scripts available on the net
(almost trivial to write anyway)

( I am attaching one of them)

Thanks.
>
> > Thanks.
> >
> > PS: Idea of playing with L3 domains is interesting, I have personally
> > tried various strategies in the past but none of them
> > demonstrated a clear win.
>
> Thanks,
> Kirill

--000000000000f0f2fe05763b5b29
Content-Type: application/octet-stream; name=super_netperf
Content-Disposition: attachment; filename=super_netperf
Content-Transfer-Encoding: base64
Content-ID: <f_jm9bomkg0>
X-Attachment-Id: f_jm9bomkg0

IyEvYmluL2Jhc2gKCnJ1bl9uZXRwZXJmKCkgewoJbG9vcHM9JDEKCXNoaWZ0Cglmb3IgKChpPTA7
IGk8bG9vcHM7IGkrKykpOyBkbwoJCS4vbmV0cGVyZiAtcyAyICRAIHwgYXdrICcvTWluL3sKCQkJ
aWYgKCFvbmNlKSB7CgkJCQlwcmludDsKCQkJCW9uY2U9MTsKCQkJfQoJCX0KCQl7CgkJCWlmIChO
UiA9PSA2KQoJCQkJc2F2ZSA9ICRORgoJCQllbHNlIGlmIChOUj09NykgewoJCQkJaWYgKE5GID4g
MCkKCQkJCQlwcmludCAkTkYKCQkJCWVsc2UKCQkJCQlwcmludCBzYXZlCgkJCX0gZWxzZSBpZiAo
TlI9PTExKSB7CgkJCQlwcmludCAkMAoJCQl9CgkJfScgJgoJZG9uZQoJd2FpdAoJcmV0dXJuIDAK
fQoKcnVuX25ldHBlcmYgJEAgfCBhd2sgJ3tpZiAoTkY9PTcpIHtwcmludCAkMDsgbmV4dH19IHtz
dW0gKz0gJDF9IEVORCB7cHJpbnRmICIlN3VcbiIsc3VtfScK
--000000000000f0f2fe05763b5b29--