Hello,
thanks to the fix from Steffen Klassert I could now run 4.14.69 + his patch
and 4.18.7 + his patch without oopsing immediately.
But I found that those kernels perform very bad. They perform so bad that they
are unusable for our router with about 3000 ipsec tunnels (tunnel mode network
<-> network).
With 4.9. (and all other kernels I used in the last 10 years with much less
potent hardware) I never had an comparable performance issue with networking.
4.18.7 is better then 4.14.69 but still remains unusuable for us.
Even with very little traffic all 8 cores are working 100% in ksoftirqd. As
soon as there is real traffic network gets rather unusuable.
Latency of packets goes from between 0.1ms to 1ms up to 100ms to 500ms (4.14)
or between 15ms to 90ms (4.18).
Throughput also suffers a lot.
I have a simple test I run after every upgrade. This test basically copies
with scp large files to 60 different remote locations (involving ipsec),
limited to 1GBit/s combined, and in paralled I ping from different networks
over this router to machines in other networks of this router (no ipsec-
tunnels involved).
With 4.9 and earlier copying needs about 2 minutes and the pings all remain
under 2ms roundtrip.
With 4.14 copying these files needs more than one our. The roundtrip time of
the ping is > 1 second.
With 4.18 this is much better, copying needs around 6 minutes and ping
roundtrip is between 30ms and 180ms. But even that is much worse then 4.9.
I think this dramatic loss in performance is due to the removal of the flow
cache. I propose to revert that for 4.14. I also propose to revert it for the
next longterm kernel if no other solution is found bringing back 4.9
performance (at least about the same order of magnitude).
Maybe it should generally be reverted until a solution to the problem exists.
Regards,
--
Wolfgang Walter
Studentenwerk M?nchen
Anstalt des ?ffentlichen Rechts
Wolfgang Walter <[email protected]> wrote:
> thanks to the fix from Steffen Klassert I could now run 4.14.69 + his patch
> and 4.18.7 + his patch without oopsing immediately.
>
> But I found that those kernels perform very bad. They perform so bad that they
> are unusable for our router with about 3000 ipsec tunnels (tunnel mode network
> <-> network).
Can you do a 'perf record -a -g sleep 5' with 4.18 and provide 'perf
report' result?
It would be good to see where those cycles are spent.
Am Donnerstag, 13. September 2018, 15:58:44 schrieb Florian Westphal:
> Wolfgang Walter <[email protected]> wrote:
> > thanks to the fix from Steffen Klassert I could now run 4.14.69 + his
> > patch
> > and 4.18.7 + his patch without oopsing immediately.
> >
> > But I found that those kernels perform very bad. They perform so bad that
> > they are unusable for our router with about 3000 ipsec tunnels (tunnel
> > mode network <-> network).
>
> Can you do a 'perf record -a -g sleep 5' with 4.18 and provide 'perf
> report' result?
>
> It would be good to see where those cycles are spent.
I'll try that but this isn't that easy as the router image does not contain
perf. I also have to do that on our production router. I try to do that
tomorrow evening.
What I can say is that it depends mainly on number of policy rules and SA.
Regards
--
Wolfgang Walter
Studentenwerk M?nchen
Anstalt des ?ffentlichen Rechts
Wolfgang Walter <[email protected]> wrote:
> I'll try that but this isn't that easy as the router image does not contain
> perf. I also have to do that on our production router. I try to do that
> tomorrow evening.
No need if its too difficult.
> What I can say is that it depends mainly on number of policy rules and SA.
Thats already a good hint, I guess we're hitting long hash chains in
xfrm_policy_lookup_bytype().
From: Florian Westphal <[email protected]>
Date: Thu, 13 Sep 2018 18:38:48 +0200
> Wolfgang Walter <[email protected]> wrote:
>> What I can say is that it depends mainly on number of policy rules and SA.
>
> Thats already a good hint, I guess we're hitting long hash chains in
> xfrm_policy_lookup_bytype().
I don't really see how recent changes can influence that.
And the bydst hashes have been dynamically sized for a very long time.
I might have missed something...
David Miller <[email protected]> wrote:
> From: Florian Westphal <[email protected]>
> Date: Thu, 13 Sep 2018 18:38:48 +0200
>
> > Wolfgang Walter <[email protected]> wrote:
> >> What I can say is that it depends mainly on number of policy rules and SA.
> >
> > Thats already a good hint, I guess we're hitting long hash chains in
> > xfrm_policy_lookup_bytype().
>
> I don't really see how recent changes can influence that.
I don't think there is a recent change that did this.
Walter says < 4.14 is ok, so this is likely related to flow cache removal.
F.e. it looks like all prefixed policies end up in a linked list
(net->xfrm.policy_inexact) and are not even in a hash table.
I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
but can't figure out how to configure that away from the
'no hashing for prefixed policies' default or why we even have
policy_inexact in first place :/
I'll look at this again tomorrow.
From: Florian Westphal <[email protected]>
Date: Thu, 13 Sep 2018 23:03:25 +0200
> I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
> but can't figure out how to configure that away from the
> 'no hashing for prefixed policies' default or why we even have
> policy_inexact in first place :/
>
> I'll look at this again tomorrow.
The inexact list exists to handle prefixed input keys.
At the time that I wrote all of the control plane hash table
stuff, configurations I could find consisted of:
1) Entires with non-prefixed keys, which are easy to hash.
The number of entries could be large (e.g. cell phone
network)
2) A very small number of prefixed policies.
So hashing, when possible, falling back to the linked list
for prefixed stuff.
Beforehand we only had the linked list :-)
On Thu, Sep 13, 2018 at 11:03:25PM +0200, Florian Westphal wrote:
> David Miller <[email protected]> wrote:
> > From: Florian Westphal <[email protected]>
> > Date: Thu, 13 Sep 2018 18:38:48 +0200
> >
> > > Wolfgang Walter <[email protected]> wrote:
> > >> What I can say is that it depends mainly on number of policy rules and SA.
> > >
> > > Thats already a good hint, I guess we're hitting long hash chains in
> > > xfrm_policy_lookup_bytype().
> >
> > I don't really see how recent changes can influence that.
>
> I don't think there is a recent change that did this.
>
> Walter says < 4.14 is ok, so this is likely related to flow cache removal.
>
> F.e. it looks like all prefixed policies end up in a linked list
> (net->xfrm.policy_inexact) and are not even in a hash table.
>
> I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
> but can't figure out how to configure that away from the
> 'no hashing for prefixed policies' default or why we even have
> policy_inexact in first place :/
The hash threshold can be configured like this:
ip x p set hthresh4 0 0
This sets the hash threshold to local /0 and remote /0 netmasks.
With this configuration, all policies should go to the hashtable.
This might help to balance the hash chains better.
Default hash thresholds are local /32 and remote /32 netmasks, so
all prefixed policies go to the inexact list.
To view the configuration:
ip -s -s x p count
Steffen Klassert <[email protected]> wrote:
> On Thu, Sep 13, 2018 at 11:03:25PM +0200, Florian Westphal wrote:
> > David Miller <[email protected]> wrote:
> > > From: Florian Westphal <[email protected]>
> > > Date: Thu, 13 Sep 2018 18:38:48 +0200
> > >
> > > > Wolfgang Walter <[email protected]> wrote:
> > > >> What I can say is that it depends mainly on number of policy rules and SA.
> > > >
> > > > Thats already a good hint, I guess we're hitting long hash chains in
> > > > xfrm_policy_lookup_bytype().
> > >
> > > I don't really see how recent changes can influence that.
> >
> > I don't think there is a recent change that did this.
> >
> > Walter says < 4.14 is ok, so this is likely related to flow cache removal.
> >
> > F.e. it looks like all prefixed policies end up in a linked list
> > (net->xfrm.policy_inexact) and are not even in a hash table.
> >
> > I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
> > but can't figure out how to configure that away from the
> > 'no hashing for prefixed policies' default or why we even have
> > policy_inexact in first place :/
>
> The hash threshold can be configured like this:
>
> ip x p set hthresh4 0 0
>
> This sets the hash threshold to local /0 and remote /0 netmasks.
> With this configuration, all policies should go to the hashtable.
Yes, but won't they all be hashed to same bucket?
[ jhash(addr & 0, addr & 0) ] ?
> Default hash thresholds are local /32 and remote /32 netmasks, so
> all prefixed policies go to the inexact list.
Yes.
Wolfgang, before having to work on getting perf into your router image
can you perhaps share a bit of info about the policies you're using?
How many are there? Are they prefixed or not ("10.1.2.1")?
On Fri, Sep 14, 2018 at 07:54:37AM +0200, Florian Westphal wrote:
> Steffen Klassert <[email protected]> wrote:
> > On Thu, Sep 13, 2018 at 11:03:25PM +0200, Florian Westphal wrote:
> > > David Miller <[email protected]> wrote:
> > > > From: Florian Westphal <[email protected]>
> > > > Date: Thu, 13 Sep 2018 18:38:48 +0200
> > > >
> > > > > Wolfgang Walter <[email protected]> wrote:
> > > > >> What I can say is that it depends mainly on number of policy rules and SA.
> > > > >
> > > > > Thats already a good hint, I guess we're hitting long hash chains in
> > > > > xfrm_policy_lookup_bytype().
> > > >
> > > > I don't really see how recent changes can influence that.
> > >
> > > I don't think there is a recent change that did this.
> > >
> > > Walter says < 4.14 is ok, so this is likely related to flow cache removal.
> > >
> > > F.e. it looks like all prefixed policies end up in a linked list
> > > (net->xfrm.policy_inexact) and are not even in a hash table.
> > >
> > > I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
> > > but can't figure out how to configure that away from the
> > > 'no hashing for prefixed policies' default or why we even have
> > > policy_inexact in first place :/
> >
> > The hash threshold can be configured like this:
> >
> > ip x p set hthresh4 0 0
> >
> > This sets the hash threshold to local /0 and remote /0 netmasks.
> > With this configuration, all policies should go to the hashtable.
>
> Yes, but won't they all be hashed to same bucket?
>
> [ jhash(addr & 0, addr & 0) ] ?
Hm, yes. Maybe something between /0 and /32 makes more sense.
Le ven. 14 sept. 2018 à 08:01, Steffen Klassert
<[email protected]> a écrit :
> > > The hash threshold can be configured like this:
> > >
> > > ip x p set hthresh4 0 0
> > >
> > > This sets the hash threshold to local /0 and remote /0 netmasks.
> > > With this configuration, all policies should go to the hashtable.
> >
> > Yes, but won't they all be hashed to same bucket?
> >
> > [ jhash(addr & 0, addr & 0) ] ?
>
> Hm, yes. Maybe something between /0 and /32 makes more sense.
Indeed, hash thresholds not only determine which policies will be
hashed, but also the number of bits of the local and remote address
that will be used to calculate the hash key. Big thresholds mean
potentially fewer hashed policies, but better distribution in the hash
table, and vice versa.
A good trade off must be found depending on the prefix lengths used in
your policies.
Best regards,
Christophe
Am Freitag, 14. September 2018, 07:54:37 schrieb Florian Westphal:
> Steffen Klassert <[email protected]> wrote:
> > On Thu, Sep 13, 2018 at 11:03:25PM +0200, Florian Westphal wrote:
> > > David Miller <[email protected]> wrote:
> > > > From: Florian Westphal <[email protected]>
> > > > Date: Thu, 13 Sep 2018 18:38:48 +0200
> > > >
> > > > > Wolfgang Walter <[email protected]> wrote:
> > > > >> What I can say is that it depends mainly on number of policy rules
> > > > >> and SA.
> > > > >
> > > > > Thats already a good hint, I guess we're hitting long hash chains in
> > > > > xfrm_policy_lookup_bytype().
> > > >
> > > > I don't really see how recent changes can influence that.
> > >
> > > I don't think there is a recent change that did this.
> > >
> > > Walter says < 4.14 is ok, so this is likely related to flow cache
> > > removal.
> > >
> > > F.e. it looks like all prefixed policies end up in a linked list
> > > (net->xfrm.policy_inexact) and are not even in a hash table.
> > >
> > > I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
> > > but can't figure out how to configure that away from the
> > > 'no hashing for prefixed policies' default or why we even have
> > > policy_inexact in first place :/
> >
> > The hash threshold can be configured like this:
> >
> > ip x p set hthresh4 0 0
> >
> > This sets the hash threshold to local /0 and remote /0 netmasks.
> > With this configuration, all policies should go to the hashtable.
>
> Yes, but won't they all be hashed to same bucket?
>
> [ jhash(addr & 0, addr & 0) ] ?
>
> > Default hash thresholds are local /32 and remote /32 netmasks, so
> > all prefixed policies go to the inexact list.
>
> Yes.
>
> Wolfgang, before having to work on getting perf into your router image
> can you perhaps share a bit of info about the policies you're using?
>
> How many are there? Are they prefixed or not ("10.1.2.1")?
All rules are tunnel rules. That is they are rules like (in strongswan
notation)
conn A-to-B
left=111.111.111.111
leftsubnet=10.148.32.0/24
leftsigkey=....
right=111.111.111.222
rightsubnet=10.148.13.224/29
rightsigkey=....
esp=aes128ctr-sha1-ecp256-esn!
ike=aes128ctr-sha1-ecp256!
mobike=no
type=tunnel
....
(... other options not important here).
leftsubnet and rightsubnet may have any prefix from /30 to /16 here (we do not
yet use ipv6 but will do so next year).
We have about 3000 of them.
strongswan install IN, FWD and OUT rules for that in the kernel security
policy database with automated generated priorities (and SAs are generated
when strongswan actually establish a tunnel).
Also some of the rules overlap in range, that means ordering is important.
With IKEv2 this may happens automatically for SAs even if you avoid it in your
rule set as IKEv2 allows narrowing.
In policies you most often get this if you want to excempt a certain network
or host. We have a about 70 of them at the moment.
We do not use other possible selectors beside src-addr-range and dst-addr-
range (you could additionally select by protocol (icmp, udp, tcp), src- and
dst-port-range). So theoretically you could have a ruleset where there is a
rule with exempts all connection to dst port 22 for several network or applies
different encryption options and so on.
A rule determins what has to be done with the packet (sending or receiving)
from an ipsec-point of view: allow it without ipsec-transformation, block it
completely, or require certain ipsec transformation (use this or that
ecnryption scheme, use header compression, use transport or tunnel mode, ...)
So for any packet the kernel sends it has to look up if there are SAs which
matches and from these chose that with the highest priority (which is that one
with the lowest priority field). If there is none he has to lookup if there is
a matching policy, again choosing the one with the highest priority (and then
let the IKE-daemon actually establish a SA). For tunnel-mode he actually has
to do it twice, I think, as the tunnel-paket again passes ipsec.
For every packet it receives and which ist not an ipsec paket he has to do a
lookup in the policy database to see if it should have been (or if it is
allowed or blocked). If no rule is found it is allowed without encryption. We
have 29.000 allow rules. I did deactivate them for the tests with 4.14 and
4.18 as these makes things horrible. They are automatically generated from our
declarativ network description and we actually don't need them as they do not
overlap with the remote networks tunneled via ipsec. They did not impose any
burden for 4.9 and earlier.
We sometimes need them (say if 10.10.0.0/16 is remote but 10.10.1.0 which is
local).
So this is basically the multidimensional packet classifiction problem: from a
set of m-dimensional blocks find that one with the highest priority which
contains a certain point.
The dimension here are src-addr-range, dst-addr-range, protocol, src-port-
range, dst-port-range.
If your rule is itself a point you may hash it (and you can only do this if it
is sure that there is no other non-point rule with higher prio matching this
point rule as there is no such rule that a more specific rule beats a less
specific rule (this would be ill defined)).
Here an example how strongswan allows you to use all of the above selectors
for your rules. For example you could write for leftsubnet:
leftsubnet=10.0.0.1[tcp/http],10.0.0.2[6/80]
leftsubnet=fec1::1[udp],10.0.0.0/16[/53].
leftsubnet=fec1::1[udp/%any],10.0.0.0/16[%any/53]
leftsubnet=fec1::1[udp/%any],10.0.0.0/16[%any/1024-32000]
So ipsec with large policy-database without xfrm flow cache is comparable with
a large netfilter ruleset (with only one chain) without conntrack.
Regards,
--
Wolfgang Walter
Studentenwerk M?nchen
Anstalt des ?ffentlichen Rechts
Hello,
Am Freitag, 14. September 2018, 07:54:37 schrieb Florian Westphal:
> Steffen Klassert <[email protected]> wrote:
> > On Thu, Sep 13, 2018 at 11:03:25PM +0200, Florian Westphal wrote:
> > > David Miller <[email protected]> wrote:
> > > > From: Florian Westphal <[email protected]>
> > > > Date: Thu, 13 Sep 2018 18:38:48 +0200
> > > >
> > > > > Wolfgang Walter <[email protected]> wrote:
> > > > >> What I can say is that it depends mainly on number of policy rules
> > > > >> and SA.
> > > > >
> > > > > Thats already a good hint, I guess we're hitting long hash chains in
> > > > > xfrm_policy_lookup_bytype().
> > > >
> > > > I don't really see how recent changes can influence that.
> > >
> > > I don't think there is a recent change that did this.
> > >
> > > Walter says < 4.14 is ok, so this is likely related to flow cache
> > > removal.
> > >
> > > F.e. it looks like all prefixed policies end up in a linked list
> > > (net->xfrm.policy_inexact) and are not even in a hash table.
> > >
> > > I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
> > > but can't figure out how to configure that away from the
> > > 'no hashing for prefixed policies' default or why we even have
> > > policy_inexact in first place :/
> >
> > The hash threshold can be configured like this:
> >
> > ip x p set hthresh4 0 0
> >
> > This sets the hash threshold to local /0 and remote /0 netmasks.
> > With this configuration, all policies should go to the hashtable.
>
> Yes, but won't they all be hashed to same bucket?
>
> [ jhash(addr & 0, addr & 0) ] ?
>
> > Default hash thresholds are local /32 and remote /32 netmasks, so
> > all prefixed policies go to the inexact list.
>
> Yes.
>
> Wolfgang, before having to work on getting perf into your router image
> can you perhaps share a bit of info about the policies you're using?
>
> How many are there? Are they prefixed or not ("10.1.2.1")?
Since my last reply to this message I didn't get a reply: is there any
progress how to fix this performance regression I missed?
Or are we stuck here with longterm kernel 4.9 for a long time?
Regards,
--
Wolfgang Walter
Studentenwerk M?nchen
Anstalt des ?ffentlichen Rechts
Wolfgang Walter <[email protected]> wrote:
> Since my last reply to this message I didn't get a reply: is there any
> progress how to fix this performance regression I missed?
Did you test/experiment with hthresh config option?
> Or are we stuck here with longterm kernel 4.9 for a long time?
I'm experimenting with per-dst inexact lists in an rbtree but
this will take time.
Am Dienstag, 2. Oktober 2018, 16:56:16 schrieb Florian Westphal:
> Wolfgang Walter <[email protected]> wrote:
> > Since my last reply to this message I didn't get a reply: is there any
> > progress how to fix this performance regression I missed?
>
> Did you test/experiment with hthresh config option?
I did. It did not improve the situation.
I suppose that is because our masks range from /16 to /30 and excpecially have
for example /16 <=> /8 and vice versa.
When forwarding, every policy A => B also implies that you add a policy B =>
A.
I'm not familiar when the policy database is consulted, but I think it now has
to for every not encrypted paket, and for those all rules have to be
consulted. And unencrypted traffic is a large part of the traffic on that
router.
That is: for unencrypted traffic neither the buckets of the hash nor the
inexact list may be large.
>
> > Or are we stuck here with longterm kernel 4.9 for a long time?
>
> I'm experimenting with per-dst inexact lists in an rbtree but
> this will take time.
Hmm, I doubt that this is worth the effort. And certainly not that easy
correctly done, as it still would have to obey the original order of the rules
(their priority).
You may have a lot of rules of the form say
10.0.0.0/16 <=> 10.1.0.0/29 encrypt ....
10.0.0.0/16 <=> 10.1.0.8/29 encrypt ....
....
And things like that.
Also, you get something like that
10.0.1.0/24 <=> 10.0.2.0/29 allow
10.0.0.0/16 <=> 10.0.2.0/24 encrypt
0.0.0.0 <=> 10.0.2.0/16 block
And people may use source port and/or destination port or protocol
(tcp/udp/imcp) to further tailor there ruleset.
Here is the approach HiPAC took for packet classification
https://pdfs.semanticscholar.org/a0bb/9d31e2499fb659c9e0d9544072d2f3c25079.pdf
https://pdfs.semanticscholar.org/0dea/8ee87f596f200de2722cbe9480610dd1a0db.pdf
Regards,
--
Wolfgang Walter
Studentenwerk M?nchen
Anstalt des ?ffentlichen Rechts
Wolfgang Walter <[email protected]> wrote:
> Am Dienstag, 2. Oktober 2018, 16:56:16 schrieb Florian Westphal:
> > I'm experimenting with per-dst inexact lists in an rbtree but
> > this will take time.
>
> Hmm, I doubt that this is worth the effort. And certainly not that easy
Well, I'm not going to send a revert of the flowcache removal.
I'm willing to experiment with alternatives to a full iteration of the
inexact list but thats it.
> correctly done, as it still would have to obey the original order of the rules
> (their priority).
Except that neither the priority or the order in which it was added
matters in case the selector doesn't match.
I see no reason why we can't have inexact lists done per dst<->src pairs.
> You may have a lot of rules of the form say
>
> 10.0.0.0/16 <=> 10.1.0.0/29 encrypt ....
> 10.0.0.0/16 <=> 10.1.0.8/29 encrypt ....
Sure.
> Also, you get something like that
>
> 10.0.1.0/24 <=> 10.0.2.0/29 allow
> 10.0.0.0/16 <=> 10.0.2.0/24 encrypt
> 0.0.0.0 <=> 10.0.2.0/16 block
>
> And people may use source port and/or destination port or protocol
> (tcp/udp/imcp) to further tailor there ruleset.
Yes. 0.0.0.0/0 handling will require some extra consideration.
So far I have not seen a show-stopper however.
Am Dienstag, 2. Oktober 2018, 23:35:36 schrieb Florian Westphal:
> Wolfgang Walter <[email protected]> wrote:
> > Am Dienstag, 2. Oktober 2018, 16:56:16 schrieb Florian Westphal:
> > > I'm experimenting with per-dst inexact lists in an rbtree but
> > > this will take time.
> >
> > Hmm, I doubt that this is worth the effort. And certainly not that easy
>
> Well, I'm not going to send a revert of the flowcache removal.
>
> I'm willing to experiment with alternatives to a full iteration of the
> inexact list but thats it.
If this brings performance back to pre-removal, I'm fine with that. I'm even
fine if it is slower by a factor of 2.
I think it is a serious regression, and there is no workaround, and therefor
it cannot stay like that.
So I still hope that reverting is an option if no acceptable solution can be
found.
>
> > correctly done, as it still would have to obey the original order of the
> > rules (their priority).
>
> Except that neither the priority or the order in which it was added
> matters in case the selector doesn't match.
To match a packet one has to find all matching rules and chose that one with
the lowest priority.
"indexing" by dst will not help much if you have a ruleset where a lot of
rules sharing a dst. You also have to replicate rules with dsts that have a
prefix oft another dst as they may habe a higher priority even if they are
less specific.
Every such entry may again have such an "indexing" by dst. Only then this
would be efficient.
>
> I see no reason why we can't have inexact lists done per dst<->src pairs.
>
> > You may have a lot of rules of the form say
> >
> > 10.0.0.0/16 <=> 10.1.0.0/29 encrypt ....
> > 10.0.0.0/16 <=> 10.1.0.8/29 encrypt ....
>
<=> means (in the forwarding case) that the rule set contains the inverted
rule (at least if you use it in usually ways). So
10.0.0.0/16 <=> 10.1.0.0/29 encrypt ....
means
10.0.0.0/16 => 10.1.0.0/29
10.1.0.0/29 => 10.0.0.0/16
> Sure.
>
> > Also, you get something like that
> >
> > 10.0.1.0/24 <=> 10.0.2.0/29 allow
> > 10.0.0.0/16 <=> 10.0.2.0/24 encrypt
> > 0.0.0.0 <=> 10.0.2.0/16 block
> >
> > And people may use source port and/or destination port or protocol
> > (tcp/udp/imcp) to further tailor there ruleset.
>
> Yes. 0.0.0.0/0 handling will require some extra consideration.
>
There may also be rulesets like
10.0.1.0/24 => 10.1.0.0/29 encrypt X
10.0.0.0/16 => 10.1.0.0/29 encrypt Y
Or
10.0.0.0/16 * => 10.1.0.0/24 80 encrypt Y
10.0.1.0/24 * => 10.1.0.0/17 * encrypt X
10.0.0.0/16 * => 10.1.0.0/20 * encrypt Z
> So far I have not seen a show-stopper however.
I wonder why there is no such thing for netfilter or the rules list in
routing. nf does not have such a thing, either. This is the reason why I think
that this is not that easy and for longterm kernel 4.14 the best solution will
be a revert anyway.
Regards,
--
Wolfgang Walter
Studentenwerk M?nchen
Anstalt des ?ffentlichen Rechts
Hello,
there is now a new 4.19 which still has the big performance regression when
many ipsec tunnels are configured (throughput and latency get worse by 10 to
50 times) which makes any kernel > 4.9 unusable for our routers.
I still don't understand why a revert of the flow cache removal at least for
the longterm kernels is that a bad option (maybe as a compile time option),
especially as there is no workaround available.
We use linux in that scenario since more than 10 years so I'm really rather
unhappy if not to say despaired that we will be stucked with 4.9 for an
unforeseeable future.
We would have detected and reported that performance regression much earlier
if not another bug in ipsec had prevented us from running 4.14 and later until
end of august 2018 (See kernels > v4.12 oops/crash with ipsec-traffic:
bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and
remove the operation of dst_free()).
Am Donnerstag, 4. Oktober 2018, 15:57:52 schrieb Wolfgang Walter:
> Am Dienstag, 2. Oktober 2018, 23:35:36 schrieb Florian Westphal:
[snip]
> > Well, I'm not going to send a revert of the flowcache removal.
> >
> >
> > I'm willing to experiment with alternatives to a full iteration of the
> > inexact list but thats it.
>
> If this brings performance back to pre-removal, I'm fine with that. I'm even
> fine if it is slower by a factor of 2.
>
> I think it is a serious regression, and there is no workaround, and therefor
> it cannot stay like that.
>
> So I still hope that reverting is an option if no acceptable solution can be
> found.
>
> > > correctly done, as it still would have to obey the original order of the
> > > rules (their priority).
> >
> > Except that neither the priority or the order in which it was added
> > matters in case the selector doesn't match.
>
> To match a packet one has to find all matching rules and chose that one with
> the lowest priority.
>
> "indexing" by dst will not help much if you have a ruleset where a lot of
> rules sharing a dst. You also have to replicate rules with dsts that have a
> prefix oft another dst as they may habe a higher priority even if they are
> less specific.
>
> Every such entry may again have such an "indexing" by dst. Only then this
> would be efficient.
>
[snip]
>
> I wonder why there is no such thing for netfilter or the rules list in
> routing. nf does not have such a thing, either. This is the reason why I
> think that this is not that easy and for longterm kernel 4.14 the best
> solution will be a revert anyway.
>
> Regards,
Regards,
--
Wolfgang Walter
Studentenwerk M?nchen
Anstalt des ?ffentlichen Rechts
From: Wolfgang Walter <[email protected]>
Date: Thu, 25 Oct 2018 11:38:19 +0200
> there is now a new 4.19 which still has the big performance regression when
> many ipsec tunnels are configured (throughput and latency get worse by 10 to
> 50 times) which makes any kernel > 4.9 unusable for our routers.
>
> I still don't understand why a revert of the flow cache removal at least for
> the longterm kernels is that a bad option (maybe as a compile time option),
> especially as there is no workaround available.
You do know that the flow cache is DDoS targettable, right?
That's why we removed it, we did not make the change lightly.
Adding a DDoS vector back into the kernel is not an option sorry.
Please work diligently with Florian and others to try and find ways to
soften the performance hit.
Thank you.
David Miller <[email protected]> wrote:
> Please work diligently with Florian and others to try and find ways to
> soften the performance hit.
I will send a patch series that pre-sorts inexact policies into rbtrees
at insert time as soon as next-next opens up again.
Wolfgang Walter <[email protected]> wrote:
> there is now a new 4.19 which still has the big performance regression when
> many ipsec tunnels are configured (throughput and latency get worse by 10 to
> 50 times) which makes any kernel > 4.9 unusable for our routers.
https://git.breakpoint.cc/cgit/fw/net-next.git/log/?h=xfrm_pol_18
This is mostly untested, if you want to test this anyway and
find bugs please feel free to report them to me.
Improvements to test script in patch #1 welcome as well (its what
i've been using so far to test this).
Am Donnerstag, 25. Oktober 2018, 10:34:50 schrieb David Miller:
> From: Wolfgang Walter <[email protected]>
> Date: Thu, 25 Oct 2018 11:38:19 +0200
>
> > there is now a new 4.19 which still has the big performance regression
> > when
> > many ipsec tunnels are configured (throughput and latency get worse by 10
> > to 50 times) which makes any kernel > 4.9 unusable for our routers.
> >
> > I still don't understand why a revert of the flow cache removal at least
> > for the longterm kernels is that a bad option (maybe as a compile time
> > option), especially as there is no workaround available.
>
> You do know that the flow cache is DDoS targettable, right?
>
> That's why we removed it, we did not make the change lightly.
Though this is true, we now have simply a permanent DDoS situation. The
removal of the flow cache leads to the situation so that with enough ipsec-
tunnels you are now always as bad as you would have been prior under a DDoS
attack.
This is not comparable to the routing cache situation where a fast, well
tested solution already existed (for routes in a table; if you use a lot of
rules for policy routing this may be a different story).
Futher I don't think that the DoS is that a strong argument for the removal of
the routing cache if the routing performance would have dropped 10 times and
more.
Also, the routing cache was even a problem with legitimate traffic, so I never
had a problem with the moderate performance regression it caused here.
>
> Adding a DDoS vector back into the kernel is not an option sorry.
All kernels >= 4.14 are in our use case as bad as if they were under attack.
They are completely unusable and I even can't
>
> Please work diligently with Florian and others to try and find ways to
> soften the performance hit.
>
I proposed to revert this for the longterm kernels and I only depending on a
compile time option which explicitely had to be switched on. Then we could
start using 4.19. People not using ipsec or who use it only with < 100 rules
would still live without flow cache.
Regards,
--
Wolfgang Walter
Studentenwerk M?nchen
Anstalt des ?ffentlichen Rechts