2023-12-04 15:23:04

by Lev Pantiukhin

[permalink] [raw]
Subject: [PATCH] ipvs: add a stateless type of service and a stateless Maglev hashing scheduler

Maglev Hashing Stateless
========================

Introduction
------------

This patch to Linux kernel provides the following changes to IPVS:

1. Adds a new type (IP_VS_SVC_F_STATELESS) of scheduler that computes the
need for connection entry addition;
2. Adds a new mhs (Maglev Hashing Stateless) scheduler based on the mh
scheduler that implements a new algorithm (more details below);
3. Adds scheduling for ACK packets;
4. Adds destination sorting (more details below).

This approach shows a significant reduction in CPU usage, even in the
case of 10% of endpoints constantly flapping. It also makes the L4
balancer less vulnerable to DDoS activity.

The Description of a New Algorithm
----------------------------------

This patch provides a modified version of the Maglev consistent hashing
scheduling algorithm (scheduler mh). It simultaneously uses two hash
tables instead of one. One of them is for old destinations, and the other
(the candidate table) is for new ones. A hash key corresponds to two
destinations, and if both hash tables point to the same destination, then
the hash key is called stable; otherwise, it is called unstable. A new
connection entry is created only in the event of an unstable hash key;
otherwise, the packet goes through stateless processing. If the hash key
is unstable:

* In the case of a SYN packet, it will pick up the destination from the
newer (candidate) hash table;
* In the case of an ACK packet, it will use the old hash table.

Upon changing the set of destinations, mhs populates a new candidate hash
table and initializes a timer equal to the TCP session timeout. When the
timer expires, the candidate hash table value is merged into the old hash
table, and the corresponding hash key again becomes stable. If there are
changes in the destinations before the timer expires, mhs overwrites the
candidate hash table without the timer reset. If the set of destinations
is unchanged, the connection tracking table will be empty.

IPVS stores destinations in an unordered way, so the same destination set
may generate different hash tables. To guarantee proper generation of the
Maglev hash table, the sorting of the destination list was added. This is
important in the case of destination flaps, which return the candidate
hash table to its original state. This patch implements sorting via
simple insertion with linear complexity. However, this complexity may be
simplified.

Signed-off-by: Lev Pantiukhin <[email protected]>
---
include/net/ip_vs.h | 6 +
include/uapi/linux/ip_vs.h | 1 +
net/netfilter/ipvs/Kconfig | 9 +
net/netfilter/ipvs/Makefile | 1 +
net/netfilter/ipvs/ip_vs_core.c | 34 +-
net/netfilter/ipvs/ip_vs_ctl.c | 54 +-
net/netfilter/ipvs/ip_vs_mhs.c | 740 +++++++++++++++++++++++++++
net/netfilter/ipvs/ip_vs_proto_tcp.c | 18 +-
8 files changed, 851 insertions(+), 12 deletions(-)
create mode 100644 net/netfilter/ipvs/ip_vs_mhs.c

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index ff406ef4fd4a..c3f0488bdd6a 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -778,6 +778,12 @@ struct ip_vs_scheduler {
struct ip_vs_dest* (*schedule)(struct ip_vs_service *svc,
const struct sk_buff *skb,
struct ip_vs_iphdr *iph);
+
+ /* selecting 2 servers from the given service and choose one */
+ struct ip_vs_dest* (*schedule_sl)(struct ip_vs_service *svc,
+ const struct sk_buff *skb,
+ struct ip_vs_iphdr *iph,
+ bool *need_state);
};

/* The persistence engine object */
diff --git a/include/uapi/linux/ip_vs.h b/include/uapi/linux/ip_vs.h
index 1ed234e7f251..cc205c1c796c 100644
--- a/include/uapi/linux/ip_vs.h
+++ b/include/uapi/linux/ip_vs.h
@@ -24,6 +24,7 @@
#define IP_VS_SVC_F_SCHED1 0x0008 /* scheduler flag 1 */
#define IP_VS_SVC_F_SCHED2 0x0010 /* scheduler flag 2 */
#define IP_VS_SVC_F_SCHED3 0x0020 /* scheduler flag 3 */
+#define IP_VS_SVC_F_STATELESS 0x0040 /* stateless scheduling */

#define IP_VS_SVC_F_SCHED_SH_FALLBACK IP_VS_SVC_F_SCHED1 /* SH fallback */
#define IP_VS_SVC_F_SCHED_SH_PORT IP_VS_SVC_F_SCHED2 /* SH use port */
diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
index 2a3017b9c001..886b75c48551 100644
--- a/net/netfilter/ipvs/Kconfig
+++ b/net/netfilter/ipvs/Kconfig
@@ -246,6 +246,15 @@ config IP_VS_MH
If you want to compile it in kernel, say Y. To compile it as a
module, choose M here. If unsure, say N.

+config IP_VS_MHS
+ tristate "stateless maglev hashing scheduling"
+ help
+ The usual Maglev consistent hashing scheduling algorithm provides
+ Google's Maglev hashing algorithm as an IPVS scheduler.
+ This is a modified version of maglev consistent hashing scheduling algorithm.
+ It simultaneously uses two hash tables instead of one.
+ One of them is for old destinations, and the other is for new ones.
+
config IP_VS_SED
tristate "shortest expected delay scheduling"
help
diff --git a/net/netfilter/ipvs/Makefile b/net/netfilter/ipvs/Makefile
index bb5d8125c82a..ffe9977397e0 100644
--- a/net/netfilter/ipvs/Makefile
+++ b/net/netfilter/ipvs/Makefile
@@ -34,6 +34,7 @@ obj-$(CONFIG_IP_VS_LBLCR) += ip_vs_lblcr.o
obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o
obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o
obj-$(CONFIG_IP_VS_MH) += ip_vs_mh.o
+obj-$(CONFIG_IP_VS_MHS) += ip_vs_mhs.o
obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o
obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o
obj-$(CONFIG_IP_VS_TWOS) += ip_vs_twos.o
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index a2c16b501087..6aaf762c0a1d 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -449,6 +449,7 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb,
__be16 _ports[2], *pptr, cport, vport;
const void *caddr, *vaddr;
unsigned int flags;
+ bool need_state;

*ignored = 1;
/*
@@ -525,7 +526,11 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb,
if (sched) {
/* read svc->sched_data after svc->scheduler */
smp_rmb();
- dest = sched->schedule(svc, skb, iph);
+ /* we use distinct handler for stateless service */
+ if (svc->flags & IP_VS_SVC_F_STATELESS)
+ dest = sched->schedule_sl(svc, skb, iph, &need_state);
+ else
+ dest = sched->schedule(svc, skb, iph);
} else {
dest = NULL;
}
@@ -534,9 +539,11 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb,
return NULL;
}

- flags = (svc->flags & IP_VS_SVC_F_ONEPACKET
- && iph->protocol == IPPROTO_UDP) ?
- IP_VS_CONN_F_ONE_PACKET : 0;
+ /* We use IP_VS_SVC_F_ONEPACKET flag to create no state */
+ flags = ((svc->flags & IP_VS_SVC_F_ONEPACKET &&
+ iph->protocol == IPPROTO_UDP) ||
+ (svc->flags & IP_VS_SVC_F_STATELESS && !need_state))
+ ? IP_VS_CONN_F_ONE_PACKET : 0;

/*
* Create a connection entry.
@@ -563,7 +570,10 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb,
IP_VS_DBG_ADDR(cp->daf, &cp->daddr), ntohs(cp->dport),
cp->flags, refcount_read(&cp->refcnt));

- ip_vs_conn_stats(cp, svc);
+ if (!(svc->flags & IP_VS_SVC_F_STATELESS) ||
+ (svc->flags & IP_VS_SVC_F_STATELESS && need_state)) {
+ ip_vs_conn_stats(cp, svc);
+ }
return cp;
}

@@ -1915,6 +1925,7 @@ ip_vs_in_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state
int ret, pkts;
struct sock *sk;
int af = state->pf;
+ struct ip_vs_service *svc;

/* Already marked as IPVS request or reply? */
if (skb->ipvs_property)
@@ -1990,6 +2001,19 @@ ip_vs_in_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state
cp = INDIRECT_CALL_1(pp->conn_in_get, ip_vs_conn_in_get_proto,
ipvs, af, skb, &iph);

+ /* Don't use expired connection in stateless service case;
+ * otherwise reuse can maintain the number connection entries
+ */
+ if (cp && cp->dest) {
+ svc = rcu_dereference(cp->dest->svc);
+
+ if ((svc->flags & IP_VS_SVC_F_STATELESS) &&
+ !(timer_pending(&cp->timer) && time_after(cp->timer.expires, jiffies))) {
+ __ip_vs_conn_put(cp);
+ cp = NULL;
+ }
+ }
+
if (!iph.fragoffs && is_new_conn(skb, &iph) && cp) {
int conn_reuse_mode = sysctl_conn_reuse_mode(ipvs);
bool old_ct = false, resched = false;
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 143a341bbc0a..fda321edbd9c 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -960,6 +960,43 @@ void ip_vs_stats_free(struct ip_vs_stats *stats)
}
}

+static int __ip_vs_mh_compare_dests(struct list_head *a, struct list_head *b)
+{
+ struct ip_vs_dest *dest_a = list_entry(a, struct ip_vs_dest, n_list);
+ struct ip_vs_dest *dest_b = list_entry(b, struct ip_vs_dest, n_list);
+ unsigned int i = 0;
+ __be32 diff;
+
+ switch (dest_a->af) {
+ case AF_INET:
+ return (int)(dest_a->addr.ip - dest_b->addr.ip);
+
+ case AF_INET6:
+ for (; i < ARRAY_SIZE(dest_a->addr.ip6); i++) {
+ diff = dest_a->addr.ip6[i] - dest_b->addr.ip6[i];
+ if (diff)
+ return (int)diff;
+ }
+ }
+
+ return 0;
+}
+
+static struct list_head *
+__ip_vs_find_insertion_place(struct list_head *new, struct list_head *head)
+{
+ struct list_head *p = head;
+ int ret;
+
+ while ((p = p->next) != head) {
+ ret = __ip_vs_mh_compare_dests(new, p);
+ if (ret < 0)
+ break;
+ }
+
+ return p->prev;
+}
+
/*
* Update a destination in the given service
*/
@@ -1038,7 +1075,10 @@ __ip_vs_update_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest,
spin_unlock_bh(&dest->dst_lock);

if (add) {
- list_add_rcu(&dest->n_list, &svc->destinations);
+ /* sorting of dests list */
+ list_add_rcu(&dest->n_list,
+ __ip_vs_find_insertion_place(&dest->n_list,
+ &svc->destinations));
svc->num_dests++;
sched = rcu_dereference_protected(svc->scheduler, 1);
if (sched && sched->add_dest)
@@ -1276,7 +1316,9 @@ static void __ip_vs_unlink_dest(struct ip_vs_service *svc,
struct ip_vs_dest *dest,
int svcupd)
{
- dest->flags &= ~IP_VS_DEST_F_AVAILABLE;
+ /* dest must be available from trash for stateless service */
+ if (!(svc->flags & IP_VS_SVC_F_STATELESS))
+ dest->flags &= ~IP_VS_DEST_F_AVAILABLE;

/*
* Remove it from the d-linked destination list.
@@ -1440,6 +1482,10 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
svc->port = u->port;
svc->fwmark = u->fwmark;
svc->flags = u->flags & ~IP_VS_SVC_F_HASHED;
+ if (!strcmp(u->sched_name, "mhs")) {
+ svc->flags |= IP_VS_SVC_F_STATELESS;
+ svc->flags &= ~IP_VS_SVC_F_PERSISTENT;
+ }
svc->timeout = u->timeout * HZ;
svc->netmask = u->netmask;
svc->ipvs = ipvs;
@@ -1578,6 +1624,10 @@ ip_vs_edit_service(struct ip_vs_service *svc, struct ip_vs_service_user_kern *u)
* Set the flags and timeout value
*/
svc->flags = u->flags | IP_VS_SVC_F_HASHED;
+ if (!strcmp(u->sched_name, "mhs")) {
+ svc->flags |= IP_VS_SVC_F_STATELESS;
+ svc->flags &= ~IP_VS_SVC_F_PERSISTENT;
+ }
svc->timeout = u->timeout * HZ;
svc->netmask = u->netmask;

diff --git a/net/netfilter/ipvs/ip_vs_mhs.c b/net/netfilter/ipvs/ip_vs_mhs.c
new file mode 100644
index 000000000000..ab19ac0f5b02
--- /dev/null
+++ b/net/netfilter/ipvs/ip_vs_mhs.c
@@ -0,0 +1,740 @@
+// SPDX-License-Identifier: GPL-2.0
+/* IPVS: Stateless Maglev Hashing scheduling module
+ *
+ * Authors: Lev Pantiukhin <[email protected]>
+ *
+ */
+
+/* The mh algorithm is to assign a preference list of all the lookup
+ * table positions to each destination and populate the table with
+ * the most-preferred position of destinations. Then it is to select
+ * destination with the hash key of source IP address through looking
+ * up a the lookup table.
+ * The mhs algorithm is modificated stateless version of mh algorithm.
+ * It uses 2 look up tables and chooses one of 2 destinations.
+ *
+ * The mh algorithm is detailed in:
+ * [3.4 Consistent Hasing]
+https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-eisenbud.pdf
+ *
+ */
+
+#define KMSG_COMPONENT "IPVS"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include <linux/ip.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/skbuff.h>
+
+#include <net/ip_vs.h>
+
+#include <linux/siphash.h>
+#include <linux/bitops.h>
+#include <linux/gcd.h>
+
+#include <linux/list_sort.h>
+
+#define IP_VS_SVC_F_SCHED_MH_FALLBACK IP_VS_SVC_F_SCHED1 /* MH fallback */
+#define IP_VS_SVC_F_SCHED_MH_PORT IP_VS_SVC_F_SCHED2 /* MH use port */
+
+struct ip_vs_mhs_lookup {
+ struct ip_vs_dest __rcu *dest; /* real server (cache) */
+};
+
+struct ip_vs_mhs_dest_setup {
+ unsigned int offset; /* starting offset */
+ unsigned int skip; /* skip */
+ unsigned int perm; /* next_offset */
+ int turns; /* weight / gcd() and rshift */
+};
+
+/* Available prime numbers for MH table */
+static int primes[] = {251, 509, 1021, 2039, 4093,
+ 8191, 16381, 32749, 65521, 131071};
+
+/* For IPVS MH entry hash table */
+#ifndef CONFIG_IP_VS_MH_TAB_INDEX
+#define CONFIG_IP_VS_MH_TAB_INDEX 12
+#endif
+#define IP_VS_MH_TAB_BITS (CONFIG_IP_VS_MH_TAB_INDEX / 2)
+#define IP_VS_MH_TAB_INDEX (CONFIG_IP_VS_MH_TAB_INDEX - 8)
+#define IP_VS_MH_TAB_SIZE primes[IP_VS_MH_TAB_INDEX]
+
+struct ip_vs_mhs_state {
+ struct rcu_head rcu_head;
+ struct ip_vs_mhs_lookup *lookup;
+ struct ip_vs_mhs_dest_setup *dest_setup;
+ hsiphash_key_t hash1, hash2;
+ int gcd;
+ int rshift;
+};
+
+struct ip_vs_mhs_two_states {
+ struct ip_vs_mhs_state *first;
+ struct ip_vs_mhs_state *second;
+ ktime_t *timestamps;
+ ktime_t unstable_timeout;
+};
+
+struct ip_vs_mhs_two_dests {
+ struct ip_vs_dest *dest;
+ struct ip_vs_dest *new_dest;
+ bool unstable;
+};
+
+static inline bool
+ip_vs_mhs_is_new_conn(const struct sk_buff *skb, struct ip_vs_iphdr *iph)
+{
+ switch (iph->protocol) {
+ case IPPROTO_TCP: {
+ struct tcphdr _tcph, *th;
+
+ th = skb_header_pointer(skb, iph->len, sizeof(_tcph), &_tcph);
+ if (!th)
+ return false;
+ return th->syn;
+ }
+ default:
+ return false;
+ }
+}
+
+static inline void
+generate_hash_secret(hsiphash_key_t *hash1, hsiphash_key_t *hash2)
+{
+ hash1->key[0] = 2654435761UL;
+ hash1->key[1] = 2654435761UL;
+
+ hash2->key[0] = 2654446892UL;
+ hash2->key[1] = 2654446892UL;
+}
+
+/* Returns hash value for IPVS MH entry */
+static inline unsigned int
+ip_vs_mhs_hashkey(int af, const union nf_inet_addr *addr, __be16 port,
+ hsiphash_key_t *key, unsigned int offset)
+{
+ unsigned int v;
+ __be32 addr_fold = addr->ip;
+
+#ifdef CONFIG_IP_VS_IPV6
+ if (af == AF_INET6)
+ addr_fold = addr->ip6[0] ^ addr->ip6[1] ^
+ addr->ip6[2] ^ addr->ip6[3];
+#endif
+ v = (offset + ntohs(port) + ntohl(addr_fold));
+ return hsiphash(&v, sizeof(v), key);
+}
+
+/* Reset all the hash buckets of the specified table. */
+static void ip_vs_mhs_reset(struct ip_vs_mhs_state *s)
+{
+ int i;
+ struct ip_vs_mhs_lookup *l;
+ struct ip_vs_dest *dest;
+
+ l = &s->lookup[0];
+ for (i = 0; i < IP_VS_MH_TAB_SIZE; i++) {
+ dest = rcu_dereference_protected(l->dest, 1);
+ if (dest) {
+ ip_vs_dest_put(dest);
+ RCU_INIT_POINTER(l->dest, NULL);
+ }
+ l++;
+ }
+}
+
+/* Update timestamps with new lookup table */
+static void
+ip_vs_mhs_update_timestamps(struct ip_vs_mhs_two_states *states)
+{
+ unsigned int offset = 0;
+
+ while (offset < IP_VS_MH_TAB_SIZE) {
+ if (states->first->lookup[offset].dest ==
+ states->second->lookup[offset].dest) {
+ if (states->timestamps[offset]) {
+ /* stabilization */
+ states->timestamps[offset] = (ktime_t)0;
+ }
+ } else {
+ if (!states->timestamps[offset]) {
+ /* destabilization */
+ states->timestamps[offset] = ktime_get();
+ }
+ }
+ ++offset;
+ }
+}
+
+static int
+ip_vs_mhs_permutate(struct ip_vs_mhs_state *s, struct ip_vs_service *svc)
+{
+ struct list_head *p;
+ struct ip_vs_mhs_dest_setup *ds;
+ struct ip_vs_dest *dest;
+ int lw;
+
+ /* If gcd is smaller then 1, number of dests or
+ * all weight of dests are zero. So, skip
+ * permutation for the dests.
+ */
+ if (s->gcd < 1)
+ return 0;
+
+ /* Set dest_setup for the dests permutation */
+ p = &svc->destinations;
+ ds = &s->dest_setup[0];
+ while ((p = p->next) != &svc->destinations) {
+ dest = list_entry(p, struct ip_vs_dest, n_list);
+
+ ds->offset = ip_vs_mhs_hashkey(svc->af, &dest->addr, dest->port,
+ &s->hash1, 0) %
+ IP_VS_MH_TAB_SIZE;
+ ds->skip = ip_vs_mhs_hashkey(svc->af, &dest->addr, dest->port,
+ &s->hash2, 0) %
+ (IP_VS_MH_TAB_SIZE - 1) + 1;
+ ds->perm = ds->offset;
+
+ lw = atomic_read(&dest->weight);
+ ds->turns = ((lw / s->gcd) >> s->rshift) ?: (lw != 0);
+ ds++;
+ }
+ return 0;
+}
+
+static int
+ip_vs_mhs_populate(struct ip_vs_mhs_state *s, struct ip_vs_service *svc)
+{
+ int n, c, dt_count;
+ unsigned long *table;
+ struct list_head *p;
+ struct ip_vs_mhs_dest_setup *ds;
+ struct ip_vs_dest *dest, *new_dest;
+
+ /* If gcd is smaller then 1, number of dests or
+ * all last_weight of dests are zero. So, skip
+ * the population for the dests and reset lookup table.
+ */
+ if (s->gcd < 1) {
+ ip_vs_mhs_reset(s);
+ return 0;
+ }
+
+ table = kcalloc(BITS_TO_LONGS(IP_VS_MH_TAB_SIZE), sizeof(unsigned long),
+ GFP_KERNEL);
+ if (!table)
+ return -ENOMEM;
+
+ p = &svc->destinations;
+ n = 0;
+ dt_count = 0;
+ while (n < IP_VS_MH_TAB_SIZE) {
+ if (p == &svc->destinations)
+ p = p->next;
+
+ ds = &s->dest_setup[0];
+ while (p != &svc->destinations) {
+ /* Ignore added server with zero weight */
+ if (ds->turns < 1) {
+ p = p->next;
+ ds++;
+ continue;
+ }
+
+ c = ds->perm;
+ while (test_bit(c, table)) {
+ /* Add skip, mod s->tab_size */
+ ds->perm += ds->skip;
+ if (ds->perm >= IP_VS_MH_TAB_SIZE)
+ ds->perm -= IP_VS_MH_TAB_SIZE;
+ c = ds->perm;
+ }
+
+ __set_bit(c, table);
+
+ dest = rcu_dereference_protected(s->lookup[c].dest, 1);
+ new_dest = list_entry(p, struct ip_vs_dest, n_list);
+ if (dest != new_dest) {
+ if (dest)
+ ip_vs_dest_put(dest);
+ ip_vs_dest_hold(new_dest);
+ RCU_INIT_POINTER(s->lookup[c].dest, new_dest);
+ }
+
+ if (++n == IP_VS_MH_TAB_SIZE)
+ goto out;
+
+ if (++dt_count >= ds->turns) {
+ dt_count = 0;
+ p = p->next;
+ ds++;
+ }
+ }
+ }
+
+out:
+ kfree(table);
+ return 0;
+}
+
+/* Assign all the hash buckets of the specified table with the service. */
+static int
+ip_vs_mhs_reassign(struct ip_vs_mhs_state *s, struct ip_vs_service *svc)
+{
+ int ret;
+
+ if (svc->num_dests > IP_VS_MH_TAB_SIZE)
+ return -EINVAL;
+
+ if (svc->num_dests >= 1) {
+ s->dest_setup = kcalloc(svc->num_dests,
+ sizeof(struct ip_vs_mhs_dest_setup),
+ GFP_KERNEL);
+ if (!s->dest_setup)
+ return -ENOMEM;
+ }
+
+ ip_vs_mhs_permutate(s, svc);
+
+ ret = ip_vs_mhs_populate(s, svc);
+ if (ret < 0)
+ goto out;
+
+ IP_VS_DBG_BUF(6, "MHS: %s(): reassign lookup table of %s:%u\n",
+ __func__,
+ IP_VS_DBG_ADDR(svc->af, &svc->addr),
+ ntohs(svc->port));
+
+out:
+ if (svc->num_dests >= 1) {
+ kfree(s->dest_setup);
+ s->dest_setup = NULL;
+ }
+ return ret;
+}
+
+static int
+ip_vs_mhs_gcd_weight(struct ip_vs_service *svc)
+{
+ struct ip_vs_dest *dest;
+ int weight;
+ int g = 0;
+
+ list_for_each_entry(dest, &svc->destinations, n_list) {
+ weight = atomic_read(&dest->weight);
+ if (weight > 0) {
+ if (g > 0)
+ g = gcd(weight, g);
+ else
+ g = weight;
+ }
+ }
+ return g;
+}
+
+/* To avoid assigning huge weight for the MH table,
+ * calculate shift value with gcd.
+ */
+static int
+ip_vs_mhs_shift_weight(struct ip_vs_service *svc, int gcd)
+{
+ struct ip_vs_dest *dest;
+ int new_weight, weight = 0;
+ int mw, shift;
+
+ /* If gcd is smaller then 1, number of dests or
+ * all weight of dests are zero. So, return
+ * shift value as zero.
+ */
+ if (gcd < 1)
+ return 0;
+
+ list_for_each_entry(dest, &svc->destinations, n_list) {
+ new_weight = atomic_read(&dest->weight);
+ if (new_weight > weight)
+ weight = new_weight;
+ }
+
+ /* Because gcd is greater than zero,
+ * the maximum weight and gcd are always greater than zero
+ */
+ mw = weight / gcd;
+
+ /* shift = occupied bits of weight/gcd - MH highest bits */
+ shift = fls(mw) - IP_VS_MH_TAB_BITS;
+ return (shift >= 0) ? shift : 0;
+}
+
+static ktime_t
+ip_vs_mhs_get_unstable_timeout(struct ip_vs_service *svc)
+{
+ struct ip_vs_proto_data *pd;
+ u64 tcp_to, tcp_fin_to;
+
+ pd = ip_vs_proto_data_get(svc->ipvs, IPPROTO_TCP);
+ tcp_to = pd->timeout_table[IP_VS_TCP_S_ESTABLISHED];
+ tcp_fin_to = pd->timeout_table[IP_VS_TCP_S_FIN_WAIT];
+ return ns_to_ktime(jiffies64_to_nsecs(max(tcp_to, tcp_fin_to)));
+}
+
+static void
+ip_vs_mhs_state_free(struct rcu_head *head)
+{
+ struct ip_vs_mhs_state *s;
+
+ s = container_of(head, struct ip_vs_mhs_state, rcu_head);
+ kfree(s->lookup);
+ kfree(s);
+}
+
+static int
+ip_vs_mhs_init_svc(struct ip_vs_service *svc)
+{
+ struct ip_vs_mhs_state *s0, *s1;
+ struct ip_vs_mhs_two_states *states;
+ ktime_t *tss;
+ int ret;
+
+ /* Allocate timestamps */
+ tss = kcalloc(IP_VS_MH_TAB_SIZE, sizeof(ktime_t), GFP_KERNEL);
+ if (!tss)
+ return -ENOMEM;
+
+ /* Allocate the first MH table for this service */
+ s0 = kzalloc(sizeof(*s0), GFP_KERNEL);
+ if (!s0) {
+ kfree(tss);
+ return -ENOMEM;
+ }
+
+ s0->lookup = kcalloc(IP_VS_MH_TAB_SIZE, sizeof(struct ip_vs_mhs_lookup),
+ GFP_KERNEL);
+ if (!s0->lookup) {
+ kfree(tss);
+ kfree(s0);
+ return -ENOMEM;
+ }
+
+ generate_hash_secret(&s0->hash1, &s0->hash2);
+ s0->gcd = ip_vs_mhs_gcd_weight(svc);
+ s0->rshift = ip_vs_mhs_shift_weight(svc, s0->gcd);
+
+ IP_VS_DBG(6,
+ "MHS: %s(): The first lookup table (memory=%zdbytes) allocated\n",
+ __func__,
+ sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE);
+
+ /* Assign the first lookup table with current dests */
+ ret = ip_vs_mhs_reassign(s0, svc);
+ if (ret < 0) {
+ kfree(tss);
+ ip_vs_mhs_reset(s0);
+ ip_vs_mhs_state_free(&s0->rcu_head);
+ return ret;
+ }
+
+ /* Allocate the second MH table for this service */
+ s1 = kzalloc(sizeof(*s1), GFP_KERNEL);
+ if (!s1) {
+ kfree(tss);
+ ip_vs_mhs_reset(s0);
+ ip_vs_mhs_state_free(&s0->rcu_head);
+ return -ENOMEM;
+ }
+ s1->lookup = kcalloc(IP_VS_MH_TAB_SIZE, sizeof(struct ip_vs_mhs_lookup),
+ GFP_KERNEL);
+ if (!s1->lookup) {
+ kfree(tss);
+ ip_vs_mhs_reset(s0);
+ ip_vs_mhs_state_free(&s0->rcu_head);
+ kfree(s1);
+ return -ENOMEM;
+ }
+
+ s1->hash1 = s0->hash1;
+ s1->hash2 = s0->hash2;
+ s1->gcd = s0->gcd;
+ s1->rshift = s0->rshift;
+
+ IP_VS_DBG(6,
+ "MHS: %s(): The second lookup table (memory=%zdbytes) allocated\n",
+ __func__,
+ sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE);
+
+ /* Assign the second lookup table with current dests */
+ ret = ip_vs_mhs_reassign(s1, svc);
+ if (ret < 0) {
+ kfree(tss);
+ ip_vs_mhs_reset(s0);
+ ip_vs_mhs_state_free(&s0->rcu_head);
+ ip_vs_mhs_reset(s1);
+ ip_vs_mhs_state_free(&s1->rcu_head);
+ return ret;
+ }
+
+ /* Allocate, initialize and attach states */
+ states = kcalloc(1, sizeof(struct ip_vs_mhs_two_states), GFP_KERNEL);
+ if (!states) {
+ kfree(tss);
+ ip_vs_mhs_reset(s0);
+ ip_vs_mhs_state_free(&s0->rcu_head);
+ ip_vs_mhs_reset(s1);
+ ip_vs_mhs_state_free(&s1->rcu_head);
+ return -ENOMEM;
+ }
+
+ states->first = s0;
+ states->second = s1;
+ states->timestamps = tss;
+ states->unstable_timeout = ip_vs_mhs_get_unstable_timeout(svc);
+ svc->sched_data = states;
+ return 0;
+}
+
+static void
+ip_vs_mhs_done_svc(struct ip_vs_service *svc)
+{
+ struct ip_vs_mhs_two_states *states = svc->sched_data;
+
+ kfree(states->timestamps);
+
+ /* Got to clean up the first lookup entry here */
+ ip_vs_mhs_reset(states->first);
+
+ call_rcu(&states->first->rcu_head, ip_vs_mhs_state_free);
+ IP_VS_DBG(6,
+ "MHS: The first MH lookup table (memory=%zdbytes) released\n",
+ sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE);
+
+ /* Got to clean up the second lookup entry here */
+ ip_vs_mhs_reset(states->second);
+
+ call_rcu(&states->second->rcu_head, ip_vs_mhs_state_free);
+ IP_VS_DBG(6,
+ "MHS: The second MH lookup table (memory=%zdbytes) released\n",
+ sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE);
+
+ kfree(states);
+}
+
+static int
+ip_vs_mhs_dest_changed(struct ip_vs_service *svc,
+ struct ip_vs_dest *dest)
+{
+ struct ip_vs_mhs_two_states *states = svc->sched_data;
+ struct ip_vs_mhs_state *s1 = states->second;
+ int ret;
+
+ s1->gcd = ip_vs_mhs_gcd_weight(svc);
+ s1->rshift = ip_vs_mhs_shift_weight(svc, s1->gcd);
+
+ /* Assign the lookup table with the updated service */
+ ret = ip_vs_mhs_reassign(s1, svc);
+
+ ip_vs_mhs_update_timestamps(states);
+ states->unstable_timeout = ip_vs_mhs_get_unstable_timeout(svc);
+ IP_VS_DBG(6,
+ "MHS: %s: set unstable timeout: %llu",
+ __func__,
+ ktime_divns(states->unstable_timeout,
+ NSEC_PER_SEC));
+ return ret;
+}
+
+/* Helper function to get port number */
+static inline __be16
+ip_vs_mhs_get_port(const struct sk_buff *skb, struct ip_vs_iphdr *iph)
+{
+ __be16 _ports[2], *ports;
+
+ /* At this point we know that we have a valid packet of some kind.
+ * Because ICMP packets are only guaranteed to have the first 8
+ * bytes, let's just grab the ports. Fortunately they're in the
+ * same position for all three of the protocols we care about.
+ */
+ switch (iph->protocol) {
+ case IPPROTO_TCP:
+ case IPPROTO_UDP:
+ case IPPROTO_SCTP:
+ ports = skb_header_pointer(skb, iph->len, sizeof(_ports),
+ &_ports);
+ if (unlikely(!ports))
+ return 0;
+
+ if (likely(!ip_vs_iph_inverse(iph)))
+ return ports[0];
+ else
+ return ports[1];
+ default:
+ return 0;
+ }
+}
+
+/* Get ip_vs_dest associated with supplied parameters. */
+static inline void
+ip_vs_mhs_get(struct ip_vs_service *svc,
+ struct ip_vs_mhs_two_states *states,
+ struct ip_vs_mhs_two_dests *dests,
+ const union nf_inet_addr *addr,
+ __be16 port)
+{
+ unsigned int hash;
+ ktime_t timestamp;
+
+ hash = ip_vs_mhs_hashkey(svc->af, addr, port, &states->first->hash1,
+ 0) % IP_VS_MH_TAB_SIZE;
+ dests->dest = rcu_dereference(states->first->lookup[hash].dest);
+ dests->new_dest = rcu_dereference(states->second->lookup[hash].dest);
+ timestamp = states->timestamps[hash];
+
+ /* only unstable hashes have non-zero value */
+ if (timestamp > 0) {
+ /* unstable */
+ if (timestamp + states->unstable_timeout > ktime_get()) {
+ /* timer didn't expire */
+ dests->unstable = true;
+ return;
+ }
+ /* unstable -> stable */
+ if (dests->dest)
+ ip_vs_dest_put(dests->dest);
+ if (dests->new_dest)
+ ip_vs_dest_hold(dests->new_dest);
+ dests->dest = dests->new_dest;
+ RCU_INIT_POINTER(states->first->lookup[hash].dest,
+ dests->new_dest);
+ states->timestamps[hash] = (ktime_t)0;
+ }
+ /* stable */
+ dests->unstable = false;
+}
+
+/* Stateless Maglev Hashing scheduling */
+static struct ip_vs_dest *
+ip_vs_mhs_schedule(struct ip_vs_service *svc,
+ const struct sk_buff *skb,
+ struct ip_vs_iphdr *iph,
+ bool *need_state)
+{
+ struct ip_vs_mhs_two_dests dests;
+ struct ip_vs_dest *final_dest = NULL;
+ struct ip_vs_mhs_two_states *states = svc->sched_data;
+ __be16 port = 0;
+ const union nf_inet_addr *hash_addr;
+
+ *need_state = false;
+ hash_addr = ip_vs_iph_inverse(iph) ? &iph->daddr : &iph->saddr;
+
+ if (svc->flags & IP_VS_SVC_F_SCHED_MH_PORT)
+ port = ip_vs_mhs_get_port(skb, iph);
+
+ ip_vs_mhs_get(svc, states, &dests, hash_addr, port);
+ IP_VS_DBG_BUF(6,
+ "MHS: %s(): source IP address %s:%u --> server %s and %s\n",
+ __func__,
+ IP_VS_DBG_ADDR(svc->af, hash_addr),
+ ntohs(port),
+ dests.dest
+ ? IP_VS_DBG_ADDR(dests.dest->af, &dests.dest->addr)
+ : "NULL",
+ dests.new_dest
+ ? IP_VS_DBG_ADDR(dests.new_dest->af,
+ &dests.new_dest->addr)
+ : "NULL");
+
+ if (!dests.dest && !dests.new_dest) {
+ /* Both dests is NULL */
+ return NULL;
+ }
+
+ if (!(dests.dest && dests.new_dest)) {
+ /* dest is NULL or new_dest is NULL,
+ * so we send all packets to singular available dest
+ * and create state
+ */
+ if (dests.new_dest) {
+ /* dest is NULL */
+ final_dest = dests.new_dest;
+ } else {
+ /* new_dest is NULL */
+ final_dest = dests.dest;
+ }
+ *need_state = true;
+ IP_VS_DBG(6,
+ "MHS: %s(): One dest, need_state=%s\n",
+ __func__,
+ *need_state ? "true" : "false");
+ } else if (dests.unstable) {
+ /* unstable */
+ if (iph->protocol == IPPROTO_TCP) {
+ /* TCP */
+ *need_state = true;
+ if (ip_vs_mhs_is_new_conn(skb, iph)) {
+ /* SYN packet */
+ final_dest = dests.new_dest;
+ IP_VS_DBG(6,
+ "MHS: %s(): Unstable, need_state=%s, SYN packet\n",
+ __func__,
+ *need_state ? "true" : "false");
+ } else {
+ /* Not SYN packet */
+ final_dest = dests.dest;
+ IP_VS_DBG(6,
+ "MHS: %s(): Unstable, need_state=%s, not SYN packet\n",
+ __func__,
+ *need_state ? "true" : "false");
+ }
+ } else if (iph->protocol == IPPROTO_UDP) {
+ /* UDP */
+ final_dest = dests.new_dest;
+ IP_VS_DBG(6,
+ "MHS: %s(): Unstable, need_state=%s, UDP packet\n",
+ __func__,
+ *need_state ? "true" : "false");
+ }
+ } else {
+ /* stable */
+ final_dest = dests.dest;
+ IP_VS_DBG(6,
+ "MHS: %s(): Stable, need_state=%s\n",
+ __func__,
+ *need_state ? "true" : "false");
+ }
+ return final_dest;
+}
+
+/* IPVS MHS Scheduler structure */
+static struct ip_vs_scheduler ip_vs_mhs_scheduler = {
+ .name = "mhs",
+ .refcnt = ATOMIC_INIT(0),
+ .module = THIS_MODULE,
+ .n_list = LIST_HEAD_INIT(ip_vs_mhs_scheduler.n_list),
+ .init_service = ip_vs_mhs_init_svc,
+ .done_service = ip_vs_mhs_done_svc,
+ .add_dest = ip_vs_mhs_dest_changed,
+ .del_dest = ip_vs_mhs_dest_changed,
+ .upd_dest = ip_vs_mhs_dest_changed,
+ .schedule_sl = ip_vs_mhs_schedule,
+};
+
+static int __init
+ip_vs_mhs_init(void)
+{
+ return register_ip_vs_scheduler(&ip_vs_mhs_scheduler);
+}
+
+static void __exit
+ip_vs_mhs_cleanup(void)
+{
+ unregister_ip_vs_scheduler(&ip_vs_mhs_scheduler);
+ rcu_barrier();
+}
+
+module_init(ip_vs_mhs_init);
+module_exit(ip_vs_mhs_cleanup);
+MODULE_DESCRIPTION("Stateless Maglev hashing ipvs scheduler");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Lev Pantiukhin <[email protected]>");
diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c b/net/netfilter/ipvs/ip_vs_proto_tcp.c
index 7da51390cea6..31a8c1bfc863 100644
--- a/net/netfilter/ipvs/ip_vs_proto_tcp.c
+++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c
@@ -38,7 +38,7 @@ tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb,
struct ip_vs_iphdr *iph)
{
struct ip_vs_service *svc;
- struct tcphdr _tcph, *th;
+ struct tcphdr _tcph, *th = NULL;
__be16 _ports[2], *ports = NULL;

/* In the event of icmp, we're only guaranteed to have the first 8
@@ -47,11 +47,8 @@ tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb,
*/
if (likely(!ip_vs_iph_icmp(iph))) {
th = skb_header_pointer(skb, iph->len, sizeof(_tcph), &_tcph);
- if (th) {
- if (th->rst || !(sysctl_sloppy_tcp(ipvs) || th->syn))
- return 1;
+ if (th)
ports = &th->source;
- }
} else {
ports = skb_header_pointer(
skb, iph->len, sizeof(_ports), &_ports);
@@ -74,6 +71,17 @@ tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb,
if (svc) {
int ignored;

+ if (th) {
+ /* If sloppy_tcp or IP_VS_SVC_F_STATELESS is true,
+ * all SYN packets are scheduled except packets
+ * with set RST flag.
+ */
+ if (!sysctl_sloppy_tcp(ipvs) &&
+ !(svc->flags & IP_VS_SVC_F_STATELESS) &&
+ (!th->syn || th->rst))
+ return 1;
+ }
+
if (ip_vs_todrop(ipvs)) {
/*
* It seems that we are very loaded.
--
2.17.1


2023-12-05 20:05:03

by Julian Anastasov

[permalink] [raw]
Subject: Re: [PATCH] ipvs: add a stateless type of service and a stateless Maglev hashing scheduler


Hello,

On Mon, 4 Dec 2023, Lev Pantiukhin wrote:

> Maglev Hashing Stateless
> ========================
>
> Introduction
> ------------
>
> This patch to Linux kernel provides the following changes to IPVS:
>
> 1. Adds a new type (IP_VS_SVC_F_STATELESS) of scheduler that computes the
> need for connection entry addition;

I see the intention to avoid keeping connections.
IPVS still creates connection struct for every packet for the
IP_VS_CONN_F_ONE_PACKET mode but I'm not sure if this is faster than
keeping conns in hash table. You probably have stats for this.

> 2. Adds a new mhs (Maglev Hashing Stateless) scheduler based on the mh
> scheduler that implements a new algorithm (more details below);
> 3. Adds scheduling for ACK packets;
> 4. Adds destination sorting (more details below).
>
> This approach shows a significant reduction in CPU usage, even in the
> case of 10% of endpoints constantly flapping. It also makes the L4

It is crucial what strategy is used to deactivate dests.
MH with setting weight to 0 should not change the lookup table.
But add/remove always lead to problems.

> balancer less vulnerable to DDoS activity.
>
> The Description of a New Algorithm
> ----------------------------------
>
> This patch provides a modified version of the Maglev consistent hashing
> scheduling algorithm (scheduler mh). It simultaneously uses two hash
> tables instead of one. One of them is for old destinations, and the other
> (the candidate table) is for new ones. A hash key corresponds to two
> destinations, and if both hash tables point to the same destination, then
> the hash key is called stable; otherwise, it is called unstable. A new
> connection entry is created only in the event of an unstable hash key;
> otherwise, the packet goes through stateless processing. If the hash key
> is unstable:
>
> * In the case of a SYN packet, it will pick up the destination from the
> newer (candidate) hash table;
> * In the case of an ACK packet, it will use the old hash table.
>
> Upon changing the set of destinations, mhs populates a new candidate hash
> table and initializes a timer equal to the TCP session timeout. When the
> timer expires, the candidate hash table value is merged into the old hash
> table, and the corresponding hash key again becomes stable. If there are
> changes in the destinations before the timer expires, mhs overwrites the
> candidate hash table without the timer reset. If the set of destinations
> is unchanged, the connection tracking table will be empty.
>
> IPVS stores destinations in an unordered way, so the same destination set
> may generate different hash tables. To guarantee proper generation of the
> Maglev hash table, the sorting of the destination list was added. This is
> important in the case of destination flaps, which return the candidate
> hash table to its original state. This patch implements sorting via
> simple insertion with linear complexity. However, this complexity may be
> simplified.
>
> Signed-off-by: Lev Pantiukhin <[email protected]>
> ---
> include/net/ip_vs.h | 6 +
> include/uapi/linux/ip_vs.h | 1 +
> net/netfilter/ipvs/Kconfig | 9 +
> net/netfilter/ipvs/Makefile | 1 +
> net/netfilter/ipvs/ip_vs_core.c | 34 +-
> net/netfilter/ipvs/ip_vs_ctl.c | 54 +-
> net/netfilter/ipvs/ip_vs_mhs.c | 740 +++++++++++++++++++++++++++
> net/netfilter/ipvs/ip_vs_proto_tcp.c | 18 +-
> 8 files changed, 851 insertions(+), 12 deletions(-)
> create mode 100644 net/netfilter/ipvs/ip_vs_mhs.c
>

> diff --git a/include/uapi/linux/ip_vs.h b/include/uapi/linux/ip_vs.h
> index 1ed234e7f251..cc205c1c796c 100644
> --- a/include/uapi/linux/ip_vs.h
> +++ b/include/uapi/linux/ip_vs.h
> @@ -24,6 +24,7 @@
> #define IP_VS_SVC_F_SCHED1 0x0008 /* scheduler flag 1 */
> #define IP_VS_SVC_F_SCHED2 0x0010 /* scheduler flag 2 */
> #define IP_VS_SVC_F_SCHED3 0x0020 /* scheduler flag 3 */
> +#define IP_VS_SVC_F_STATELESS 0x0040 /* stateless scheduling */
>
> #define IP_VS_SVC_F_SCHED_SH_FALLBACK IP_VS_SVC_F_SCHED1 /* SH fallback */
> #define IP_VS_SVC_F_SCHED_SH_PORT IP_VS_SVC_F_SCHED2 /* SH use port */
> diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
> index 2a3017b9c001..886b75c48551 100644
> --- a/net/netfilter/ipvs/Kconfig
> +++ b/net/netfilter/ipvs/Kconfig
> @@ -246,6 +246,15 @@ config IP_VS_MH
> If you want to compile it in kernel, say Y. To compile it as a
> module, choose M here. If unsure, say N.
>
> +config IP_VS_MHS
> + tristate "stateless maglev hashing scheduling"
> + help
> + The usual Maglev consistent hashing scheduling algorithm provides
> + Google's Maglev hashing algorithm as an IPVS scheduler.
> + This is a modified version of maglev consistent hashing scheduling algorithm.
> + It simultaneously uses two hash tables instead of one.
> + One of them is for old destinations, and the other is for new ones.

Looks like MHS implicitly uses the CONFIG_IP_VS_MH_TAB_INDEX
configuration. May be we should note it here.

> +
> config IP_VS_SED
> tristate "shortest expected delay scheduling"
> help
> diff --git a/net/netfilter/ipvs/Makefile b/net/netfilter/ipvs/Makefile
> index bb5d8125c82a..ffe9977397e0 100644
> --- a/net/netfilter/ipvs/Makefile
> +++ b/net/netfilter/ipvs/Makefile
> @@ -34,6 +34,7 @@ obj-$(CONFIG_IP_VS_LBLCR) += ip_vs_lblcr.o
> obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o
> obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o
> obj-$(CONFIG_IP_VS_MH) += ip_vs_mh.o
> +obj-$(CONFIG_IP_VS_MHS) += ip_vs_mhs.o
> obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o
> obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o
> obj-$(CONFIG_IP_VS_TWOS) += ip_vs_twos.o
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index a2c16b501087..6aaf762c0a1d 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -449,6 +449,7 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb,
> __be16 _ports[2], *pptr, cport, vport;
> const void *caddr, *vaddr;
> unsigned int flags;
> + bool need_state;
>
> *ignored = 1;
> /*
> @@ -525,7 +526,11 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb,
> if (sched) {
> /* read svc->sched_data after svc->scheduler */
> smp_rmb();
> - dest = sched->schedule(svc, skb, iph);
> + /* we use distinct handler for stateless service */
> + if (svc->flags & IP_VS_SVC_F_STATELESS)

Sometimes scheduler can be changed for svc, we should see
if this should be per-scheduler flag somewhere in struct ip_vs_scheduler
or simply to check for present schedule_sl. But probably in the end, it
should go as a svc flag as you use it now.

> + dest = sched->schedule_sl(svc, skb, iph, &need_state);
> + else
> + dest = sched->schedule(svc, skb, iph);
> } else {
> dest = NULL;
> }
> @@ -534,9 +539,11 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb,
> return NULL;
> }
>
> - flags = (svc->flags & IP_VS_SVC_F_ONEPACKET
> - && iph->protocol == IPPROTO_UDP) ?
> - IP_VS_CONN_F_ONE_PACKET : 0;
> + /* We use IP_VS_SVC_F_ONEPACKET flag to create no state */
> + flags = ((svc->flags & IP_VS_SVC_F_ONEPACKET &&
> + iph->protocol == IPPROTO_UDP) ||
> + (svc->flags & IP_VS_SVC_F_STATELESS && !need_state))
> + ? IP_VS_CONN_F_ONE_PACKET : 0;
>
> /*
> * Create a connection entry.
> @@ -563,7 +570,10 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb,
> IP_VS_DBG_ADDR(cp->daf, &cp->daddr), ntohs(cp->dport),
> cp->flags, refcount_read(&cp->refcnt));
>
> - ip_vs_conn_stats(cp, svc);
> + if (!(svc->flags & IP_VS_SVC_F_STATELESS) ||
> + (svc->flags & IP_VS_SVC_F_STATELESS && need_state)) {
> + ip_vs_conn_stats(cp, svc);

So, here we do not know if it is a new connection...
Then lets check IP_VS_HDR_NEW_CONN via new function ip_vs_iph_new_conn,
we should create it like ip_vs_iph_inverse and ip_vs_iph_icmp.
See below.

> + }
> return cp;
> }
>
> @@ -1915,6 +1925,7 @@ ip_vs_in_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state
> int ret, pkts;
> struct sock *sk;
> int af = state->pf;
> + struct ip_vs_service *svc;
>
> /* Already marked as IPVS request or reply? */
> if (skb->ipvs_property)
> @@ -1990,6 +2001,19 @@ ip_vs_in_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state
> cp = INDIRECT_CALL_1(pp->conn_in_get, ip_vs_conn_in_get_proto,
> ipvs, af, skb, &iph);
>
> + /* Don't use expired connection in stateless service case;
> + * otherwise reuse can maintain the number connection entries
> + */
> + if (cp && cp->dest) {
> + svc = rcu_dereference(cp->dest->svc);
> +
> + if ((svc->flags & IP_VS_SVC_F_STATELESS) &&
> + !(timer_pending(&cp->timer) && time_after(cp->timer.expires, jiffies))) {
> + __ip_vs_conn_put(cp);
> + cp = NULL;
> + }
> + }

Do we need special treatment here? Is it possible to see
connections that do not expire? At least, it will advance its timer
and it is impossible to see unexpired timer.

> +
> if (!iph.fragoffs && is_new_conn(skb, &iph) && cp) {
> int conn_reuse_mode = sysctl_conn_reuse_mode(ipvs);
> bool old_ct = false, resched = false;
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 143a341bbc0a..fda321edbd9c 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -960,6 +960,43 @@ void ip_vs_stats_free(struct ip_vs_stats *stats)
> }
> }
>
> +static int __ip_vs_mh_compare_dests(struct list_head *a, struct list_head *b)
> +{
> + struct ip_vs_dest *dest_a = list_entry(a, struct ip_vs_dest, n_list);
> + struct ip_vs_dest *dest_b = list_entry(b, struct ip_vs_dest, n_list);
> + unsigned int i = 0;
> + __be32 diff;
> +
> + switch (dest_a->af) {
> + case AF_INET:
> + return (int)(dest_a->addr.ip - dest_b->addr.ip);
> +
> + case AF_INET6:
> + for (; i < ARRAY_SIZE(dest_a->addr.ip6); i++) {
> + diff = dest_a->addr.ip6[i] - dest_b->addr.ip6[i];
> + if (diff)
> + return (int)diff;
> + }
> + }
> +
> + return 0;
> +}
> +
> +static struct list_head *
> +__ip_vs_find_insertion_place(struct list_head *new, struct list_head *head)
> +{
> + struct list_head *p = head;
> + int ret;
> +
> + while ((p = p->next) != head) {
> + ret = __ip_vs_mh_compare_dests(new, p);
> + if (ret < 0)
> + break;
> + }
> +
> + return p->prev;
> +}
> +
> /*
> * Update a destination in the given service
> */
> @@ -1038,7 +1075,10 @@ __ip_vs_update_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest,
> spin_unlock_bh(&dest->dst_lock);
>
> if (add) {
> - list_add_rcu(&dest->n_list, &svc->destinations);
> + /* sorting of dests list */
> + list_add_rcu(&dest->n_list,
> + __ip_vs_find_insertion_place(&dest->n_list,
> + &svc->destinations));

About the sorting of dests. There is no guarantee that sorting
prevents hash mismatch on reconfiguration. In MH, ip_vs_mh_permutate()
independently calculates primary offset for every dest (ds->perm)
and later ip_vs_mh_populate() walks all dests in the order they are
added (probably reverse order). Every dest gets chance to occupy
primary slots in the table based on its weight. As the hash functions
often result in collision, the next dests in the list has less chance
to occupy their primary slots.

So, the strategy of admin should be newly added dests to
be considered last in the list. If list is sorted, this even complicates
the addition of new servers because if they are inserted in the
middle of the list they will disturb the hashing for the next dests
in the list.

In any case, the adding/deleting of dest is considered a
disturbing operation for MH but MH allowed weight to be safely changed
to 0 without reordering the lookup table, thanks to last_weight.

In short, with sorting or no, it is enough to add the dests
in the same order to duplicate the lookup table on reconfiguration.
Sorting helps only if we add dests by hand in different order.
Or may be I'm wrong?

> svc->num_dests++;
> sched = rcu_dereference_protected(svc->scheduler, 1);
> if (sched && sched->add_dest)
> @@ -1276,7 +1316,9 @@ static void __ip_vs_unlink_dest(struct ip_vs_service *svc,
> struct ip_vs_dest *dest,
> int svcupd)
> {
> - dest->flags &= ~IP_VS_DEST_F_AVAILABLE;
> + /* dest must be available from trash for stateless service */
> + if (!(svc->flags & IP_VS_SVC_F_STATELESS))
> + dest->flags &= ~IP_VS_DEST_F_AVAILABLE;

Not nice, see below

>
> /*
> * Remove it from the d-linked destination list.
> @@ -1440,6 +1482,10 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
> svc->port = u->port;
> svc->fwmark = u->fwmark;
> svc->flags = u->flags & ~IP_VS_SVC_F_HASHED;
> + if (!strcmp(u->sched_name, "mhs")) {
> + svc->flags |= IP_VS_SVC_F_STATELESS;
> + svc->flags &= ~IP_VS_SVC_F_PERSISTENT;
> + }

Should be part of ip_vs_mhs_init_svc, we can return -EINVAL
there if IP_VS_SVC_F_PERSISTENT is set. Or to avoid stateless mode
in such case with all consequences:

if (!(svc->flags & IP_VS_SVC_F_PERSISTENT))
svc->flags |= IP_VS_SVC_F_STATELESS;

Can we work in different mode if we can not set
IP_VS_SVC_F_STATELESS due to some flags?

But in any case ip_vs_mhs_done_svc() should clear
IP_VS_SVC_F_STATELESS because ip_vs_edit_service() can be
changing the scheduler.

> svc->timeout = u->timeout * HZ;
> svc->netmask = u->netmask;
> svc->ipvs = ipvs;
> @@ -1578,6 +1624,10 @@ ip_vs_edit_service(struct ip_vs_service *svc, struct ip_vs_service_user_kern *u)
> * Set the flags and timeout value
> */
> svc->flags = u->flags | IP_VS_SVC_F_HASHED;
> + if (!strcmp(u->sched_name, "mhs")) {
> + svc->flags |= IP_VS_SVC_F_STATELESS;
> + svc->flags &= ~IP_VS_SVC_F_PERSISTENT;
> + }

Will be done in ip_vs_mhs_init_svc

> svc->timeout = u->timeout * HZ;
> svc->netmask = u->netmask;
>
> diff --git a/net/netfilter/ipvs/ip_vs_mhs.c b/net/netfilter/ipvs/ip_vs_mhs.c
> new file mode 100644
> index 000000000000..ab19ac0f5b02
> --- /dev/null
> +++ b/net/netfilter/ipvs/ip_vs_mhs.c
> @@ -0,0 +1,740 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* IPVS: Stateless Maglev Hashing scheduling module
> + *
> + * Authors: Lev Pantiukhin <[email protected]>
> + *
> + */
> +
> +/* The mh algorithm is to assign a preference list of all the lookup
> + * table positions to each destination and populate the table with
> + * the most-preferred position of destinations. Then it is to select
> + * destination with the hash key of source IP address through looking
> + * up a the lookup table.
> + * The mhs algorithm is modificated stateless version of mh algorithm.
> + * It uses 2 look up tables and chooses one of 2 destinations.
> + *
> + * The mh algorithm is detailed in:
> + * [3.4 Consistent Hasing]
> +https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-eisenbud.pdf
> + *
> + */
> +
> +#define KMSG_COMPONENT "IPVS"
> +#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
> +
> +#include <linux/ip.h>
> +#include <linux/slab.h>
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/skbuff.h>
> +
> +#include <net/ip_vs.h>
> +
> +#include <linux/siphash.h>
> +#include <linux/bitops.h>
> +#include <linux/gcd.h>
> +
> +#include <linux/list_sort.h>
> +
> +#define IP_VS_SVC_F_SCHED_MH_FALLBACK IP_VS_SVC_F_SCHED1 /* MH fallback */
> +#define IP_VS_SVC_F_SCHED_MH_PORT IP_VS_SVC_F_SCHED2 /* MH use port */
> +
> +struct ip_vs_mhs_lookup {
> + struct ip_vs_dest __rcu *dest; /* real server (cache) */
> +};
> +
> +struct ip_vs_mhs_dest_setup {
> + unsigned int offset; /* starting offset */
> + unsigned int skip; /* skip */
> + unsigned int perm; /* next_offset */
> + int turns; /* weight / gcd() and rshift */
> +};
> +
> +/* Available prime numbers for MH table */
> +static int primes[] = {251, 509, 1021, 2039, 4093,
> + 8191, 16381, 32749, 65521, 131071};
> +
> +/* For IPVS MH entry hash table */
> +#ifndef CONFIG_IP_VS_MH_TAB_INDEX
> +#define CONFIG_IP_VS_MH_TAB_INDEX 12
> +#endif
> +#define IP_VS_MH_TAB_BITS (CONFIG_IP_VS_MH_TAB_INDEX / 2)
> +#define IP_VS_MH_TAB_INDEX (CONFIG_IP_VS_MH_TAB_INDEX - 8)
> +#define IP_VS_MH_TAB_SIZE primes[IP_VS_MH_TAB_INDEX]
> +
> +struct ip_vs_mhs_state {
> + struct rcu_head rcu_head;
> + struct ip_vs_mhs_lookup *lookup;
> + struct ip_vs_mhs_dest_setup *dest_setup;
> + hsiphash_key_t hash1, hash2;
> + int gcd;
> + int rshift;
> +};
> +
> +struct ip_vs_mhs_two_states {
> + struct ip_vs_mhs_state *first;
> + struct ip_vs_mhs_state *second;
> + ktime_t *timestamps;
> + ktime_t unstable_timeout;
> +};
> +
> +struct ip_vs_mhs_two_dests {
> + struct ip_vs_dest *dest;
> + struct ip_vs_dest *new_dest;
> + bool unstable;
> +};
> +
> +static inline bool
> +ip_vs_mhs_is_new_conn(const struct sk_buff *skb, struct ip_vs_iphdr *iph)
> +{
> + switch (iph->protocol) {
> + case IPPROTO_TCP: {
> + struct tcphdr _tcph, *th;
> +
> + th = skb_header_pointer(skb, iph->len, sizeof(_tcph), &_tcph);
> + if (!th)
> + return false;
> + return th->syn;
> + }
> + default:
> + return false;
> + }
> +}
> +
> +static inline void
> +generate_hash_secret(hsiphash_key_t *hash1, hsiphash_key_t *hash2)
> +{
> + hash1->key[0] = 2654435761UL;
> + hash1->key[1] = 2654435761UL;
> +
> + hash2->key[0] = 2654446892UL;
> + hash2->key[1] = 2654446892UL;
> +}
> +
> +/* Returns hash value for IPVS MH entry */
> +static inline unsigned int
> +ip_vs_mhs_hashkey(int af, const union nf_inet_addr *addr, __be16 port,
> + hsiphash_key_t *key, unsigned int offset)
> +{
> + unsigned int v;
> + __be32 addr_fold = addr->ip;
> +
> +#ifdef CONFIG_IP_VS_IPV6
> + if (af == AF_INET6)
> + addr_fold = addr->ip6[0] ^ addr->ip6[1] ^
> + addr->ip6[2] ^ addr->ip6[3];
> +#endif
> + v = (offset + ntohs(port) + ntohl(addr_fold));
> + return hsiphash(&v, sizeof(v), key);
> +}
> +
> +/* Reset all the hash buckets of the specified table. */
> +static void ip_vs_mhs_reset(struct ip_vs_mhs_state *s)
> +{
> + int i;
> + struct ip_vs_mhs_lookup *l;
> + struct ip_vs_dest *dest;
> +
> + l = &s->lookup[0];
> + for (i = 0; i < IP_VS_MH_TAB_SIZE; i++) {
> + dest = rcu_dereference_protected(l->dest, 1);
> + if (dest) {
> + ip_vs_dest_put(dest);
> + RCU_INIT_POINTER(l->dest, NULL);
> + }
> + l++;
> + }
> +}
> +
> +/* Update timestamps with new lookup table */
> +static void
> +ip_vs_mhs_update_timestamps(struct ip_vs_mhs_two_states *states)
> +{
> + unsigned int offset = 0;
> +
> + while (offset < IP_VS_MH_TAB_SIZE) {
> + if (states->first->lookup[offset].dest ==
> + states->second->lookup[offset].dest) {
> + if (states->timestamps[offset]) {
> + /* stabilization */
> + states->timestamps[offset] = (ktime_t)0;
> + }
> + } else {
> + if (!states->timestamps[offset]) {
> + /* destabilization */
> + states->timestamps[offset] = ktime_get();
> + }
> + }
> + ++offset;

Can't we use jiffies? At least to call ktime_get() once?

> + }
> +}
> +
> +static int
> +ip_vs_mhs_permutate(struct ip_vs_mhs_state *s, struct ip_vs_service *svc)
> +{
> + struct list_head *p;
> + struct ip_vs_mhs_dest_setup *ds;
> + struct ip_vs_dest *dest;
> + int lw;
> +
> + /* If gcd is smaller then 1, number of dests or
> + * all weight of dests are zero. So, skip
> + * permutation for the dests.
> + */
> + if (s->gcd < 1)
> + return 0;
> +
> + /* Set dest_setup for the dests permutation */
> + p = &svc->destinations;
> + ds = &s->dest_setup[0];
> + while ((p = p->next) != &svc->destinations) {
> + dest = list_entry(p, struct ip_vs_dest, n_list);
> +
> + ds->offset = ip_vs_mhs_hashkey(svc->af, &dest->addr, dest->port,
> + &s->hash1, 0) %
> + IP_VS_MH_TAB_SIZE;
> + ds->skip = ip_vs_mhs_hashkey(svc->af, &dest->addr, dest->port,
> + &s->hash2, 0) %
> + (IP_VS_MH_TAB_SIZE - 1) + 1;
> + ds->perm = ds->offset;
> +
> + lw = atomic_read(&dest->weight);
> + ds->turns = ((lw / s->gcd) >> s->rshift) ?: (lw != 0);
> + ds++;
> + }
> + return 0;
> +}
> +
> +static int
> +ip_vs_mhs_populate(struct ip_vs_mhs_state *s, struct ip_vs_service *svc)
> +{
> + int n, c, dt_count;
> + unsigned long *table;
> + struct list_head *p;
> + struct ip_vs_mhs_dest_setup *ds;
> + struct ip_vs_dest *dest, *new_dest;
> +
> + /* If gcd is smaller then 1, number of dests or
> + * all last_weight of dests are zero. So, skip
> + * the population for the dests and reset lookup table.
> + */
> + if (s->gcd < 1) {
> + ip_vs_mhs_reset(s);
> + return 0;
> + }
> +
> + table = kcalloc(BITS_TO_LONGS(IP_VS_MH_TAB_SIZE), sizeof(unsigned long),
> + GFP_KERNEL);

MH uses bitmap_zalloc for this...

> + if (!table)
> + return -ENOMEM;
> +
> + p = &svc->destinations;
> + n = 0;
> + dt_count = 0;
> + while (n < IP_VS_MH_TAB_SIZE) {
> + if (p == &svc->destinations)
> + p = p->next;
> +
> + ds = &s->dest_setup[0];
> + while (p != &svc->destinations) {
> + /* Ignore added server with zero weight */
> + if (ds->turns < 1) {
> + p = p->next;
> + ds++;
> + continue;
> + }
> +
> + c = ds->perm;
> + while (test_bit(c, table)) {
> + /* Add skip, mod s->tab_size */

IP_VS_MH_TAB_SIZE, no s->tab_size

> + ds->perm += ds->skip;
> + if (ds->perm >= IP_VS_MH_TAB_SIZE)
> + ds->perm -= IP_VS_MH_TAB_SIZE;
> + c = ds->perm;
> + }
> +
> + __set_bit(c, table);
> +
> + dest = rcu_dereference_protected(s->lookup[c].dest, 1);
> + new_dest = list_entry(p, struct ip_vs_dest, n_list);
> + if (dest != new_dest) {
> + if (dest)
> + ip_vs_dest_put(dest);
> + ip_vs_dest_hold(new_dest);
> + RCU_INIT_POINTER(s->lookup[c].dest, new_dest);
> + }
> +
> + if (++n == IP_VS_MH_TAB_SIZE)
> + goto out;
> +
> + if (++dt_count >= ds->turns) {
> + dt_count = 0;
> + p = p->next;
> + ds++;
> + }
> + }
> + }
> +
> +out:
> + kfree(table);

bitmap_free

> + return 0;
> +}
> +
> +/* Assign all the hash buckets of the specified table with the service. */
> +static int
> +ip_vs_mhs_reassign(struct ip_vs_mhs_state *s, struct ip_vs_service *svc)
> +{
> + int ret;
> +
> + if (svc->num_dests > IP_VS_MH_TAB_SIZE)
> + return -EINVAL;
> +
> + if (svc->num_dests >= 1) {
> + s->dest_setup = kcalloc(svc->num_dests,
> + sizeof(struct ip_vs_mhs_dest_setup),
> + GFP_KERNEL);
> + if (!s->dest_setup)
> + return -ENOMEM;
> + }
> +
> + ip_vs_mhs_permutate(s, svc);
> +
> + ret = ip_vs_mhs_populate(s, svc);
> + if (ret < 0)
> + goto out;
> +
> + IP_VS_DBG_BUF(6, "MHS: %s(): reassign lookup table of %s:%u\n",
> + __func__,
> + IP_VS_DBG_ADDR(svc->af, &svc->addr),
> + ntohs(svc->port));
> +
> +out:
> + if (svc->num_dests >= 1) {
> + kfree(s->dest_setup);
> + s->dest_setup = NULL;
> + }
> + return ret;
> +}
> +
> +static int
> +ip_vs_mhs_gcd_weight(struct ip_vs_service *svc)
> +{
> + struct ip_vs_dest *dest;
> + int weight;
> + int g = 0;
> +
> + list_for_each_entry(dest, &svc->destinations, n_list) {
> + weight = atomic_read(&dest->weight);
> + if (weight > 0) {
> + if (g > 0)
> + g = gcd(weight, g);
> + else
> + g = weight;
> + }
> + }
> + return g;
> +}
> +
> +/* To avoid assigning huge weight for the MH table,
> + * calculate shift value with gcd.
> + */
> +static int
> +ip_vs_mhs_shift_weight(struct ip_vs_service *svc, int gcd)
> +{
> + struct ip_vs_dest *dest;
> + int new_weight, weight = 0;
> + int mw, shift;
> +
> + /* If gcd is smaller then 1, number of dests or
> + * all weight of dests are zero. So, return
> + * shift value as zero.
> + */
> + if (gcd < 1)
> + return 0;
> +
> + list_for_each_entry(dest, &svc->destinations, n_list) {
> + new_weight = atomic_read(&dest->weight);
> + if (new_weight > weight)
> + weight = new_weight;
> + }
> +
> + /* Because gcd is greater than zero,
> + * the maximum weight and gcd are always greater than zero
> + */
> + mw = weight / gcd;
> +
> + /* shift = occupied bits of weight/gcd - MH highest bits */
> + shift = fls(mw) - IP_VS_MH_TAB_BITS;
> + return (shift >= 0) ? shift : 0;
> +}
> +
> +static ktime_t
> +ip_vs_mhs_get_unstable_timeout(struct ip_vs_service *svc)
> +{
> + struct ip_vs_proto_data *pd;
> + u64 tcp_to, tcp_fin_to;
> +
> + pd = ip_vs_proto_data_get(svc->ipvs, IPPROTO_TCP);
> + tcp_to = pd->timeout_table[IP_VS_TCP_S_ESTABLISHED];
> + tcp_fin_to = pd->timeout_table[IP_VS_TCP_S_FIN_WAIT];
> + return ns_to_ktime(jiffies64_to_nsecs(max(tcp_to, tcp_fin_to)));
> +}
> +
> +static void
> +ip_vs_mhs_state_free(struct rcu_head *head)
> +{
> + struct ip_vs_mhs_state *s;
> +
> + s = container_of(head, struct ip_vs_mhs_state, rcu_head);
> + kfree(s->lookup);
> + kfree(s);
> +}
> +
> +static int
> +ip_vs_mhs_init_svc(struct ip_vs_service *svc)
> +{
> + struct ip_vs_mhs_state *s0, *s1;
> + struct ip_vs_mhs_two_states *states;
> + ktime_t *tss;
> + int ret;

Scheduler is assigned to virtual service in 2 cases:

- common case: new service is created, no dests

- rare case: scheduler is changed for existing service with
present dests in svc->destinations

See when ip_vs_bind_scheduler() is called

So, when ip_vs_mhs_init_svc() is called, for the common case,
we will build empty states->first table. As result, we will start
initially with unstable period of 15 mins. But it is hard to tell when all
initial dests are added if we want to avoid it.

> +
> + /* Allocate timestamps */
> + tss = kcalloc(IP_VS_MH_TAB_SIZE, sizeof(ktime_t), GFP_KERNEL);
> + if (!tss)
> + return -ENOMEM;
> +
> + /* Allocate the first MH table for this service */
> + s0 = kzalloc(sizeof(*s0), GFP_KERNEL);
> + if (!s0) {
> + kfree(tss);
> + return -ENOMEM;
> + }
> +
> + s0->lookup = kcalloc(IP_VS_MH_TAB_SIZE, sizeof(struct ip_vs_mhs_lookup),
> + GFP_KERNEL);
> + if (!s0->lookup) {
> + kfree(tss);
> + kfree(s0);
> + return -ENOMEM;
> + }
> +
> + generate_hash_secret(&s0->hash1, &s0->hash2);
> + s0->gcd = ip_vs_mhs_gcd_weight(svc);
> + s0->rshift = ip_vs_mhs_shift_weight(svc, s0->gcd);
> +
> + IP_VS_DBG(6,
> + "MHS: %s(): The first lookup table (memory=%zdbytes) allocated\n",
> + __func__,
> + sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE);
> +
> + /* Assign the first lookup table with current dests */
> + ret = ip_vs_mhs_reassign(s0, svc);
> + if (ret < 0) {
> + kfree(tss);
> + ip_vs_mhs_reset(s0);
> + ip_vs_mhs_state_free(&s0->rcu_head);
> + return ret;
> + }
> +
> + /* Allocate the second MH table for this service */
> + s1 = kzalloc(sizeof(*s1), GFP_KERNEL);
> + if (!s1) {
> + kfree(tss);
> + ip_vs_mhs_reset(s0);
> + ip_vs_mhs_state_free(&s0->rcu_head);
> + return -ENOMEM;
> + }
> + s1->lookup = kcalloc(IP_VS_MH_TAB_SIZE, sizeof(struct ip_vs_mhs_lookup),
> + GFP_KERNEL);
> + if (!s1->lookup) {
> + kfree(tss);
> + ip_vs_mhs_reset(s0);
> + ip_vs_mhs_state_free(&s0->rcu_head);
> + kfree(s1);
> + return -ENOMEM;
> + }
> +
> + s1->hash1 = s0->hash1;
> + s1->hash2 = s0->hash2;
> + s1->gcd = s0->gcd;
> + s1->rshift = s0->rshift;
> +
> + IP_VS_DBG(6,
> + "MHS: %s(): The second lookup table (memory=%zdbytes) allocated\n",
> + __func__,
> + sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE);
> +
> + /* Assign the second lookup table with current dests */
> + ret = ip_vs_mhs_reassign(s1, svc);
> + if (ret < 0) {
> + kfree(tss);
> + ip_vs_mhs_reset(s0);
> + ip_vs_mhs_state_free(&s0->rcu_head);
> + ip_vs_mhs_reset(s1);
> + ip_vs_mhs_state_free(&s1->rcu_head);

Too much things to release, probably, a common release point
will look less risky.

> + return ret;
> + }
> +
> + /* Allocate, initialize and attach states */
> + states = kcalloc(1, sizeof(struct ip_vs_mhs_two_states), GFP_KERNEL);
> + if (!states) {
> + kfree(tss);
> + ip_vs_mhs_reset(s0);
> + ip_vs_mhs_state_free(&s0->rcu_head);
> + ip_vs_mhs_reset(s1);
> + ip_vs_mhs_state_free(&s1->rcu_head);
> + return -ENOMEM;
> + }
> +
> + states->first = s0;
> + states->second = s1;
> + states->timestamps = tss;
> + states->unstable_timeout = ip_vs_mhs_get_unstable_timeout(svc);
> + svc->sched_data = states;
> + return 0;
> +}
> +
> +static void
> +ip_vs_mhs_done_svc(struct ip_vs_service *svc)
> +{
> + struct ip_vs_mhs_two_states *states = svc->sched_data;
> +
> + kfree(states->timestamps);

Freeing in done_svc is not RCU safe. You can call
ip_vs_mhs_reset but RCU callback should free 'states'.
And we can not run many RCU callbacks in parallel because their
execution order is not guaranteed. So, single call_rcu for
states should be used where we should free the first/second states
and also timestamps and finally 'states'.

> +
> + /* Got to clean up the first lookup entry here */
> + ip_vs_mhs_reset(states->first);
> +
> + call_rcu(&states->first->rcu_head, ip_vs_mhs_state_free);
> + IP_VS_DBG(6,
> + "MHS: The first MH lookup table (memory=%zdbytes) released\n",
> + sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE);
> +
> + /* Got to clean up the second lookup entry here */
> + ip_vs_mhs_reset(states->second);
> +
> + call_rcu(&states->second->rcu_head, ip_vs_mhs_state_free);
> + IP_VS_DBG(6,
> + "MHS: The second MH lookup table (memory=%zdbytes) released\n",
> + sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE);
> +
> + kfree(states);
> +}
> +
> +static int
> +ip_vs_mhs_dest_changed(struct ip_vs_service *svc,
> + struct ip_vs_dest *dest)
> +{
> + struct ip_vs_mhs_two_states *states = svc->sched_data;
> + struct ip_vs_mhs_state *s1 = states->second;
> + int ret;
> +
> + s1->gcd = ip_vs_mhs_gcd_weight(svc);
> + s1->rshift = ip_vs_mhs_shift_weight(svc, s1->gcd);
> +
> + /* Assign the lookup table with the updated service */
> + ret = ip_vs_mhs_reassign(s1, svc);
> +
> + ip_vs_mhs_update_timestamps(states);
> + states->unstable_timeout = ip_vs_mhs_get_unstable_timeout(svc);
> + IP_VS_DBG(6,
> + "MHS: %s: set unstable timeout: %llu",
> + __func__,
> + ktime_divns(states->unstable_timeout,
> + NSEC_PER_SEC));
> + return ret;
> +}
> +
> +/* Helper function to get port number */
> +static inline __be16
> +ip_vs_mhs_get_port(const struct sk_buff *skb, struct ip_vs_iphdr *iph)
> +{
> + __be16 _ports[2], *ports;
> +
> + /* At this point we know that we have a valid packet of some kind.
> + * Because ICMP packets are only guaranteed to have the first 8
> + * bytes, let's just grab the ports. Fortunately they're in the
> + * same position for all three of the protocols we care about.
> + */
> + switch (iph->protocol) {
> + case IPPROTO_TCP:
> + case IPPROTO_UDP:
> + case IPPROTO_SCTP:
> + ports = skb_header_pointer(skb, iph->len, sizeof(_ports),
> + &_ports);
> + if (unlikely(!ports))
> + return 0;
> +
> + if (likely(!ip_vs_iph_inverse(iph)))
> + return ports[0];
> + else
> + return ports[1];
> + default:
> + return 0;
> + }
> +}
> +
> +/* Get ip_vs_dest associated with supplied parameters. */
> +static inline void
> +ip_vs_mhs_get(struct ip_vs_service *svc,
> + struct ip_vs_mhs_two_states *states,
> + struct ip_vs_mhs_two_dests *dests,
> + const union nf_inet_addr *addr,
> + __be16 port)
> +{
> + unsigned int hash;
> + ktime_t timestamp;
> +
> + hash = ip_vs_mhs_hashkey(svc->af, addr, port, &states->first->hash1,
> + 0) % IP_VS_MH_TAB_SIZE;
> + dests->dest = rcu_dereference(states->first->lookup[hash].dest);
> + dests->new_dest = rcu_dereference(states->second->lookup[hash].dest);
> + timestamp = states->timestamps[hash];
> +
> + /* only unstable hashes have non-zero value */
> + if (timestamp > 0) {
> + /* unstable */
> + if (timestamp + states->unstable_timeout > ktime_get()) {
> + /* timer didn't expire */
> + dests->unstable = true;
> + return;
> + }
> + /* unstable -> stable */
> + if (dests->dest)
> + ip_vs_dest_put(dests->dest);
> + if (dests->new_dest)
> + ip_vs_dest_hold(dests->new_dest);
> + dests->dest = dests->new_dest;
> + RCU_INIT_POINTER(states->first->lookup[hash].dest,
> + dests->new_dest);
> + states->timestamps[hash] = (ktime_t)0;

These operations are not SMP safe, many readers may try to
switch to stable state at the same time. May be some xchg operation
for timestamps[] can help. But it also races with reconfiguration,
i.e. ip_vs_mhs_update_timestamps(), ip_vs_mhs_populate(), etc.
As it is a rare condition, spin_lock_bh(&state->lock) will help instead.
You should revalidate states->timestamps[hash] under lock.

> + }
> + /* stable */
> + dests->unstable = false;
> +}
> +
> +/* Stateless Maglev Hashing scheduling */
> +static struct ip_vs_dest *
> +ip_vs_mhs_schedule(struct ip_vs_service *svc,
> + const struct sk_buff *skb,
> + struct ip_vs_iphdr *iph,
> + bool *need_state)
> +{
> + struct ip_vs_mhs_two_dests dests;
> + struct ip_vs_dest *final_dest = NULL;
> + struct ip_vs_mhs_two_states *states = svc->sched_data;
> + __be16 port = 0;
> + const union nf_inet_addr *hash_addr;
> +
> + *need_state = false;
> + hash_addr = ip_vs_iph_inverse(iph) ? &iph->daddr : &iph->saddr;
> +
> + if (svc->flags & IP_VS_SVC_F_SCHED_MH_PORT)
> + port = ip_vs_mhs_get_port(skb, iph);
> +
> + ip_vs_mhs_get(svc, states, &dests, hash_addr, port);
> + IP_VS_DBG_BUF(6,
> + "MHS: %s(): source IP address %s:%u --> server %s and %s\n",
> + __func__,
> + IP_VS_DBG_ADDR(svc->af, hash_addr),
> + ntohs(port),
> + dests.dest
> + ? IP_VS_DBG_ADDR(dests.dest->af, &dests.dest->addr)
> + : "NULL",
> + dests.new_dest
> + ? IP_VS_DBG_ADDR(dests.new_dest->af,
> + &dests.new_dest->addr)
> + : "NULL");
> +
> + if (!dests.dest && !dests.new_dest) {
> + /* Both dests is NULL */
> + return NULL;
> + }
> +
> + if (!(dests.dest && dests.new_dest)) {
> + /* dest is NULL or new_dest is NULL,
> + * so we send all packets to singular available dest
> + * and create state
> + */
> + if (dests.new_dest) {
> + /* dest is NULL */
> + final_dest = dests.new_dest;
> + } else {
> + /* new_dest is NULL */
> + final_dest = dests.dest;

In two cases we return dests.dest without checking
for IP_VS_DEST_F_AVAILABLE, even, you keep the flag set after dest is
removed which is not nice. If we do not want to fallback, in this case
we should return NULL, eg. for ACK. Any traffic should stop if
!IP_VS_DEST_F_AVAILABLE and if weight=0 only established connections should
work. As for IP_VS_DEST_F_OVERLOAD, if used, it should lead to allocating
connection to fallback server, something not suitable for every scheduler.

> + }
> + *need_state = true;
> + IP_VS_DBG(6,
> + "MHS: %s(): One dest, need_state=%s\n",
> + __func__,
> + *need_state ? "true" : "false");
> + } else if (dests.unstable) {
> + /* unstable */
> + if (iph->protocol == IPPROTO_TCP) {
> + /* TCP */
> + *need_state = true;

Looks like we can use iph.hdr_flags & IP_VS_HDR_NEW_CONN instead
of ip_vs_mhs_is_new_conn. IP_VS_HDR_NEW_CONN can be set where we
call is_new_conn in ip_vs_in_hook:

if (!iph.fragoffs && is_new_conn(skb, &iph))
iph.hdr_flags |= IP_VS_HDR_NEW_CONN;
if (iph.hdr_flags & IP_VS_HDR_NEW_CONN && cp) {

> + if (ip_vs_mhs_is_new_conn(skb, iph)) {
> + /* SYN packet */
> + final_dest = dests.new_dest;
> + IP_VS_DBG(6,
> + "MHS: %s(): Unstable, need_state=%s, SYN packet\n",
> + __func__,
> + *need_state ? "true" : "false");
> + } else {
> + /* Not SYN packet */
> + final_dest = dests.dest;
> + IP_VS_DBG(6,
> + "MHS: %s(): Unstable, need_state=%s, not SYN packet\n",
> + __func__,
> + *need_state ? "true" : "false");
> + }
> + } else if (iph->protocol == IPPROTO_UDP) {
> + /* UDP */
> + final_dest = dests.new_dest;
> + IP_VS_DBG(6,
> + "MHS: %s(): Unstable, need_state=%s, UDP packet\n",
> + __func__,
> + *need_state ? "true" : "false");
> + }
> + } else {
> + /* stable */
> + final_dest = dests.dest;
> + IP_VS_DBG(6,
> + "MHS: %s(): Stable, need_state=%s\n",
> + __func__,
> + *need_state ? "true" : "false");
> + }
> + return final_dest;
> +}
> +
> +/* IPVS MHS Scheduler structure */
> +static struct ip_vs_scheduler ip_vs_mhs_scheduler = {
> + .name = "mhs",
> + .refcnt = ATOMIC_INIT(0),
> + .module = THIS_MODULE,
> + .n_list = LIST_HEAD_INIT(ip_vs_mhs_scheduler.n_list),
> + .init_service = ip_vs_mhs_init_svc,
> + .done_service = ip_vs_mhs_done_svc,
> + .add_dest = ip_vs_mhs_dest_changed,
> + .del_dest = ip_vs_mhs_dest_changed,
> + .upd_dest = ip_vs_mhs_dest_changed,
> + .schedule_sl = ip_vs_mhs_schedule,
> +};
> +
> +static int __init
> +ip_vs_mhs_init(void)
> +{
> + return register_ip_vs_scheduler(&ip_vs_mhs_scheduler);
> +}
> +
> +static void __exit
> +ip_vs_mhs_cleanup(void)
> +{
> + unregister_ip_vs_scheduler(&ip_vs_mhs_scheduler);
> + rcu_barrier();
> +}
> +
> +module_init(ip_vs_mhs_init);
> +module_exit(ip_vs_mhs_cleanup);
> +MODULE_DESCRIPTION("Stateless Maglev hashing ipvs scheduler");
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Lev Pantiukhin <[email protected]>");
> diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> index 7da51390cea6..31a8c1bfc863 100644
> --- a/net/netfilter/ipvs/ip_vs_proto_tcp.c
> +++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> @@ -38,7 +38,7 @@ tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb,
> struct ip_vs_iphdr *iph)
> {
> struct ip_vs_service *svc;
> - struct tcphdr _tcph, *th;
> + struct tcphdr _tcph, *th = NULL;
> __be16 _ports[2], *ports = NULL;
>
> /* In the event of icmp, we're only guaranteed to have the first 8
> @@ -47,11 +47,8 @@ tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb,
> */
> if (likely(!ip_vs_iph_icmp(iph))) {
> th = skb_header_pointer(skb, iph->len, sizeof(_tcph), &_tcph);
> - if (th) {
> - if (th->rst || !(sysctl_sloppy_tcp(ipvs) || th->syn))
> - return 1;
> + if (th)
> ports = &th->source;
> - }
> } else {
> ports = skb_header_pointer(
> skb, iph->len, sizeof(_ports), &_ports);
> @@ -74,6 +71,17 @@ tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb,
> if (svc) {
> int ignored;
>
> + if (th) {
> + /* If sloppy_tcp or IP_VS_SVC_F_STATELESS is true,
> + * all SYN packets are scheduled except packets
> + * with set RST flag.
> + */
> + if (!sysctl_sloppy_tcp(ipvs) &&
> + !(svc->flags & IP_VS_SVC_F_STATELESS) &&
> + (!th->syn || th->rst))
> + return 1;
> + }

Probably same can be done for sctp_conn_schedule()

> +
> if (ip_vs_todrop(ipvs)) {
> /*
> * It seems that we are very loaded.
> --
> 2.17.1

Regards

--
Julian Anastasov <[email protected]>

2023-12-06 10:53:44

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] ipvs: add a stateless type of service and a stateless Maglev hashing scheduler

Hi Lev,

kernel test robot noticed the following build warnings:

[auto build test WARNING on horms-ipvs-next/master]
[also build test WARNING on horms-ipvs/master linus/master v6.7-rc4 next-20231206]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Lev-Pantiukhin/ipvs-add-a-stateless-type-of-service-and-a-stateless-Maglev-hashing-scheduler/20231204-232344
base: https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next.git master
patch link: https://lore.kernel.org/r/20231204152020.472247-1-kndrvt%40yandex-team.ru
patch subject: [PATCH] ipvs: add a stateless type of service and a stateless Maglev hashing scheduler
config: microblaze-randconfig-r123-20231206 (https://download.01.org/0day-ci/archive/20231206/[email protected]/config)
compiler: microblaze-linux-gcc (GCC) 13.2.0
reproduce: (https://download.01.org/0day-ci/archive/20231206/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

sparse warnings: (new ones prefixed by >>)
>> net/netfilter/ipvs/ip_vs_ctl.c:972:42: sparse: sparse: restricted __be32 degrades to integer
net/netfilter/ipvs/ip_vs_ctl.c:972:60: sparse: sparse: restricted __be32 degrades to integer
net/netfilter/ipvs/ip_vs_ctl.c:976:48: sparse: sparse: restricted __be32 degrades to integer
net/netfilter/ipvs/ip_vs_ctl.c:976:70: sparse: sparse: restricted __be32 degrades to integer
>> net/netfilter/ipvs/ip_vs_ctl.c:976:30: sparse: sparse: incorrect type in assignment (different base types) @@ expected restricted __be32 [usertype] diff @@ got unsigned int @@
net/netfilter/ipvs/ip_vs_ctl.c:976:30: sparse: expected restricted __be32 [usertype] diff
net/netfilter/ipvs/ip_vs_ctl.c:976:30: sparse: got unsigned int
>> net/netfilter/ipvs/ip_vs_ctl.c:978:41: sparse: sparse: cast from restricted __be32
net/netfilter/ipvs/ip_vs_ctl.c:1532:27: sparse: sparse: dereference of noderef expression

vim +972 net/netfilter/ipvs/ip_vs_ctl.c

962
963 static int __ip_vs_mh_compare_dests(struct list_head *a, struct list_head *b)
964 {
965 struct ip_vs_dest *dest_a = list_entry(a, struct ip_vs_dest, n_list);
966 struct ip_vs_dest *dest_b = list_entry(b, struct ip_vs_dest, n_list);
967 unsigned int i = 0;
968 __be32 diff;
969
970 switch (dest_a->af) {
971 case AF_INET:
> 972 return (int)(dest_a->addr.ip - dest_b->addr.ip);
973
974 case AF_INET6:
975 for (; i < ARRAY_SIZE(dest_a->addr.ip6); i++) {
> 976 diff = dest_a->addr.ip6[i] - dest_b->addr.ip6[i];
977 if (diff)
> 978 return (int)diff;
979 }
980 }
981
982 return 0;
983 }
984

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-12-06 12:34:16

by Julian Anastasov

[permalink] [raw]
Subject: Re: [PATCH] ipvs: add a stateless type of service and a stateless Maglev hashing scheduler


Hello,

On Mon, 4 Dec 2023, Lev Pantiukhin wrote:

> +#define IP_VS_SVC_F_STATELESS 0x0040 /* stateless scheduling */

I have another idea for the traffic that does not
need per-client state. We need some per-dest cp to forward the packet.
If we replace the cp->caddr usage with iph->saddr/daddr usage we can try
it. cp->caddr is used at the following places:

- tcp_snat_handler (iph->daddr), tcp_dnat_handler (iph->saddr): iph is
already provided. tcp_snat_handler requires IP_VS_SVC_F_STATELESS
to be set for serivce with present vaddr, i.e. non-fwmark based.
So, NAT+svc->fwmark is another restriction for IP_VS_SVC_F_STATELESS
because we do not know what VIP to use as saddr for outgoing traffic.

- ip_vs_nfct_expect_related
- we should investigate for any problems when IP_VS_CONN_F_NFCT
is set, probably, we can not work with NFCT?

- ip_vs_conn_drop_conntrack

- FTP:
- sets IP_VS_CONN_F_NFCT, uses cp->app

May be IP_VS_CONN_F_NFCT should be restriction for
IP_VS_SVC_F_STATELESS mode? cp->app for sure because we keep TCP
seq/ack state for the app in cp->in_seq/out_seq.

We can keep some dest->cp_route or another name that will
hold our cp for such connections. The idea is to not allocate cp for
every packet but to reuse this saved cp. It has all needed info to
forward skb to real server. The first packet will create it, save
it with some locking into dest and next packets will reuse it.

Probably, it should be ONE_PACKET entry (not hashed in table) but
can be with running timer, if needed. One refcnt for attaching to dest,
new temp refcnt for every packet. But in this mode __ip_vs_conn_put_timer
uses 0-second timer, we have to handle it somehow. It should be released
when dest is removed and on edit_dest if needed.

There are other problems to solve, such as set_tcp_state()
changing dest->activeconns and dest->inactconns. They are used also
in ip_vs_bind_dest(), ip_vs_unbind_dest(). As we do not keep previous
connection state and as conn can start in established state, we should
avoid touching these counters. For UDP ONE_PACKET has no such problem
with states but for TCP/SCTP we should take care.

Regards

--
Julian Anastasov <[email protected]>

2023-12-07 05:33:58

by Dan Carpenter

[permalink] [raw]
Subject: Re: [PATCH] ipvs: add a stateless type of service and a stateless Maglev hashing scheduler

Hi Lev,

kernel test robot noticed the following build warnings:

https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Lev-Pantiukhin/ipvs-add-a-stateless-type-of-service-and-a-stateless-Maglev-hashing-scheduler/20231204-232344
base: https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next.git master
patch link: https://lore.kernel.org/r/20231204152020.472247-1-kndrvt%40yandex-team.ru
patch subject: [PATCH] ipvs: add a stateless type of service and a stateless Maglev hashing scheduler
config: i386-randconfig-141-20231207 (https://download.01.org/0day-ci/archive/20231207/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce: (https://download.01.org/0day-ci/archive/20231207/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Reported-by: Dan Carpenter <[email protected]>
| Closes: https://lore.kernel.org/r/[email protected]/

New smatch warnings:
net/netfilter/ipvs/ip_vs_core.c:545 ip_vs_schedule() error: uninitialized symbol 'need_state'.

vim +/need_state +545 net/netfilter/ipvs/ip_vs_core.c

^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 440 struct ip_vs_conn *
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 441 ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb,
d4383f04d145cc net/netfilter/ipvs/ip_vs_core.c Jesper Dangaard Brouer 2012-09-26 442 struct ip_vs_proto_data *pd, int *ignored,
d4383f04d145cc net/netfilter/ipvs/ip_vs_core.c Jesper Dangaard Brouer 2012-09-26 443 struct ip_vs_iphdr *iph)
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 444 {
9330419d9aa4f9 net/netfilter/ipvs/ip_vs_core.c Hans Schillstrom 2011-01-03 445 struct ip_vs_protocol *pp = pd->pp;
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 446 struct ip_vs_conn *cp = NULL;
ceec4c38168184 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2013-03-22 447 struct ip_vs_scheduler *sched;
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 448 struct ip_vs_dest *dest;
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 449 __be16 _ports[2], *pptr, cport, vport;
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 450 const void *caddr, *vaddr;
3575792e005dc9 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-09-17 451 unsigned int flags;
b276d504bee439 net/netfilter/ipvs/ip_vs_core.c Lev Pantiukhin 2023-12-04 452 bool need_state;
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 453
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 454 *ignored = 1;
2f74713d1436b7 net/netfilter/ipvs/ip_vs_core.c Jesper Dangaard Brouer 2012-09-26 455 /*
2f74713d1436b7 net/netfilter/ipvs/ip_vs_core.c Jesper Dangaard Brouer 2012-09-26 456 * IPv6 frags, only the first hit here.
2f74713d1436b7 net/netfilter/ipvs/ip_vs_core.c Jesper Dangaard Brouer 2012-09-26 457 */
6b3d933000cbe5 net/netfilter/ipvs/ip_vs_core.c Gao Feng 2017-11-13 458 pptr = frag_safe_skb_hp(skb, iph->len, sizeof(_ports), _ports);
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 459 if (pptr == NULL)
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 460 return NULL;
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 461
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 462 if (likely(!ip_vs_iph_inverse(iph))) {
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 463 cport = pptr[0];
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 464 caddr = &iph->saddr;
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 465 vport = pptr[1];
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 466 vaddr = &iph->daddr;
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 467 } else {
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 468 cport = pptr[1];
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 469 caddr = &iph->daddr;
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 470 vport = pptr[0];
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 471 vaddr = &iph->saddr;
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 472 }
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 473
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 474 /*
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 475 * FTPDATA needs this check when using local real server.
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 476 * Never schedule Active FTPDATA connections from real server.
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 477 * For LVS-NAT they must be already created. For other methods
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 478 * with persistence the connection is created on SYN+ACK.
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 479 */
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 480 if (cport == FTPDATA) {
b0e010c527de74 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 481 IP_VS_DBG_PKT(12, svc->af, pp, skb, iph->off,
0d79641a96d612 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 482 "Not scheduling FTPDATA");
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 483 return NULL;
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 484 }
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 485
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 486 /*
a5959d53d6048a net/netfilter/ipvs/ip_vs_core.c Hans Schillstrom 2010-11-19 487 * Do not schedule replies from local real server.
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 488 */
802c41adcf3be6 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 489 if ((!skb->dev || skb->dev->flags & IFF_LOOPBACK)) {
802c41adcf3be6 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 490 iph->hdr_flags ^= IP_VS_HDR_INVERSE;
6ecd754883daff net/netfilter/ipvs/ip_vs_core.c Matteo Croce 2019-01-19 491 cp = INDIRECT_CALL_1(pp->conn_in_get,
6ecd754883daff net/netfilter/ipvs/ip_vs_core.c Matteo Croce 2019-01-19 492 ip_vs_conn_in_get_proto, svc->ipvs,
6ecd754883daff net/netfilter/ipvs/ip_vs_core.c Matteo Croce 2019-01-19 493 svc->af, skb, iph);
802c41adcf3be6 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 494 iph->hdr_flags ^= IP_VS_HDR_INVERSE;
802c41adcf3be6 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 495
802c41adcf3be6 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 496 if (cp) {
b0e010c527de74 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 497 IP_VS_DBG_PKT(12, svc->af, pp, skb, iph->off,
802c41adcf3be6 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 498 "Not scheduling reply for existing"
802c41adcf3be6 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 499 " connection");
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 500 __ip_vs_conn_put(cp);
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 501 return NULL;
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 502 }
802c41adcf3be6 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 503 }
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 504
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 505 /*
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 506 * Persistent service
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 507 */
a5959d53d6048a net/netfilter/ipvs/ip_vs_core.c Hans Schillstrom 2010-11-19 508 if (svc->flags & IP_VS_SVC_F_PERSISTENT)
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 509 return ip_vs_sched_persist(svc, skb, cport, vport, ignored,
d4383f04d145cc net/netfilter/ipvs/ip_vs_core.c Jesper Dangaard Brouer 2012-09-26 510 iph);
a5959d53d6048a net/netfilter/ipvs/ip_vs_core.c Hans Schillstrom 2010-11-19 511
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2010-10-17 512 *ignored = 0;
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 513
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 514 /*
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 515 * Non-persistent service
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 516 */
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 517 if (!svc->fwmark && vport != svc->port) {
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 518 if (!svc->port)
1e3e238e9c4bf9 net/netfilter/ipvs/ip_vs_core.c Hannes Eder 2009-08-02 519 pr_err("Schedule: port zero only supported "
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 520 "in persistent services, "
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 521 "check your ipvs configuration\n");
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 522 return NULL;
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 523 }
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 524
ceec4c38168184 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2013-03-22 525 sched = rcu_dereference(svc->scheduler);
05f00505a89acd net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2015-06-29 526 if (sched) {
05f00505a89acd net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2015-06-29 527 /* read svc->sched_data after svc->scheduler */
05f00505a89acd net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2015-06-29 528 smp_rmb();
b276d504bee439 net/netfilter/ipvs/ip_vs_core.c Lev Pantiukhin 2023-12-04 529 /* we use distinct handler for stateless service */
b276d504bee439 net/netfilter/ipvs/ip_vs_core.c Lev Pantiukhin 2023-12-04 530 if (svc->flags & IP_VS_SVC_F_STATELESS)
b276d504bee439 net/netfilter/ipvs/ip_vs_core.c Lev Pantiukhin 2023-12-04 531 dest = sched->schedule_sl(svc, skb, iph, &need_state);
b276d504bee439 net/netfilter/ipvs/ip_vs_core.c Lev Pantiukhin 2023-12-04 532 else
bba54de5bdd107 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2013-06-16 533 dest = sched->schedule(svc, skb, iph);

need_state not initialized on this path

05f00505a89acd net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2015-06-29 534 } else {
05f00505a89acd net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2015-06-29 535 dest = NULL;
05f00505a89acd net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2015-06-29 536 }
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 537 if (dest == NULL) {
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 538 IP_VS_DBG(1, "Schedule: no dest found.\n");
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 539 return NULL;
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 540 }
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 541
b276d504bee439 net/netfilter/ipvs/ip_vs_core.c Lev Pantiukhin 2023-12-04 542 /* We use IP_VS_SVC_F_ONEPACKET flag to create no state */
b276d504bee439 net/netfilter/ipvs/ip_vs_core.c Lev Pantiukhin 2023-12-04 543 flags = ((svc->flags & IP_VS_SVC_F_ONEPACKET &&
b276d504bee439 net/netfilter/ipvs/ip_vs_core.c Lev Pantiukhin 2023-12-04 544 iph->protocol == IPPROTO_UDP) ||
b276d504bee439 net/netfilter/ipvs/ip_vs_core.c Lev Pantiukhin 2023-12-04 @545 (svc->flags & IP_VS_SVC_F_STATELESS && !need_state))
^^^^^^^^^^

b276d504bee439 net/netfilter/ipvs/ip_vs_core.c Lev Pantiukhin 2023-12-04 546 ? IP_VS_CONN_F_ONE_PACKET : 0;
26ec037f9841e4 net/netfilter/ipvs/ip_vs_core.c Nick Chalk 2010-06-22 547
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 548 /*
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 549 * Create a connection entry.
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 550 */
f11017ec2d1859 net/netfilter/ipvs/ip_vs_core.c Simon Horman 2010-08-22 551 {
f11017ec2d1859 net/netfilter/ipvs/ip_vs_core.c Simon Horman 2010-08-22 552 struct ip_vs_conn_param p;
6e67e586e7289c net/netfilter/ipvs/ip_vs_core.c Hans Schillstrom 2011-01-03 553
3109d2f2d1fe06 net/netfilter/ipvs/ip_vs_core.c Eric W. Biederman 2015-09-21 554 ip_vs_conn_fill_param(svc->ipvs, svc->af, iph->protocol,
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 555 caddr, cport, vaddr, vport, &p);
ba38528aae6ee2 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2014-09-09 556 cp = ip_vs_conn_new(&p, dest->af, &dest->addr,
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell 2015-08-26 557 dest->port ? dest->port : vport,
0e051e683ba4ac net/netfilter/ipvs/ip_vs_core.c Hans Schillstrom 2010-11-19 558 flags, dest, skb->mark);
a5959d53d6048a net/netfilter/ipvs/ip_vs_core.c Hans Schillstrom 2010-11-19 559 if (!cp) {
a5959d53d6048a net/netfilter/ipvs/ip_vs_core.c Hans Schillstrom 2010-11-19 560 *ignored = -1;
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 561 return NULL;
f11017ec2d1859 net/netfilter/ipvs/ip_vs_core.c Simon Horman 2010-08-22 562 }
a5959d53d6048a net/netfilter/ipvs/ip_vs_core.c Hans Schillstrom 2010-11-19 563 }
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 564
cd17f9ed099ed2 net/ipv4/ipvs/ip_vs_core.c Julius Volz 2008-09-02 565 IP_VS_DBG_BUF(6, "Schedule fwd:%c c:%s:%u v:%s:%u "
cd17f9ed099ed2 net/ipv4/ipvs/ip_vs_core.c Julius Volz 2008-09-02 566 "d:%s:%u conn->flags:%X conn->refcnt:%d\n",
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 567 ip_vs_fwd_tag(cp),
f18ae7206eaebf net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2014-09-09 568 IP_VS_DBG_ADDR(cp->af, &cp->caddr), ntohs(cp->cport),
f18ae7206eaebf net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2014-09-09 569 IP_VS_DBG_ADDR(cp->af, &cp->vaddr), ntohs(cp->vport),
f18ae7206eaebf net/netfilter/ipvs/ip_vs_core.c Julian Anastasov 2014-09-09 570 IP_VS_DBG_ADDR(cp->daf, &cp->daddr), ntohs(cp->dport),
b54ab92b84b616 net/netfilter/ipvs/ip_vs_core.c Reshetova, Elena 2017-03-16 571 cp->flags, refcount_read(&cp->refcnt));
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 572
b276d504bee439 net/netfilter/ipvs/ip_vs_core.c Lev Pantiukhin 2023-12-04 573 if (!(svc->flags & IP_VS_SVC_F_STATELESS) ||
b276d504bee439 net/netfilter/ipvs/ip_vs_core.c Lev Pantiukhin 2023-12-04 574 (svc->flags & IP_VS_SVC_F_STATELESS && need_state)) {
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 575 ip_vs_conn_stats(cp, svc);
b276d504bee439 net/netfilter/ipvs/ip_vs_core.c Lev Pantiukhin 2023-12-04 576 }
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 577 return cp;
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c Linus Torvalds 2005-04-16 578 }

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki