Received: by 2002:a05:7412:b10a:b0:f3:1519:9f41 with SMTP id az10csp2780695rdb; Mon, 4 Dec 2023 07:23:04 -0800 (PST) X-Google-Smtp-Source: AGHT+IEbIemObGDPOVCa0NzUgtfrettXwUoNjxDVweLqE0YaOicFYQUYWUud4KtW2IrxmdYSegko X-Received: by 2002:a17:902:7613:b0:1d0:b367:b3d4 with SMTP id k19-20020a170902761300b001d0b367b3d4mr721948pll.79.1701703383568; Mon, 04 Dec 2023 07:23:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701703383; cv=none; d=google.com; s=arc-20160816; b=F1lrurMJ25rhQthRohJIiFQqYq67tB8cYt+ZMto6tNw7PDcBszv93fMEqAEdok+1MH 56D9T93xF6PjENIg6lv8KGW6TZXmJHJxhPkIJCFaSJJvS6U3WL7bqSB+moxqVAB0UFj3 qZu5BeGNujr0kecht/SWfITSKWrKmFaLBFNDrUZNwz8x5U3pzWTUUuRjsQpZh7mONytV 8JjJBPp6S/lwsujbUmV/H8SZgKN6dq/kO9yLNM5AFZmE9/B5Qxbx2C+RIjS/ikJ8Rk34 0Od1Ul+hqiFVpBrg/pwMfSxlc180GJQWIMbGzAWI/TquMWuUzKh/a6ymZQdA0BYaNUVN i51Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:to:content-transfer-encoding:mime-version :message-id:date:subject:cc:from:dkim-signature; bh=IBcNs+ZlWGVhNZIdrbbSpZJavOkAvPGHqUVtzp/fagQ=; fh=llErUfQSFN1Mg2KjSNnryaTe14oQnKCVL/f+2m9b7oY=; b=EWLq/86Duh97GNuuHzaotBjH4pddMK2eGXpSoVfgAEc4wPull+oxh3NuWfHTROqiHh C+3BRJ0lJq+WKCUUCJOs2BDK0qozXFjP/A5QWnQvgNleZUjTedNigXYGqSJcloGoYrC7 PK5Q5n3poCjaQbuvximFByOYHUZTS7Gmze6QWRCA6bZBAYGg5KhmXvgNeUQuTmepYq/F W/joCk1nc9aO97Z6HfXmf1cK0ySBMt6lKvfFrGtP/w/4eFmD6Im9R6h3CNc42+7SqEAx SFQf6gXRoPRFMsF4OcXWoZ8yI395DxAlloDvd69GeBkYmqwTe9n9hfjIG/Eg1Yhi5A2d 1BZw== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@yandex-team.ru header.s=default header.b=eN4HyfPq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=yandex-team.ru Return-Path: Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id u12-20020a17090341cc00b001c754f13381si320035ple.455.2023.12.04.07.23.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 04 Dec 2023 07:23:03 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=fail header.i=@yandex-team.ru header.s=default header.b=eN4HyfPq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=yandex-team.ru Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 8BC088069F17; Mon, 4 Dec 2023 07:23:01 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234424AbjLDPWu (ORCPT + 99 others); Mon, 4 Dec 2023 10:22:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39948 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234336AbjLDPWt (ORCPT ); Mon, 4 Dec 2023 10:22:49 -0500 X-Greylist: delayed 73 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Mon, 04 Dec 2023 07:22:49 PST Received: from forwardcorp1a.mail.yandex.net (forwardcorp1a.mail.yandex.net [178.154.239.72]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 15401B0; Mon, 4 Dec 2023 07:22:48 -0800 (PST) Received: from mail-nwsmtp-smtp-corp-main-83.vla.yp-c.yandex.net (mail-nwsmtp-smtp-corp-main-83.vla.yp-c.yandex.net [IPv6:2a02:6b8:c18:d11:0:640:6943:0]) by forwardcorp1a.mail.yandex.net (Yandex) with ESMTP id 8631C61530; Mon, 4 Dec 2023 18:21:27 +0300 (MSK) Received: from qavm-56efe012.qemu (kndrvt-dev.vla.yp-c.yandex.net [2a02:6b8:c0d:2682:0:696:4f00:0]) by mail-nwsmtp-smtp-corp-main-83.vla.yp-c.yandex.net (smtpcorp/Yandex) with ESMTPSA id 1LcJwZ1Id8c0-1NoFKRS7; Mon, 04 Dec 2023 18:21:26 +0300 X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.ru; s=default; t=1701703286; bh=IBcNs+ZlWGVhNZIdrbbSpZJavOkAvPGHqUVtzp/fagQ=; h=Message-Id:Date:Cc:Subject:To:From; b=eN4HyfPqHqBqvmcO7BZ0lvgBbXPEIV7QyIJoF3Lelbs1cnadkrGBVvZKgjMqCrKLd r6sOEZYwwPRn/QkLoS+hIYdCc3jh1PeAol2v6dHFZtHGJlB3siZulqpOCplso435uy dyyjOiNvVAL36uSc68bfbclzhcXlSnIZRCGKE+Gs= Authentication-Results: mail-nwsmtp-smtp-corp-main-83.vla.yp-c.yandex.net; dkim=pass header.i=@yandex-team.ru From: Lev Pantiukhin Cc: mitradir@yandex-team.ru, Lev Pantiukhin , Simon Horman , Julian Anastasov , "David S. Miller" , David Ahern , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Pablo Neira Ayuso , Jozsef Kadlecsik , Florian Westphal , linux-kernel@vger.kernel.org, netdev@vger.kernel.org, lvs-devel@vger.kernel.org, netfilter-devel@vger.kernel.org, coreteam@netfilter.org Subject: [PATCH] ipvs: add a stateless type of service and a stateless Maglev hashing scheduler Date: Mon, 4 Dec 2023 18:20:17 +0300 Message-Id: <20231204152020.472247-1-kndrvt@yandex-team.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,HK_RANDOM_ENVFROM, HK_RANDOM_FROM,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net To: unlisted-recipients:; (no To-header on input) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Mon, 04 Dec 2023 07:23:01 -0800 (PST) Maglev Hashing Stateless ======================== Introduction ------------ This patch to Linux kernel provides the following changes to IPVS: 1. Adds a new type (IP_VS_SVC_F_STATELESS) of scheduler that computes the need for connection entry addition; 2. Adds a new mhs (Maglev Hashing Stateless) scheduler based on the mh scheduler that implements a new algorithm (more details below); 3. Adds scheduling for ACK packets; 4. Adds destination sorting (more details below). This approach shows a significant reduction in CPU usage, even in the case of 10% of endpoints constantly flapping. It also makes the L4 balancer less vulnerable to DDoS activity. The Description of a New Algorithm ---------------------------------- This patch provides a modified version of the Maglev consistent hashing scheduling algorithm (scheduler mh). It simultaneously uses two hash tables instead of one. One of them is for old destinations, and the other (the candidate table) is for new ones. A hash key corresponds to two destinations, and if both hash tables point to the same destination, then the hash key is called stable; otherwise, it is called unstable. A new connection entry is created only in the event of an unstable hash key; otherwise, the packet goes through stateless processing. If the hash key is unstable: * In the case of a SYN packet, it will pick up the destination from the newer (candidate) hash table; * In the case of an ACK packet, it will use the old hash table. Upon changing the set of destinations, mhs populates a new candidate hash table and initializes a timer equal to the TCP session timeout. When the timer expires, the candidate hash table value is merged into the old hash table, and the corresponding hash key again becomes stable. If there are changes in the destinations before the timer expires, mhs overwrites the candidate hash table without the timer reset. If the set of destinations is unchanged, the connection tracking table will be empty. IPVS stores destinations in an unordered way, so the same destination set may generate different hash tables. To guarantee proper generation of the Maglev hash table, the sorting of the destination list was added. This is important in the case of destination flaps, which return the candidate hash table to its original state. This patch implements sorting via simple insertion with linear complexity. However, this complexity may be simplified. Signed-off-by: Lev Pantiukhin --- include/net/ip_vs.h | 6 + include/uapi/linux/ip_vs.h | 1 + net/netfilter/ipvs/Kconfig | 9 + net/netfilter/ipvs/Makefile | 1 + net/netfilter/ipvs/ip_vs_core.c | 34 +- net/netfilter/ipvs/ip_vs_ctl.c | 54 +- net/netfilter/ipvs/ip_vs_mhs.c | 740 +++++++++++++++++++++++++++ net/netfilter/ipvs/ip_vs_proto_tcp.c | 18 +- 8 files changed, 851 insertions(+), 12 deletions(-) create mode 100644 net/netfilter/ipvs/ip_vs_mhs.c diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h index ff406ef4fd4a..c3f0488bdd6a 100644 --- a/include/net/ip_vs.h +++ b/include/net/ip_vs.h @@ -778,6 +778,12 @@ struct ip_vs_scheduler { struct ip_vs_dest* (*schedule)(struct ip_vs_service *svc, const struct sk_buff *skb, struct ip_vs_iphdr *iph); + + /* selecting 2 servers from the given service and choose one */ + struct ip_vs_dest* (*schedule_sl)(struct ip_vs_service *svc, + const struct sk_buff *skb, + struct ip_vs_iphdr *iph, + bool *need_state); }; /* The persistence engine object */ diff --git a/include/uapi/linux/ip_vs.h b/include/uapi/linux/ip_vs.h index 1ed234e7f251..cc205c1c796c 100644 --- a/include/uapi/linux/ip_vs.h +++ b/include/uapi/linux/ip_vs.h @@ -24,6 +24,7 @@ #define IP_VS_SVC_F_SCHED1 0x0008 /* scheduler flag 1 */ #define IP_VS_SVC_F_SCHED2 0x0010 /* scheduler flag 2 */ #define IP_VS_SVC_F_SCHED3 0x0020 /* scheduler flag 3 */ +#define IP_VS_SVC_F_STATELESS 0x0040 /* stateless scheduling */ #define IP_VS_SVC_F_SCHED_SH_FALLBACK IP_VS_SVC_F_SCHED1 /* SH fallback */ #define IP_VS_SVC_F_SCHED_SH_PORT IP_VS_SVC_F_SCHED2 /* SH use port */ diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig index 2a3017b9c001..886b75c48551 100644 --- a/net/netfilter/ipvs/Kconfig +++ b/net/netfilter/ipvs/Kconfig @@ -246,6 +246,15 @@ config IP_VS_MH If you want to compile it in kernel, say Y. To compile it as a module, choose M here. If unsure, say N. +config IP_VS_MHS + tristate "stateless maglev hashing scheduling" + help + The usual Maglev consistent hashing scheduling algorithm provides + Google's Maglev hashing algorithm as an IPVS scheduler. + This is a modified version of maglev consistent hashing scheduling algorithm. + It simultaneously uses two hash tables instead of one. + One of them is for old destinations, and the other is for new ones. + config IP_VS_SED tristate "shortest expected delay scheduling" help diff --git a/net/netfilter/ipvs/Makefile b/net/netfilter/ipvs/Makefile index bb5d8125c82a..ffe9977397e0 100644 --- a/net/netfilter/ipvs/Makefile +++ b/net/netfilter/ipvs/Makefile @@ -34,6 +34,7 @@ obj-$(CONFIG_IP_VS_LBLCR) += ip_vs_lblcr.o obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o obj-$(CONFIG_IP_VS_MH) += ip_vs_mh.o +obj-$(CONFIG_IP_VS_MHS) += ip_vs_mhs.o obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o obj-$(CONFIG_IP_VS_TWOS) += ip_vs_twos.o diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c index a2c16b501087..6aaf762c0a1d 100644 --- a/net/netfilter/ipvs/ip_vs_core.c +++ b/net/netfilter/ipvs/ip_vs_core.c @@ -449,6 +449,7 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb, __be16 _ports[2], *pptr, cport, vport; const void *caddr, *vaddr; unsigned int flags; + bool need_state; *ignored = 1; /* @@ -525,7 +526,11 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb, if (sched) { /* read svc->sched_data after svc->scheduler */ smp_rmb(); - dest = sched->schedule(svc, skb, iph); + /* we use distinct handler for stateless service */ + if (svc->flags & IP_VS_SVC_F_STATELESS) + dest = sched->schedule_sl(svc, skb, iph, &need_state); + else + dest = sched->schedule(svc, skb, iph); } else { dest = NULL; } @@ -534,9 +539,11 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb, return NULL; } - flags = (svc->flags & IP_VS_SVC_F_ONEPACKET - && iph->protocol == IPPROTO_UDP) ? - IP_VS_CONN_F_ONE_PACKET : 0; + /* We use IP_VS_SVC_F_ONEPACKET flag to create no state */ + flags = ((svc->flags & IP_VS_SVC_F_ONEPACKET && + iph->protocol == IPPROTO_UDP) || + (svc->flags & IP_VS_SVC_F_STATELESS && !need_state)) + ? IP_VS_CONN_F_ONE_PACKET : 0; /* * Create a connection entry. @@ -563,7 +570,10 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb, IP_VS_DBG_ADDR(cp->daf, &cp->daddr), ntohs(cp->dport), cp->flags, refcount_read(&cp->refcnt)); - ip_vs_conn_stats(cp, svc); + if (!(svc->flags & IP_VS_SVC_F_STATELESS) || + (svc->flags & IP_VS_SVC_F_STATELESS && need_state)) { + ip_vs_conn_stats(cp, svc); + } return cp; } @@ -1915,6 +1925,7 @@ ip_vs_in_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state int ret, pkts; struct sock *sk; int af = state->pf; + struct ip_vs_service *svc; /* Already marked as IPVS request or reply? */ if (skb->ipvs_property) @@ -1990,6 +2001,19 @@ ip_vs_in_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state cp = INDIRECT_CALL_1(pp->conn_in_get, ip_vs_conn_in_get_proto, ipvs, af, skb, &iph); + /* Don't use expired connection in stateless service case; + * otherwise reuse can maintain the number connection entries + */ + if (cp && cp->dest) { + svc = rcu_dereference(cp->dest->svc); + + if ((svc->flags & IP_VS_SVC_F_STATELESS) && + !(timer_pending(&cp->timer) && time_after(cp->timer.expires, jiffies))) { + __ip_vs_conn_put(cp); + cp = NULL; + } + } + if (!iph.fragoffs && is_new_conn(skb, &iph) && cp) { int conn_reuse_mode = sysctl_conn_reuse_mode(ipvs); bool old_ct = false, resched = false; diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c index 143a341bbc0a..fda321edbd9c 100644 --- a/net/netfilter/ipvs/ip_vs_ctl.c +++ b/net/netfilter/ipvs/ip_vs_ctl.c @@ -960,6 +960,43 @@ void ip_vs_stats_free(struct ip_vs_stats *stats) } } +static int __ip_vs_mh_compare_dests(struct list_head *a, struct list_head *b) +{ + struct ip_vs_dest *dest_a = list_entry(a, struct ip_vs_dest, n_list); + struct ip_vs_dest *dest_b = list_entry(b, struct ip_vs_dest, n_list); + unsigned int i = 0; + __be32 diff; + + switch (dest_a->af) { + case AF_INET: + return (int)(dest_a->addr.ip - dest_b->addr.ip); + + case AF_INET6: + for (; i < ARRAY_SIZE(dest_a->addr.ip6); i++) { + diff = dest_a->addr.ip6[i] - dest_b->addr.ip6[i]; + if (diff) + return (int)diff; + } + } + + return 0; +} + +static struct list_head * +__ip_vs_find_insertion_place(struct list_head *new, struct list_head *head) +{ + struct list_head *p = head; + int ret; + + while ((p = p->next) != head) { + ret = __ip_vs_mh_compare_dests(new, p); + if (ret < 0) + break; + } + + return p->prev; +} + /* * Update a destination in the given service */ @@ -1038,7 +1075,10 @@ __ip_vs_update_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest, spin_unlock_bh(&dest->dst_lock); if (add) { - list_add_rcu(&dest->n_list, &svc->destinations); + /* sorting of dests list */ + list_add_rcu(&dest->n_list, + __ip_vs_find_insertion_place(&dest->n_list, + &svc->destinations)); svc->num_dests++; sched = rcu_dereference_protected(svc->scheduler, 1); if (sched && sched->add_dest) @@ -1276,7 +1316,9 @@ static void __ip_vs_unlink_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest, int svcupd) { - dest->flags &= ~IP_VS_DEST_F_AVAILABLE; + /* dest must be available from trash for stateless service */ + if (!(svc->flags & IP_VS_SVC_F_STATELESS)) + dest->flags &= ~IP_VS_DEST_F_AVAILABLE; /* * Remove it from the d-linked destination list. @@ -1440,6 +1482,10 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u, svc->port = u->port; svc->fwmark = u->fwmark; svc->flags = u->flags & ~IP_VS_SVC_F_HASHED; + if (!strcmp(u->sched_name, "mhs")) { + svc->flags |= IP_VS_SVC_F_STATELESS; + svc->flags &= ~IP_VS_SVC_F_PERSISTENT; + } svc->timeout = u->timeout * HZ; svc->netmask = u->netmask; svc->ipvs = ipvs; @@ -1578,6 +1624,10 @@ ip_vs_edit_service(struct ip_vs_service *svc, struct ip_vs_service_user_kern *u) * Set the flags and timeout value */ svc->flags = u->flags | IP_VS_SVC_F_HASHED; + if (!strcmp(u->sched_name, "mhs")) { + svc->flags |= IP_VS_SVC_F_STATELESS; + svc->flags &= ~IP_VS_SVC_F_PERSISTENT; + } svc->timeout = u->timeout * HZ; svc->netmask = u->netmask; diff --git a/net/netfilter/ipvs/ip_vs_mhs.c b/net/netfilter/ipvs/ip_vs_mhs.c new file mode 100644 index 000000000000..ab19ac0f5b02 --- /dev/null +++ b/net/netfilter/ipvs/ip_vs_mhs.c @@ -0,0 +1,740 @@ +// SPDX-License-Identifier: GPL-2.0 +/* IPVS: Stateless Maglev Hashing scheduling module + * + * Authors: Lev Pantiukhin + * + */ + +/* The mh algorithm is to assign a preference list of all the lookup + * table positions to each destination and populate the table with + * the most-preferred position of destinations. Then it is to select + * destination with the hash key of source IP address through looking + * up a the lookup table. + * The mhs algorithm is modificated stateless version of mh algorithm. + * It uses 2 look up tables and chooses one of 2 destinations. + * + * The mh algorithm is detailed in: + * [3.4 Consistent Hasing] +https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-eisenbud.pdf + * + */ + +#define KMSG_COMPONENT "IPVS" +#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt + +#include +#include +#include +#include +#include + +#include + +#include +#include +#include + +#include + +#define IP_VS_SVC_F_SCHED_MH_FALLBACK IP_VS_SVC_F_SCHED1 /* MH fallback */ +#define IP_VS_SVC_F_SCHED_MH_PORT IP_VS_SVC_F_SCHED2 /* MH use port */ + +struct ip_vs_mhs_lookup { + struct ip_vs_dest __rcu *dest; /* real server (cache) */ +}; + +struct ip_vs_mhs_dest_setup { + unsigned int offset; /* starting offset */ + unsigned int skip; /* skip */ + unsigned int perm; /* next_offset */ + int turns; /* weight / gcd() and rshift */ +}; + +/* Available prime numbers for MH table */ +static int primes[] = {251, 509, 1021, 2039, 4093, + 8191, 16381, 32749, 65521, 131071}; + +/* For IPVS MH entry hash table */ +#ifndef CONFIG_IP_VS_MH_TAB_INDEX +#define CONFIG_IP_VS_MH_TAB_INDEX 12 +#endif +#define IP_VS_MH_TAB_BITS (CONFIG_IP_VS_MH_TAB_INDEX / 2) +#define IP_VS_MH_TAB_INDEX (CONFIG_IP_VS_MH_TAB_INDEX - 8) +#define IP_VS_MH_TAB_SIZE primes[IP_VS_MH_TAB_INDEX] + +struct ip_vs_mhs_state { + struct rcu_head rcu_head; + struct ip_vs_mhs_lookup *lookup; + struct ip_vs_mhs_dest_setup *dest_setup; + hsiphash_key_t hash1, hash2; + int gcd; + int rshift; +}; + +struct ip_vs_mhs_two_states { + struct ip_vs_mhs_state *first; + struct ip_vs_mhs_state *second; + ktime_t *timestamps; + ktime_t unstable_timeout; +}; + +struct ip_vs_mhs_two_dests { + struct ip_vs_dest *dest; + struct ip_vs_dest *new_dest; + bool unstable; +}; + +static inline bool +ip_vs_mhs_is_new_conn(const struct sk_buff *skb, struct ip_vs_iphdr *iph) +{ + switch (iph->protocol) { + case IPPROTO_TCP: { + struct tcphdr _tcph, *th; + + th = skb_header_pointer(skb, iph->len, sizeof(_tcph), &_tcph); + if (!th) + return false; + return th->syn; + } + default: + return false; + } +} + +static inline void +generate_hash_secret(hsiphash_key_t *hash1, hsiphash_key_t *hash2) +{ + hash1->key[0] = 2654435761UL; + hash1->key[1] = 2654435761UL; + + hash2->key[0] = 2654446892UL; + hash2->key[1] = 2654446892UL; +} + +/* Returns hash value for IPVS MH entry */ +static inline unsigned int +ip_vs_mhs_hashkey(int af, const union nf_inet_addr *addr, __be16 port, + hsiphash_key_t *key, unsigned int offset) +{ + unsigned int v; + __be32 addr_fold = addr->ip; + +#ifdef CONFIG_IP_VS_IPV6 + if (af == AF_INET6) + addr_fold = addr->ip6[0] ^ addr->ip6[1] ^ + addr->ip6[2] ^ addr->ip6[3]; +#endif + v = (offset + ntohs(port) + ntohl(addr_fold)); + return hsiphash(&v, sizeof(v), key); +} + +/* Reset all the hash buckets of the specified table. */ +static void ip_vs_mhs_reset(struct ip_vs_mhs_state *s) +{ + int i; + struct ip_vs_mhs_lookup *l; + struct ip_vs_dest *dest; + + l = &s->lookup[0]; + for (i = 0; i < IP_VS_MH_TAB_SIZE; i++) { + dest = rcu_dereference_protected(l->dest, 1); + if (dest) { + ip_vs_dest_put(dest); + RCU_INIT_POINTER(l->dest, NULL); + } + l++; + } +} + +/* Update timestamps with new lookup table */ +static void +ip_vs_mhs_update_timestamps(struct ip_vs_mhs_two_states *states) +{ + unsigned int offset = 0; + + while (offset < IP_VS_MH_TAB_SIZE) { + if (states->first->lookup[offset].dest == + states->second->lookup[offset].dest) { + if (states->timestamps[offset]) { + /* stabilization */ + states->timestamps[offset] = (ktime_t)0; + } + } else { + if (!states->timestamps[offset]) { + /* destabilization */ + states->timestamps[offset] = ktime_get(); + } + } + ++offset; + } +} + +static int +ip_vs_mhs_permutate(struct ip_vs_mhs_state *s, struct ip_vs_service *svc) +{ + struct list_head *p; + struct ip_vs_mhs_dest_setup *ds; + struct ip_vs_dest *dest; + int lw; + + /* If gcd is smaller then 1, number of dests or + * all weight of dests are zero. So, skip + * permutation for the dests. + */ + if (s->gcd < 1) + return 0; + + /* Set dest_setup for the dests permutation */ + p = &svc->destinations; + ds = &s->dest_setup[0]; + while ((p = p->next) != &svc->destinations) { + dest = list_entry(p, struct ip_vs_dest, n_list); + + ds->offset = ip_vs_mhs_hashkey(svc->af, &dest->addr, dest->port, + &s->hash1, 0) % + IP_VS_MH_TAB_SIZE; + ds->skip = ip_vs_mhs_hashkey(svc->af, &dest->addr, dest->port, + &s->hash2, 0) % + (IP_VS_MH_TAB_SIZE - 1) + 1; + ds->perm = ds->offset; + + lw = atomic_read(&dest->weight); + ds->turns = ((lw / s->gcd) >> s->rshift) ?: (lw != 0); + ds++; + } + return 0; +} + +static int +ip_vs_mhs_populate(struct ip_vs_mhs_state *s, struct ip_vs_service *svc) +{ + int n, c, dt_count; + unsigned long *table; + struct list_head *p; + struct ip_vs_mhs_dest_setup *ds; + struct ip_vs_dest *dest, *new_dest; + + /* If gcd is smaller then 1, number of dests or + * all last_weight of dests are zero. So, skip + * the population for the dests and reset lookup table. + */ + if (s->gcd < 1) { + ip_vs_mhs_reset(s); + return 0; + } + + table = kcalloc(BITS_TO_LONGS(IP_VS_MH_TAB_SIZE), sizeof(unsigned long), + GFP_KERNEL); + if (!table) + return -ENOMEM; + + p = &svc->destinations; + n = 0; + dt_count = 0; + while (n < IP_VS_MH_TAB_SIZE) { + if (p == &svc->destinations) + p = p->next; + + ds = &s->dest_setup[0]; + while (p != &svc->destinations) { + /* Ignore added server with zero weight */ + if (ds->turns < 1) { + p = p->next; + ds++; + continue; + } + + c = ds->perm; + while (test_bit(c, table)) { + /* Add skip, mod s->tab_size */ + ds->perm += ds->skip; + if (ds->perm >= IP_VS_MH_TAB_SIZE) + ds->perm -= IP_VS_MH_TAB_SIZE; + c = ds->perm; + } + + __set_bit(c, table); + + dest = rcu_dereference_protected(s->lookup[c].dest, 1); + new_dest = list_entry(p, struct ip_vs_dest, n_list); + if (dest != new_dest) { + if (dest) + ip_vs_dest_put(dest); + ip_vs_dest_hold(new_dest); + RCU_INIT_POINTER(s->lookup[c].dest, new_dest); + } + + if (++n == IP_VS_MH_TAB_SIZE) + goto out; + + if (++dt_count >= ds->turns) { + dt_count = 0; + p = p->next; + ds++; + } + } + } + +out: + kfree(table); + return 0; +} + +/* Assign all the hash buckets of the specified table with the service. */ +static int +ip_vs_mhs_reassign(struct ip_vs_mhs_state *s, struct ip_vs_service *svc) +{ + int ret; + + if (svc->num_dests > IP_VS_MH_TAB_SIZE) + return -EINVAL; + + if (svc->num_dests >= 1) { + s->dest_setup = kcalloc(svc->num_dests, + sizeof(struct ip_vs_mhs_dest_setup), + GFP_KERNEL); + if (!s->dest_setup) + return -ENOMEM; + } + + ip_vs_mhs_permutate(s, svc); + + ret = ip_vs_mhs_populate(s, svc); + if (ret < 0) + goto out; + + IP_VS_DBG_BUF(6, "MHS: %s(): reassign lookup table of %s:%u\n", + __func__, + IP_VS_DBG_ADDR(svc->af, &svc->addr), + ntohs(svc->port)); + +out: + if (svc->num_dests >= 1) { + kfree(s->dest_setup); + s->dest_setup = NULL; + } + return ret; +} + +static int +ip_vs_mhs_gcd_weight(struct ip_vs_service *svc) +{ + struct ip_vs_dest *dest; + int weight; + int g = 0; + + list_for_each_entry(dest, &svc->destinations, n_list) { + weight = atomic_read(&dest->weight); + if (weight > 0) { + if (g > 0) + g = gcd(weight, g); + else + g = weight; + } + } + return g; +} + +/* To avoid assigning huge weight for the MH table, + * calculate shift value with gcd. + */ +static int +ip_vs_mhs_shift_weight(struct ip_vs_service *svc, int gcd) +{ + struct ip_vs_dest *dest; + int new_weight, weight = 0; + int mw, shift; + + /* If gcd is smaller then 1, number of dests or + * all weight of dests are zero. So, return + * shift value as zero. + */ + if (gcd < 1) + return 0; + + list_for_each_entry(dest, &svc->destinations, n_list) { + new_weight = atomic_read(&dest->weight); + if (new_weight > weight) + weight = new_weight; + } + + /* Because gcd is greater than zero, + * the maximum weight and gcd are always greater than zero + */ + mw = weight / gcd; + + /* shift = occupied bits of weight/gcd - MH highest bits */ + shift = fls(mw) - IP_VS_MH_TAB_BITS; + return (shift >= 0) ? shift : 0; +} + +static ktime_t +ip_vs_mhs_get_unstable_timeout(struct ip_vs_service *svc) +{ + struct ip_vs_proto_data *pd; + u64 tcp_to, tcp_fin_to; + + pd = ip_vs_proto_data_get(svc->ipvs, IPPROTO_TCP); + tcp_to = pd->timeout_table[IP_VS_TCP_S_ESTABLISHED]; + tcp_fin_to = pd->timeout_table[IP_VS_TCP_S_FIN_WAIT]; + return ns_to_ktime(jiffies64_to_nsecs(max(tcp_to, tcp_fin_to))); +} + +static void +ip_vs_mhs_state_free(struct rcu_head *head) +{ + struct ip_vs_mhs_state *s; + + s = container_of(head, struct ip_vs_mhs_state, rcu_head); + kfree(s->lookup); + kfree(s); +} + +static int +ip_vs_mhs_init_svc(struct ip_vs_service *svc) +{ + struct ip_vs_mhs_state *s0, *s1; + struct ip_vs_mhs_two_states *states; + ktime_t *tss; + int ret; + + /* Allocate timestamps */ + tss = kcalloc(IP_VS_MH_TAB_SIZE, sizeof(ktime_t), GFP_KERNEL); + if (!tss) + return -ENOMEM; + + /* Allocate the first MH table for this service */ + s0 = kzalloc(sizeof(*s0), GFP_KERNEL); + if (!s0) { + kfree(tss); + return -ENOMEM; + } + + s0->lookup = kcalloc(IP_VS_MH_TAB_SIZE, sizeof(struct ip_vs_mhs_lookup), + GFP_KERNEL); + if (!s0->lookup) { + kfree(tss); + kfree(s0); + return -ENOMEM; + } + + generate_hash_secret(&s0->hash1, &s0->hash2); + s0->gcd = ip_vs_mhs_gcd_weight(svc); + s0->rshift = ip_vs_mhs_shift_weight(svc, s0->gcd); + + IP_VS_DBG(6, + "MHS: %s(): The first lookup table (memory=%zdbytes) allocated\n", + __func__, + sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE); + + /* Assign the first lookup table with current dests */ + ret = ip_vs_mhs_reassign(s0, svc); + if (ret < 0) { + kfree(tss); + ip_vs_mhs_reset(s0); + ip_vs_mhs_state_free(&s0->rcu_head); + return ret; + } + + /* Allocate the second MH table for this service */ + s1 = kzalloc(sizeof(*s1), GFP_KERNEL); + if (!s1) { + kfree(tss); + ip_vs_mhs_reset(s0); + ip_vs_mhs_state_free(&s0->rcu_head); + return -ENOMEM; + } + s1->lookup = kcalloc(IP_VS_MH_TAB_SIZE, sizeof(struct ip_vs_mhs_lookup), + GFP_KERNEL); + if (!s1->lookup) { + kfree(tss); + ip_vs_mhs_reset(s0); + ip_vs_mhs_state_free(&s0->rcu_head); + kfree(s1); + return -ENOMEM; + } + + s1->hash1 = s0->hash1; + s1->hash2 = s0->hash2; + s1->gcd = s0->gcd; + s1->rshift = s0->rshift; + + IP_VS_DBG(6, + "MHS: %s(): The second lookup table (memory=%zdbytes) allocated\n", + __func__, + sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE); + + /* Assign the second lookup table with current dests */ + ret = ip_vs_mhs_reassign(s1, svc); + if (ret < 0) { + kfree(tss); + ip_vs_mhs_reset(s0); + ip_vs_mhs_state_free(&s0->rcu_head); + ip_vs_mhs_reset(s1); + ip_vs_mhs_state_free(&s1->rcu_head); + return ret; + } + + /* Allocate, initialize and attach states */ + states = kcalloc(1, sizeof(struct ip_vs_mhs_two_states), GFP_KERNEL); + if (!states) { + kfree(tss); + ip_vs_mhs_reset(s0); + ip_vs_mhs_state_free(&s0->rcu_head); + ip_vs_mhs_reset(s1); + ip_vs_mhs_state_free(&s1->rcu_head); + return -ENOMEM; + } + + states->first = s0; + states->second = s1; + states->timestamps = tss; + states->unstable_timeout = ip_vs_mhs_get_unstable_timeout(svc); + svc->sched_data = states; + return 0; +} + +static void +ip_vs_mhs_done_svc(struct ip_vs_service *svc) +{ + struct ip_vs_mhs_two_states *states = svc->sched_data; + + kfree(states->timestamps); + + /* Got to clean up the first lookup entry here */ + ip_vs_mhs_reset(states->first); + + call_rcu(&states->first->rcu_head, ip_vs_mhs_state_free); + IP_VS_DBG(6, + "MHS: The first MH lookup table (memory=%zdbytes) released\n", + sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE); + + /* Got to clean up the second lookup entry here */ + ip_vs_mhs_reset(states->second); + + call_rcu(&states->second->rcu_head, ip_vs_mhs_state_free); + IP_VS_DBG(6, + "MHS: The second MH lookup table (memory=%zdbytes) released\n", + sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE); + + kfree(states); +} + +static int +ip_vs_mhs_dest_changed(struct ip_vs_service *svc, + struct ip_vs_dest *dest) +{ + struct ip_vs_mhs_two_states *states = svc->sched_data; + struct ip_vs_mhs_state *s1 = states->second; + int ret; + + s1->gcd = ip_vs_mhs_gcd_weight(svc); + s1->rshift = ip_vs_mhs_shift_weight(svc, s1->gcd); + + /* Assign the lookup table with the updated service */ + ret = ip_vs_mhs_reassign(s1, svc); + + ip_vs_mhs_update_timestamps(states); + states->unstable_timeout = ip_vs_mhs_get_unstable_timeout(svc); + IP_VS_DBG(6, + "MHS: %s: set unstable timeout: %llu", + __func__, + ktime_divns(states->unstable_timeout, + NSEC_PER_SEC)); + return ret; +} + +/* Helper function to get port number */ +static inline __be16 +ip_vs_mhs_get_port(const struct sk_buff *skb, struct ip_vs_iphdr *iph) +{ + __be16 _ports[2], *ports; + + /* At this point we know that we have a valid packet of some kind. + * Because ICMP packets are only guaranteed to have the first 8 + * bytes, let's just grab the ports. Fortunately they're in the + * same position for all three of the protocols we care about. + */ + switch (iph->protocol) { + case IPPROTO_TCP: + case IPPROTO_UDP: + case IPPROTO_SCTP: + ports = skb_header_pointer(skb, iph->len, sizeof(_ports), + &_ports); + if (unlikely(!ports)) + return 0; + + if (likely(!ip_vs_iph_inverse(iph))) + return ports[0]; + else + return ports[1]; + default: + return 0; + } +} + +/* Get ip_vs_dest associated with supplied parameters. */ +static inline void +ip_vs_mhs_get(struct ip_vs_service *svc, + struct ip_vs_mhs_two_states *states, + struct ip_vs_mhs_two_dests *dests, + const union nf_inet_addr *addr, + __be16 port) +{ + unsigned int hash; + ktime_t timestamp; + + hash = ip_vs_mhs_hashkey(svc->af, addr, port, &states->first->hash1, + 0) % IP_VS_MH_TAB_SIZE; + dests->dest = rcu_dereference(states->first->lookup[hash].dest); + dests->new_dest = rcu_dereference(states->second->lookup[hash].dest); + timestamp = states->timestamps[hash]; + + /* only unstable hashes have non-zero value */ + if (timestamp > 0) { + /* unstable */ + if (timestamp + states->unstable_timeout > ktime_get()) { + /* timer didn't expire */ + dests->unstable = true; + return; + } + /* unstable -> stable */ + if (dests->dest) + ip_vs_dest_put(dests->dest); + if (dests->new_dest) + ip_vs_dest_hold(dests->new_dest); + dests->dest = dests->new_dest; + RCU_INIT_POINTER(states->first->lookup[hash].dest, + dests->new_dest); + states->timestamps[hash] = (ktime_t)0; + } + /* stable */ + dests->unstable = false; +} + +/* Stateless Maglev Hashing scheduling */ +static struct ip_vs_dest * +ip_vs_mhs_schedule(struct ip_vs_service *svc, + const struct sk_buff *skb, + struct ip_vs_iphdr *iph, + bool *need_state) +{ + struct ip_vs_mhs_two_dests dests; + struct ip_vs_dest *final_dest = NULL; + struct ip_vs_mhs_two_states *states = svc->sched_data; + __be16 port = 0; + const union nf_inet_addr *hash_addr; + + *need_state = false; + hash_addr = ip_vs_iph_inverse(iph) ? &iph->daddr : &iph->saddr; + + if (svc->flags & IP_VS_SVC_F_SCHED_MH_PORT) + port = ip_vs_mhs_get_port(skb, iph); + + ip_vs_mhs_get(svc, states, &dests, hash_addr, port); + IP_VS_DBG_BUF(6, + "MHS: %s(): source IP address %s:%u --> server %s and %s\n", + __func__, + IP_VS_DBG_ADDR(svc->af, hash_addr), + ntohs(port), + dests.dest + ? IP_VS_DBG_ADDR(dests.dest->af, &dests.dest->addr) + : "NULL", + dests.new_dest + ? IP_VS_DBG_ADDR(dests.new_dest->af, + &dests.new_dest->addr) + : "NULL"); + + if (!dests.dest && !dests.new_dest) { + /* Both dests is NULL */ + return NULL; + } + + if (!(dests.dest && dests.new_dest)) { + /* dest is NULL or new_dest is NULL, + * so we send all packets to singular available dest + * and create state + */ + if (dests.new_dest) { + /* dest is NULL */ + final_dest = dests.new_dest; + } else { + /* new_dest is NULL */ + final_dest = dests.dest; + } + *need_state = true; + IP_VS_DBG(6, + "MHS: %s(): One dest, need_state=%s\n", + __func__, + *need_state ? "true" : "false"); + } else if (dests.unstable) { + /* unstable */ + if (iph->protocol == IPPROTO_TCP) { + /* TCP */ + *need_state = true; + if (ip_vs_mhs_is_new_conn(skb, iph)) { + /* SYN packet */ + final_dest = dests.new_dest; + IP_VS_DBG(6, + "MHS: %s(): Unstable, need_state=%s, SYN packet\n", + __func__, + *need_state ? "true" : "false"); + } else { + /* Not SYN packet */ + final_dest = dests.dest; + IP_VS_DBG(6, + "MHS: %s(): Unstable, need_state=%s, not SYN packet\n", + __func__, + *need_state ? "true" : "false"); + } + } else if (iph->protocol == IPPROTO_UDP) { + /* UDP */ + final_dest = dests.new_dest; + IP_VS_DBG(6, + "MHS: %s(): Unstable, need_state=%s, UDP packet\n", + __func__, + *need_state ? "true" : "false"); + } + } else { + /* stable */ + final_dest = dests.dest; + IP_VS_DBG(6, + "MHS: %s(): Stable, need_state=%s\n", + __func__, + *need_state ? "true" : "false"); + } + return final_dest; +} + +/* IPVS MHS Scheduler structure */ +static struct ip_vs_scheduler ip_vs_mhs_scheduler = { + .name = "mhs", + .refcnt = ATOMIC_INIT(0), + .module = THIS_MODULE, + .n_list = LIST_HEAD_INIT(ip_vs_mhs_scheduler.n_list), + .init_service = ip_vs_mhs_init_svc, + .done_service = ip_vs_mhs_done_svc, + .add_dest = ip_vs_mhs_dest_changed, + .del_dest = ip_vs_mhs_dest_changed, + .upd_dest = ip_vs_mhs_dest_changed, + .schedule_sl = ip_vs_mhs_schedule, +}; + +static int __init +ip_vs_mhs_init(void) +{ + return register_ip_vs_scheduler(&ip_vs_mhs_scheduler); +} + +static void __exit +ip_vs_mhs_cleanup(void) +{ + unregister_ip_vs_scheduler(&ip_vs_mhs_scheduler); + rcu_barrier(); +} + +module_init(ip_vs_mhs_init); +module_exit(ip_vs_mhs_cleanup); +MODULE_DESCRIPTION("Stateless Maglev hashing ipvs scheduler"); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Lev Pantiukhin "); diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c b/net/netfilter/ipvs/ip_vs_proto_tcp.c index 7da51390cea6..31a8c1bfc863 100644 --- a/net/netfilter/ipvs/ip_vs_proto_tcp.c +++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c @@ -38,7 +38,7 @@ tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb, struct ip_vs_iphdr *iph) { struct ip_vs_service *svc; - struct tcphdr _tcph, *th; + struct tcphdr _tcph, *th = NULL; __be16 _ports[2], *ports = NULL; /* In the event of icmp, we're only guaranteed to have the first 8 @@ -47,11 +47,8 @@ tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb, */ if (likely(!ip_vs_iph_icmp(iph))) { th = skb_header_pointer(skb, iph->len, sizeof(_tcph), &_tcph); - if (th) { - if (th->rst || !(sysctl_sloppy_tcp(ipvs) || th->syn)) - return 1; + if (th) ports = &th->source; - } } else { ports = skb_header_pointer( skb, iph->len, sizeof(_ports), &_ports); @@ -74,6 +71,17 @@ tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb, if (svc) { int ignored; + if (th) { + /* If sloppy_tcp or IP_VS_SVC_F_STATELESS is true, + * all SYN packets are scheduled except packets + * with set RST flag. + */ + if (!sysctl_sloppy_tcp(ipvs) && + !(svc->flags & IP_VS_SVC_F_STATELESS) && + (!th->syn || th->rst)) + return 1; + } + if (ip_vs_todrop(ipvs)) { /* * It seems that we are very loaded. -- 2.17.1