Subject: [PATCH] ipvs: change ip_vs_conn_tab_bits range to [8,31]

From: Abhijeet Rastogi <[email protected]>

Current range [8, 20] is set purely due to historical reasons
because at the time, ~1M (2^20) was considered sufficient.

Previous change regarding this limit is here.

Link: https://lore.kernel.org/all/86eabeb9dd62aebf1e2533926fdd13fed48bab1f.1631289960.git.aclaudi@redhat.com/T/#u

Signed-off-by: Abhijeet Rastogi <[email protected]>
---
The conversation for this started at:

https://www.spinics.net/lists/netfilter/msg60995.html

The upper limit for algo is any bit size less than 32, so this
change will allow us to set bit size > 20. Today, it is common to have
RAM available to handle greater than 2^20 connections per-host.

Distros like RHEL already have higher limits set.
---
net/netfilter/ipvs/Kconfig | 4 ++--
net/netfilter/ipvs/ip_vs_conn.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
index 271da8447b29..3e3371f8c0f9 100644
--- a/net/netfilter/ipvs/Kconfig
+++ b/net/netfilter/ipvs/Kconfig
@@ -44,7 +44,7 @@ config IP_VS_DEBUG

config IP_VS_TAB_BITS
int "IPVS connection table size (the Nth power of 2)"
- range 8 20
+ range 8 31
default 12
help
The IPVS connection hash table uses the chaining scheme to handle
@@ -54,7 +54,7 @@ config IP_VS_TAB_BITS

Note the table size must be power of 2. The table size will be the
value of 2 to the your input number power. The number to choose is
- from 8 to 20, the default number is 12, which means the table size
+ from 8 to 31, the default number is 12, which means the table size
is 4096. Don't input the number too small, otherwise you will lose
performance on it. You can adapt the table size yourself, according
to your virtual server application. It is good to set the table size
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 13534e02346c..bc0fe1a698d4 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -1484,8 +1484,8 @@ int __init ip_vs_conn_init(void)
int idx;

/* Compute size and mask */
- if (ip_vs_conn_tab_bits < 8 || ip_vs_conn_tab_bits > 20) {
- pr_info("conn_tab_bits not in [8, 20]. Using default value\n");
+ if (ip_vs_conn_tab_bits < 8 || ip_vs_conn_tab_bits > 31) {
+ pr_info("conn_tab_bits not in [8, 31]. Using default value\n");
ip_vs_conn_tab_bits = CONFIG_IP_VS_TAB_BITS;
}
ip_vs_conn_tab_size = 1 << ip_vs_conn_tab_bits;

---
base-commit: 09a9639e56c01c7a00d6c0ca63f4c7c41abe075d
change-id: 20230412-increase_ipvs_conn_tab_bits-4322c90da216

Best regards,
--
Abhijeet Rastogi <[email protected]>


2023-04-13 08:13:01

by Simon Horman

[permalink] [raw]
Subject: Re: [PATCH] ipvs: change ip_vs_conn_tab_bits range to [8,31]

On Wed, Apr 12, 2023 at 01:49:08PM -0700, Abhijeet Rastogi via B4 Relay wrote:
> From: Abhijeet Rastogi <[email protected]>
>
> Current range [8, 20] is set purely due to historical reasons
> because at the time, ~1M (2^20) was considered sufficient.
>
> Previous change regarding this limit is here.
>
> Link: https://lore.kernel.org/all/86eabeb9dd62aebf1e2533926fdd13fed48bab1f.1631289960.git.aclaudi@redhat.com/T/#u
>
> Signed-off-by: Abhijeet Rastogi <[email protected]>
> ---

Hi Abhijeet,

> The conversation for this started at:
>
> https://www.spinics.net/lists/netfilter/msg60995.html

'The 20 bit (1m entries) ceiling exists since the original merge of ipvs
in 2003, so likely this was just considered "big enough" back then.'

Yes, that matches my recollection.

There were probably also concerns about the viability of making
larger allocations at the time on the kinds of systems where
IPVS would be deployed.

On the allocation theme, I do note that 2^31 does lead to a substantial
vmalloc allocation regardless of actual usage. Probably it would be best
to move IPVS to use rhashtable(). But that is obviously a much more
invasive change.

In any case, I think this patch is an improvement on the current situation.

Acked-by: Simon Horman <[email protected]>

>
> The upper limit for algo is any bit size less than 32, so this
> change will allow us to set bit size > 20. Today, it is common to have
> RAM available to handle greater than 2^20 connections per-host.
>
> Distros like RHEL already have higher limits set.

...

2023-04-13 10:48:01

by Andrea Claudi

[permalink] [raw]
Subject: Re: [PATCH] ipvs: change ip_vs_conn_tab_bits range to [8,31]

On Wed, Apr 12, 2023 at 01:49:08PM -0700, Abhijeet Rastogi via B4 Relay wrote:
> From: Abhijeet Rastogi <[email protected]>
>
> Current range [8, 20] is set purely due to historical reasons
> because at the time, ~1M (2^20) was considered sufficient.
>
> Previous change regarding this limit is here.
>
> Link: https://lore.kernel.org/all/86eabeb9dd62aebf1e2533926fdd13fed48bab1f.1631289960.git.aclaudi@redhat.com/T/#u
>
> Signed-off-by: Abhijeet Rastogi <[email protected]>
> ---
> The conversation for this started at:
>
> https://www.spinics.net/lists/netfilter/msg60995.html
>
> The upper limit for algo is any bit size less than 32, so this
> change will allow us to set bit size > 20. Today, it is common to have
> RAM available to handle greater than 2^20 connections per-host.
>
> Distros like RHEL already have higher limits set.

Hi Abhijeet,
for the record, RHEL ships with CONFIG_IP_VS_TAB_BITS set to 12 as
default.

> ---
> net/netfilter/ipvs/Kconfig | 4 ++--
> net/netfilter/ipvs/ip_vs_conn.c | 4 ++--
> 2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
> index 271da8447b29..3e3371f8c0f9 100644
> --- a/net/netfilter/ipvs/Kconfig
> +++ b/net/netfilter/ipvs/Kconfig
> @@ -44,7 +44,7 @@ config IP_VS_DEBUG
>
> config IP_VS_TAB_BITS
> int "IPVS connection table size (the Nth power of 2)"
> - range 8 20
> + range 8 31
> default 12
> help
> The IPVS connection hash table uses the chaining scheme to handle
> @@ -54,7 +54,7 @@ config IP_VS_TAB_BITS
>
> Note the table size must be power of 2. The table size will be the
> value of 2 to the your input number power. The number to choose is
> - from 8 to 20, the default number is 12, which means the table size
> + from 8 to 31, the default number is 12, which means the table size
> is 4096. Don't input the number too small, otherwise you will lose
> performance on it. You can adapt the table size yourself, according
> to your virtual server application. It is good to set the table size
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index 13534e02346c..bc0fe1a698d4 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -1484,8 +1484,8 @@ int __init ip_vs_conn_init(void)
> int idx;
>
> /* Compute size and mask */
> - if (ip_vs_conn_tab_bits < 8 || ip_vs_conn_tab_bits > 20) {
> - pr_info("conn_tab_bits not in [8, 20]. Using default value\n");
> + if (ip_vs_conn_tab_bits < 8 || ip_vs_conn_tab_bits > 31) {
> + pr_info("conn_tab_bits not in [8, 31]. Using default value\n");
> ip_vs_conn_tab_bits = CONFIG_IP_VS_TAB_BITS;
> }
> ip_vs_conn_tab_size = 1 << ip_vs_conn_tab_bits;
>
> ---
> base-commit: 09a9639e56c01c7a00d6c0ca63f4c7c41abe075d
> change-id: 20230412-increase_ipvs_conn_tab_bits-4322c90da216
>
> Best regards,
> --
> Abhijeet Rastogi <[email protected]>
>

Looks good to me.

Reviewed-by: Andrea Claudi <[email protected]>

2023-04-13 12:15:57

by Julian Anastasov

[permalink] [raw]
Subject: Re: [PATCH] ipvs: change ip_vs_conn_tab_bits range to [8,31]


Hello,

On Wed, 12 Apr 2023, Abhijeet Rastogi via B4 Relay wrote:

> From: Abhijeet Rastogi <[email protected]>
>
> Current range [8, 20] is set purely due to historical reasons
> because at the time, ~1M (2^20) was considered sufficient.
>
> Previous change regarding this limit is here.
>
> Link: https://lore.kernel.org/all/86eabeb9dd62aebf1e2533926fdd13fed48bab1f.1631289960.git.aclaudi@redhat.com/T/#u
>
> Signed-off-by: Abhijeet Rastogi <[email protected]>
> ---
> The conversation for this started at:
>
> https://www.spinics.net/lists/netfilter/msg60995.html
>
> The upper limit for algo is any bit size less than 32, so this
> change will allow us to set bit size > 20. Today, it is common to have
> RAM available to handle greater than 2^20 connections per-host.

This is not a limit of number of connections. I prefer
not to allow value above 24 without adding checks for the
available memory, this more concern for 32-bit. Blindly
allocating 2^20 (1048576 pointers which is 8MB) should not
cause OOM but selecting large value and then using this
kernel on boxes with less memory is dangerous.

> Distros like RHEL already have higher limits set.
> ---
> net/netfilter/ipvs/Kconfig | 4 ++--
> net/netfilter/ipvs/ip_vs_conn.c | 4 ++--
> 2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
> index 271da8447b29..3e3371f8c0f9 100644
> --- a/net/netfilter/ipvs/Kconfig
> +++ b/net/netfilter/ipvs/Kconfig
> @@ -44,7 +44,7 @@ config IP_VS_DEBUG
>
> config IP_VS_TAB_BITS
> int "IPVS connection table size (the Nth power of 2)"
> - range 8 20
> + range 8 31
> default 12
> help
> The IPVS connection hash table uses the chaining scheme to handle
> @@ -54,7 +54,7 @@ config IP_VS_TAB_BITS
>
> Note the table size must be power of 2. The table size will be the
> value of 2 to the your input number power. The number to choose is
> - from 8 to 20, the default number is 12, which means the table size
> + from 8 to 31, the default number is 12, which means the table size
> is 4096. Don't input the number too small, otherwise you will lose
> performance on it. You can adapt the table size yourself, according
> to your virtual server application. It is good to set the table size
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index 13534e02346c..bc0fe1a698d4 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -1484,8 +1484,8 @@ int __init ip_vs_conn_init(void)
> int idx;
>
> /* Compute size and mask */
> - if (ip_vs_conn_tab_bits < 8 || ip_vs_conn_tab_bits > 20) {
> - pr_info("conn_tab_bits not in [8, 20]. Using default value\n");
> + if (ip_vs_conn_tab_bits < 8 || ip_vs_conn_tab_bits > 31) {
> + pr_info("conn_tab_bits not in [8, 31]. Using default value\n");
> ip_vs_conn_tab_bits = CONFIG_IP_VS_TAB_BITS;
> }
> ip_vs_conn_tab_size = 1 << ip_vs_conn_tab_bits;
>
> ---
> base-commit: 09a9639e56c01c7a00d6c0ca63f4c7c41abe075d
> change-id: 20230412-increase_ipvs_conn_tab_bits-4322c90da216
>
> Best regards,
> --
> Abhijeet Rastogi <[email protected]>

Regards

--
Julian Anastasov <[email protected]>

2023-04-14 02:04:59

by Abhijeet Rastogi

[permalink] [raw]
Subject: Re: [PATCH] ipvs: change ip_vs_conn_tab_bits range to [8,31]

Hi Simon, Andrea and Julian,

I really appreciate you taking the time to respond to my patch. Some follow up
questions that I'll appreciate a response for.

@Simon Horman
>In any case, I think this patch is an improvement on the current situation.

+1 to this. I wanted to add that, we're not changing the defaults
here, the default still stays at 2^12. If a kernel user changes the
default, they probably already know what the limitations are, so I
personally don't think it is a big concern.

@Andrea Claudi
>for the record, RHEL ships with CONFIG_IP_VS_TAB_BITS set to 12 as
default.

Sorry, I should have been clearer. RHEL ships with the same default,
yes, but it doesn't have the range check, at least, on the version I'm
using right now (3.10.0-1160.62.1.el7.x86_64).

On this version, I'm able to load with bit size 30, 31 gives me error
regarding allocating memory (64GB host) and anything beyond 31 is
mysteriously switched to a lower number. The following dmesg on my
host confirms that the bitsize 30 worked, which is not possible
without a patch on the current kernel version.

"[Fri Apr 14 01:14:51 2023] IPVS: Connection hash table configured (size=1073741
824, memory=16777216Kbytes)"

@Julian Anastasov,
>This is not a limit of number of connections. I prefer
not to allow value above 24 without adding checks for the
available memory,

Interesting that you brought up that number 24, that is exactly what
we use in production today. One IPVS node is able to handle spikes of
10M active connections without issues. This patch idea originated as
my company is migrating from the ancient RHEL version to a somewhat
newer CentOS (5.* kernel) and noticed that we were unable to load the
ip_vs kernel module with anything greater than 20 bits. Another
motivation for kernel upgrade is utilizing maglev to reduce table size
but that's out of context in this discussion.

My request is, can we increase the range from 20 to something larger?
If 31 seems a bit excessive, maybe, we can settle for something like
[8,30] or even lower. With conn_tab_bits=30, it allocates 16GB at
initialization time, it is not entirely absurd by today's standards.

I can revise my patch to a lower range as you guys see fit.

--
Cheers,
Abhijeet (https://abhi.host)

2023-04-14 14:02:20

by Julian Anastasov

[permalink] [raw]
Subject: Re: [PATCH] ipvs: change ip_vs_conn_tab_bits range to [8,31]


Hello,

On Thu, 13 Apr 2023, Abhijeet Rastogi wrote:

> Hi Simon, Andrea and Julian,
>
> I really appreciate you taking the time to respond to my patch. Some follow up
> questions that I'll appreciate a response for.
>
> @Simon Horman
> >In any case, I think this patch is an improvement on the current situation.
>
> +1 to this. I wanted to add that, we're not changing the defaults
> here, the default still stays at 2^12. If a kernel user changes the
> default, they probably already know what the limitations are, so I
> personally don't think it is a big concern.
>
> @Andrea Claudi
> >for the record, RHEL ships with CONFIG_IP_VS_TAB_BITS set to 12 as
> default.
>
> Sorry, I should have been clearer. RHEL ships with the same default,
> yes, but it doesn't have the range check, at least, on the version I'm
> using right now (3.10.0-1160.62.1.el7.x86_64).
>
> On this version, I'm able to load with bit size 30, 31 gives me error
> regarding allocating memory (64GB host) and anything beyond 31 is
> mysteriously switched to a lower number. The following dmesg on my
> host confirms that the bitsize 30 worked, which is not possible
> without a patch on the current kernel version.
>
> "[Fri Apr 14 01:14:51 2023] IPVS: Connection hash table configured (size=1073741
> 824, memory=16777216Kbytes)"
>
> @Julian Anastasov,
> >This is not a limit of number of connections. I prefer
> not to allow value above 24 without adding checks for the
> available memory,
>
> Interesting that you brought up that number 24, that is exactly what
> we use in production today. One IPVS node is able to handle spikes of
> 10M active connections without issues. This patch idea originated as
> my company is migrating from the ancient RHEL version to a somewhat
> newer CentOS (5.* kernel) and noticed that we were unable to load the
> ip_vs kernel module with anything greater than 20 bits. Another
> motivation for kernel upgrade is utilizing maglev to reduce table size
> but that's out of context in this discussion.
>
> My request is, can we increase the range from 20 to something larger?
> If 31 seems a bit excessive, maybe, we can settle for something like
> [8,30] or even lower. With conn_tab_bits=30, it allocates 16GB at
> initialization time, it is not entirely absurd by today's standards.
>
> I can revise my patch to a lower range as you guys see fit.

Some 32-bit platforms have a 120MB limit for
vmalloc. 24-bit table on 32-bit box will allocate 64MB.

One way to solve the problem is to use in Kconfig:

range 8 20 if !64BIT
range 8 27 if 64BIT

Why 30 and above do not work? Because we store the
size, mask in 'int' which is 32 bits. But also some places do not
allow allocations above INT_MAX, for example, kvmalloc_node().
So, even 28 may not work for 8-byte array items on 64-bit.

It would be good to check if the provided
value does not exceed some real limits. Here is an example
that assumes IPVS will allocate up to 1/8 of the memory,
8 conns average in a hash row. Such checks should not
exceed the small vmalloc area for 32-bit boxes and also
kvmalloc allows vmalloc with huge pages. This idea is
entirely untested/compiled. These checks apply some
sane thresholds. If you need something above, you are
probably allocating more than needed.

/* This will match the Kconfig range: */
int min = 8;
#if __BITS_PER_LONG > 32
int max = 27;
#else
int max = 20;
#endif

We can safely use 27 in Kconfig even for 32-bit
due to the below checks, they will clamp it to lower value.

/* Order of the available memory */
int max_avail = order_base_2(totalram_pages()) + PAGE_SHIFT;

We can remove this 'if' check:
if (ip_vs_conn_tab_bits < 8 || ip_vs_conn_tab_bits > 20) {
pr_info("conn_tab_bits not in [8, 20]. Using default value\n");
ip_vs_conn_tab_bits = CONFIG_IP_VS_TAB_BITS;
}

max_avail -= 3; /* ~8 in hash row */
max_avail -= 3; /* IPVS up to 1/8 of mem */
/* The hash table links allocated memory for IPVS conns */
max_avail -= order_base_2(sizeof(struct ip_vs_conn));
/* Range should not exceed the available memory */
max = clamp(max, min, max_avail);
/* Clamp configured value silently */
ip_vs_conn_tab_bits = clamp_val(ip_vs_conn_tab_bits, min, max);
ip_vs_conn_tab_size = 1 << ip_vs_conn_tab_bits;
ip_vs_conn_tab_mask = ip_vs_conn_tab_size - 1;

/* Switch to kvmalloc */
ip_vs_conn_tab = kvmalloc_array(ip_vs_conn_tab_size,
sizeof(*ip_vs_conn_tab), GFP_KERNEL);

and use everywhere kvfree(ip_vs_conn_tab);

For 64GB box the calcs should be:

max_avail = 36 - 3 - 3 - 9 => ip_vs_conn_tab_bits = 21
Allocated hash table: (2^21)*8=16MB
Allocated for IPVS conns (8 cols per row): (2^21)*8*(400..512)=6..8GB
which is ~1/8 of 64GB. All memory will be allocated with
~64 conns per row. May be the above calcs can be changed
to ~4 cols and 1/2 mem to use 128MB (24 bits instead of 21)
for our example: 36 - 2 - 1 - 9 => 24.

Possible problems if using large table that is
not loaded with enough conns:

- walking the table will cost more cycles, for example,
ip_vs_random_dropentry() wants to walk part of the table
every second. Even normal netns cleanup has to walk it.

- cat /proc/net/ip_vs_conn will be slower

Regards

--
Julian Anastasov <[email protected]>

2023-04-18 12:12:29

by Andrea Claudi

[permalink] [raw]
Subject: Re: [PATCH] ipvs: change ip_vs_conn_tab_bits range to [8,31]

On Thu, Apr 13, 2023 at 06:58:06PM -0700, Abhijeet Rastogi wrote:
> Hi Simon, Andrea and Julian,
>
> I really appreciate you taking the time to respond to my patch. Some follow up
> questions that I'll appreciate a response for.
>
> @Simon Horman
> >In any case, I think this patch is an improvement on the current situation.
>
> +1 to this. I wanted to add that, we're not changing the defaults
> here, the default still stays at 2^12. If a kernel user changes the
> default, they probably already know what the limitations are, so I
> personally don't think it is a big concern.
>
> @Andrea Claudi
> >for the record, RHEL ships with CONFIG_IP_VS_TAB_BITS set to 12 as
> default.
>
> Sorry, I should have been clearer. RHEL ships with the same default,
> yes, but it doesn't have the range check, at least, on the version I'm
> using right now (3.10.0-1160.62.1.el7.x86_64).
>
> On this version, I'm able to load with bit size 30, 31 gives me error
> regarding allocating memory (64GB host) and anything beyond 31 is
> mysteriously switched to a lower number. The following dmesg on my
> host confirms that the bitsize 30 worked, which is not possible
> without a patch on the current kernel version.
>
> "[Fri Apr 14 01:14:51 2023] IPVS: Connection hash table configured (size=1073741
> 824, memory=16777216Kbytes)"

I see. This makes sense to me as RHEL 7 does not include the range
check, while RHEL 8 and RHEL 9 both includes it.

The reason why any number beyond 31 results in a lower number is to be
searched in gcc implementation. IIRC shifting an int by more than 31 or
less than 0 results in an undefined behaviour, according to the C
standard.

>
> @Julian Anastasov,
> >This is not a limit of number of connections. I prefer
> not to allow value above 24 without adding checks for the
> available memory,
>
> Interesting that you brought up that number 24, that is exactly what
> we use in production today. One IPVS node is able to handle spikes of
> 10M active connections without issues. This patch idea originated as
> my company is migrating from the ancient RHEL version to a somewhat
> newer CentOS (5.* kernel) and noticed that we were unable to load the
> ip_vs kernel module with anything greater than 20 bits. Another
> motivation for kernel upgrade is utilizing maglev to reduce table size
> but that's out of context in this discussion.
>
> My request is, can we increase the range from 20 to something larger?
> If 31 seems a bit excessive, maybe, we can settle for something like
> [8,30] or even lower. With conn_tab_bits=30, it allocates 16GB at
> initialization time, it is not entirely absurd by today's standards.
>
> I can revise my patch to a lower range as you guys see fit.
>
> --
> Cheers,
> Abhijeet (https://abhi.host)
>

2023-05-14 21:12:19

by Abhijeet Rastogi

[permalink] [raw]
Subject: Re: [PATCH] ipvs: change ip_vs_conn_tab_bits range to [8,31]

> One way to solve the problem is to use in Kconfig:
>
> range 8 20 if !64BIT
> range 8 27 if 64BIT

Thanks @Julian Anastasov. I appreciate the detailed response around
why these limits exist. Personally, I won't be able to own the task of
making these checks more intelligent, but for now, I wonder if it
would be okay to accept the range increase to 27.

I am sending a v2 patch to set a higher limit to 27.

Thanks,
Abhijeet