2023-02-12 06:01:56

by Wang Yugui

[permalink] [raw]
Subject: question about the performance impact of sec=krb5

Hi,

question about the performance of sec=krb5.

https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-impact-kerberos
Performance impact of krb5:
Average IOPS decreased by 53%
Average throughput decreased by 53%
Average latency increased by 3.2 ms

and then in 'man 5 nfs'
sec=krb5 provides cryptographic proof of a user's identity in each RPC request.

Is there a option of better performance to check krb5 only when mount.nfs4,
not when file acess?

Best Regards
Wang Yugui ([email protected])
2023/02/12




2023-02-12 17:47:40

by Chuck Lever

[permalink] [raw]
Subject: Re: question about the performance impact of sec=krb5



> On Feb 12, 2023, at 1:01 AM, Wang Yugui <[email protected]> wrote:
>
> Hi,
>
> question about the performance of sec=krb5.
>
> https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-impact-kerberos
> Performance impact of krb5:
> Average IOPS decreased by 53%
> Average throughput decreased by 53%
> Average latency increased by 3.2 ms

Looking at the numbers in this article... they don't
seem quite right. Here are the others:

> Performance impact of krb5i:
> • Average IOPS decreased by 55%
> • Average throughput decreased by 55%
> • Average latency increased by 0.6 ms
> Performance impact of krb5p:
> • Average IOPS decreased by 77%
> • Average throughput decreased by 77%
> • Average latency increased by 1.6 ms

I would expect krb5p to be the worst in terms of
latency. And I would like to see round-trip numbers
reported: what part of the increase in latency is
due to server versus client processing?

This is also remarkable:

> When nconnect is used in Linux, the GSS security context is shared between all the nconnect connections to a particular server. TCP is a reliable transport that supports out-of-order packet delivery to deal with out-of-order packets in a GSS stream, using a sliding window of sequence numbers. When packets not in the sequence window are received, the security context is discarded, and a new security context is negotiated. All messages sent with in the now-discarded context are no longer valid, thus requiring the messages to be sent again. Larger number of packets in an nconnect setup cause frequent out-of-window packets, triggering the described behavior. No specific degradation percentages can be stated with this behavior.


So, does this mean that nconnect makes the GSS sequence
window problem worse, or that when a window underrun
occurs it has broader impact because multiple connections
are affected?

Seems like maybe nconnect should set up a unique GSS
context for each xprt. It would be helpful to file a bug.


> and then in 'man 5 nfs'
> sec=krb5 provides cryptographic proof of a user's identity in each RPC request.

Kerberos has performance impacts due to the crypto-
graphic operations that are performed on even small
fixed-sized sections of each RPC message, when using
sec=krb5 (no 'i' or 'p').


> Is there a option of better performance to check krb5 only when mount.nfs4,
> not when file acess?

If you mount with NFSv4 and sec=sys from a Linux NFS
client that has a keytab, the client will attempt to
use krb5i for lease management operations (such as
EXCHANGE_ID) but it will continue to use sec=sys for
user authentication. That's not terribly secure.

A better answer would be to make Kerberos faster.
I've done some recent work on improving the overhead
of using message digest algorithms with GSS-API, but
haven't done any specific measurement. I'm sure
there's more room for optimization.

Even better would be to use a transport layer security
service. Amazon has EFS and Oracle Cloud has something
similar, but we're working on a standard approach that
uses TLSv1.3.


--
Chuck Lever



2023-02-12 22:46:00

by Wang Yugui

[permalink] [raw]
Subject: Re: question about the performance impact of sec=krb5

Hi,

>
>
> > On Feb 12, 2023, at 1:01 AM, Wang Yugui <[email protected]> wrote:
> >
> > Hi,
> >
> > question about the performance of sec=krb5.
> >
> > https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-impact-kerberos
> > Performance impact of krb5:
> > Average IOPS decreased by 53%
> > Average throughput decreased by 53%
> > Average latency increased by 3.2 ms
>
> Looking at the numbers in this article... they don't
> seem quite right. Here are the others:
>
> > Performance impact of krb5i:
> > ? Average IOPS decreased by 55%
> > ? Average throughput decreased by 55%
> > ? Average latency increased by 0.6 ms
> > Performance impact of krb5p:
> > ? Average IOPS decreased by 77%
> > ? Average throughput decreased by 77%
> > ? Average latency increased by 1.6 ms
>
> I would expect krb5p to be the worst in terms of
> latency. And I would like to see round-trip numbers
> reported: what part of the increase in latency is
> due to server versus client processing?
>
> This is also remarkable:
>
> > When nconnect is used in Linux, the GSS security context is shared between all the nconnect connections to a particular server. TCP is a reliable transport that supports out-of-order packet delivery to deal with out-of-order packets in a GSS stream, using a sliding window of sequence numbers.?When packets not in the sequence window are received, the security context is discarded, and?a new security context is negotiated. All messages sent with in the now-discarded context are no longer valid, thus requiring the messages to be sent again. Larger number of packets in an nconnect setup cause frequent out-of-window packets, triggering the described behavior. No specific degradation percentages can be stated with this behavior.
>
>
> So, does this mean that nconnect makes the GSS sequence
> window problem worse, or that when a window underrun
> occurs it has broader impact because multiple connections
> are affected?
>
> Seems like maybe nconnect should set up a unique GSS
> context for each xprt. It would be helpful to file a bug.
>
>
> > and then in 'man 5 nfs'
> > sec=krb5 provides cryptographic proof of a user's identity in each RPC request.
>
> Kerberos has performance impacts due to the crypto-
> graphic operations that are performed on even small
> fixed-sized sections of each RPC message, when using
> sec=krb5 (no 'i' or 'p').
>
>
> > Is there a option of better performance to check krb5 only when mount.nfs4,
> > not when file acess?
>
> If you mount with NFSv4 and sec=sys from a Linux NFS
> client that has a keytab, the client will attempt to
> use krb5i for lease management operations (such as
> EXCHANGE_ID) but it will continue to use sec=sys for
> user authentication. That's not terribly secure.

I noticed this feature in this case
- the nfs client joined the windows AD(then have a keytab)
- the windows AD server is shutdown.
then 'mount.nfs4 -o sec=sys' will take about 3 min.
because there are 60s timeout *3 inside.
but 'sec=sys' does not need any krb5 operations?

maybe we can have another krb5 mode, such as 'krb5l'
- the nfs client must have a keytab.
- krb5 must be used only when mount.nfs4
It would be more secure than IP address check in /etc/exorts?

Best Regards
Wang Yugui ([email protected])
2023/02/13


>
> A better answer would be to make Kerberos faster.
> I've done some recent work on improving the overhead
> of using message digest algorithms with GSS-API, but
> haven't done any specific measurement. I'm sure
> there's more room for optimization.
>
> Even better would be to use a transport layer security
> service. Amazon has EFS and Oracle Cloud has something
> similar, but we're working on a standard approach that
> uses TLSv1.3.
>
>
> --
> Chuck Lever
>
>
>



2023-02-13 01:07:26

by Chuck Lever

[permalink] [raw]
Subject: Re: question about the performance impact of sec=krb5



> On Feb 12, 2023, at 5:45 PM, Wang Yugui <[email protected]> wrote:
>
> Hi,
>
>>
>>
>>> On Feb 12, 2023, at 1:01 AM, Wang Yugui <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> question about the performance of sec=krb5.
>>>
>>> https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-impact-kerberos
>>> Performance impact of krb5:
>>> Average IOPS decreased by 53%
>>> Average throughput decreased by 53%
>>> Average latency increased by 3.2 ms
>>
>> Looking at the numbers in this article... they don't
>> seem quite right. Here are the others:
>>
>>> Performance impact of krb5i:
>>> ? Average IOPS decreased by 55%
>>> ? Average throughput decreased by 55%
>>> ? Average latency increased by 0.6 ms
>>> Performance impact of krb5p:
>>> ? Average IOPS decreased by 77%
>>> ? Average throughput decreased by 77%
>>> ? Average latency increased by 1.6 ms
>>
>> I would expect krb5p to be the worst in terms of
>> latency. And I would like to see round-trip numbers
>> reported: what part of the increase in latency is
>> due to server versus client processing?
>>
>> This is also remarkable:
>>
>>> When nconnect is used in Linux, the GSS security context is shared between all the nconnect connections to a particular server. TCP is a reliable transport that supports out-of-order packet delivery to deal with out-of-order packets in a GSS stream, using a sliding window of sequence numbers.?When packets not in the sequence window are received, the security context is discarded, and?a new security context is negotiated. All messages sent with in the now-discarded context are no longer valid, thus requiring the messages to be sent again. Larger number of packets in an nconnect setup cause frequent out-of-window packets, triggering the described behavior. No specific degradation percentages can be stated with this behavior.
>>
>>
>> So, does this mean that nconnect makes the GSS sequence
>> window problem worse, or that when a window underrun
>> occurs it has broader impact because multiple connections
>> are affected?
>>
>> Seems like maybe nconnect should set up a unique GSS
>> context for each xprt. It would be helpful to file a bug.
>>
>>
>>> and then in 'man 5 nfs'
>>> sec=krb5 provides cryptographic proof of a user's identity in each RPC request.
>>
>> Kerberos has performance impacts due to the crypto-
>> graphic operations that are performed on even small
>> fixed-sized sections of each RPC message, when using
>> sec=krb5 (no 'i' or 'p').
>>
>>
>>> Is there a option of better performance to check krb5 only when mount.nfs4,
>>> not when file acess?
>>
>> If you mount with NFSv4 and sec=sys from a Linux NFS
>> client that has a keytab, the client will attempt to
>> use krb5i for lease management operations (such as
>> EXCHANGE_ID) but it will continue to use sec=sys for
>> user authentication. That's not terribly secure.
>
> I noticed this feature in this case
> - the nfs client joined the windows AD(then have a keytab)
> - the windows AD server is shutdown.
> then 'mount.nfs4 -o sec=sys' will take about 3 min.
> because there are 60s timeout *3 inside.
> but 'sec=sys' does not need any krb5 operations?

I would expect some bad behavior in this case: the
client is using Kerberos while part of the network
service infrastructure is not available to it. It's
going to hang.

If you don't want sec=sys to hang, then either don't
take the AD offline, don't put a keytab on the client,
or don't use NFSv4.


> maybe we can have another krb5 mode, such as 'krb5l'
> - the nfs client must have a keytab.
> - krb5 must be used only when mount.nfs4

It's not that simple.

All mount points on that client of that server share the
same lease, whether they are sec=sys or sec=krb5*. The
krb5 mounts must use krb5 for lease management, the
sec=sys mounts may use it, but don't have to.

What's more, when the client reboots, it needs to re-
identify itself to the server using the same credential,
no matter which order the mounts are re-established --
sys first or krb5 first.

Or, more generally speaking, when a keytab is present,
even if the client has only sec=sys mounts at this moment,
it might establish a sec=krb5 mount at any time in the
future. For instance, consider the case where only sec=sys
mounts reside in /etc/fstab that get mounted at boot time,
but there are sec=krb5 mounts in an automounter map that
get pulled in when a user accesses them.

In other words, it's not a per-mount setting, and it has
to be the same principal and security flavor after every
client reboot. We picked an appropriate level of security
for lease management that meets these requirements. The
only choice is to use Kerberos if there is even the
possibility that sec=krb5* can be used.

It might be surprising behavior, until you realize this
is kind of the only way it can work with a single lease
per client. Plus it encourages better security.


> It would be more secure than IP address check in /etc/exorts?

Well, it would provide some degree of peer authentication
based on whatever principal is available on the client
(a host service principal or some user that wants to
provide a password for this purpose).

But then user I/O requests would use AUTH_SYS, which is
trivial to alter while the RPC messages transit an open
network. That's what I meant by not terribly secure. But
better than all AUTH_SYS, sure.


> Best Regards
> Wang Yugui ([email protected])
> 2023/02/13
>
>
>>
>> A better answer would be to make Kerberos faster.
>> I've done some recent work on improving the overhead
>> of using message digest algorithms with GSS-API, but
>> haven't done any specific measurement. I'm sure
>> there's more room for optimization.
>>
>> Even better would be to use a transport layer security
>> service. Amazon has EFS and Oracle Cloud has something
>> similar, but we're working on a standard approach that
>> uses TLSv1.3.
>>
>>
>> --
>> Chuck Lever

--
Chuck Lever




2023-02-13 04:30:41

by Rick Macklem

[permalink] [raw]
Subject: Re: question about the performance impact of sec=krb5

On Sun, Feb 12, 2023 at 9:47 AM Chuck Lever III <[email protected]> wrote:
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to [email protected]
>
>
>
>
> > On Feb 12, 2023, at 1:01 AM, Wang Yugui <[email protected]> wrote:
> >
> > Hi,
> >
> > question about the performance of sec=krb5.
> >
> > https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-impact-kerberos
> > Performance impact of krb5:
> > Average IOPS decreased by 53%
> > Average throughput decreased by 53%
> > Average latency increased by 3.2 ms
>
> Looking at the numbers in this article... they don't
> seem quite right. Here are the others:
>
> > Performance impact of krb5i:
> > • Average IOPS decreased by 55%
> > • Average throughput decreased by 55%
> > • Average latency increased by 0.6 ms
> > Performance impact of krb5p:
> > • Average IOPS decreased by 77%
> > • Average throughput decreased by 77%
> > • Average latency increased by 1.6 ms
>
> I would expect krb5p to be the worst in terms of
> latency. And I would like to see round-trip numbers
> reported: what part of the increase in latency is
> due to server versus client processing?
>
> This is also remarkable:
>
> > When nconnect is used in Linux, the GSS security context is shared between all the nconnect connections to a particular server. TCP is a reliable transport that supports out-of-order packet delivery to deal with out-of-order packets in a GSS stream, using a sliding window of sequence numbers. When packets not in the sequence window are received, the security context is discarded, and a new security context is negotiated. All messages sent with in the now-discarded context are no longer valid, thus requiring the messages to be sent again. Larger number of packets in an nconnect setup cause frequent out-of-window packets, triggering the described behavior. No specific degradation percentages can be stated with this behavior.
>
>
> So, does this mean that nconnect makes the GSS sequence
> window problem worse, or that when a window underrun
> occurs it has broader impact because multiple connections
> are affected?
>
> Seems like maybe nconnect should set up a unique GSS
> context for each xprt. It would be helpful to file a bug.
>
Here's a snippet from RFC2203:
In a successful response, the seq_window field is set to the sequence
window length supported by the server for this context. This window
specifies the maximum number of client requests that may be
outstanding for this context. The server will accept "seq_window"
requests at a time, and these may be out of order. The client may
use this number to determine the number of threads that can
simultaneously send requests on this context.

It would be interesting to know what size of window Netapp filers specify
in the reply when context initialization completes.
A simple fix might be to get Netapp to increase the window, since they
have observed the problem.
FreeBSD servers use 128. I have no idea what other servers use.

rick

>
> > and then in 'man 5 nfs'
> > sec=krb5 provides cryptographic proof of a user's identity in each RPC request.
>
> Kerberos has performance impacts due to the crypto-
> graphic operations that are performed on even small
> fixed-sized sections of each RPC message, when using
> sec=krb5 (no 'i' or 'p').
>
>
> > Is there a option of better performance to check krb5 only when mount.nfs4,
> > not when file acess?
>
> If you mount with NFSv4 and sec=sys from a Linux NFS
> client that has a keytab, the client will attempt to
> use krb5i for lease management operations (such as
> EXCHANGE_ID) but it will continue to use sec=sys for
> user authentication. That's not terribly secure.
>
> A better answer would be to make Kerberos faster.
> I've done some recent work on improving the overhead
> of using message digest algorithms with GSS-API, but
> haven't done any specific measurement. I'm sure
> there's more room for optimization.
>
> Even better would be to use a transport layer security
> service. Amazon has EFS and Oracle Cloud has something
> similar, but we're working on a standard approach that
> uses TLSv1.3.
>
>
> --
> Chuck Lever
>
>
>

2023-02-13 14:55:29

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: question about the performance impact of sec=krb5

On Sun, Feb 12, 2023 at 1:08 PM Chuck Lever III <[email protected]> wrote:
>
>
>
> > On Feb 12, 2023, at 1:01 AM, Wang Yugui <[email protected]> wrote:
> >
> > Hi,
> >
> > question about the performance of sec=krb5.
> >
> > https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-impact-kerberos
> > Performance impact of krb5:
> > Average IOPS decreased by 53%
> > Average throughput decreased by 53%
> > Average latency increased by 3.2 ms
>
> Looking at the numbers in this article... they don't
> seem quite right. Here are the others:
>
> > Performance impact of krb5i:
> > • Average IOPS decreased by 55%
> > • Average throughput decreased by 55%
> > • Average latency increased by 0.6 ms
> > Performance impact of krb5p:
> > • Average IOPS decreased by 77%
> > • Average throughput decreased by 77%
> > • Average latency increased by 1.6 ms
>
> I would expect krb5p to be the worst in terms of
> latency. And I would like to see round-trip numbers
> reported: what part of the increase in latency is
> due to server versus client processing?
>
> This is also remarkable:
>
> > When nconnect is used in Linux, the GSS security context is shared between all the nconnect connections to a particular server. TCP is a reliable transport that supports out-of-order packet delivery to deal with out-of-order packets in a GSS stream, using a sliding window of sequence numbers. When packets not in the sequence window are received, the security context is discarded, and a new security context is negotiated. All messages sent with in the now-discarded context are no longer valid, thus requiring the messages to be sent again. Larger number of packets in an nconnect setup cause frequent out-of-window packets, triggering the described behavior. No specific degradation percentages can be stated with this behavior.
>
>
> So, does this mean that nconnect makes the GSS sequence
> window problem worse, or that when a window underrun
> occurs it has broader impact because multiple connections
> are affected?

Yes nconnect makes the GSS sequence window problem worse (very typical
to generate more than gss window size number of rpcs and have no
ability to control in what order they would be sent) and yes all
connections are affected. ONTAP as linux uses 128 gss window size but
we've experimented with increasing it to larger values and it would
still cause issues.

> Seems like maybe nconnect should set up a unique GSS
> context for each xprt. It would be helpful to file a bug.

At the time when I saw the issue and asked about it (though can't find
a reference now) I got the impression that having multiple contexts
for the same rpc client was not going to be acceptable.



>
>
> > and then in 'man 5 nfs'
> > sec=krb5 provides cryptographic proof of a user's identity in each RPC request.
>
> Kerberos has performance impacts due to the crypto-
> graphic operations that are performed on even small
> fixed-sized sections of each RPC message, when using
> sec=krb5 (no 'i' or 'p').
>
>
> > Is there a option of better performance to check krb5 only when mount.nfs4,
> > not when file acess?
>
> If you mount with NFSv4 and sec=sys from a Linux NFS
> client that has a keytab, the client will attempt to
> use krb5i for lease management operations (such as
> EXCHANGE_ID) but it will continue to use sec=sys for
> user authentication. That's not terribly secure.
>
> A better answer would be to make Kerberos faster.
> I've done some recent work on improving the overhead
> of using message digest algorithms with GSS-API, but
> haven't done any specific measurement. I'm sure
> there's more room for optimization.
>
> Even better would be to use a transport layer security
> service. Amazon has EFS and Oracle Cloud has something
> similar, but we're working on a standard approach that
> uses TLSv1.3.
>
>
> --
> Chuck Lever
>
>
>

2023-02-13 15:19:07

by Rick Macklem

[permalink] [raw]
Subject: Re: question about the performance impact of sec=krb5

On Mon, Feb 13, 2023 at 6:55 AM Olga Kornievskaia <[email protected]> wrote:
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to [email protected]
>
>
> On Sun, Feb 12, 2023 at 1:08 PM Chuck Lever III <[email protected]> wrote:
> >
> >
> >
> > > On Feb 12, 2023, at 1:01 AM, Wang Yugui <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > question about the performance of sec=krb5.
> > >
> > > https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-impact-kerberos
> > > Performance impact of krb5:
> > > Average IOPS decreased by 53%
> > > Average throughput decreased by 53%
> > > Average latency increased by 3.2 ms
> >
> > Looking at the numbers in this article... they don't
> > seem quite right. Here are the others:
> >
> > > Performance impact of krb5i:
> > > • Average IOPS decreased by 55%
> > > • Average throughput decreased by 55%
> > > • Average latency increased by 0.6 ms
> > > Performance impact of krb5p:
> > > • Average IOPS decreased by 77%
> > > • Average throughput decreased by 77%
> > > • Average latency increased by 1.6 ms
> >
> > I would expect krb5p to be the worst in terms of
> > latency. And I would like to see round-trip numbers
> > reported: what part of the increase in latency is
> > due to server versus client processing?
> >
> > This is also remarkable:
> >
> > > When nconnect is used in Linux, the GSS security context is shared between all the nconnect connections to a particular server. TCP is a reliable transport that supports out-of-order packet delivery to deal with out-of-order packets in a GSS stream, using a sliding window of sequence numbers. When packets not in the sequence window are received, the security context is discarded, and a new security context is negotiated. All messages sent with in the now-discarded context are no longer valid, thus requiring the messages to be sent again. Larger number of packets in an nconnect setup cause frequent out-of-window packets, triggering the described behavior. No specific degradation percentages can be stated with this behavior.
> >
> >
> > So, does this mean that nconnect makes the GSS sequence
> > window problem worse, or that when a window underrun
> > occurs it has broader impact because multiple connections
> > are affected?
>
> Yes nconnect makes the GSS sequence window problem worse (very typical
> to generate more than gss window size number of rpcs and have no
> ability to control in what order they would be sent) and yes all
> connections are affected. ONTAP as linux uses 128 gss window size but
> we've experimented with increasing it to larger values and it would
> still cause issues.
>
> > Seems like maybe nconnect should set up a unique GSS
> > context for each xprt. It would be helpful to file a bug.
>
> At the time when I saw the issue and asked about it (though can't find
> a reference now) I got the impression that having multiple contexts
> for the same rpc client was not going to be acceptable.
>
I suspect there might be awkward corner cases if there are multiple
contexts for a given user principal.
For example:
- If the group database changed at about the same time as the
context was established, you might get two contexts for a user
that map to different sets of groups on the server.
- If a user renewed a TGT at about the time the contexts were being
created on the client, I think they could end up with different expiry
times.

These are just off the top of my head, but I suspect there are issues
when you create multiple contexts for a given user?

rick

>
>
> >
> >
> > > and then in 'man 5 nfs'
> > > sec=krb5 provides cryptographic proof of a user's identity in each RPC request.
> >
> > Kerberos has performance impacts due to the crypto-
> > graphic operations that are performed on even small
> > fixed-sized sections of each RPC message, when using
> > sec=krb5 (no 'i' or 'p').
> >
> >
> > > Is there a option of better performance to check krb5 only when mount.nfs4,
> > > not when file acess?
> >
> > If you mount with NFSv4 and sec=sys from a Linux NFS
> > client that has a keytab, the client will attempt to
> > use krb5i for lease management operations (such as
> > EXCHANGE_ID) but it will continue to use sec=sys for
> > user authentication. That's not terribly secure.
> >
> > A better answer would be to make Kerberos faster.
> > I've done some recent work on improving the overhead
> > of using message digest algorithms with GSS-API, but
> > haven't done any specific measurement. I'm sure
> > there's more room for optimization.
> >
> > Even better would be to use a transport layer security
> > service. Amazon has EFS and Oracle Cloud has something
> > similar, but we're working on a standard approach that
> > uses TLSv1.3.
> >
> >
> > --
> > Chuck Lever
> >
> >
> >

2023-02-13 15:38:28

by Trond Myklebust

[permalink] [raw]
Subject: Re: question about the performance impact of sec=krb5



> On Feb 13, 2023, at 09:55, Olga Kornievskaia <[email protected]> wrote:
>
> On Sun, Feb 12, 2023 at 1:08 PM Chuck Lever III <[email protected]> wrote:
>>
>>
>>
>>> On Feb 12, 2023, at 1:01 AM, Wang Yugui <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> question about the performance of sec=krb5.
>>>
>>> https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-impact-kerberos
>>> Performance impact of krb5:
>>> Average IOPS decreased by 53%
>>> Average throughput decreased by 53%
>>> Average latency increased by 3.2 ms
>>
>> Looking at the numbers in this article... they don't
>> seem quite right. Here are the others:
>>
>>> Performance impact of krb5i:
>>> • Average IOPS decreased by 55%
>>> • Average throughput decreased by 55%
>>> • Average latency increased by 0.6 ms
>>> Performance impact of krb5p:
>>> • Average IOPS decreased by 77%
>>> • Average throughput decreased by 77%
>>> • Average latency increased by 1.6 ms
>>
>> I would expect krb5p to be the worst in terms of
>> latency. And I would like to see round-trip numbers
>> reported: what part of the increase in latency is
>> due to server versus client processing?
>>
>> This is also remarkable:
>>
>>> When nconnect is used in Linux, the GSS security context is shared between all the nconnect connections to a particular server. TCP is a reliable transport that supports out-of-order packet delivery to deal with out-of-order packets in a GSS stream, using a sliding window of sequence numbers. When packets not in the sequence window are received, the security context is discarded, and a new security context is negotiated. All messages sent with in the now-discarded context are no longer valid, thus requiring the messages to be sent again. Larger number of packets in an nconnect setup cause frequent out-of-window packets, triggering the described behavior. No specific degradation percentages can be stated with this behavior.
>>
>>
>> So, does this mean that nconnect makes the GSS sequence
>> window problem worse, or that when a window underrun
>> occurs it has broader impact because multiple connections
>> are affected?
>
> Yes nconnect makes the GSS sequence window problem worse (very typical
> to generate more than gss window size number of rpcs and have no
> ability to control in what order they would be sent) and yes all
> connections are affected. ONTAP as linux uses 128 gss window size but
> we've experimented with increasing it to larger values and it would
> still cause issues.
>
>> Seems like maybe nconnect should set up a unique GSS
>> context for each xprt. It would be helpful to file a bug.
>
> At the time when I saw the issue and asked about it (though can't find
> a reference now) I got the impression that having multiple contexts
> for the same rpc client was not going to be acceptable.
>

We have discussed this earlier on this mailing list. To me, the two issues are separate.
- It would be nice to enforce the GSS window on the client, and to throttle further RPC calls from using a context once the window is full.
- It might also be nice to allow for multiple contexts on the client and to have them assigned on a per-xprt basis so that the number of slots scales with the number of connections.

Note though, that window issues do tend to be mitigated by the NFSv4.x (x>0) sessions. It would make sense for server vendors to ensure that they match the GSS window size to the max number of session slots.

_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2023-02-13 17:45:44

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: question about the performance impact of sec=krb5

On Mon, Feb 13, 2023 at 10:38 AM Trond Myklebust
<[email protected]> wrote:
>
>
>
> > On Feb 13, 2023, at 09:55, Olga Kornievskaia <[email protected]> wrote:
> >
> > On Sun, Feb 12, 2023 at 1:08 PM Chuck Lever III <[email protected]> wrote:
> >>
> >>
> >>
> >>> On Feb 12, 2023, at 1:01 AM, Wang Yugui <[email protected]> wrote:
> >>>
> >>> Hi,
> >>>
> >>> question about the performance of sec=krb5.
> >>>
> >>> https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-impact-kerberos
> >>> Performance impact of krb5:
> >>> Average IOPS decreased by 53%
> >>> Average throughput decreased by 53%
> >>> Average latency increased by 3.2 ms
> >>
> >> Looking at the numbers in this article... they don't
> >> seem quite right. Here are the others:
> >>
> >>> Performance impact of krb5i:
> >>> • Average IOPS decreased by 55%
> >>> • Average throughput decreased by 55%
> >>> • Average latency increased by 0.6 ms
> >>> Performance impact of krb5p:
> >>> • Average IOPS decreased by 77%
> >>> • Average throughput decreased by 77%
> >>> • Average latency increased by 1.6 ms
> >>
> >> I would expect krb5p to be the worst in terms of
> >> latency. And I would like to see round-trip numbers
> >> reported: what part of the increase in latency is
> >> due to server versus client processing?
> >>
> >> This is also remarkable:
> >>
> >>> When nconnect is used in Linux, the GSS security context is shared between all the nconnect connections to a particular server. TCP is a reliable transport that supports out-of-order packet delivery to deal with out-of-order packets in a GSS stream, using a sliding window of sequence numbers. When packets not in the sequence window are received, the security context is discarded, and a new security context is negotiated. All messages sent with in the now-discarded context are no longer valid, thus requiring the messages to be sent again. Larger number of packets in an nconnect setup cause frequent out-of-window packets, triggering the described behavior. No specific degradation percentages can be stated with this behavior.
> >>
> >>
> >> So, does this mean that nconnect makes the GSS sequence
> >> window problem worse, or that when a window underrun
> >> occurs it has broader impact because multiple connections
> >> are affected?
> >
> > Yes nconnect makes the GSS sequence window problem worse (very typical
> > to generate more than gss window size number of rpcs and have no
> > ability to control in what order they would be sent) and yes all
> > connections are affected. ONTAP as linux uses 128 gss window size but
> > we've experimented with increasing it to larger values and it would
> > still cause issues.
> >
> >> Seems like maybe nconnect should set up a unique GSS
> >> context for each xprt. It would be helpful to file a bug.
> >
> > At the time when I saw the issue and asked about it (though can't find
> > a reference now) I got the impression that having multiple contexts
> > for the same rpc client was not going to be acceptable.
> >
>
> We have discussed this earlier on this mailing list. To me, the two issues are separate.
> - It would be nice to enforce the GSS window on the client, and to throttle further RPC calls from using a context once the window is full.
> - It might also be nice to allow for multiple contexts on the client and to have them assigned on a per-xprt basis so that the number of slots scales with the number of connections.
>
> Note though, that window issues do tend to be mitigated by the NFSv4.x (x>0) sessions. It would make sense for server vendors to ensure that they match the GSS window size to the max number of session slots.

Matching max session slots to gss window size doesn't help but perhaps
my understanding of the flow is wrong. Typically all these runs are
done with the client's default session slot # which is only 64slots
(server's session slot size is higher). The session slot assignment
happens after the gss sequence assignment. So we have a bunch of
requests that have gotten gss sequence numbers that exceed the window
slot and then they go wait for the slot assignment but when they are
sent they are already out of sequence window.

>
> _________________________________
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>

2023-02-13 17:52:44

by Charles Hedrick

[permalink] [raw]
Subject: Re: question about the performance impact of sec=krb5

those numbers seem implausible.

I just tried my standard quick NFS test on the same file system with sec=sys and sec=krb5. It untar's a file with 80,000 files in it, of a size typical for our users.

krb5: 1:38
sys: 1:29

I did the test only once. Since the server is in use, it should really be tried multiple times.

krb5i and krb5p have to work on all the contents. I haven't looked at the protocol details, but krb5 with no suffix should only have to work on headers. 3.2 msec increase in latency would be a disaster, which we would certainly have noticed. (Almost all of our NFS activity uses krb5.)

It is particularly implausible that latency would increase by 3.2 msec for krb5, 0.6 msec for krb5i and 1.6 for krb5p. krb5 encrypts only security info. krb5p encrypts everything. Perhaps they mean 0.32 msec? We'd even notice that, but at least it would be consistent with krb5i and krb5p.



From: Wang Yugui <[email protected]>
Sent: Sunday, February 12, 2023 1:01 AM
To: [email protected] <[email protected]>
Subject: question about the performance impact of sec=krb5
?
Hi,

question about the performance of sec=krb5.

https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-impact-kerberos
Performance impact of krb5:
??????? Average IOPS decreased by 53%
??????? Average throughput decreased by 53%
??????? Average latency increased by 3.2 ms

and then in 'man 5 nfs'
sec=krb5? provides cryptographic proof of a user's identity in each RPC request.

Is there a option of better performance to check krb5 only when mount.nfs4,
not when file acess?

Best Regards
Wang Yugui ([email protected])
2023/02/12


2023-02-13 18:36:54

by Trond Myklebust

[permalink] [raw]
Subject: Re: question about the performance impact of sec=krb5



> On Feb 13, 2023, at 12:45, Olga Kornievskaia <[email protected]> wrote:
>
> On Mon, Feb 13, 2023 at 10:38 AM Trond Myklebust
> <[email protected]> wrote:
>>
>>
>>
>>> On Feb 13, 2023, at 09:55, Olga Kornievskaia <[email protected]> wrote:
>>>
>>> On Sun, Feb 12, 2023 at 1:08 PM Chuck Lever III <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>>> On Feb 12, 2023, at 1:01 AM, Wang Yugui <[email protected]> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> question about the performance of sec=krb5.
>>>>>
>>>>> https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-impact-kerberos
>>>>> Performance impact of krb5:
>>>>> Average IOPS decreased by 53%
>>>>> Average throughput decreased by 53%
>>>>> Average latency increased by 3.2 ms
>>>>
>>>> Looking at the numbers in this article... they don't
>>>> seem quite right. Here are the others:
>>>>
>>>>> Performance impact of krb5i:
>>>>> • Average IOPS decreased by 55%
>>>>> • Average throughput decreased by 55%
>>>>> • Average latency increased by 0.6 ms
>>>>> Performance impact of krb5p:
>>>>> • Average IOPS decreased by 77%
>>>>> • Average throughput decreased by 77%
>>>>> • Average latency increased by 1.6 ms
>>>>
>>>> I would expect krb5p to be the worst in terms of
>>>> latency. And I would like to see round-trip numbers
>>>> reported: what part of the increase in latency is
>>>> due to server versus client processing?
>>>>
>>>> This is also remarkable:
>>>>
>>>>> When nconnect is used in Linux, the GSS security context is shared between all the nconnect connections to a particular server. TCP is a reliable transport that supports out-of-order packet delivery to deal with out-of-order packets in a GSS stream, using a sliding window of sequence numbers. When packets not in the sequence window are received, the security context is discarded, and a new security context is negotiated. All messages sent with in the now-discarded context are no longer valid, thus requiring the messages to be sent again. Larger number of packets in an nconnect setup cause frequent out-of-window packets, triggering the described behavior. No specific degradation percentages can be stated with this behavior.
>>>>
>>>>
>>>> So, does this mean that nconnect makes the GSS sequence
>>>> window problem worse, or that when a window underrun
>>>> occurs it has broader impact because multiple connections
>>>> are affected?
>>>
>>> Yes nconnect makes the GSS sequence window problem worse (very typical
>>> to generate more than gss window size number of rpcs and have no
>>> ability to control in what order they would be sent) and yes all
>>> connections are affected. ONTAP as linux uses 128 gss window size but
>>> we've experimented with increasing it to larger values and it would
>>> still cause issues.
>>>
>>>> Seems like maybe nconnect should set up a unique GSS
>>>> context for each xprt. It would be helpful to file a bug.
>>>
>>> At the time when I saw the issue and asked about it (though can't find
>>> a reference now) I got the impression that having multiple contexts
>>> for the same rpc client was not going to be acceptable.
>>>
>>
>> We have discussed this earlier on this mailing list. To me, the two issues are separate.
>> - It would be nice to enforce the GSS window on the client, and to throttle further RPC calls from using a context once the window is full.
>> - It might also be nice to allow for multiple contexts on the client and to have them assigned on a per-xprt basis so that the number of slots scales with the number of connections.
>>
>> Note though, that window issues do tend to be mitigated by the NFSv4.x (x>0) sessions. It would make sense for server vendors to ensure that they match the GSS window size to the max number of session slots.
>
> Matching max session slots to gss window size doesn't help but perhaps
> my understanding of the flow is wrong. Typically all these runs are
> done with the client's default session slot # which is only 64slots
> (server's session slot size is higher). The session slot assignment
> happens after the gss sequence assignment. So we have a bunch of
> requests that have gotten gss sequence numbers that exceed the window
> slot and then they go wait for the slot assignment but when they are
> sent they are already out of sequence window.
>

The NFSv4.x session slot is normally assigned before we kick off the RPC state machine in ‘call_start()’. So if you are limited to 64 session slots, then that will prevent you from exceeding the GSS 128 entry window.

_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2023-02-13 18:54:37

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: question about the performance impact of sec=krb5

On Mon, Feb 13, 2023 at 1:02 PM Charles Hedrick <[email protected]> wrote:
>
> those numbers seem implausible.
>
> I just tried my standard quick NFS test on the same file system with sec=sys and sec=krb5. It untar's a file with 80,000 files in it, of a size typical for our users.
>
> krb5: 1:38
> sys: 1:29
>
> I did the test only once. Since the server is in use, it should really be tried multiple times.
>
> krb5i and krb5p have to work on all the contents. I haven't looked at the protocol details, but krb5 with no suffix should only have to work on headers. 3.2 msec increase in latency would be a disaster, which we would certainly have noticed. (Almost all of our NFS activity uses krb5.)
>
> It is particularly implausible that latency would increase by 3.2 msec for krb5, 0.6 msec for krb5i and 1.6 for krb5p. krb5 encrypts only security info. krb5p encrypts everything. Perhaps they mean 0.32 msec? We'd even notice that, but at least it would be consistent with krb5i and krb5p.

Actually they really did mean 3.2. Here's another reference that
produces similar numbers :
https://www.netapp.com/media/19384-tr-4616.pdf . Why krb5 perf gets
higher latency then the rest is bizarre and should have been looked at
before publication.

> From: Wang Yugui <[email protected]>
> Sent: Sunday, February 12, 2023 1:01 AM
> To: [email protected] <[email protected]>
> Subject: question about the performance impact of sec=krb5
>
> Hi,
>
> question about the performance of sec=krb5.
>
> https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-impact-kerberos
> Performance impact of krb5:
> Average IOPS decreased by 53%
> Average throughput decreased by 53%
> Average latency increased by 3.2 ms
>
> and then in 'man 5 nfs'
> sec=krb5 provides cryptographic proof of a user's identity in each RPC request.
>
> Is there a option of better performance to check krb5 only when mount.nfs4,
> not when file acess?
>
> Best Regards
> Wang Yugui ([email protected])
> 2023/02/12
>

2023-02-14 14:53:21

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: question about the performance impact of sec=krb5

On Mon, Feb 13, 2023 at 1:53 PM Olga Kornievskaia <[email protected]> wrote:
>
> On Mon, Feb 13, 2023 at 1:02 PM Charles Hedrick <[email protected]> wrote:
> >
> > those numbers seem implausible.
> >
> > I just tried my standard quick NFS test on the same file system with sec=sys and sec=krb5. It untar's a file with 80,000 files in it, of a size typical for our users.
> >
> > krb5: 1:38
> > sys: 1:29
> >
> > I did the test only once. Since the server is in use, it should really be tried multiple times.
> >
> > krb5i and krb5p have to work on all the contents. I haven't looked at the protocol details, but krb5 with no suffix should only have to work on headers. 3.2 msec increase in latency would be a disaster, which we would certainly have noticed. (Almost all of our NFS activity uses krb5.)
> >
> > It is particularly implausible that latency would increase by 3.2 msec for krb5, 0.6 msec for krb5i and 1.6 for krb5p. krb5 encrypts only security info. krb5p encrypts everything. Perhaps they mean 0.32 msec? We'd even notice that, but at least it would be consistent with krb5i and krb5p.
>
> Actually they really did mean 3.2. Here's another reference that
> produces similar numbers :
> https://www.netapp.com/media/19384-tr-4616.pdf . Why krb5 perf gets
> higher latency then the rest is bizarre and should have been looked at
> before publication.

Nevermind. After some investigation, it turns out the public report
has a typo, it should have been 0.2ms. Hopefully they'll fix it.

> > From: Wang Yugui <[email protected]>
> > Sent: Sunday, February 12, 2023 1:01 AM
> > To: [email protected] <[email protected]>
> > Subject: question about the performance impact of sec=krb5
> >
> > Hi,
> >
> > question about the performance of sec=krb5.
> >
> > https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-impact-kerberos
> > Performance impact of krb5:
> > Average IOPS decreased by 53%
> > Average throughput decreased by 53%
> > Average latency increased by 3.2 ms
> >
> > and then in 'man 5 nfs'
> > sec=krb5 provides cryptographic proof of a user's identity in each RPC request.
> >
> > Is there a option of better performance to check krb5 only when mount.nfs4,
> > not when file acess?
> >
> > Best Regards
> > Wang Yugui ([email protected])
> > 2023/02/12
> >

2023-02-14 15:25:08

by Charles Hedrick

[permalink] [raw]
Subject: Re: question about the performance impact of sec=krb5

That's better, but that still leaves 53% reduction in performance. We don't see that, and it's hard to understand where it would come from. 0.2 msec still looks big. Our total RTT for a read is a bit over 0.3 msec, for data that comes from cache, which is 95% of it. We're using a generic Dell server with Ubuntu 22.04, i.e. kernel 5.15.

I do see a big slowdown from krb5p, but that's not so surprising.

krb5p: 3:25
krb5: 1:38
sys: 1:29

I guess it's possible that their server is so much more efficient that the sec=sys performance is a lot faster, and so the difference is greater.

From: Olga Kornievskaia <[email protected]>
Sent: Tuesday, February 14, 2023 9:53 AM
To: Charles Hedrick <[email protected]>
Cc: Wang Yugui <[email protected]>; [email protected] <[email protected]>
Subject: Re: question about the performance impact of sec=krb5
?
On Mon, Feb 13, 2023 at 1:53 PM Olga Kornievskaia <[email protected]> wrote:
>
> On Mon, Feb 13, 2023 at 1:02 PM Charles Hedrick <[email protected]> wrote:
> >
> > those numbers seem implausible.
> >
> > I just tried my standard quick NFS test on the same file system with sec=sys and sec=krb5. It untar's a file with 80,000 files in it, of a size typical for our users.
> >
> > krb5: 1:38
> > sys: 1:29
> >
> > I did the test only once. Since the server is in use, it should really be tried multiple times.
> >
> > krb5i and krb5p have to work on all the contents. I haven't looked at the protocol details, but krb5 with no suffix should only have to work on headers. 3.2 msec increase in latency would be a disaster, which we would certainly have noticed. (Almost all of our NFS activity uses krb5.)
> >
> > It is particularly implausible that latency would increase by 3.2 msec for krb5, 0.6 msec for krb5i and 1.6 for krb5p. krb5 encrypts only security info. krb5p encrypts everything.? Perhaps they mean 0.32 msec? We'd even notice that, but at least it would be consistent with krb5i and krb5p.
>
> Actually they really did mean 3.2. Here's another reference that
> produces similar numbers :
> https://www.netapp.com/media/19384-tr-4616.pdf . Why krb5 perf gets
> higher latency then the rest is bizarre and should have been looked at
> before publication.

Nevermind. After some investigation, it turns out the public report
has a typo, it should have been 0.2ms. Hopefully they'll fix it.

> > From: Wang Yugui <[email protected]>
> > Sent: Sunday, February 12, 2023 1:01 AM
> > To: [email protected] <[email protected]>
> > Subject: question about the performance impact of sec=krb5
> >
> > Hi,
> >
> > question about the performance of sec=krb5.
> >
> > https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-impact-kerberos
> > Performance impact of krb5:
> >???????? Average IOPS decreased by 53%
> >???????? Average throughput decreased by 53%
> >???????? Average latency increased by 3.2 ms
> >
> > and then in 'man 5 nfs'
> > sec=krb5? provides cryptographic proof of a user's identity in each RPC request.
> >
> > Is there a option of better performance to check krb5 only when mount.nfs4,
> > not when file acess?
> >
> > Best Regards
> > Wang Yugui ([email protected])
> > 2023/02/12
> >