2014-10-26 21:47:30

by Chuck Lever III

[permalink] [raw]
Subject: NFSv4 idmap misbehavior

Hi David-

I?m looking into some odd NFS idmapper behavior, and I?ve bisected
the problem to this commit (merged in 3.13):

> commit b2a4df200d570b2c33a57e1ebfa5896e4bc81b69
> Author: David Howells <[email protected]>
> Date: Tue Sep 24 10:35:18 2013 +0100
>
> KEYS: Expand the capacity of a keyring
>
> Expand the capacity of a keyring to be able to hold a lot more keys by using
> the previously added associative array implementation. Currently the maximum
> capacity is:
>
> (PAGE_SIZE - sizeof(header)) / sizeof(struct key *)
>
> which, on a 64-bit system, is a little more 500. However, since this is being
> used for the NFS uid mapper, we need more than that. The new implementation
> gives us effectively unlimited capacity.
>
> With some alterations, the keyutils testsuite runs successfully to completion
> after this patch is applied. The alterations are because (a) keyrings that
> are simply added to no longer appear ordered and (b) some of the errors have
> changed a bit.
>
> Signed-off-by: David Howells <[email protected]>


The problem occurs when running on RHEL6 with an upstream kernel
against a Solaris 11 update 2 server. I am able to reproduce this
with 3.17. Notably I?m not able to reproduce this with a newer
user space (tried with F19), nor with a Linux NFS server.

To reproduce it, I?ll start a long-running cthon04 test on an
NFS mount.

[cel@dali cthon04-x86_64]$ ./server -a -N10

Just after starting the test, in another window:

# grep id_ /proc/keys
03515452 I--Q--- 1 9m 3b010000 0 0 id_resolv gid:[email protected]: 4
1c6ebd43 I--Q--- 1 perm 1f3f0000 0 65534 keyring _uid_ses.0: 1
249f8fdb I--Q--- 1 9m 3b010000 0 0 id_resolv gid:[email protected]: 2
2da69ca9 I--Q--- 1 perm 3f3f0000 0 0 keyring .id_resolver_child_1: 4
37df5ceb I--Q--- 1 9m 3b010000 0 0 id_resolv uid:[email protected]: 5
38810f75 I------ 1 perm 1f030000 0 0 keyring .id_resolver: 1
3e7df923 I--Q--- 1 9m 3b010000 0 0 id_resolv uid:[email protected]: 2

After the test has been running for ten minutes, the id_resolv
keys expire, and id_legacy keys appear. Before the above commit,
the id_resolv keys would simply be refreshed and operation
would continue normally.

# grep id_ /proc/keys
00f0a664 I--Q-N- 1 42s 3b010000 0 0 id_legacy uid:[email protected]
03515452 I--Q--- 1 expd 3b010000 0 0 id_resolv gid:[email protected]: 4
0efeaada I--Q-N- 1 53s 3b010000 0 0 id_legacy uid:[email protected]
12d6cd15 I--Q-N- 1 42s 3b010000 0 0 id_legacy gid:[email protected]
1c6ebd43 I--Q--- 1 perm 1f3f0000 0 65534 keyring _uid_ses.0: 1
249f8fdb I--Q--- 1 expd 3b010000 0 0 id_resolv gid:[email protected]: 2
2da69ca9 I--Q--- 1 perm 3f3f0000 0 0 keyring .id_resolver_child_1: 4
2e7150a3 I--Q-N- 1 53s 3b010000 0 0 id_legacy gid:[email protected]
37df5ceb I--Q--- 1 expd 3b010000 0 0 id_resolv uid:[email protected]: 5
38810f75 I------ 1 perm 1f030000 0 0 keyring .id_resolver: 5
3e7df923 I--Q--- 1 expd 3b010000 0 0 id_resolv uid:[email protected]: 2

Subsequently cthon04 fails when it tries to start another pass:

** CHILD pass 1 results: 64/64 pass, 0/0 warn, 0/0 fail (pass/total).
Congratulations, you passed the locking tests!
... Pass 8 ...

rm: cannot remove `/mnt/monet/dali.test': Operation not permitted
Starting BASIC tests: test directory /mnt/monet/dali.test (arg: -t)
mkdir: cannot create directory `/mnt/monet/dali.test': File exists

./test1: File and directory creation test
rm: cannot remove `/mnt/monet/dali.test': Operation not permitted
./test1: (/home/cel/src/cthon04-x86_64/basic) can't remove old test directory /mnt/monet/dali.test
basic tests failed
Tests failed, leaving /mnt/monet mounted
[cel@dali cthon04-x86_64]$

And ID mapping on the test mount is broken. ?dali.test? is the
test directory, but all other files on that mount have bogus
ownership.

[cel@dali cthon04-x86_64]$ ls -l /mnt/monet
total 38995
drwxr-xr-x 2 4294967294 4294967294 4098 Oct 15 22:59 310
-rw------- 1 4294967294 4294967294 10485760 Oct 15 23:00 aio-testfile
-rw-r--r-- 1 4294967294 4294967294 0 Oct 15 22:38 client.out
drwxr-xr-x 12 4294967294 4294967294 12 Oct 15 11:47 clients
drwxrwxrwx 2 4294967294 4294967294 2 Oct 26 17:16 dali.test
drwxr-xr-x 3 4294967294 4294967294 3 Oct 15 22:54 dbench
-rw------- 1 4294967294 4294967294 0 Oct 15 22:53 file
. . .

Restarting the tests or removing the test directory by hand
results in ?Operation not permitted."

After several minutes, all expired id_ keys are purged:

1c6ebd43 I--Q--- 1 perm 1f3f0000 0 65534 keyring _uid_ses.0: 1
2da69ca9 I--Q--- 1 perm 3f3f0000 0 0 keyring .id_resolver_child_1: empty
38810f75 I------ 1 perm 1f030000 0 0 keyring .id_resolver: 1

And cthon04 is able to run again.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2014-10-27 16:06:47

by Chuck Lever III

[permalink] [raw]
Subject: Re: NFSv4 idmap misbehavior


On Oct 27, 2014, at 10:44 AM, Chuck Lever <[email protected]> wrote:

> Hi David-
>
> On Oct 27, 2014, at 6:32 AM, David Howells <[email protected]> wrote:
>
>> Chuck Lever <[email protected]> wrote:
>>
>>> After the test has been running for ten minutes, the id_resolv
>>> keys expire, and id_legacy keys appear. Before the above commit,
>>> the id_resolv keys would simply be refreshed and operation
>>> would continue normally.
>>>
>>> # grep id_ /proc/keys
>>> 00f0a664 I--Q-N- 1 42s 3b010000 0 0 id_legacy uid:[email protected]
>>
>> The 'N' flag indicates that the key is negatively instantiated - ie. the
>> upcall to instantiate them failed in some way. Keys so marked will cause key
>> lookups to return ENOKEY (or maybe some other error if keyctl_reject() was
>> used).
>>
>> Failure can be due to the key being properly negatively instantiated
>> (keyctl_negate() or keyctl_reject()) or the userspace side of instantiation
>> exiting or otherwise dying before it managed to instantiate the key.
>>
>> Is there anything to actually process the id_legacy upcall? It appears
>> they're all negatively instantiated.
>
> rpc.idmapd isn?t running on this system, so no. Legacy idmapping is
> normally disabled on EL6 update 4 AFAICT.
>
> There is an /etc/request-key.d/id_resolver.conf file, and
> /usr/sbin/nfsidmap is there. On another system I previously tried
> using only rpc.idmapd, and it didn?t resolve the sudden ownership
> change issue.
>
> There should be no reason to switch to using the legacy ID mapper
> upcall if the key resolution infrastructure is working. Before the
> ?KEYS: Expand the capacity . . . ? commit, id_legacy keys do not
> appear in /proc/keys.
>
> That leads me to believe something that commit did is preventing
> expired id_resolv keys from getting renewed.
>
>> Can you stick "#define __KDEBUG" at the top of security/keys/request_key.c,
>> rebuild and look in dmesg?
>
> Thanks, I?ll give that a try and report back.

I enabled __KDEBUG. There are no new messages in /var/log/messages
when the reproducer fails.


>> Also, /sbin/request-key logs to syslog if it can't find a match in its config
>> file.
>
> I didn?t see any relevant messages in /var/log/messages, but I may
> have overlooked something.
>
> More recent kernels change the key resolver interface to require
> key_revoke instead of key_invalidate. I?ve seen /var/log/messages
> complaints about that on other EL6 systems running kernels newer
> than 3.16.

--
Chuck Lever




2014-10-28 10:12:42

by Carl Hetherington

[permalink] [raw]
Subject: Re: NFSv4 idmap misbehavior


Hi Chuck,

[snip]

> After the test has been running for ten minutes, the id_resolv
> keys expire, and id_legacy keys appear. Before the above commit,
> the id_resolv keys would simply be refreshed and operation
> would continue normally.

This sounds a lot like a bug I have seen, our investigations into which
are summarised here:

https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/1124250

There's an inexpertly written (by me) patch in that bug report (which I
posted to this list recently, too) which (in a nutshell) makes the
idmapper code call up to userspace to get a new key when the current ones
have expired.

I don't know whether that's the right approach, but I'm sure someone here
more qualified can comment.

Best regards,
Carl

2014-10-27 14:44:13

by Chuck Lever III

[permalink] [raw]
Subject: Re: NFSv4 idmap misbehavior

Hi David-

On Oct 27, 2014, at 6:32 AM, David Howells <[email protected]> wrote:

> Chuck Lever <[email protected]> wrote:
>
>> After the test has been running for ten minutes, the id_resolv
>> keys expire, and id_legacy keys appear. Before the above commit,
>> the id_resolv keys would simply be refreshed and operation
>> would continue normally.
>>
>> # grep id_ /proc/keys
>> 00f0a664 I--Q-N- 1 42s 3b010000 0 0 id_legacy uid:[email protected]
>
> The 'N' flag indicates that the key is negatively instantiated - ie. the
> upcall to instantiate them failed in some way. Keys so marked will cause key
> lookups to return ENOKEY (or maybe some other error if keyctl_reject() was
> used).
>
> Failure can be due to the key being properly negatively instantiated
> (keyctl_negate() or keyctl_reject()) or the userspace side of instantiation
> exiting or otherwise dying before it managed to instantiate the key.
>
> Is there anything to actually process the id_legacy upcall? It appears
> they're all negatively instantiated.

rpc.idmapd isn?t running on this system, so no. Legacy idmapping is
normally disabled on EL6 update 4 AFAICT.

There is an /etc/request-key.d/id_resolver.conf file, and
/usr/sbin/nfsidmap is there. On another system I previously tried
using only rpc.idmapd, and it didn?t resolve the sudden ownership
change issue.

There should be no reason to switch to using the legacy ID mapper
upcall if the key resolution infrastructure is working. Before the
?KEYS: Expand the capacity . . . ? commit, id_legacy keys do not
appear in /proc/keys.

That leads me to believe something that commit did is preventing
expired id_resolv keys from getting renewed.

> Can you stick "#define __KDEBUG" at the top of security/keys/request_key.c,
> rebuild and look in dmesg?

Thanks, I?ll give that a try and report back.

> Also, /sbin/request-key logs to syslog if it can't find a match in its config
> file.

I didn?t see any relevant messages in /var/log/messages, but I may
have overlooked something.

More recent kernels change the key resolver interface to require
key_revoke instead of key_invalidate. I?ve seen /var/log/messages
complaints about that on other EL6 systems running kernels newer
than 3.16.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2014-10-27 10:33:07

by David Howells

[permalink] [raw]
Subject: Re: NFSv4 idmap misbehavior

Chuck Lever <[email protected]> wrote:

> After the test has been running for ten minutes, the id_resolv
> keys expire, and id_legacy keys appear. Before the above commit,
> the id_resolv keys would simply be refreshed and operation
> would continue normally.
>
> # grep id_ /proc/keys
> 00f0a664 I--Q-N- 1 42s 3b010000 0 0 id_legacy uid:[email protected]

The 'N' flag indicates that the key is negatively instantiated - ie. the
upcall to instantiate them failed in some way. Keys so marked will cause key
lookups to return ENOKEY (or maybe some other error if keyctl_reject() was
used).

Failure can be due to the key being properly negatively instantiated
(keyctl_negate() or keyctl_reject()) or the userspace side of instantiation
exiting or otherwise dying before it managed to instantiate the key.

Is there anything to actually process the id_legacy upcall? It appears
they're all negatively instantiated.

Can you stick "#define __KDEBUG" at the top of security/keys/request_key.c,
rebuild and look in dmesg?

Also, /sbin/request-key logs to syslog if it can't find a match in its config
file.

David

2014-10-28 14:16:38

by Chuck Lever III

[permalink] [raw]
Subject: Re: NFSv4 idmap misbehavior

Hey Carl-

On Oct 28, 2014, at 6:12 AM, Carl Hetherington <[email protected]> wrote:

> Hi Chuck,
>
> [snip]
>
>> After the test has been running for ten minutes, the id_resolv
>> keys expire, and id_legacy keys appear. Before the above commit,
>> the id_resolv keys would simply be refreshed and operation
>> would continue normally.
>
> This sounds a lot like a bug I have seen, our investigations into which
> are summarised here:
>
> https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/1124250

Looks like a lot of folks are having trouble, and there are two or
three sets of independent symptoms reported in the comments. Thanks
for bringing it to this forum.

I might have a different issue from the bug report, though:

+ 1124250 might be a server-side problem, and my problem is
definitely on the NFS client

+ My system works fine with kernels up to 3.12

+ I can reproduce with just one user, so probably not a key
quota issue

+ I?ve got keyutils installed already

> There's an inexpertly written (by me) patch in that bug report (which I
> posted to this list recently, too) which (in a nutshell) makes the
> idmapper code call up to userspace to get a new key when the current ones
> have expired.

David should have a look at this.

However, your patch changes the behavior of renewing legacy ID
resolver keys AFAICT? On my system, at least, we shouldn?t even
be getting this far.

I?ll poke around more.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com