2023-03-09 07:24:23

by Aram Akhavan

[permalink] [raw]
Subject: nfs-idmapd startup race

Hi all,

I've been debugging an nfs server issue where id mapping was not
happening correctly unless I restarted nfs-kernel-server and re-exported
shares shortly after reboot. The main symptom is the following log
entries from nfs-idmapd.service:

Mar 08 22:45:59 343guiltyspark.nub.lan systemd[1]: Starting NFSv4 ID-name mapping service...
Mar 08 22:45:59 343guiltyspark.nub.lan rpc.idmapd[620]: libnfsidmap: Unable to determine the NFSv4 domain; Using 'localdomain' as the NFSv4 domain which means UIDs will be mapped to the 'Nobody-User' user defined in /etc/idmapd.conf
Mar 08 22:45:59 343guiltyspark.nub.lan rpc.idmapd[620]: rpc.idmapd: libnfsidmap: Unable to determine the NFSv4 domain; Using 'localdomain' as the NFSv4 domain which means UIDs will be mapped to the 'Nobody-User' user defined in /etc/idmapd.conf
Mar 08 22:45:59 343guiltyspark.nub.lan rpc.idmapd[620]: rpc.idmapd: libnfsidmap: using (default) domain: localdomain
Mar 08 22:45:59 343guiltyspark.nub.lan rpc.idmapd[620]: rpc.idmapd: libnfsidmap: Realms list: 'LOCALDOMAIN'
Mar 08 22:45:59 343guiltyspark.nub.lan rpc.idmapd[620]: rpc.idmapd: libnfsidmap: loaded plugin /lib/x86_64-linux-gnu/libnfsidmap/nsswitch.so for method nsswitch

I wrote a little test program to mimic libnfsidmap's domain_from_dns()
function, which causes the above message:

#include <netdb.h>
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
extern int h_errno;
int main() {
    struct hostent *he;
    char hname[64], *c;

    if (gethostname(hname, sizeof(hname)))
        printf("gethostname error: %d\n", errno);
    else
        printf("gethostname: '%s'\n", hname);

    if ((he = gethostbyname(hname)) == NULL)
        printf("gethostbyname error: '%s'\n", hstrerror(h_errno));
    else {
        printf("gethostbyname h_name: '%s'\n", he->h_name);
    }
}

and added it as an ExecStartPre= to the systemd service. The output is:

gethostname: '343guiltyspark.nub.lan'
gethostbyname error: 'Host name lookup failure'

It seems dns resolution isn't quite working when the service is started,
so I added Wants=network-online.target (and After=) to the systemd
service. It still fails.
But if I then add a "sleep 1" to the ExecStartPre, everything starts up
correctly.

Obviously there are many solutions, including the above and setting the
domain manually in /etc/idmap.conf. But on principle I'd like to solve
the root race condition and help others avoid the same issue.

I'm hoping someone can answer my open questions:

1. Why does libnfsidmap use gethostname() and gethostbyname() (i.e. why
does it need a dns lookup on the hostname)?

2. nfs-server.service already has a dependency on network-online.target,
but nfs-idmapd.service does not (and it starts first). Since id mapping
can depend on DNS resolution (and seems to out of the box), why not add
the dependency to the latter as well?

3. Since the network-online.target doesn't completely solve the issue,
any ideas on how to fix the startup race without something haphazard
like a "sleep"?

Thanks,

Aram