LinuxLists.cc - [RFC PATCH 0/6] nfs-utils: Improving NFS re-exports

2022-02-17 13:58:24

by Richard Weinberger

[permalink] [raw]

Subject: [RFC PATCH 0/6] nfs-utils: Improving NFS re-exports

This is the second iteration of the NFS re-export improvement series for nfs-utils.
While the kernel side didn't change at all and is still small,
the userspace side saw much more changes.
Please note that this is still an RFC, there is room for improvement.

The core idea is adding new export option: reeport=
Using reexport= it is possible to mark an export entry in the exports file
explicitly as NFS re-export and select a strategy how unique identifiers
should be provided.
"remote-devfsid" is the strategy I have proposed in my first patch,
I understand that this one is dangerous. But I still find it useful in some
situations.
"auto-fsidnum" and "predefined-fsidnum" are new and use a SQLite database as
backend to keep track of generated ids.
For a more detailed description see patch "exports: Implement new export option reexport=".
I choose SQLite because nfs-utils already uses it and using SQL ids can nicely
generated and maintained. It will also scale for large setups where the amount
of subvolumes is high.

Beside of id generation this series also addresses the reboot problem.
If the re-exporting NFS server reboots, uncovered NFS subvolumes are not yet
mounted and file handles become stale.
Now mountd/exportd keeps track of uncovered subvolumes and makes sure they get
uncovered while nfsd starts.

The whole set of features is currently opt-in via --enable-reexport.
I'm also not sure about the rearrangement of the reexport code,
currently it is a helper library.

Please let me know whether you like this approach.
If so I'd tidy it up and submit it as non-RFC.

TODOs/Open questions:
- When re-exporting, fs.nfs.nfs_mountpoint_timeout should be set to 0
to make subvolumes not vanish.
Is this something exportfs should do automatically when it sees an export entry with a reexport= option?
- exportd saw only minimal testing so far, I wasn't aware of it yet. :-S
- Currently wtere is no way to release the shared memory which contains the database lock.
I guess it could be released via exportfs -f, which is the very last exec in nfs-server.service
- Add a tool to import/export entries from the reexport database which obeys the shared lock.
- When doing v4->v4 or v3->v4 re-exports very first read access to a file block a few seconds until
the client does a retransmit.
v3->v3 works fine. More investigation needed.

Looking forward for your feedback!

Thanks,
//richard

Richard Weinberger (6):
Implement reexport helper library
exports: Implement new export option reexport=
export: Implement logic behind reexport=
export: Record mounted volumes
nfsd: statfs() every known subvolume upon start
export: Garbage collect orphaned subvolumes upon start

configure.ac | 12 +
support/Makefile.am | 4 +
support/export/Makefile.am | 2 +
support/export/cache.c | 241 +++++++++++++++++-
support/export/export.h | 3 +
support/include/nfslib.h | 1 +
support/nfs/Makefile.am | 1 +
support/nfs/exports.c | 73 ++++++
support/reexport/Makefile.am | 6 +
support/reexport/reexport.c | 477 +++++++++++++++++++++++++++++++++++
support/reexport/reexport.h | 53 ++++
utils/exportd/Makefile.am | 8 +-
utils/exportd/exportd.c | 17 ++
utils/exportfs/Makefile.am | 4 +
utils/mount/Makefile.am | 6 +
utils/mountd/Makefile.am | 6 +
utils/mountd/mountd.c | 1 +
utils/mountd/svc_run.c | 18 ++
utils/nfsd/Makefile.am | 6 +
utils/nfsd/nfsd.c | 10 +
20 files changed, 934 insertions(+), 15 deletions(-)
create mode 100644 support/reexport/Makefile.am
create mode 100644 support/reexport/reexport.c
create mode 100644 support/reexport/reexport.h

--
2.31.1

2022-02-17 16:28:27

by Richard Weinberger

[permalink] [raw]

Subject: [RFC PATCH 6/6] export: Garbage collect orphaned subvolumes upon start

Make sure the database contains no orphaned subvolumes.
We have to be careful.

Signed-off-by: Richard Weinberger <[email protected]>
---
support/export/cache.c | 97 +++++++++++++++++++++++++++++++++++++++++
support/export/export.h | 3 ++
utils/exportd/exportd.c | 17 +++++++-
utils/mountd/mountd.c | 1 +
utils/mountd/svc_run.c | 18 ++++++++
5 files changed, 135 insertions(+), 1 deletion(-)

diff --git a/support/export/cache.c b/support/export/cache.c
index b5763b1d..94a0d79a 100644
--- a/support/export/cache.c
+++ b/support/export/cache.c
@@ -1181,6 +1181,103 @@ lookup_export(char *dom, char *path, struct addrinfo *ai)
return found;
}

+static char *get_export_path(char *path)
+{
+ int i;
+ nfs_export *exp;
+ nfs_export *found = NULL;
+
+ for (i = 0; i < MCL_MAXTYPES; i++) {
+ for (exp = exportlist[i].p_head; exp; exp = exp->m_next) {
+ if (!path_matches(exp, path))
+ continue;
+
+ if (!found) {
+ found = exp;
+ continue;
+ }
+
+ /* Always prefer non-V4ROOT exports */
+ if (exp->m_export.e_flags & NFSEXP_V4ROOT)
+ continue;
+ if (found->m_export.e_flags & NFSEXP_V4ROOT) {
+ found = exp;
+ continue;
+ }
+
+ /* If one is a CROSSMOUNT, then prefer the longest path */
+ if (((found->m_export.e_flags & NFSEXP_CROSSMOUNT) ||
+ (exp->m_export.e_flags & NFSEXP_CROSSMOUNT)) &&
+ strlen(found->m_export.e_path) !=
+ strlen(exp->m_export.e_path)) {
+
+ if (strlen(exp->m_export.e_path) >
+ strlen(found->m_export.e_path)) {
+ found = exp;
+ }
+ continue;
+ }
+ }
+ }
+
+ if (found)
+ return found->m_export.e_path;
+ else
+ return NULL;
+}
+
+int export_subvol_orphaned(char *path)
+{
+ struct statfs st, stp;
+ char *path_parent;
+ int ret;
+
+ path_parent = get_export_path(path);
+ if (!path_parent)
+ /*
+ * Path has no parent in export list.
+ * Must be orphaned.
+ */
+ return 1;
+
+ ret = statfs(path_parent, &stp);
+ if (ret == -1)
+ /*
+ * Parent path is not statfs'able. Maybe not yet mounted?
+ * Can't be sure, don't treat path as orphaned.
+ */
+ return 0;
+
+ if (strcmp(path_parent, path) == 0)
+ /*
+ * This is not a subvolume, it is listed in exports.
+ * No need to keep tack of it.
+ */
+ return 1;
+
+ if (stp.f_type != 0x6969)
+ /*
+ * Parent is not a NFS mount. Maybe not yet mounted?
+ * Can't be sure either.
+ */
+ return 0;
+
+ ret = statfs(path, &st);
+ if (ret == -1) {
+ if (errno == ENOENT)
+ /*
+ * Parent is a NFS mount but path is gone.
+ * Must be orphaned.
+ */
+ return 1;
+ }
+
+ /*
+ * For all remaining cases we can't be sure either.
+ */
+ return 0;
+}
+
#ifdef HAVE_JUNCTION_SUPPORT

#include <libxml/parser.h>
diff --git a/support/export/export.h b/support/export/export.h
index 8d5a0d30..45dd3da4 100644
--- a/support/export/export.h
+++ b/support/export/export.h
@@ -38,4 +38,7 @@ static inline bool is_ipaddr_client(char *dom)
{
return dom[0] == '$';
}
+
+int export_subvol_orphaned(char *path);
+
#endif /* EXPORT__H */
diff --git a/utils/exportd/exportd.c b/utils/exportd/exportd.c
index 4ddfed35..6dc51a32 100644
--- a/utils/exportd/exportd.c
+++ b/utils/exportd/exportd.c
@@ -208,6 +208,12 @@ read_exportd_conf(char *progname, char **argv)
default_ttl = ttl;
}

+static void subvol_cb(char *path)
+{
+ if (export_subvol_orphaned(path))
+ reexpdb_drop_subvolume_unlocked(path);
+}
+
int
main(int argc, char **argv)
{
@@ -297,7 +303,16 @@ main(int argc, char **argv)
/* Open files now to avoid sharing descriptors among forked processes */
cache_open();
v4clients_init();
- reexpdb_init();
+ if (reexpdb_init() != 0) {
+ xlog(L_ERROR, "%s: Failed to init reexport database", __func__);
+ exit(1);
+ }
+
+ /*
+ * Load exports into memory and garbage collect orphaned subvolumes.
+ */
+ auth_reload();
+ reexpdb_uncover_subvolumes(subvol_cb);

/* Process incoming upcalls */
cache_process_loop();
diff --git a/utils/mountd/mountd.c b/utils/mountd/mountd.c
index bcf749fa..8555d746 100644
--- a/utils/mountd/mountd.c
+++ b/utils/mountd/mountd.c
@@ -32,6 +32,7 @@
#include "nfsd_path.h"
#include "nfslib.h"
#include "export.h"
+#include "reexport.h"

extern void my_svc_run(void);

diff --git a/utils/mountd/svc_run.c b/utils/mountd/svc_run.c
index 167b9757..9a891ff0 100644
--- a/utils/mountd/svc_run.c
+++ b/utils/mountd/svc_run.c
@@ -57,6 +57,7 @@
#include <rpc/rpc_com.h>
#endif
#include "export.h"
+#include "reexport.h"

void my_svc_run(void);

@@ -87,6 +88,12 @@ my_svc_getreqset (fd_set *readfds)

#endif

+static void subvol_cb(char *path)
+{
+ if (export_subvol_orphaned(path))
+ reexpdb_drop_subvolume_unlocked(path);
+}
+
/*
* The heart of the server. A crib from libc for the most part...
*/
@@ -96,6 +103,17 @@ my_svc_run(void)
fd_set readfds;
int selret;

+ if (reexpdb_init() != 0) {
+ xlog(L_ERROR, "%s: Failed to init reexport database", __func__);
+ return;
+ }
+
+ /*
+ * Load exports into memory and garbage collect orphaned subvolumes.
+ */
+ auth_reload();
+ reexpdb_uncover_subvolumes(subvol_cb);
+
for (;;) {

readfds = svc_fdset;
--
2.31.1

2022-02-17 18:09:21

by Richard Weinberger

[permalink] [raw]

Subject: [RFC PATCH 1/6] Implement reexport helper library

This internal library contains code that will be used by various
tools within the nfs-utils package to deal better with NFS re-export,
especially cross mounts.

Signed-off-by: Richard Weinberger <[email protected]>
---
configure.ac | 12 +
support/Makefile.am | 4 +
support/reexport/Makefile.am | 6 +
support/reexport/reexport.c | 477 +++++++++++++++++++++++++++++++++++
support/reexport/reexport.h | 53 ++++
5 files changed, 552 insertions(+)
create mode 100644 support/reexport/Makefile.am
create mode 100644 support/reexport/reexport.c
create mode 100644 support/reexport/reexport.h

diff --git a/configure.ac b/configure.ac
index 93626d62..86bf8ba9 100644
--- a/configure.ac
+++ b/configure.ac
@@ -274,6 +274,17 @@ AC_ARG_ENABLE(nfsv4server,
fi
AM_CONDITIONAL(CONFIG_NFSV4SERVER, [test "$enable_nfsv4server" = "yes" ])

+AC_ARG_ENABLE(reexport,
+ [AC_HELP_STRING([--enable-reexport],
+ [enable support for re-exporting NFS mounts @<:@default=no@:>@])],
+ enable_reexport=$enableval,
+ enable_reexport="no")
+ if test "$enable_reexport" = yes; then
+ AC_DEFINE(HAVE_REEXPORT_SUPPORT, 1,
+ [Define this if you want NFS re-export support compiled in])
+ fi
+ AM_CONDITIONAL(CONFIG_REEXPORT, [test "$enable_reexport" = "yes" ])
+
dnl Check for TI-RPC library and headers
AC_LIBTIRPC

@@ -730,6 +741,7 @@ AC_CONFIG_FILES([
support/nsm/Makefile
support/nfsidmap/Makefile
support/nfsidmap/libnfsidmap.pc
+ support/reexport/Makefile
tools/Makefile
tools/locktest/Makefile
tools/nlmtest/Makefile
diff --git a/support/Makefile.am b/support/Makefile.am
index c962d4d4..986e9b5f 100644
--- a/support/Makefile.am
+++ b/support/Makefile.am
@@ -10,6 +10,10 @@ if CONFIG_JUNCTION
OPTDIRS += junction
endif

+if CONFIG_REEXPORT
+OPTDIRS += reexport
+endif
+
SUBDIRS = export include misc nfs nsm $(OPTDIRS)

MAINTAINERCLEANFILES = Makefile.in
diff --git a/support/reexport/Makefile.am b/support/reexport/Makefile.am
new file mode 100644
index 00000000..9d544a8f
--- /dev/null
+++ b/support/reexport/Makefile.am
@@ -0,0 +1,6 @@
+## Process this file with automake to produce Makefile.in
+
+noinst_LIBRARIES = libreexport.a
+libreexport_a_SOURCES = reexport.c
+
+MAINTAINERCLEANFILES = Makefile.in
diff --git a/support/reexport/reexport.c b/support/reexport/reexport.c
new file mode 100644
index 00000000..551ec278
--- /dev/null
+++ b/support/reexport/reexport.c
@@ -0,0 +1,477 @@
+#ifdef HAVE_CONFIG_H
+#include <config.h>
+#endif
+
+#include <assert.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <sqlite3.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <sys/file.h>
+#include <sys/mman.h>
+#include <sys/shm.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/vfs.h>
+#include <unistd.h>
+
+#include "nfslib.h"
+#include "reexport.h"
+#include "xlog.h"
+
+#define REEXPDB_SHM_NAME "/nfs_reexport_db_lock"
+#define REEXPDB_SHM_SZ 4096
+#define REEXPDB_INIT_LOCK NFS_STATEDIR "/reexpdb_init.lock"
+#define REEXPDB_DBFILE NFS_STATEDIR "/reexpdb.sqlite3"
+
+static const char initdb_sql[] = "CREATE TABLE IF NOT EXISTS fsidnums (num INTEGER PRIMARY KEY CHECK (num > 0 AND num < 4294967296), path TEXT UNIQUE); CREATE TABLE IF NOT EXISTS subvolumes (path TEXT PRIMARY KEY); CREATE INDEX IF NOT EXISTS idx_ids_path ON fsidnums (path);";
+/*
+ * This query is a little tricky. We use SQL to find and claim the smallest free fsid number.
+ * To find a free fsid the fsidnums is left joined to itself but with an offset of 1.
+ * Everything after the UNION statement is to handle the corner case where fsidnums
+ * is empty. In this case we want 1 as first fsid number.
+ */
+static const char new_fsidnum_by_path_sql[] = "INSERT INTO fsidnums VALUES ((SELECT ids1.num + 1 FROM fsidnums AS ids1 LEFT JOIN fsidnums AS ids2 ON ids2.num = ids1.num + 1 WHERE ids2.num IS NULL UNION SELECT 1 WHERE NOT EXISTS (SELECT NULL FROM fsidnums WHERE num = 1) LIMIT 1), ?1) RETURNING num;";
+static const char fsidnum_by_path_sql[] = "SELECT num FROM fsidnums WHERE path = ?1;";
+static const char add_crossed_volume_sql[] = "REPLACE INTO subvolumes VALUES(?1);";
+static const char drop_crossed_volume_sql[] = "DELETE FROM subvolumes WHERE path = ?1;";
+static const char get_crossed_volumes_sql[] = "SELECT path from subvolumes;";
+
+static sqlite3 *db;
+static pthread_rwlock_t *reexpdb_rwlock;
+static int init_done;
+
+static void reexpdb_wrlock(void)
+{
+ assert(pthread_rwlock_wrlock(reexpdb_rwlock) == 0);
+}
+
+static void reexpdb_rdlock(void)
+{
+ assert(pthread_rwlock_rdlock(reexpdb_rwlock) == 0);
+}
+
+static void reexpdb_unlock(void)
+{
+ assert(pthread_rwlock_unlock(reexpdb_rwlock) == 0);
+}
+
+static int init_shm_lock(void)
+{
+ int lockfd = -1, shmfd = -1;
+ int initlock = 0;
+ int ret = 0;
+
+ assert(sizeof(*reexpdb_rwlock) <= REEXPDB_SHM_SZ);
+
+ lockfd = open(REEXPDB_INIT_LOCK, O_RDWR | O_CREAT, 0600);
+ if (lockfd == -1) {
+ ret = -1;
+ xlog(L_FATAL, "Unable to open %s: %m", REEXPDB_INIT_LOCK);
+
+ goto out;
+ }
+
+ ret = flock(lockfd, LOCK_EX);
+ if (ret == -1) {
+ ret = -1;
+ xlog(L_FATAL, "Unable to lock %s: %m", REEXPDB_INIT_LOCK);
+
+ goto out_close;
+ }
+
+ shmfd = shm_open(REEXPDB_SHM_NAME, O_RDWR, 0600);
+ if (shmfd == -1 && errno == ENOENT) {
+ shmfd = shm_open(REEXPDB_SHM_NAME, O_RDWR | O_CREAT, 0600);
+ if (shmfd == -1) {
+ ret = -1;
+ xlog(L_FATAL, "Unable to create shared memory: %m");
+ goto out_unflock;
+ }
+
+ ret = ftruncate(shmfd, REEXPDB_SHM_SZ);
+ if (ret == -1) {
+ ret = -1;
+ xlog(L_FATAL, "Unable to ftruncate shared memory: %m");
+ goto out_unflock;
+ }
+
+ initlock = 1;
+ } else if (shmfd == -1) {
+ ret = -1;
+ xlog(L_FATAL, "Unable to open shared memory: %m");
+ goto out_unflock;
+ }
+
+ reexpdb_rwlock = mmap(NULL, REEXPDB_SHM_SZ, PROT_READ | PROT_WRITE, MAP_SHARED, shmfd, 0);
+ close(shmfd);
+ if (reexpdb_rwlock == (void *)-1) {
+ xlog(L_FATAL, "Unable to mmap shared memory: %m");
+ ret = -1;
+ goto out_unflock;
+ }
+
+ if (initlock) {
+ pthread_rwlockattr_t attr;
+
+ ret = pthread_rwlockattr_init(&attr);
+ if (ret != 0) {
+ xlog(L_FATAL, "Unable to pthread_rwlockattr_init: %m");
+ ret = -1;
+ goto out_unflock;
+ }
+
+ ret = pthread_rwlockattr_setpshared(&attr, PTHREAD_PROCESS_SHARED);
+ if (ret != 0) {
+ xlog(L_FATAL, "Unable to set PTHREAD_PROCESS_SHARED: %m");
+ ret = -1;
+ goto out_unflock;
+ }
+
+ ret = pthread_rwlock_init(reexpdb_rwlock, &attr);
+ if (ret != 0) {
+ xlog(L_FATAL, "Unable to pthread_rwlock_init: %m");
+ ret = -1;
+ goto out_unflock;
+ }
+ }
+
+ ret = 0;
+
+out_unflock:
+ flock(lockfd, LOCK_UN);
+out_close:
+ close(lockfd);
+out:
+ return ret;
+}
+
+/*
+ * reexpdb_init - Initialize reexport database
+ *
+ * Setup shared lock (database is concurrently used by multiple processes),
+ * if needed create tables and create database handle.
+ * It is okay to call this function multiple times per process.
+ */
+int reexpdb_init(void)
+{
+ char *sqlerr;
+ int ret;
+
+ if (init_done)
+ return 0;
+
+ ret = init_shm_lock();
+ if (ret)
+ return -1;
+
+ ret = sqlite3_open_v2(REEXPDB_DBFILE, &db, SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX, NULL);
+ if (ret != SQLITE_OK) {
+ xlog(L_ERROR, "Unable to open reexport database: %s", sqlite3_errstr(ret));
+ return -1;
+ }
+
+ reexpdb_wrlock();
+ ret = sqlite3_exec(db, initdb_sql, NULL, NULL, &sqlerr);
+ reexpdb_unlock();
+ if (ret != SQLITE_OK) {
+ xlog(L_ERROR, "Unable to init reexport database: %s", sqlite3_errstr(ret));
+ sqlite3_free(sqlerr);
+ sqlite3_close_v2(db);
+ ret = -1;
+ } else {
+ init_done = 1;
+ ret = 0;
+ }
+
+ return ret;
+}
+
+/*
+ * reexpdb_destroy - Undo reexpdb_init().
+ *
+ * The shared lock keeps. We cannot know which other
+ * processes are still use the database.
+ */
+void reexpdb_destroy(void)
+{
+ if (!init_done)
+ return;
+
+ sqlite3_close_v2(db);
+ munmap((void *)reexpdb_rwlock, REEXPDB_SHM_SZ);
+ reexpdb_rwlock = NULL;
+}
+
+static int get_fsidnum_by_path(char *path, uint32_t *fsidnum)
+{
+ sqlite3_stmt *stmt = NULL;
+ int found = 0;
+ int ret;
+
+ ret = sqlite3_prepare_v2(db, fsidnum_by_path_sql, sizeof(fsidnum_by_path_sql), &stmt, NULL);
+ if (ret != SQLITE_OK) {
+ xlog(L_WARNING, "Unable to prepare SQL query: %s", sqlite3_errstr(ret));
+ goto out;
+ }
+
+ ret = sqlite3_bind_text(stmt, 1, path, -1, NULL);
+ if (ret != SQLITE_OK) {
+ xlog(L_WARNING, "Unable to bind \"%s\" SQL query: %s", __func__, sqlite3_errstr(ret));
+ goto out;
+ }
+
+ reexpdb_rdlock();
+ ret = sqlite3_step(stmt);
+ if (ret == SQLITE_ROW) {
+ *fsidnum = sqlite3_column_int(stmt, 0);
+ found = 1;
+ } else if (ret == SQLITE_DONE) {
+ /* No hit */
+ found = 0;
+ } else {
+ xlog(L_WARNING, "Error while looking up \"%s\" in database: %s", path, sqlite3_errstr(ret));
+ }
+ reexpdb_unlock();
+
+out:
+ sqlite3_finalize(stmt);
+ return found;
+}
+
+static int new_fsidnum_by_path(char *path, uint32_t *fsidnum)
+{
+ sqlite3_stmt *stmt = NULL;
+ int found = 0, check = 0;
+ int ret;
+
+ ret = sqlite3_prepare_v2(db, new_fsidnum_by_path_sql, sizeof(new_fsidnum_by_path_sql), &stmt, NULL);
+ if (ret != SQLITE_OK) {
+ xlog(L_WARNING, "Unable to prepare SQL query: %s", sqlite3_errstr(ret));
+ goto out;
+ }
+
+ ret = sqlite3_bind_text(stmt, 1, path, -1, NULL);
+ if (ret != SQLITE_OK) {
+ xlog(L_WARNING, "Unable to bind \"%s\" SQL query: %s", path, sqlite3_errstr(ret));
+ goto out;
+ }
+
+ reexpdb_wrlock();
+ ret = sqlite3_step(stmt);
+ if (ret == SQLITE_ROW) {
+ *fsidnum = sqlite3_column_int(stmt, 0);
+ found = 1;
+ } else if (ret == SQLITE_CONSTRAINT) {
+ /* Maybe we lost the race against another writer and the path is now present. */
+ check = 1;
+ } else {
+ xlog(L_WARNING, "Error while looking up \"%s\" in database: %s", path, sqlite3_errstr(ret));
+ }
+ reexpdb_unlock();
+
+out:
+ sqlite3_finalize(stmt);
+
+ if (check) {
+ found = get_fsidnum_by_path(path, fsidnum);
+ if (!found)
+ xlog(L_WARNING, "SQLITE_CONSTRAINT error while inserting \"%s\" in database", path);
+ }
+
+ return found;
+}
+
+int reexpdb_fsidnum_by_path(char *path, uint32_t *fsidnum, int may_create)
+{
+ int found;
+
+ found = get_fsidnum_by_path(path, fsidnum);
+
+ if (!found && may_create)
+ found = new_fsidnum_by_path(path, fsidnum);
+
+ return found;
+}
+
+int reexpdb_apply_reexport_settings(struct exportent *ep, char *flname, int flline)
+{
+ int ret = 0;
+
+ switch (ep->e_reexport) {
+ case REEXP_REMOTE_DEVFSID:
+ if (!ep->e_fsid && !ep->e_uuid) {
+ xlog(L_ERROR, "%s:%i: Selected 'reexport=' mode needs either a numerical or UUID 'fsid='\n",
+ flname, flline);
+ ret = -1;
+ }
+ break;
+ case REEXP_AUTO_FSIDNUM:
+ case REEXP_PREDEFINED_FSIDNUM: {
+ uint32_t fsidnum;
+ int found;
+
+ if (ep->e_uuid)
+ break;
+
+ if (reexpdb_init() != 0) {
+ ret = -1;
+
+ break;
+ }
+
+ found = reexpdb_fsidnum_by_path(ep->e_path, &fsidnum, 0);
+ if (!found) {
+ if (ep->e_reexport == REEXP_AUTO_FSIDNUM) {
+ found = reexpdb_fsidnum_by_path(ep->e_path, &fsidnum, 1);
+ if (!found) {
+ xlog(L_ERROR, "%s:%i: Unable to generate fsid for %s",
+ flname, flline, ep->e_path);
+ ret = -1;
+
+ break;
+ }
+ } else {
+ if (!ep->e_fsid) {
+ xlog(L_ERROR, "%s:%i: Selected 'reexport=' mode requires either a UUID 'fsid=' or a numerical 'fsid=' or a reexport db entry %d",
+ flname, flline, ep->e_fsid);
+ ret = -1;
+ }
+
+ break;
+ }
+ }
+
+ if (ep->e_fsid) {
+ if (ep->e_fsid != fsidnum) {
+ xlog(L_ERROR, "%s:%i: Selected 'reexport=' mode requires configured numerical 'fsid=' to agree with reexport db entry",
+ flname, flline);
+ ret = -1;
+ }
+
+ break;
+ }
+
+ ep->e_fsid = fsidnum;
+
+ break;
+ }
+ }
+
+ return ret;
+}
+
+int reexpdb_add_subvolume(char *path)
+{
+ sqlite3_stmt *stmt = NULL;
+ int ret;
+
+ reexpdb_wrlock();
+ ret = sqlite3_prepare_v2(db, add_crossed_volume_sql, sizeof(add_crossed_volume_sql), &stmt, NULL);
+ if (ret != SQLITE_OK) {
+ xlog(L_WARNING, "Unable to prepare SQL query: %s", sqlite3_errstr(ret));
+ ret = -1;
+ goto out;
+ }
+
+ ret = sqlite3_bind_text(stmt, 1, path, -1, NULL);
+ if (ret != SQLITE_OK) {
+ xlog(L_WARNING, "Unable to bind \"%s\" SQL query: %s", __func__, sqlite3_errstr(ret));
+ ret = -1;
+ goto out;
+ }
+
+ ret = sqlite3_step(stmt);
+ if (ret != SQLITE_DONE) {
+ xlog(L_WARNING, "Error while adding \"%s\" from database: %s", path, sqlite3_errstr(ret));
+ ret = -1;
+ } else {
+ ret = 0;
+ }
+
+out:
+ reexpdb_unlock();
+ sqlite3_finalize(stmt);
+ return ret;
+}
+
+int reexpdb_drop_subvolume_unlocked(char *path)
+{
+ sqlite3_stmt *stmt = NULL;
+ int ret;
+
+ ret = sqlite3_prepare_v2(db, drop_crossed_volume_sql, sizeof(drop_crossed_volume_sql), &stmt, NULL);
+ if (ret != SQLITE_OK) {
+ xlog(L_WARNING, "Unable to prepare SQL query: %s", sqlite3_errstr(ret));
+ ret = -1;
+ goto out;
+ }
+
+ ret = sqlite3_bind_text(stmt, 1, path, -1, NULL);
+ if (ret != SQLITE_OK) {
+ xlog(L_WARNING, "Unable to bind \"%s\" SQL query: %s", __func__, sqlite3_errstr(ret));
+ ret = -1;
+ goto out;
+ }
+
+ ret = sqlite3_step(stmt);
+ if (ret != SQLITE_DONE) {
+ xlog(L_WARNING, "Error while deleting \"%s\" from database: %s", path, sqlite3_errstr(ret));
+ ret = -1;
+ } else {
+ ret = 0;
+ }
+
+out:
+ sqlite3_finalize(stmt);
+ return ret;
+}
+
+
+int reexpdb_uncover_subvolumes(void (*cb)(char *path))
+{
+ sqlite3_stmt *stmt = NULL;
+ struct statfs st;
+ const unsigned char *path;
+ int ret;
+
+ if (cb)
+ reexpdb_wrlock();
+ else
+ reexpdb_rdlock();
+
+ ret = sqlite3_prepare_v2(db, get_crossed_volumes_sql, sizeof(get_crossed_volumes_sql), &stmt, NULL);
+ if (ret != SQLITE_OK) {
+ xlog(L_WARNING, "Unable to prepare SQL query: %s", sqlite3_errstr(ret));
+ ret = -1;
+ goto out;
+ }
+
+ for (;;) {
+ ret = sqlite3_step(stmt);
+ if (ret != SQLITE_ROW)
+ break;
+
+ path = sqlite3_column_text(stmt, 0);
+ if (cb)
+ cb((char *)path);
+ else
+ statfs((char *)path, &st);
+ }
+
+ if (ret != SQLITE_DONE) {
+ xlog(L_WARNING, "Error while reading all subvolumes: %s", sqlite3_errstr(ret));
+ ret = -1;
+ goto out_unlock;
+ }
+
+ ret = 0;
+
+out_unlock:
+ reexpdb_unlock();
+ sqlite3_finalize(stmt);
+out:
+ return ret;
+}
diff --git a/support/reexport/reexport.h b/support/reexport/reexport.h
new file mode 100644
index 00000000..46ec8a96
--- /dev/null
+++ b/support/reexport/reexport.h
@@ -0,0 +1,53 @@
+#ifndef REEXPORT_H
+#define REEXPORT_H
+
+enum {
+ REEXP_NONE = 0,
+ REEXP_AUTO_FSIDNUM,
+ REEXP_PREDEFINED_FSIDNUM,
+ REEXP_REMOTE_DEVFSID,
+};
+
+#ifdef HAVE_REEXPORT_SUPPORT
+int reexpdb_init(void);
+void reexpdb_destroy(void);
+int reexpdb_fsidnum_by_path(char *path, uint32_t *fsidnum, int may_create);
+int reexpdb_apply_reexport_settings(struct exportent *ep, char *flname, int flline);
+int reexpdb_add_subvolume(char *path);
+int reexpdb_uncover_subvolumes(void (*cb)(char *path));
+int reexpdb_drop_subvolume_unlocked(char *path);
+#else
+static inline int reexpdb_init(void) { return 0; }
+static inline void reexpdb_destroy(void) {}
+static inline int reexpdb_fsidnum_by_path(char *path, uint32_t *fsidnum, int may_create)
+{
+ (void)path;
+ (void)may_create;
+ *fsidnum = 0;
+ return 0;
+}
+static inline int reexpdb_apply_reexport_settings(struct exportent *ep, char *flname, int flline)
+{
+ (void)ep;
+ (void)flname;
+ (void)flline;
+ return 0;
+}
+static inline int reexpdb_add_subvolume(char *path)
+{
+ (void)path;
+ return 0;
+}
+static inline int reexpdb_uncover_subvolumes(void (*cb)(char *path))
+{
+ (void)cb;
+ return 0;
+}
+static inline int reexpdb_drop_subvolume_unlocked(char *path)
+{
+ (void)path;
+ return 0;
+}
+#endif /* HAVE_REEXPORT_SUPPORT */
+
+#endif /* REEXPORT_H */
--
2.31.1

2022-02-17 19:34:10

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] nfs-utils: Improving NFS re-exports

On Thu, Feb 17, 2022 at 06:27:15PM +0100, Richard Weinberger wrote:
> > Von: "bfields" <[email protected]>
> > Thanks, I'll try to take a look.
> >
> > Before upstreaming, I would like us to pick just one. These kind of
> > options tend to complicate testing and documentation and debugging.
> >
> > For an RFC, though, I think it makes sense, so I'm fine with keeping
> > "reexport=" while we're still exploring the different options. And,
> > hey, maybe we end up adding more than one after we've upstreamed the
> > first one.
>
> Which one do you prefer?
> "predefined-fsidnum" should be the safest one to start.

I don't know! I'll have to do some more reading and think about it.

> > Setting the timeout to 0 doesn't help with re-export server reboots.
> > After a reboot is another case where we could end up in a situation
> > where a client hands us a filehandle for a filesystem that isn't mounted
> > yet.
> >
> > I think you want to keep a path with each entry in the database. When
> > mountd gets a request for a filesystem it hasn't seen before, it stats
> > that path, which should trigger the automounts.
>
> I have implemented that already. This feature is part of this series. :-)

Oh, good, got it. It'll take me some time to catch up.

--b.

2022-02-17 19:47:34

by Richard Weinberger

[permalink] [raw]

Subject: [RFC PATCH 4/6] export: Record mounted volumes

As soon a client mounts a volume, record it in the database
to be able to uncover NFS subvolumes after a reboot.

Signed-off-by: Richard Weinberger <[email protected]>
---
support/export/cache.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/support/export/cache.c b/support/export/cache.c
index 6039745e..b5763b1d 100644
--- a/support/export/cache.c
+++ b/support/export/cache.c
@@ -967,8 +967,10 @@ static void nfsd_fh(int f)
* line.
*/
qword_addint(&bp, &blen, 0x7fffffff);
- if (found)
+ if (found) {
+ reexpdb_add_subvolume(found_path);
qword_add(&bp, &blen, found_path);
+ }
qword_addeol(&bp, &blen);
if (blen <= 0 || cache_write(f, buf, bp - buf) != bp - buf)
xlog(L_ERROR, "nfsd_fh: error writing reply");
--
2.31.1

2022-02-17 20:43:35

by Richard Weinberger

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] nfs-utils: Improving NFS re-exports

----- Ursprüngliche Mail -----
> Von: "bfields" <[email protected]>
>> The reason why setting the timeout to 0 is still needed is because
>> when mountd uncovers a subvolume but no client uses it a filehandle,
>> it is not locked and can be unmounted later.
>> Only when nfsd sees a matching filehandle the reference counter will
>> be increased and umounting is no longer possible.
>
> I understand that. But, then if a client comes in with a matching
> filehandle, mountd should be able to look up the filesystem and trigger
> a new mount, right?

This does not match what I saw in my experiments. The handle machted only
when the subvolume was mounted before.
But I understand now your point, in theory it should work.
I'll investigate into this.

Thanks,
//richard

2022-02-17 23:02:58

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] nfs-utils: Improving NFS re-exports

On Thu, Feb 17, 2022 at 02:15:25PM +0100, Richard Weinberger wrote:
> This is the second iteration of the NFS re-export improvement series for nfs-utils.
> While the kernel side didn't change at all and is still small,
> the userspace side saw much more changes.
> Please note that this is still an RFC, there is room for improvement.
>
> The core idea is adding new export option: reeport=
> Using reexport= it is possible to mark an export entry in the exports file
> explicitly as NFS re-export and select a strategy how unique identifiers
> should be provided.
> "remote-devfsid" is the strategy I have proposed in my first patch,
> I understand that this one is dangerous. But I still find it useful in some
> situations.
> "auto-fsidnum" and "predefined-fsidnum" are new and use a SQLite database as
> backend to keep track of generated ids.
> For a more detailed description see patch "exports: Implement new export option reexport=".

Thanks, I'll try to take a look.

Before upstreaming, I would like us to pick just one. These kind of
options tend to complicate testing and documentation and debugging.

For an RFC, though, I think it makes sense, so I'm fine with keeping
"reexport=" while we're still exploring the different options. And,
hey, maybe we end up adding more than one after we've upstreamed the
first one.

> I choose SQLite because nfs-utils already uses it and using SQL ids can nicely
> generated and maintained. It will also scale for large setups where the amount
> of subvolumes is high.
>
> Beside of id generation this series also addresses the reboot problem.
> If the re-exporting NFS server reboots, uncovered NFS subvolumes are not yet
> mounted and file handles become stale.
> Now mountd/exportd keeps track of uncovered subvolumes and makes sure they get
> uncovered while nfsd starts.
>
> The whole set of features is currently opt-in via --enable-reexport.
> I'm also not sure about the rearrangement of the reexport code,
> currently it is a helper library.
>
> Please let me know whether you like this approach.
> If so I'd tidy it up and submit it as non-RFC.
>
> TODOs/Open questions:
> - When re-exporting, fs.nfs.nfs_mountpoint_timeout should be set to 0
> to make subvolumes not vanish.
> Is this something exportfs should do automatically when it sees an export entry with a reexport= option?

Setting the timeout to 0 doesn't help with re-export server reboots.
After a reboot is another case where we could end up in a situation
where a client hands us a filehandle for a filesystem that isn't mounted
yet.

I think you want to keep a path with each entry in the database. When
mountd gets a request for a filesystem it hasn't seen before, it stats
that path, which should trigger the automounts.

And it'd be good to have a test case with a client (Linux client or
pynfs) that, say, opens a file several mounts deep, then reboots the
reexport server, then tries to, say, write to the file descriptor after
the reboot. (Or maybe there's a way to force the mounts to expire as a
shortcut instead of doing a full reboot.)

> - exportd saw only minimal testing so far, I wasn't aware of it yet. :-S
> - Currently wtere is no way to release the shared memory which contains the database lock.
> I guess it could be released via exportfs -f, which is the very last exec in nfs-server.service
> - Add a tool to import/export entries from the reexport database which obeys the shared lock.
> - When doing v4->v4 or v3->v4 re-exports very first read access to a file block a few seconds until
> the client does a retransmit.
> v3->v3 works fine. More investigation needed.

Might want to strace mountd and look at the communication over the
/proc/fs/nfsd/*/channel files, maybe mountd is failing to respond to an
upcall.

--b.

>
> Looking forward for your feedback!
>
> Thanks,
> //richard
>
> Richard Weinberger (6):
> Implement reexport helper library
> exports: Implement new export option reexport=
> export: Implement logic behind reexport=
> export: Record mounted volumes
> nfsd: statfs() every known subvolume upon start
> export: Garbage collect orphaned subvolumes upon start
>
> configure.ac | 12 +
> support/Makefile.am | 4 +
> support/export/Makefile.am | 2 +
> support/export/cache.c | 241 +++++++++++++++++-
> support/export/export.h | 3 +
> support/include/nfslib.h | 1 +
> support/nfs/Makefile.am | 1 +
> support/nfs/exports.c | 73 ++++++
> support/reexport/Makefile.am | 6 +
> support/reexport/reexport.c | 477 +++++++++++++++++++++++++++++++++++
> support/reexport/reexport.h | 53 ++++
> utils/exportd/Makefile.am | 8 +-
> utils/exportd/exportd.c | 17 ++
> utils/exportfs/Makefile.am | 4 +
> utils/mount/Makefile.am | 6 +
> utils/mountd/Makefile.am | 6 +
> utils/mountd/mountd.c | 1 +
> utils/mountd/svc_run.c | 18 ++
> utils/nfsd/Makefile.am | 6 +
> utils/nfsd/nfsd.c | 10 +
> 20 files changed, 934 insertions(+), 15 deletions(-)
> create mode 100644 support/reexport/Makefile.am
> create mode 100644 support/reexport/reexport.c
> create mode 100644 support/reexport/reexport.h
>
> --
> 2.31.1

2022-02-17 23:33:38

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] nfs-utils: Improving NFS re-exports

On Thu, Feb 17, 2022 at 09:15:38PM +0100, Richard Weinberger wrote:
> ----- Ursprüngliche Mail -----
> > Von: "bfields" <[email protected]>
> >> Which one do you prefer?
> >> "predefined-fsidnum" should be the safest one to start.
> >
> > I don't know! I'll have to do some more reading and think about it.
>
> No need to worry, take your time.
>
> >> > Setting the timeout to 0 doesn't help with re-export server reboots.
> >> > After a reboot is another case where we could end up in a situation
> >> > where a client hands us a filehandle for a filesystem that isn't mounted
> >> > yet.
> >> >
> >> > I think you want to keep a path with each entry in the database. When
> >> > mountd gets a request for a filesystem it hasn't seen before, it stats
> >> > that path, which should trigger the automounts.
> >>
> >> I have implemented that already. This feature is part of this series. :-)
> >
> > Oh, good, got it. It'll take me some time to catch up.
>
> The reason why setting the timeout to 0 is still needed is because
> when mountd uncovers a subvolume but no client uses it a filehandle,
> it is not locked and can be unmounted later.
> Only when nfsd sees a matching filehandle the reference counter will
> be increased and umounting is no longer possible.

I understand that. But, then if a client comes in with a matching
filehandle, mountd should be able to look up the filesystem and trigger
a new mount, right?

I can imagine that setting the timeout to 0 might be an optimization,
but I'm not seeing why it's required for correct behavior.

--b.

2022-02-17 23:34:34

by Richard Weinberger

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] nfs-utils: Improving NFS re-exports

----- Ursprüngliche Mail -----
> Von: "bfields" <[email protected]>
>> Which one do you prefer?
>> "predefined-fsidnum" should be the safest one to start.
>
> I don't know! I'll have to do some more reading and think about it.

No need to worry, take your time.

>> > Setting the timeout to 0 doesn't help with re-export server reboots.
>> > After a reboot is another case where we could end up in a situation
>> > where a client hands us a filehandle for a filesystem that isn't mounted
>> > yet.
>> >
>> > I think you want to keep a path with each entry in the database. When
>> > mountd gets a request for a filesystem it hasn't seen before, it stats
>> > that path, which should trigger the automounts.
>>
>> I have implemented that already. This feature is part of this series. :-)
>
> Oh, good, got it. It'll take me some time to catch up.

The reason why setting the timeout to 0 is still needed is because
when mountd uncovers a subvolume but no client uses it a filehandle,
it is not locked and can be unmounted later.
Only when nfsd sees a matching filehandle the reference counter will
be increased and umounting is no longer possible.

Thanks,
//richard

2022-02-17 23:53:59

by Richard Weinberger

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] nfs-utils: Improving NFS re-exports

Bruce,

----- Ursprüngliche Mail -----
> Von: "bfields" <[email protected]>
> An: "richard" <[email protected]>
> CC: "linux-nfs" <[email protected]>, "david" <[email protected]>, "luis turcitu"
> <[email protected]>, "david young" <[email protected]>, "david oberhollenzer"
> <[email protected]>, "trond myklebust" <[email protected]>, "anna schumaker"
> <[email protected]>, "chris chilvers" <[email protected]>
> Gesendet: Donnerstag, 17. Februar 2022 17:33:32
> Betreff: Re: [RFC PATCH 0/6] nfs-utils: Improving NFS re-exports

> On Thu, Feb 17, 2022 at 02:15:25PM +0100, Richard Weinberger wrote:
>> This is the second iteration of the NFS re-export improvement series for
>> nfs-utils.
>> While the kernel side didn't change at all and is still small,
>> the userspace side saw much more changes.
>> Please note that this is still an RFC, there is room for improvement.
>>
>> The core idea is adding new export option: reeport=
>> Using reexport= it is possible to mark an export entry in the exports file
>> explicitly as NFS re-export and select a strategy how unique identifiers
>> should be provided.
>> "remote-devfsid" is the strategy I have proposed in my first patch,
>> I understand that this one is dangerous. But I still find it useful in some
>> situations.
>> "auto-fsidnum" and "predefined-fsidnum" are new and use a SQLite database as
>> backend to keep track of generated ids.
>> For a more detailed description see patch "exports: Implement new export option
>> reexport=".
>
> Thanks, I'll try to take a look.
>
> Before upstreaming, I would like us to pick just one. These kind of
> options tend to complicate testing and documentation and debugging.
>
> For an RFC, though, I think it makes sense, so I'm fine with keeping
> "reexport=" while we're still exploring the different options. And,
> hey, maybe we end up adding more than one after we've upstreamed the
> first one.

Which one do you prefer?
"predefined-fsidnum" should be the safest one to start.

>> - When re-exporting, fs.nfs.nfs_mountpoint_timeout should be set to 0
>> to make subvolumes not vanish.
>> Is this something exportfs should do automatically when it sees an export entry
>> with a reexport= option?
>
> Setting the timeout to 0 doesn't help with re-export server reboots.
> After a reboot is another case where we could end up in a situation
> where a client hands us a filehandle for a filesystem that isn't mounted
> yet.
>
> I think you want to keep a path with each entry in the database. When
> mountd gets a request for a filesystem it hasn't seen before, it stats
> that path, which should trigger the automounts.

I have implemented that already. This feature is part of this series. :-)

> And it'd be good to have a test case with a client (Linux client or
> pynfs) that, say, opens a file several mounts deep, then reboots the
> reexport server, then tries to, say, write to the file descriptor after
> the reboot. (Or maybe there's a way to force the mounts to expire as a
> shortcut instead of doing a full reboot.)

Okay, I'll test that.

>> - exportd saw only minimal testing so far, I wasn't aware of it yet. :-S
>> - Currently wtere is no way to release the shared memory which contains the
>> database lock.
>> I guess it could be released via exportfs -f, which is the very last exec in
>> nfs-server.service
>> - Add a tool to import/export entries from the reexport database which obeys the
>> shared lock.
>> - When doing v4->v4 or v3->v4 re-exports very first read access to a file block
>> a few seconds until
>> the client does a retransmit.
>> v3->v3 works fine. More investigation needed.
>
> Might want to strace mountd and look at the communication over the
> /proc/fs/nfsd/*/channel files, maybe mountd is failing to respond to an
> upcall.

Ahh. The upcall might be the issue. Thanks for the pointer.

Thanks,
//richard

2022-02-18 00:04:01

by Richard Weinberger

[permalink] [raw]

Subject: [RFC PATCH 2/6] exports: Implement new export option reexport=

When re-exporting a NFS volume it is mandatory to specify
either a UUID or numerical fsid= option because nfsd is unable
to derive a identifier on its own.

For NFS cross mounts this becomes a problem because nfsd also
needs a identifier for every crossed mount.
A common workaround is stating every single subvolume in the
exports list too.
But this defeats the purpose of the crossmnt option and is tedious.

This is where the reexport= tries to help.
It offers various strategies to automatically derive a identifier
for NFS volumes and sub volumes.
Each have their pros and cons.

Currently three modes are implemented:

1. auto-fsidnum
In this mode mountd/exportd will create a new numerical fsid
for a NFS volume and subvolume. The numbers are stored in a database
such that the server will always use the same fsid.
The entry in the exports file allowed to skip fsid= entiry but
stating a UUID is allowed, if needed.

This mode has the obvious downside that load balancing is not
possible since multiple re-exporting NFS servers would generate
different ids.

2. predefined-fsidnum
This mode works just like auto-fsidnum but does not generate ids
for you. It helps in the load balancing case. A system administrator
has to manually maintain the database and install it on all re-exporting
NFS servers. If you have a massive amount of subvolumes this mode
will help because you don't have to bloat the exports list.

3. remote-devfsid
If this mode is selected mountd/exportd will derive an UUID from the
re-exported NFS volume's fsid (rfc7530 section-5.8.1.9).
No further local state is needed on the re-exporting server.
The export list entry still needs a fsid= setting because while
parsing the exports file the NFS mounts might be not there yet.
This mode is dangerous, use only of you're absolutely sure that the
NFS server you're re-exporting has a stable fsid. Chances are good
that it can change.
Since an UUID is derived, reexporting from NFSv3 to NFSv3 is not
possible. The file handle space is too small.
NFSv3 to NFSv4 works, though.

Signed-off-by: Richard Weinberger <[email protected]>
---
support/include/nfslib.h | 1 +
support/nfs/Makefile.am | 1 +
support/nfs/exports.c | 73 ++++++++++++++++++++++++++++++++++++++
utils/exportfs/Makefile.am | 4 +++
utils/mount/Makefile.am | 6 ++++
5 files changed, 85 insertions(+)

diff --git a/support/include/nfslib.h b/support/include/nfslib.h
index 6faba71b..0465a1ff 100644
--- a/support/include/nfslib.h
+++ b/support/include/nfslib.h
@@ -85,6 +85,7 @@ struct exportent {
struct sec_entry e_secinfo[SECFLAVOR_COUNT+1];
unsigned int e_ttl;
char * e_realpath;
+ int e_reexport;
};

struct rmtabent {
diff --git a/support/nfs/Makefile.am b/support/nfs/Makefile.am
index 67e3a8e1..c4357e7d 100644
--- a/support/nfs/Makefile.am
+++ b/support/nfs/Makefile.am
@@ -9,6 +9,7 @@ libnfs_la_SOURCES = exports.c rmtab.c xio.c rpcmisc.c rpcdispatch.c \
svc_socket.c cacheio.c closeall.c nfs_mntent.c \
svc_create.c atomicio.c strlcat.c strlcpy.c
libnfs_la_LIBADD = libnfsconf.la
+libnfs_la_CPPFLAGS = -I$(top_srcdir)/support/reexport

libnfsconf_la_SOURCES = conffile.c xlog.c

diff --git a/support/nfs/exports.c b/support/nfs/exports.c
index 2c8f0752..13129d68 100644
--- a/support/nfs/exports.c
+++ b/support/nfs/exports.c
@@ -31,6 +31,7 @@
#include "xlog.h"
#include "xio.h"
#include "pseudoflavors.h"
+#include "reexport.h"

#define EXPORT_DEFAULT_FLAGS \
(NFSEXP_READONLY|NFSEXP_ROOTSQUASH|NFSEXP_GATHERED_WRITES|NFSEXP_NOSUBTREECHECK)
@@ -103,6 +104,7 @@ static void init_exportent (struct exportent *ee, int fromkernel)
ee->e_nsqgids = 0;
ee->e_uuid = NULL;
ee->e_ttl = default_ttl;
+ ee->e_reexport = REEXP_NONE;
}

struct exportent *
@@ -302,6 +304,26 @@ putexportent(struct exportent *ep)
}
if (ep->e_uuid)
fprintf(fp, "fsid=%s,", ep->e_uuid);
+
+ if (ep->e_reexport) {
+ fprintf(fp, "reexport=");
+ switch (ep->e_reexport) {
+ case REEXP_AUTO_FSIDNUM:
+ fprintf(fp, "auto-fsidnum");
+ break;
+ case REEXP_PREDEFINED_FSIDNUM:
+ fprintf(fp, "predefined-fsidnum");
+ break;
+ case REEXP_REMOTE_DEVFSID:
+ fprintf(fp, "remote-devfsid");
+ break;
+ default:
+ xlog(L_ERROR, "unknown reexport method %i", ep->e_reexport);
+ fprintf(fp, "none");
+ }
+ fprintf(fp, ",");
+ }
+
if (ep->e_mountpoint)
fprintf(fp, "mountpoint%s%s,",
ep->e_mountpoint[0]?"=":"", ep->e_mountpoint);
@@ -538,6 +560,7 @@ parseopts(char *cp, struct exportent *ep, int warn, int *had_subtree_opt_ptr)
char *flname = efname?efname:"command line";
int flline = efp?efp->x_line:0;
unsigned int active = 0;
+ int saw_reexport = 0;

squids = ep->e_squids; nsquids = ep->e_nsquids;
sqgids = ep->e_sqgids; nsqgids = ep->e_nsqgids;
@@ -644,6 +667,13 @@ bad_option:
}
} else if (strncmp(opt, "fsid=", 5) == 0) {
char *oe;
+
+ if (saw_reexport) {
+ xlog(L_ERROR, "%s:%d: 'fsid=' has to be after 'reexport=' %s\n",
+ flname, flline, opt);
+ goto bad_option;
+ }
+
if (strcmp(opt+5, "root") == 0) {
ep->e_fsid = 0;
setflags(NFSEXP_FSID, active, ep);
@@ -688,6 +718,49 @@ bad_option:
active = parse_flavors(opt+4, ep);
if (!active)
goto bad_option;
+ } else if (strncmp(opt, "reexport=", 9) == 0) {
+#ifdef HAVE_REEXPORT_SUPPORT
+ char *strategy = strchr(opt, '=');
+
+ if (!strategy) {
+ xlog(L_ERROR, "%s:%d: bad option %s\n",
+ flname, flline, opt);
+ goto bad_option;
+ }
+ strategy++;
+
+ if (saw_reexport) {
+ xlog(L_ERROR, "%s:%d: only one 'reexport=' is allowed%s\n",
+ flname, flline, opt);
+ goto bad_option;
+ }
+
+ if (strcmp(strategy, "auto-fsidnum") == 0) {
+ ep->e_reexport = REEXP_AUTO_FSIDNUM;
+ } else if (strcmp(strategy, "predefined-fsidnum") == 0) {
+ ep->e_reexport = REEXP_PREDEFINED_FSIDNUM;
+ } else if (strcmp(strategy, "remote-devfsid") == 0) {
+ ep->e_reexport = REEXP_REMOTE_DEVFSID;
+ } else if (strcmp(strategy, "none") == 0) {
+ ep->e_reexport = REEXP_NONE;
+ } else {
+ xlog(L_ERROR, "%s:%d: bad option %s\n",
+ flname, flline, strategy);
+ goto bad_option;
+ }
+
+ if (reexpdb_apply_reexport_settings(ep, flname, flline) != 0)
+ goto bad_option;
+
+ if (ep->e_fsid)
+ setflags(NFSEXP_FSID, active, ep);
+
+ saw_reexport = 1;
+#else
+ xlog(L_ERROR, "%s:%d: 'reexport=' not available, rebuild with --enable-reexport\n",
+ flname, flline);
+ goto bad_option;
+#endif
} else {
xlog(L_ERROR, "%s:%d: unknown keyword \"%s\"\n",
flname, flline, opt);
diff --git a/utils/exportfs/Makefile.am b/utils/exportfs/Makefile.am
index 96524c72..9eabef14 100644
--- a/utils/exportfs/Makefile.am
+++ b/utils/exportfs/Makefile.am
@@ -12,4 +12,8 @@ exportfs_LDADD = ../../support/export/libexport.a \
../../support/misc/libmisc.a \
$(LIBWRAP) $(LIBNSL) $(LIBPTHREAD)

+if CONFIG_REEXPORT
+exportfs_LDADD += ../../support/reexport/libreexport.a $(LIBSQLITE) -lrt
+endif
+
MAINTAINERCLEANFILES = Makefile.in
diff --git a/utils/mount/Makefile.am b/utils/mount/Makefile.am
index 3101f7ab..f4d5b182 100644
--- a/utils/mount/Makefile.am
+++ b/utils/mount/Makefile.am
@@ -32,6 +32,12 @@ mount_nfs_LDADD = ../../support/nfs/libnfs.la \
../../support/misc/libmisc.a \
$(LIBTIRPC)

+if CONFIG_REEXPORT
+mount_nfs_LDADD += ../../support/reexport/libreexport.a \
+ $(LIBSQLITE) -lrt $(LIBPTHREAD)
+endif
+
+
mount_nfs_SOURCES = $(mount_common)

if CONFIG_LIBMOUNT
--
2.31.1

2022-03-07 10:04:26

by Richard Weinberger

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] nfs-utils: Improving NFS re-exports

Bruce,

----- Ursprüngliche Mail -----
> Von: "bfields" <[email protected]>

> On Thu, Feb 17, 2022 at 02:15:25PM +0100, Richard Weinberger wrote:
>> This is the second iteration of the NFS re-export improvement series for
>> nfs-utils.
>> While the kernel side didn't change at all and is still small,
>> the userspace side saw much more changes.
>> Please note that this is still an RFC, there is room for improvement.
>>
>> The core idea is adding new export option: reeport=
>> Using reexport= it is possible to mark an export entry in the exports file
>> explicitly as NFS re-export and select a strategy how unique identifiers
>> should be provided.
>> "remote-devfsid" is the strategy I have proposed in my first patch,
>> I understand that this one is dangerous. But I still find it useful in some
>> situations.
>> "auto-fsidnum" and "predefined-fsidnum" are new and use a SQLite database as
>> backend to keep track of generated ids.
>> For a more detailed description see patch "exports: Implement new export option
>> reexport=".
>
> Thanks, I'll try to take a look.

Did you find some cycles to think about which approach you prefer?

Thanks,
//richard

2022-03-08 04:41:21

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] nfs-utils: Improving NFS re-exports

On Mon, Mar 07, 2022 at 10:25:52AM +0100, Richard Weinberger wrote:
> Did you find some cycles to think about which approach you prefer?

I haven't. I'll try to take a look in the next couple days. Thanks for
the reminder.

--b.

2022-03-08 23:28:40

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 2/6] exports: Implement new export option reexport=

On Thu, Feb 17, 2022 at 02:15:27PM +0100, Richard Weinberger wrote:
> When re-exporting a NFS volume it is mandatory to specify
> either a UUID or numerical fsid= option because nfsd is unable
> to derive a identifier on its own.
>
> For NFS cross mounts this becomes a problem because nfsd also
> needs a identifier for every crossed mount.
> A common workaround is stating every single subvolume in the
> exports list too.
> But this defeats the purpose of the crossmnt option and is tedious.
>
> This is where the reexport= tries to help.
> It offers various strategies to automatically derive a identifier
> for NFS volumes and sub volumes.
> Each have their pros and cons.
>
> Currently three modes are implemented:
>
> 1. auto-fsidnum
> In this mode mountd/exportd will create a new numerical fsid
> for a NFS volume and subvolume. The numbers are stored in a database
> such that the server will always use the same fsid.
> The entry in the exports file allowed to skip fsid= entiry but
> stating a UUID is allowed, if needed.
>
> This mode has the obvious downside that load balancing is not
> possible since multiple re-exporting NFS servers would generate
> different ids.

This is the one I think it makes sense to concentrate on first. Ideally
it should Just Work without requiring any configuration.

And then eventually my hope is that we could replace sqlite by a
distributed database to get filehandles that are consistent across
multiple servers.

>
> 2. predefined-fsidnum
> This mode works just like auto-fsidnum but does not generate ids
> for you. It helps in the load balancing case. A system administrator
> has to manually maintain the database and install it on all re-exporting
> NFS servers. If you have a massive amount of subvolumes this mode
> will help because you don't have to bloat the exports list.

OK, I can see that being sort of useful but it'd be nice if we could
start with something more automatic.

> 3. remote-devfsid
> If this mode is selected mountd/exportd will derive an UUID from the
> re-exported NFS volume's fsid (rfc7530 section-5.8.1.9).

How does the server take a filehandle with a UUID in it and map that
UUID back to the original fsid?

> No further local state is needed on the re-exporting server.
> The export list entry still needs a fsid= setting because while
> parsing the exports file the NFS mounts might be not there yet.

I don't understand that bit.

> This mode is dangerous, use only of you're absolutely sure that the
> NFS server you're re-exporting has a stable fsid. Chances are good
> that it can change.

The fsid should be stable.

The case I'm worried about is the case where we're reexporting exports
from multiple servers. Then there's nothing preventing the two servers
from accidentally picking the same fsid to represent different exports.

--b.

> Since an UUID is derived, reexporting from NFSv3 to NFSv3 is not
> possible. The file handle space is too small.
> NFSv3 to NFSv4 works, though.

2022-03-08 23:33:20

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 1/6] Implement reexport helper library

On Thu, Feb 17, 2022 at 02:15:26PM +0100, Richard Weinberger wrote:
> +#define REEXPDB_SHM_NAME "/nfs_reexport_db_lock"
> +#define REEXPDB_SHM_SZ 4096
> +#define REEXPDB_INIT_LOCK NFS_STATEDIR "/reexpdb_init.lock"
> +#define REEXPDB_DBFILE NFS_STATEDIR "/reexpdb.sqlite3"

I don't know much about sqlite--why do we need to do our own file
locking? If we do need to do it ourself, could we lock the database
file instead instead of using a separate lock file?

> +static const char initdb_sql[] = "CREATE TABLE IF NOT EXISTS fsidnums (num INTEGER PRIMARY KEY CHECK (num > 0 AND num < 4294967296), path TEXT UNIQUE); CREATE TABLE IF NOT EXISTS subvolumes (path TEXT PRIMARY KEY); CREATE INDEX IF NOT EXISTS idx_ids_path ON fsidnums (path);";

I'd personally find it easier to read if these were defined in the place
where they're used. (And, honestly, if this is just used once, maybe
the definition is unnecessary.)

What are the two tables used for? Naively I'd've thought the
"subvolumes" table was redundant.

> +/*
> + * This query is a little tricky. We use SQL to find and claim the smallest free fsid number.

Yes, that is a little tricky. Is it necessary? My SQL Is rusty, but
the database should be able to pick a unique value for us, shouldn't it?

> + * To find a free fsid the fsidnums is left joined to itself but with an offset of 1.
> + * Everything after the UNION statement is to handle the corner case where fsidnums
> + * is empty. In this case we want 1 as first fsid number.
> + */
> +static const char new_fsidnum_by_path_sql[] = "INSERT INTO fsidnums VALUES ((SELECT ids1.num + 1 FROM fsidnums AS ids1 LEFT JOIN fsidnums AS ids2 ON ids2.num = ids1.num + 1 WHERE ids2.num IS NULL UNION SELECT 1 WHERE NOT EXISTS (SELECT NULL FROM fsidnums WHERE num = 1) LIMIT 1), ?1) RETURNING num;";
> +static const char fsidnum_by_path_sql[] = "SELECT num FROM fsidnums WHERE path = ?1;";
> +static const char add_crossed_volume_sql[] = "REPLACE INTO subvolumes VALUES(?1);";
> +static const char drop_crossed_volume_sql[] = "DELETE FROM subvolumes WHERE path = ?1;";
> +static const char get_crossed_volumes_sql[] = "SELECT path from subvolumes;";
...
> +/*
> + * reexpdb_init - Initialize reexport database
> + *
> + * Setup shared lock (database is concurrently used by multiple processes),

So, this should all work when rpc.mountd is run with --num_threads > 1?

--b.

2022-03-09 12:08:21

by Richard Weinberger

[permalink] [raw]

Subject: Re: [RFC PATCH 1/6] Implement reexport helper library

Bruce,

----- Ursprüngliche Mail -----
> Von: "bfields" <[email protected]>
> On Thu, Feb 17, 2022 at 02:15:26PM +0100, Richard Weinberger wrote:
>> +#define REEXPDB_SHM_NAME "/nfs_reexport_db_lock"
>> +#define REEXPDB_SHM_SZ 4096
>> +#define REEXPDB_INIT_LOCK NFS_STATEDIR "/reexpdb_init.lock"
>> +#define REEXPDB_DBFILE NFS_STATEDIR "/reexpdb.sqlite3"
>
> I don't know much about sqlite--why do we need to do our own file
> locking? If we do need to do it ourself, could we lock the database
> file instead instead of using a separate lock file?

Concurrent access to the database is synchronized using a shared rwlock (on shared memory).
reexpdb_init.lock is used to make sure that creating and initializing the shared memory/lock
happens once.

>> +static const char initdb_sql[] = "CREATE TABLE IF NOT EXISTS fsidnums (num
>> INTEGER PRIMARY KEY CHECK (num > 0 AND num < 4294967296), path TEXT UNIQUE);
>> CREATE TABLE IF NOT EXISTS subvolumes (path TEXT PRIMARY KEY); CREATE INDEX IF
>> NOT EXISTS idx_ids_path ON fsidnums (path);";
>
> I'd personally find it easier to read if these were defined in the place
> where they're used. (And, honestly, if this is just used once, maybe
> the definition is unnecessary.)

Ok.

> What are the two tables used for? Naively I'd've thought the
> "subvolumes" table was redundant.

fsidnums is used to store generated and predefined fsid numbers.
It is only used in reexport modes auto-fsidnum and predefined-fsidnum.

subvolumes contains a list of subvolumes which a are likely in use by
a client. Up start all these paths will get touched such that they can
be exported.

>> +/*
>> + * This query is a little tricky. We use SQL to find and claim the smallest
>> free fsid number.
>
> Yes, that is a little tricky. Is it necessary? My SQL Is rusty, but
> the database should be able to pick a unique value for us, shouldn't it?

SQLite can generate a unique value, but we cannot select the range.
It will give a value between 0 and 2^64.
We need an id between 1 and 2^32.

>> + * To find a free fsid the fsidnums is left joined to itself but with an offset
>> of 1.
>> + * Everything after the UNION statement is to handle the corner case where
>> fsidnums
>> + * is empty. In this case we want 1 as first fsid number.
>> + */
>> +static const char new_fsidnum_by_path_sql[] = "INSERT INTO fsidnums VALUES
>> ((SELECT ids1.num + 1 FROM fsidnums AS ids1 LEFT JOIN fsidnums AS ids2 ON
>> ids2.num = ids1.num + 1 WHERE ids2.num IS NULL UNION SELECT 1 WHERE NOT EXISTS
>> (SELECT NULL FROM fsidnums WHERE num = 1) LIMIT 1), ?1) RETURNING num;";
>> +static const char fsidnum_by_path_sql[] = "SELECT num FROM fsidnums WHERE path
>> = ?1;";
>> +static const char add_crossed_volume_sql[] = "REPLACE INTO subvolumes
>> VALUES(?1);";
>> +static const char drop_crossed_volume_sql[] = "DELETE FROM subvolumes WHERE
>> path = ?1;";
>> +static const char get_crossed_volumes_sql[] = "SELECT path from subvolumes;";
> ...
>> +/*
>> + * reexpdb_init - Initialize reexport database
>> + *
>> + * Setup shared lock (database is concurrently used by multiple processes),
>
> So, this should all work when rpc.mountd is run with --num_threads > 1?

Yes, that's why we need the shared rwlock.

Thanks,
//richard

2022-03-09 12:19:19

by Richard Weinberger

[permalink] [raw]

Subject: Re: [RFC PATCH 2/6] exports: Implement new export option reexport=

Bruce,

----- Ursprüngliche Mail -----
> Von: "bfields" <[email protected]>
>> 1. auto-fsidnum
>> In this mode mountd/exportd will create a new numerical fsid
>> for a NFS volume and subvolume. The numbers are stored in a database
>> such that the server will always use the same fsid.
>> The entry in the exports file allowed to skip fsid= entiry but
>> stating a UUID is allowed, if needed.
>>
>> This mode has the obvious downside that load balancing is not
>> possible since multiple re-exporting NFS servers would generate
>> different ids.
>
> This is the one I think it makes sense to concentrate on first. Ideally
> it should Just Work without requiring any configuration.

Agreed.

> And then eventually my hope is that we could replace sqlite by a
> distributed database to get filehandles that are consistent across
> multiple servers.

Sure. I see at least two options here:

a. Allow multiple SQL backends in nfs-utils. SQLite by default, but also remote MariaDB
or Postgres...

b. Placing the SQLite database on a shared file system that is capable of file locks.
That way we can use SQlite as-is. We just need to handle the SQLITE_LOCKED case in the code.
Luckily writing happens seldom, so this shouldn't be a big deal.

>>
>> 2. predefined-fsidnum
>> This mode works just like auto-fsidnum but does not generate ids
>> for you. It helps in the load balancing case. A system administrator
>> has to manually maintain the database and install it on all re-exporting
>> NFS servers. If you have a massive amount of subvolumes this mode
>> will help because you don't have to bloat the exports list.
>
> OK, I can see that being sort of useful but it'd be nice if we could
> start with something more automatic.
>
>> 3. remote-devfsid
>> If this mode is selected mountd/exportd will derive an UUID from the
>> re-exported NFS volume's fsid (rfc7530 section-5.8.1.9).
>
> How does the server take a filehandle with a UUID in it and map that
> UUID back to the original fsid?

knfsd does not need the original fsid. All it sees is the UUID.
If it needs to know which export belongs to a UUID it asks mountd.
In mountd the regular UUID lookup is used then.

>> No further local state is needed on the re-exporting server.
>> The export list entry still needs a fsid= setting because while
>> parsing the exports file the NFS mounts might be not there yet.
>
> I don't understand that bit.

I tried to explain that with this mode we don't need to store UUID or
fsids on disk.

>> This mode is dangerous, use only of you're absolutely sure that the
>> NFS server you're re-exporting has a stable fsid. Chances are good
>> that it can change.
>
> The fsid should be stable.

Didn't you explain me last time that it is not?
By fsid I mean:
https://datatracker.ietf.org/doc/html/rfc7530#section-5.8.1.9
https://datatracker.ietf.org/doc/html/rfc7530#section-2.2.5

So after a reboot the very same filesystem could be on different
disks and the major/minor tuple is different. (If the server uses disk ids
as is).

> The case I'm worried about is the case where we're reexporting exports
> from multiple servers. Then there's nothing preventing the two servers
> from accidentally picking the same fsid to represent different exports.

That's a good point. Since /proc/fs/nfsfs/volumes shows all that information
we can add sanity checks to mountd.

Thanks,
//richard

2022-03-09 16:13:22

by Richard Weinberger

[permalink] [raw]

Subject: Re: [RFC PATCH 1/6] Implement reexport helper library

Bruce,

----- Ursprüngliche Mail -----
> Von: "bfields" <[email protected]>
>> Concurrent access to the database is synchronized using a shared rwlock (on
>> shared memory).
>> reexpdb_init.lock is used to make sure that creating and initializing the shared
>> memory/lock
>> happens once.
>
> Could you point me to sqlite documentation that explains why the user
> would need to do their own locking?

https://www.sqlite.org/rescode.html#busy

> I assumed sqlite would do any necessary locking for you. It seems like
> a core function for a database.

Well, SQLite does locking but no queuing.
So, as soon somebody is writing the data base it is locked and all other
read/writes will fail either with SQLITE_BUSY or SQLITE_LOCKED.
It is up to the user how to react on that.

That's why I chose to use a shared rwlock where a task can *wait* upon
conflicting access.

Maybe there is a better way do it, dunno.

>> > What are the two tables used for? Naively I'd've thought the
>> > "subvolumes" table was redundant.
>>
>> fsidnums is used to store generated and predefined fsid numbers.
>> It is only used in reexport modes auto-fsidnum and predefined-fsidnum.
>>
>> subvolumes contains a list of subvolumes which a are likely in use by
>> a client. Up start all these paths will get touched such that they can
>> be exported.
>
> The fsidnums also contains that same list of paths, right? So I don't
> understand why we need both.

In the current design generated fsidnums will stay forever while the paths
in subvolumes can get cleaned.

> Also, if we're depending on touching all the paths on startup, something
> is wrong.

I think we talked about that already and agreed that it should work without
touching. So far I didn't had a chance to investigate into this.

> What we want to do is touch the path when we get an upcall for the given
> fsid. That way we don't have to assume, for example, that the system
> will never expire mounts that haven't been used recently.
>
>> >> +/*
>> >> + * This query is a little tricky. We use SQL to find and claim the smallest
>> >> free fsid number.
>> >
>> > Yes, that is a little tricky. Is it necessary? My SQL Is rusty, but
>> > the database should be able to pick a unique value for us, shouldn't it?
>>
>> SQLite can generate a unique value, but we cannot select the range.
>> It will give a value between 0 and 2^64.
>> We need an id between 1 and 2^32.
>
> Surely that CHECK constraint doesn't somehow cause sqlite to generate
> non-unique primary keys? At worst I'd think it would cause INSERTs to
> fail if the ordinary primary-key-choosing algorithm chooses something
> over 2^32.

The CHECK is just a paranoid check. My SQL INSERT generates ids starting with 1.
Sure, if you run it 2^32 times, it will fail due to the CHECK.

Thanks,
//richard

2022-03-09 16:13:26

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 1/6] Implement reexport helper library

On Wed, Mar 09, 2022 at 04:02:11PM +0100, Richard Weinberger wrote:
> Bruce,
>
> ----- Ursprüngliche Mail -----
> > Von: "bfields" <[email protected]>
> >> Concurrent access to the database is synchronized using a shared rwlock (on
> >> shared memory).
> >> reexpdb_init.lock is used to make sure that creating and initializing the shared
> >> memory/lock
> >> happens once.
> >
> > Could you point me to sqlite documentation that explains why the user
> > would need to do their own locking?
>
> https://www.sqlite.org/rescode.html#busy
>
> > I assumed sqlite would do any necessary locking for you. It seems like
> > a core function for a database.
>
> Well, SQLite does locking but no queuing.
> So, as soon somebody is writing the data base it is locked and all other
> read/writes will fail either with SQLITE_BUSY or SQLITE_LOCKED.
> It is up to the user how to react on that.
>
> That's why I chose to use a shared rwlock where a task can *wait* upon
> conflicting access.
>
> Maybe there is a better way do it, dunno.

Oh, got it, thanks for the explanation.

Assuming writes are rare, maybe a dumb retry loop would be adequate.
Sounds like that's what we'd need anyway if we were to share the
database between cooperating re-export servers. (Would we have a
performance problem in that case, if several reexport servers start at
once and all start trying to populate the shared database? I don't
know.)

Anyway, it's a judgement call, fair enough. Might be worth a brief
comment, at least.

> >> > What are the two tables used for? Naively I'd've thought the
> >> > "subvolumes" table was redundant.
> >>
> >> fsidnums is used to store generated and predefined fsid numbers.
> >> It is only used in reexport modes auto-fsidnum and predefined-fsidnum.
> >>
> >> subvolumes contains a list of subvolumes which a are likely in use by
> >> a client. Up start all these paths will get touched such that they can
> >> be exported.
> >
> > The fsidnums also contains that same list of paths, right? So I don't
> > understand why we need both.
>
> In the current design generated fsidnums will stay forever while the paths
> in subvolumes can get cleaned.
>
> > Also, if we're depending on touching all the paths on startup, something
> > is wrong.
>
> I think we talked about that already and agreed that it should work without
> touching. So far I didn't had a chance to investigate into this.

OK. Do you think you could look into that, and strip this down to the
one auto-fsidnum case, and then continue the discussion? I think that'd
clarify things.

As I say, I wouldn't necessarily be opposed to later adding a reexport=
option back in, but for now I'd first like to see if we can find the
simplest patches that will solve the problem in one good-enough way.

> > What we want to do is touch the path when we get an upcall for the given
> > fsid. That way we don't have to assume, for example, that the system
> > will never expire mounts that haven't been used recently.
> >
> >> >> +/*
> >> >> + * This query is a little tricky. We use SQL to find and claim the smallest
> >> >> free fsid number.
> >> >
> >> > Yes, that is a little tricky. Is it necessary? My SQL Is rusty, but
> >> > the database should be able to pick a unique value for us, shouldn't it?
> >>
> >> SQLite can generate a unique value, but we cannot select the range.
> >> It will give a value between 0 and 2^64.
> >> We need an id between 1 and 2^32.
> >
> > Surely that CHECK constraint doesn't somehow cause sqlite to generate
> > non-unique primary keys? At worst I'd think it would cause INSERTs to
> > fail if the ordinary primary-key-choosing algorithm chooses something
> > over 2^32.
>
> The CHECK is just a paranoid check. My SQL INSERT generates ids starting with 1.
> Sure, if you run it 2^32 times, it will fail due to the CHECK.

OK.

--b.

2022-03-09 16:21:55

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 1/6] Implement reexport helper library

On Wed, Mar 09, 2022 at 10:43:34AM +0100, Richard Weinberger wrote:
> Bruce,
>
> ----- Ursprüngliche Mail -----
> > Von: "bfields" <[email protected]>
> > On Thu, Feb 17, 2022 at 02:15:26PM +0100, Richard Weinberger wrote:
> >> +#define REEXPDB_SHM_NAME "/nfs_reexport_db_lock"
> >> +#define REEXPDB_SHM_SZ 4096
> >> +#define REEXPDB_INIT_LOCK NFS_STATEDIR "/reexpdb_init.lock"
> >> +#define REEXPDB_DBFILE NFS_STATEDIR "/reexpdb.sqlite3"
> >
> > I don't know much about sqlite--why do we need to do our own file
> > locking? If we do need to do it ourself, could we lock the database
> > file instead instead of using a separate lock file?
>
> Concurrent access to the database is synchronized using a shared rwlock (on shared memory).
> reexpdb_init.lock is used to make sure that creating and initializing the shared memory/lock
> happens once.

Could you point me to sqlite documentation that explains why the user
would need to do their own locking?

I assumed sqlite would do any necessary locking for you. It seems like
a core function for a database.

> > What are the two tables used for? Naively I'd've thought the
> > "subvolumes" table was redundant.
>
> fsidnums is used to store generated and predefined fsid numbers.
> It is only used in reexport modes auto-fsidnum and predefined-fsidnum.
>
> subvolumes contains a list of subvolumes which a are likely in use by
> a client. Up start all these paths will get touched such that they can
> be exported.

The fsidnums also contains that same list of paths, right? So I don't
understand why we need both.

Also, if we're depending on touching all the paths on startup, something
is wrong.

What we want to do is touch the path when we get an upcall for the given
fsid. That way we don't have to assume, for example, that the system
will never expire mounts that haven't been used recently.

> >> +/*
> >> + * This query is a little tricky. We use SQL to find and claim the smallest
> >> free fsid number.
> >
> > Yes, that is a little tricky. Is it necessary? My SQL Is rusty, but
> > the database should be able to pick a unique value for us, shouldn't it?
>
> SQLite can generate a unique value, but we cannot select the range.
> It will give a value between 0 and 2^64.
> We need an id between 1 and 2^32.

Surely that CHECK constraint doesn't somehow cause sqlite to generate
non-unique primary keys? At worst I'd think it would cause INSERTs to
fail if the ordinary primary-key-choosing algorithm chooses something
over 2^32.

--b.

2022-04-20 00:44:46

by Steve Dickson

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] nfs-utils: Improving NFS re-exports

On 3/7/22 5:29 PM, bfields wrote:
> On Mon, Mar 07, 2022 at 10:25:52AM +0100, Richard Weinberger wrote:
>> Did you find some cycles to think about which approach you prefer?
>
> I haven't. I'll try to take a look in the next couple days. Thanks for
> the reminder.

Is there any movement on this?? I would be good to have some
for the upcoming Bakeathon... Next week.

steved.

2022-04-21 18:54:15

by Richard Weinberger

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] nfs-utils: Improving NFS re-exports

----- Ursprüngliche Mail -----
> On 3/7/22 5:29 PM, bfields wrote:
>> On Mon, Mar 07, 2022 at 10:25:52AM +0100, Richard Weinberger wrote:
>>> Did you find some cycles to think about which approach you prefer?
>>
>> I haven't. I'll try to take a look in the next couple days. Thanks for
>> the reminder.
>
> Is there any movement on this?? I would be good to have some
> for the upcoming Bakeathon... Next week.

I'm in the middle of preparing the next version of the series.

Thanks,
//richard