2009-08-05 14:46:19

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

Provide a new implementation of statd that supports IPv6. The new
statd implementation resides under

utils/new-statd/

The contents of this directory are built if --enable-tirpc is set
on the ./configure command line, and sqlite3 is available on the
build system. Otherwise, the legacy version of statd, which still
resides under utils/statd/, is built.

The goals of this re-write are:

o Support IPv6 networking

Support interoperation with TI-RPC-based NSM implementations.
Transport Independent RPC, or TI-RPC, provides IPv6 network support
for Linux's NSM implementation.

To support TI-RPC, open code to construct RPC requests in socket
buffers and then schedule them has been replaced with standard
library calls.

o Support notification via TCP

As a secondary benefit of using TI-RPC library calls, reboot
notifications and NLM callbacks can now be sent via connection-
oriented transport protocols.

Note that lockd does not (yet) tell statd what transport protocol
to use when sending reboot notifications. statd/sm-notify will
continue to use UDP for the time being.

o Use an embedded database for storing on-disk callback data

This whole exercise is for the purpose of crash robustness. There
are well-known deficiencies with simple create/rename/unlink
disk storage schemes during system crashes. Replace the current
flat-file monitor list mechanism which uses sync(2) with sqlite3,
which uses fsync(3).

o Share code between sm-notify and statd

Statd and sm-notify access the same set of on-disk data. These
separate programs now share the same code and implementation, with
access to on-disk data serialized by sqlite3. The two remain
separate executables to allow other system facilities to send
reboot notifications without poking statd.

o Reduce impact of DNS outages

The heuristics used by SM_NOTIFY to figure out which remote peer
has rebooted are heavily dependent on DNS. If the DNS service is
slow or hangs, that will make the NSM listener unresponsive.
Incoming SM_NOTIFY requests are now handled in a sidecar process
to reduce the impact of DNS outages on the NSM service listener.

o Proper my_name support

The current version of statd uses gethostname(3) to generate the
mon_name argument of SM_NOTIFY. This value can change across a
reboot. The new version of statd records lockd's my_name, passed
by every SM_MON request, and uses that when sending SM_NOTIFY.

This can be useful for multi-homed and DHCP configured hosts.

o Send SM_NOTIFY more aggressively

It has been recommended that statd/sm-notify send SM_NOTIFY
more aggressively (for example, to the entire list returned by
getaddrinfo(3)). Since SM_NOTIFY's reply is NULL, there's no
way to tell whether the remote peer recognized the mon_name we
sent. More study is required, but this implementation attempts
to send an SM_NOTIFY request to each address returned by
getaddrinfo(3).

This re-implementation paves the way for a number of future
improvements. However, it does not immediately address:

o lockd/statd start-up serialization issues

Sending reboot notifications, starting statd and lockd, and opening
the lockd grace period are still determined independently in user
space and the kernel.

o Binding mon_names to caller IP addresses

By default, lockd continues to send IP addresses as the mon_name
argument of the SM_MON procedure. This provides a better guarantee
of being able to contact remote peers during a reboot, but means
statd must continue to use heuristics to match incoming SM_NOTIFY
requests with peers on the monitor list.

o Distinct logic for NFS client- and server-side

Client-side and server-side monitoring requirements are different.
Statd continues to use the same logic for both NFS client and
server, as the NSMv1 protocol does not provide any indication
that a mon_name is for a client or server peer.

Signed-off-by: Chuck Lever <[email protected]>
---

utils/new-statd/file.c | 779 ++++++++++++++++++++++++++++++++++++++++++++
utils/new-statd/hostname.c | 520 +++++++++++++++++++++++++++++
utils/new-statd/nlmcall.c | 525 ++++++++++++++++++++++++++++++
3 files changed, 1824 insertions(+), 0 deletions(-)
create mode 100644 utils/new-statd/file.c
create mode 100644 utils/new-statd/hostname.c
create mode 100644 utils/new-statd/nlmcall.c

diff --git a/utils/new-statd/file.c b/utils/new-statd/file.c
new file mode 100644
index 0000000..db54bf7
--- /dev/null
+++ b/utils/new-statd/file.c
@@ -0,0 +1,779 @@
+/*
+ * Copyright 2009 Oracle. All rights reserved.
+ *
+ * This file is part of nfs-utils.
+ *
+ * nfs-utils is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * nfs-utils is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with nfs-utils. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * NSM for Linux.
+ *
+ * Callback information is stored in an sqlite database that provides
+ * transactional semantics, to preserve state in the event of a system
+ * crash, and can easily be updated to store additional data elements
+ * without need for versioning
+ */
+
+#ifdef HAVE_CONFIG_H
+#include <config.h>
+#endif
+
+#include <sys/types.h>
+#include <sys/capability.h>
+#include <sys/prctl.h>
+#include <sys/stat.h>
+#include <sys/ipc.h>
+#include <sys/sem.h>
+
+#include <stdint.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <dirent.h>
+#include <grp.h>
+#include <arpa/inet.h>
+
+#include "statd.h"
+
+#ifdef NFS_STATEDIR
+#define STATD_DEFAULT_STATEDIR NFS_STATEDIR "/statd"
+#else /* !defined(NFS_STATEDIR) */
+#define STATD_DEFAULT_STATEDIR "/var/lib/nfs/statd"
+#endif /* !defined(NFS_STATEDIR) */
+
+#define STATD_DATABASE_FILE "statdb"
+
+static char statd_db_dirname[PATH_MAX];
+static char statd_db_filename[PATH_MAX];
+
+/**
+ * statd_open_db - open a database connection to statd's sqlite3 database
+ * @flags: flags passed to sqlite3_open_v2
+ *
+ * Returns an open database connection; otherwise NULL is returned if
+ * an error occurred.
+ */
+sqlite3 *
+statd_open_db(int flags)
+{
+ sqlite3 *db;
+ int rc;
+
+ rc = sqlite3_open_v2(statd_db_filename, &db, flags, NULL);
+ if (rc != SQLITE_OK) {
+ xlog(L_ERROR, "Failed to open database in %s: %s",
+ statd_db_filename, sqlite3_errmsg(db));
+ xlog(L_ERROR, "Check that the state directory pathname "
+ "is correct and has proper permissions");
+ (void)sqlite3_close(db);
+ return NULL;
+ }
+
+ /*
+ * Retry SQLITE_BUSY for 100 msec before returning an error.
+ */
+ (void)sqlite3_busy_timeout(db, 100);
+
+ return db;
+}
+
+/**
+ * statd_close_db - close an open sqlite3 database connection
+ * @db: open sqlite3 database connection descriptor
+ *
+ */
+void
+statd_close_db(sqlite3 *db)
+{
+ if (sqlite3_close(db) != SQLITE_OK)
+ xlog(L_ERROR, "Failed to close database: %s",
+ sqlite3_errmsg(db));
+}
+
+/**
+ * statd_prepare_stmt - prepare a C string SQL statement
+ * @db: open database connection
+ * @stmt: OUT: pointer to prepared SQL statement object
+ * @sql: C string containing SQL statement to prepare
+ *
+ * Returns TRUE if the statement was prepared successfully; otherwise
+ * FALSE is returned and an error is logged.
+ */
+bool_t
+statd_prepare_stmt(sqlite3 *db, sqlite3_stmt **stmt, const char *sql)
+{
+ int rc;
+
+ rc = sqlite3_prepare_v2(db, sql, -1, stmt, NULL);
+ if (rc != SQLITE_OK) {
+ xlog(L_ERROR, "Failed to compile SQL: %s",
+ sqlite3_errmsg(db));
+ xlog(L_ERROR, "SQL: %s", sql);
+ return FALSE;
+ }
+ return TRUE;
+}
+
+/**
+ * statd_finalize_stmt - release a prepared SQL statement
+ * @stmt: sqlite3_stmt to release
+ *
+ */
+void
+statd_finalize_stmt(sqlite3_stmt *stmt)
+{
+ sqlite3 *db = sqlite3_db_handle(stmt);
+ int rc;
+
+ rc = sqlite3_finalize(stmt);
+ switch(rc) {
+ case SQLITE_OK:
+ case SQLITE_ABORT:
+ case SQLITE_CONSTRAINT:
+ break;
+ default:
+ xlog(L_ERROR, "Failed to finalize SQL statement: %s",
+ sqlite3_errmsg(db));
+ }
+}
+
+/**
+ * statd_begin_transaction - start a sqlite3 transaction
+ * @db: open database connection
+ *
+ * Returns TRUE if the transaction was started; otherwise FALSE is
+ * returned if sqlite3 reported an error.
+ */
+bool_t
+statd_begin_transaction(sqlite3 *db)
+{
+ char *err_msg;
+ int rc;
+
+ err_msg = NULL;
+ rc = sqlite3_exec(db, "BEGIN IMMEDIATE TRANSACTION;", NULL, 0, &err_msg);
+ if (rc != SQLITE_OK) {
+ xlog(L_ERROR, "Failed to start transaction: %s", err_msg);
+ sqlite3_free(err_msg);
+ return FALSE;
+ }
+
+ xlog(D_CALL, "Transaction started");
+ return TRUE;
+}
+
+/**
+ * statd_end_transaction - commit a sqlite3 transaction
+ * @db: open database connection
+ *
+ */
+void
+statd_end_transaction(sqlite3 *db)
+{
+ char *err_msg;
+ int rc;
+
+ err_msg = NULL;
+ rc = sqlite3_exec(db, "COMMIT TRANSACTION;", NULL, 0, &err_msg);
+ if (rc != SQLITE_OK) {
+ xlog(L_ERROR, "Failed to commit transaction: %s", err_msg);
+ sqlite3_free(err_msg);
+ return;
+ }
+
+ xlog(D_CALL, "Transaction committed");
+}
+
+/**
+ * statd_rollback_transaction - rollback an active sqlite3 transaction
+ * @db: open database connection
+ *
+ */
+void
+statd_rollback_transaction(sqlite3 *db)
+{
+ char *err_msg;
+ int rc;
+
+ err_msg = NULL;
+ rc = sqlite3_exec(db, "ROLLBACK TRANSACTION;", NULL, 0, &err_msg);
+ if (rc != SQLITE_OK) {
+ xlog(L_ERROR, "Failed to roll back transaction: %s", err_msg);
+ sqlite3_free(err_msg);
+ return;
+ }
+
+ xlog(D_CALL, "Transaction rolled back");
+}
+
+/**
+ * statd_check_pathname - set up pathname
+ * @parentdir: C string containing pathname to on-disk state
+ *
+ * Returns TRUE if pathname was valid; otherwise FALSE is returned.
+ */
+bool_t
+statd_check_pathname(const char *parentdir)
+{
+ struct stat st;
+ int len;
+
+ if (parentdir != NULL) {
+ if (strlen(parentdir) >= PATH_MAX) {
+ xlog(L_ERROR, "Invalid directory name");
+ return FALSE;
+ }
+ } else
+ parentdir = STATD_DEFAULT_STATEDIR;
+
+ if (lstat(parentdir, &st) == -1) {
+ xlog(L_ERROR, "Failed to stat %s: %m", parentdir);
+ return FALSE;
+ }
+
+ if (chdir(parentdir) == -1) {
+ xlog(L_ERROR, "Failed to change working directory to %s: %m",
+ parentdir);
+ return FALSE;
+ }
+
+ strncpy(statd_db_dirname, parentdir, sizeof(statd_db_dirname));
+
+ len = snprintf(statd_db_filename, sizeof(statd_db_filename),
+ "%s/%s", parentdir, STATD_DATABASE_FILE);
+ if (__error_check(len, sizeof(statd_db_filename))) {
+ xlog(L_ERROR, "Illegal directory name");
+ return FALSE;
+ }
+
+ return TRUE;
+}
+
+/*
+ * Returns TRUE if info table creation was successful, or if
+ * info table already exists; otherwise FALSE is returned if
+ * any error occurs.
+ */
+static bool_t
+statd_create_info_table(sqlite3 *db)
+{
+ sqlite3_stmt *stmt;
+ bool_t result;
+ int rc;
+
+ result = FALSE;
+
+ rc = sqlite3_prepare_v2(db, "CREATE TABLE " STATD_INFO_TABLENAME
+ " (name TEXT PRIMARY KEY,"
+ " value INTEGER);",
+ -1, &stmt, NULL);
+ switch (rc) {
+ case SQLITE_OK:
+ rc = sqlite3_step(stmt);
+ if (rc != SQLITE_DONE) {
+ xlog(L_ERROR, "Failed to create table '"
+ STATD_INFO_TABLENAME "': %s",
+ sqlite3_errmsg(db));
+ break;
+ }
+
+ result = TRUE;
+ xlog(D_GENERAL, "Table '" STATD_INFO_TABLENAME
+ "' created successfully");
+ break;
+ case SQLITE_ERROR:
+ result = TRUE;
+ xlog(D_GENERAL, "Table '" STATD_INFO_TABLENAME "' already exists");
+ goto out;
+ default:
+ xlog(L_ERROR, "Failed to compile SQL: %s",
+ sqlite3_errmsg(db));
+ xlog(L_ERROR, "SQL: %s");
+ goto out;
+ }
+
+ statd_finalize_stmt(stmt);
+
+out:
+ return result;
+}
+
+/*
+ * Returns TRUE if the monitor table exists, or was created; otherwise
+ * FALSE is returned if an error occurs.
+ *
+ * This could be done in svc.c. Since sm-notify links file.o, that
+ * would mean it would have to link svc.c as well.
+ */
+static bool_t
+statd_create_monitor_table(sqlite3 *db)
+{
+ sqlite3_stmt *stmt;
+ bool_t result;
+ int rc;
+
+ result = FALSE;
+
+ rc = sqlite3_prepare_v2(db, "CREATE TABLE " STATD_MONITOR_TABLENAME
+ " (priv BLOB,"
+ " mon_name TEXT NOT NULL,"
+ " my_name TEXT NOT NULL,"
+ " program INTEGER,"
+ " version INTEGER,"
+ " procedure INTEGER,"
+ " protocol TEXT NOT NULL,"
+ " state INTEGER,"
+ " UNIQUE(mon_name,my_name));",
+ -1, &stmt, NULL);
+ switch (rc) {
+ case SQLITE_OK:
+ rc = sqlite3_step(stmt);
+ if (rc != SQLITE_DONE) {
+ xlog(L_ERROR, "Failed to create table '"
+ STATD_MONITOR_TABLENAME "': %s",
+ sqlite3_errmsg(db));
+ break;
+ }
+
+ result = 1;
+ xlog(D_GENERAL, "Table '" STATD_MONITOR_TABLENAME
+ "' created successfully");
+ break;
+ case SQLITE_ERROR:
+ result = TRUE;
+ xlog(D_GENERAL, "Table '" STATD_MONITOR_TABLENAME
+ "' already exists");
+ goto out;
+ default:
+ xlog(L_ERROR, "Failed to compile SQL: %s",
+ sqlite3_errmsg(db));
+ xlog(L_ERROR, "SQL: %s");
+ goto out;
+ }
+
+ statd_finalize_stmt(stmt);
+
+out:
+ return result;
+}
+
+/*
+ * Ensure database file exists. This should create the file
+ * with the proper ownership. Also ensure tables are created.
+ *
+ * XXX: Is it enough to set journal_mode and synchronous just once?
+ */
+static bool_t
+statd_init_database(void)
+{
+ bool_t result;
+ char *err_msg;
+ sqlite3 *db;
+ int rc;
+
+ xlog(D_CALL, "Initializing database");
+
+ result = FALSE;
+ db = statd_open_db(SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE);
+ if (db == NULL)
+ goto out;
+
+ /*
+ * Don't delete the journal file after each transaction.
+ * This provides better performance and crash robustness.
+ */
+ err_msg = NULL;
+ rc = sqlite3_exec(db, "PRAGMA journal_mode=TRUNCATE;", NULL, 0, &err_msg);
+ if (rc != SQLITE_OK) {
+ xlog(L_ERROR, "Failed to enable persistent journaling: %s",
+ err_msg);
+ sqlite3_free(err_msg);
+ goto out_close;
+ }
+
+ /*
+ * Protect against db corruption during crashes.
+ */
+ err_msg = NULL;
+ rc = sqlite3_exec(db, "PRAGMA synchronous=FULL;", NULL, 0, &err_msg);
+ if (rc != SQLITE_OK) {
+ xlog(L_ERROR, "Failed to enable full synchronous mode: %s",
+ err_msg);
+ sqlite3_free(err_msg);
+ goto out_close;
+ }
+
+ if (!statd_create_info_table(db))
+ goto out_close;
+
+ if (!statd_create_monitor_table(db))
+ goto out_close;
+
+ result = TRUE;
+
+out_close:
+ statd_close_db(db);
+
+out:
+ return result;
+}
+
+/*
+ * Clear all capabilities but CAP_NET_BIND_SERVICE. This permits
+ * statd to acquire privileged source ports, but all other
+ * capabilities are disallowed.
+ */
+static bool_t
+statd_clear_capabilities(void)
+{
+ bool_t result;
+ cap_t caps;
+
+ result = FALSE;
+
+ caps = cap_from_text("cap_net_bind_service=ep");
+ if (caps == NULL) {
+ xlog(L_ERROR, "Failed to allocate working storage: %m");
+ return result;
+ }
+
+ if (cap_set_proc(caps) == -1) {
+ xlog(L_ERROR, "Failed to set capability flags: %m");
+ goto out_free;
+ }
+
+ result = TRUE;
+
+out_free:
+ (void)cap_free(caps);
+ return result;
+}
+
+/**
+ * statd_drop_privileges - drop root privileges
+ * @keep_bind: caller sets this to keep CAP_NET_BIND_SERVICE
+ *
+ * Returns TRUE if successful, or FALSE if some error occurred.
+ *
+ * Set our effective UID and GID to that of our on-disk database.
+ */
+bool_t
+statd_drop_privileges(const int keep_bind)
+{
+ struct stat st;
+
+ /*
+ * XXX: If we can't stat dirname, or if dirname is owned by
+ * root, we should use "statduser" instead, which is set up
+ * by configure.ac
+ *
+ * Nothing in nfs-utils seems to use statduser, though.
+ */
+ if (lstat(statd_db_dirname, &st) == -1) {
+ xlog(L_ERROR, "Failed to stat %s: %m", statd_db_dirname);
+ return FALSE;
+ }
+
+ if (st.st_uid == 0) {
+ xlog(L_WARNING,
+ "Running as root. chown %s to choose different user",
+ statd_db_dirname);
+ return TRUE;
+ }
+
+ /*
+ * Don't clear capabilities when dropping root.
+ */
+ if (keep_bind && prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0) == -1) {
+ xlog(L_ERROR, "prctl(PR_SET_KEEPCAPS) failed: %m");
+ return FALSE;
+ }
+
+ if (setgroups(0, NULL) == -1) {
+ xlog(L_ERROR, "Failed to drop supplementary groups: %m");
+ return FALSE;
+ }
+
+ /*
+ * ORDER
+ *
+ * setgid(2) first, as setuid(2) may remove privileges needed
+ * to set the group id.
+ */
+ if (setgid(st.st_gid) == -1 || setuid(st.st_uid) == -1) {
+ xlog(L_ERROR, "Failed to drop privileges: %m");
+ return FALSE;
+ }
+ xlog(D_GENERAL, "Effective UID, GID: %u, %u", st.st_uid, st.st_gid);
+
+ if (keep_bind && !statd_clear_capabilities())
+ return FALSE;
+
+ /*
+ * ORDER
+ *
+ * If the on-disk database doesn't exist yet, it is created
+ * here, after lowering our privileges. This ensures that
+ * statd can continue to access the database after privileges
+ * are dropped.
+ */
+ return statd_init_database();
+}
+
+/**
+ * statd_system_rebooted - check if we've run since the last reboot
+ *
+ * Returns TRUE if the system has rebooted since we last ran, or if
+ * we can't tell if we rebooted. Otherwise FALSE is returned.
+ */
+bool_t
+statd_system_rebooted(void)
+{
+ key_t key;
+
+ key = ftok(statd_db_filename, 'r'); /* 'r' as in aRbitrary */
+
+ if (semget(key, 1, IPC_CREAT | IPC_EXCL | S_IRUSR | S_IWUSR) == -1) {
+ if (errno == EEXIST) {
+ xlog(D_GENERAL, "Already ran since last reboot");
+ return FALSE;
+ } else
+ xlog(D_GENERAL, "Failed to find shared semaphor: %m");
+ }
+
+ xlog(D_GENERAL, "First run since last reboot");
+ return TRUE;
+}
+
+/*
+ * Returns TRUE if successful, or FALSE if an error occurred.
+ */
+static bool_t
+statd_insert_info(sqlite3 *db, const char *name, const int64_t value)
+{
+ sqlite3_stmt *stmt;
+ bool_t result;
+ int rc;
+
+ result = FALSE;
+
+ if (!statd_prepare_stmt(db, &stmt, "INSERT OR REPLACE INTO "
+ STATD_INFO_TABLENAME " (name,value) VALUES(?,?);"))
+ goto out;
+
+ rc = sqlite3_bind_text(stmt, 1, name, -1, SQLITE_STATIC);
+ if (rc != SQLITE_OK) {
+ xlog(L_ERROR, "Failed to bind name %s: %s",
+ name, sqlite3_errmsg(db));
+ goto out_finalize;
+ }
+
+ rc = sqlite3_bind_int64(stmt, 2, (sqlite3_int64)value);
+ if (rc != SQLITE_OK) {
+ xlog(L_ERROR, "Failed to bind value: %s",
+ sqlite3_errmsg(db));
+ goto out_finalize;
+ }
+
+ rc = sqlite3_step(stmt);
+ if (rc != SQLITE_DONE) {
+ xlog(L_ERROR, "Failed to write NSM state number: %s",
+ sqlite3_errmsg(db));
+ goto out_finalize;
+ }
+
+ result = TRUE;
+
+out_finalize:
+ statd_finalize_stmt(stmt);
+
+out:
+ return result;
+}
+
+/*
+ * Returns TRUE if successful, or FALSE if an error occurred.
+ */
+static bool_t
+statd_get_info(sqlite3 *db, const char *name, int64_t *value)
+{
+ sqlite3_stmt *stmt;
+ bool_t result;
+ int rc;
+
+ result = FALSE;
+
+ if (!statd_prepare_stmt(db, &stmt,
+ "SELECT * FROM " STATD_INFO_TABLENAME " WHERE name=?;"))
+ goto out;
+
+ rc = sqlite3_bind_text(stmt, 1, name, -1, SQLITE_STATIC);
+ if (rc != SQLITE_OK) {
+ xlog(L_ERROR, "Failed to bind name %s: %s",
+ name, sqlite3_errmsg(db));
+ goto out_finalize;
+ }
+
+ rc = sqlite3_step(stmt);
+ switch (rc) {
+ case SQLITE_ROW:
+ *value = (int64_t)sqlite3_column_int64(stmt, 1);
+ result = TRUE;
+ break;
+ case SQLITE_DONE:
+ xlog(D_GENERAL, "Table '" STATD_INFO_TABLENAME
+ "' did not contain a row for '%s'", name);
+ break;
+ default:
+ xlog(L_ERROR, "Failed to read value of '%s': %s",
+ name, sqlite3_errmsg(db));
+ }
+
+out_finalize:
+ statd_finalize_stmt(stmt);
+
+out:
+ return result;
+}
+
+/**
+ * statd_get_nsm_state - retrieve on-disk NSM state number
+ *
+ * Returns an odd NSM state number, or a starting NSM state number
+ * of 1 if some error occurred.
+ */
+int
+statd_get_nsm_state(void)
+{
+ int64_t state;
+ sqlite3 *db;
+
+ state = 1;
+ db = statd_open_db(SQLITE_OPEN_READONLY);
+ if (db == NULL)
+ goto out;
+
+ if (!statd_get_info(db, "state", &state))
+ goto out_close;
+
+ if (!(state & 1)) {
+ xlog(L_WARNING, "Database contained an even state number");
+ state++;
+ }
+
+out_close:
+ statd_close_db(db);
+
+out:
+ return (int)state;
+}
+
+/**
+ * statd_update_nsm_state - unconditionally update NSM state number
+ * @db: open database connection
+ *
+ * Returns TRUE if the on-disk NSM state number was updated successfully;
+ * otherwise FALSE.
+ */
+bool_t
+statd_update_nsm_state(sqlite3 *db)
+{
+ int64_t state;
+
+ if (!statd_get_info(db, "state", &state))
+ state = 1;
+ else
+ state += 2;
+
+ if (!statd_insert_info(db, "state", state))
+ return FALSE;
+
+ xlog(D_GENERAL, "Bumped NSM state, new value: %d", state);
+ return TRUE;
+}
+
+/**
+ * statd_dump_monitor_list - Display the monitor list
+ *
+ * For debugging.
+ */
+void
+statd_dump_monitor_list(void)
+{
+ int result, err, i, nrow, ncolumn;
+ char **resultp;
+ char *err_msg;
+ sqlite3 *db;
+
+ result = 0;
+
+ db = statd_open_db(SQLITE_OPEN_READONLY);
+ if (db == NULL)
+ return;
+
+ err_msg = NULL;
+ err = sqlite3_get_table(db, "SELECT"
+ " mon_name,my_name,state FROM " STATD_MONITOR_TABLENAME ";",
+ &resultp, &nrow, &ncolumn, &err_msg);
+ if (err != SQLITE_OK) {
+ xlog(L_ERROR, "Failed to get table '"
+ STATD_MONITOR_TABLENAME "': %s", err_msg);
+ sqlite3_free(err_msg);
+ goto out_close;
+ }
+ if (ncolumn == 0) {
+ xlog(D_GENERAL, "The monitor list currently is empty");
+ goto out_free_table;
+ }
+ if (ncolumn != 3) {
+ xlog(L_ERROR, "Incorrect column count %d in SELECT result",
+ ncolumn);
+ goto out_free_table;
+ }
+
+ xlog(D_GENERAL, "Monitor list (mon_name, my_name, state)");
+ for (i = ncolumn; i < (nrow + 1) * ncolumn ; i += ncolumn)
+ xlog(D_GENERAL, "%s, %s, %d",
+ resultp[i], resultp[i + 1], resultp[i + 2]);
+
+out_free_table:
+ sqlite3_free_table(resultp);
+
+out_close:
+ statd_close_db(db);
+}
+
+/**
+ * statd_print_capabilities - dump process capability set to log
+ *
+ * For debugging.
+ */
+void
+statd_print_capabilities(void)
+{
+ cap_t caps;
+
+ caps = cap_get_proc();
+ if (caps == NULL) {
+ xlog(L_ERROR, "Failed to get process capaabilities: %m");
+ return;
+ }
+
+ xlog(L_NOTICE, "Capabilities: %s", cap_to_text(caps, NULL));
+
+ (void)cap_free(caps);
+}
diff --git a/utils/new-statd/hostname.c b/utils/new-statd/hostname.c
new file mode 100644
index 0000000..a9f1557
--- /dev/null
+++ b/utils/new-statd/hostname.c
@@ -0,0 +1,520 @@
+/*
+ * Copyright 2009 Oracle. All rights reserved.
+ *
+ * This file is part of nfs-utils.
+ *
+ * nfs-utils is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * nfs-utils is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with nfs-utils. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * NSM for Linux.
+ *
+ * mon_name matching and verification functions.
+ *
+ * Multi-homed remotes usually send only a single SM_NOTIFY when they
+ * reboot, even though the local system may be accessing the remote
+ * by different names.
+ *
+ * statd maps such remotes to the same callback database entry by
+ * performing a series of steps to generate a canonical hostname
+ * from the mon_name passed to it by lockd, which it then uses as
+ * a key to index its database of NLM callback information. The
+ * main assumption is that the names of the remotes are not volatile
+ * across local host reboots.
+ *
+ * There are some risks to this strategy:
+ *
+ * 1. The external DNS database can change over time so that
+ * this canonicalization process generates a different string,
+ * preventing a clean match
+ *
+ * 2. Local DNS resolution can produce different results than the
+ * hostname the remote used to contact us, preventing a clean
+ * match.
+ *
+ * 3. Inconsistent forward and reverse DNS maps may prevent a clean
+ * match.
+ *
+ * 4. When DNS is slow or unavailable, statd, and thus mounting
+ * and reboot recovery, hangs.
+ */
+
+#ifdef HAVE_CONFIG_H
+#include <config.h>
+#endif
+
+#include <ctype.h>
+#include <string.h>
+#include <stdlib.h>
+#include <netdb.h>
+#include <errno.h>
+
+#include <netinet/in.h>
+#include <arpa/inet.h>
+
+#include <rpc/rpc.h>
+#include <rpc/svc.h>
+
+#include "statd.h"
+
+#define IN_LOOPBACK(a) ((((long int)(a)) & 0xff000000) == 0x7f000000)
+
+static bool_t
+statd_report_nonlocal(const struct sockaddr *sap)
+{
+ const struct sockaddr_in *sin = (const struct sockaddr_in *)sap;
+ const struct sockaddr_in6 *sin6 = (const struct sockaddr_in6 *)sap;
+ char buf[INET6_ADDRSTRLEN];
+
+ switch (sap->sa_family) {
+ case AF_INET:
+ if (inet_ntop(AF_INET, &sin->sin_addr.s_addr,
+ buf, sizeof(buf)) == NULL)
+ break;
+ xlog(L_WARNING, "Dropping non-local call from %s:%u",
+ buf, ntohs(sin->sin_port));
+ goto out;
+ case AF_INET6:
+ if (inet_ntop(AF_INET6, &sin6->sin6_addr,
+ buf, sizeof(buf)) == NULL)
+ break;
+ xlog(L_WARNING, "Dropping non-local call from %s:%u",
+ buf, ntohs(sin6->sin6_port));
+ goto out;
+ }
+
+ xlog(L_WARNING, "Received call from unrecognized address");
+
+out:
+ return FALSE;
+}
+
+static bool_t
+statd_is_loopback4(const struct sockaddr_in *sin)
+{
+ return IN_LOOPBACK(ntohl(sin->sin_addr.s_addr)) ? TRUE : FALSE;
+}
+
+static bool_t
+statd_is_loopback6(const struct sockaddr_in6 *sin6)
+{
+ return IN6_IS_ADDR_LOOPBACK(&sin6->sin6_addr) ? TRUE : FALSE;
+}
+
+static bool_t
+statd_is_loopback46(const struct sockaddr_in6 *sin6)
+{
+ if (IN6_IS_ADDR_V4MAPPED(&sin6->sin6_addr))
+ return IN_LOOPBACK(ntohl(sin6->sin6_addr.s6_addr32[3])) ?
+ TRUE : FALSE;
+ return statd_is_loopback6(sin6);
+}
+
+static bool_t
+statd_localhost4_caller_check(const struct sockaddr *sap)
+{
+ const struct sockaddr_in *sin = (struct sockaddr_in *)sap;
+
+ if (statd_is_loopback4(sin) && ntohs(sin->sin_port) < IPPORT_RESERVED)
+ return TRUE;
+ return statd_report_nonlocal(sap);
+}
+
+static bool_t
+statd_localhost6_caller_check(const struct sockaddr *sap)
+{
+ const struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)sap;
+
+ if (statd_is_loopback6(sin6) && ntohs(sin6->sin6_port) < IPPORT_RESERVED)
+ return TRUE;
+ return statd_report_nonlocal(sap);
+}
+
+/**
+ * statd_localhost_caller - ensure RPC caller is from loopback
+ * @rqstp: RPC request metadata
+ *
+ * Return TRUE if caller is localhost; otherwise return FALSE.
+ *
+ * Reject requests from non-loopback addresses in order
+ * to prevent attack described in CERT CA-99.05. Also ensure
+ * that these calls are from a privileged port.
+ */
+bool_t
+statd_localhost_caller(const struct svc_req *rqstp)
+{
+ const struct netbuf *nbuf = svc_getrpccaller(rqstp->rq_xprt);
+ const struct sockaddr *sap = (struct sockaddr *)nbuf->buf;
+
+ switch (sap->sa_family) {
+ case AF_INET:
+ return statd_localhost4_caller_check(sap);
+ case AF_INET6:
+ return statd_localhost6_caller_check(sap);
+ }
+ return statd_report_nonlocal(sap);
+}
+
+/*
+ * Look up the hostname; report exceptional errors. Caller must
+ * call freeaddrinfo(3) if a valid addrinfo is returned.
+ */
+static struct addrinfo *
+statd_get_addrinfo(const char *hostname, const struct addrinfo* gai_hint)
+{
+ struct addrinfo *gai_results;
+ int error;
+
+ error = getaddrinfo(hostname, NULL, gai_hint, &gai_results);
+ switch (error) {
+ case 0:
+ xlog(D_CALL, "%s: getaddrinfo('%s') successful",
+ __func__, hostname);
+ return gai_results;
+ case EAI_NONAME:
+ xlog(D_CALL, "%s: No hostname '%s' was found",
+ __func__, hostname);
+ break;
+ case EAI_SYSTEM:
+ xlog(L_ERROR, "Unexpected system error %m while "
+ "resolving host %s", hostname);
+ break;
+ default:
+ xlog(L_ERROR, "Unexpected getaddrinfo error %s while "
+ "resolving host %s", gai_strerror(error), hostname);
+ }
+
+ return NULL;
+}
+
+/**
+ * statd_forward_lookup - hostname or presentation address to sockaddr list
+ * @hostname: C string containing hostname or presentation address
+ * @protocol: IPPROTO_ number hint
+ *
+ * Returns addrinfo list, or NULL if an error occurs. Caller must free
+ * addrinfo list via freeaddrinfo(3).
+ *
+ * We must be able to send an SM_NOTIFY later, so ensure that the
+ * provided mon_name either is an IP address, or can be resolved by DNS.
+ * Verify that the name does not contain illegal characters. See CERT
+ * CA-96.09 for details.
+ *
+ * AI_ADDRCONFIG should prevent us from monitoring a host that we can't
+ * reach. If IPv6 is not enabled on this system, then we don't want to
+ * monitor remote hosts that have only IPv6 addresses.
+ */
+struct addrinfo *
+statd_forward_lookup(const char *hostname, const int protocol)
+{
+ struct addrinfo gai_hint = {
+#ifdef IPV6_SUPPORTED
+ .ai_family = AF_UNSPEC,
+ .ai_flags = AI_CANONNAME | AI_ADDRCONFIG,
+#else /* !IPV6_SUPPORTED */
+ .ai_family = AF_INET,
+ .ai_flags = AI_CANONNAME,
+#endif /* !IPV6_SUPPORTED */
+ };
+
+ gai_hint.ai_protocol = protocol;
+ return statd_get_addrinfo(hostname, &gai_hint);
+}
+
+/*
+ * This module is built only if the local system has TI-RPC
+ * and sqlite3, so we are fairly sure it has a recent enough
+ * glibc that getnameinfo(3) is available.
+ *
+ * Returns a pointer to a C string containing a hostname (never
+ * a presentation address), or NULL if the resolution failed.
+ * Caller must free the string.
+ */
+static char *
+statd_reverse_lookup(const struct sockaddr *sap, const socklen_t salen,
+ const char *hostname)
+{
+ char buf[NI_MAXHOST];
+ int error;
+
+ error = getnameinfo(sap, salen, buf, sizeof(buf), NULL, 0, NI_NAMEREQD);
+ switch (error) {
+ case 0:
+ xlog(D_CALL, "Reverse lookup found hostname '%s'", buf);
+ return strdup(buf);
+ case EAI_NONAME:
+ xlog(D_CALL, "Reverse lookup found no name");
+ break;
+ case EAI_SYSTEM:
+ xlog(L_ERROR, "Unexpected system error %m while "
+ "resolving host %s", hostname);
+ break;
+ default:
+ xlog(L_ERROR, "Unexpected getnameinfo(3) error %s while "
+ "resolving host %s", gai_strerror(error), hostname);
+ break;
+ }
+
+ return NULL;
+}
+
+/**
+ * statd_get_address_list - Generate list of addresses for @hostname
+ * @hostname: C string containing hostname or presentation address
+ * @gai_hint: hint passed to getaddrinfo(3)
+ *
+ * If @hostname is a presentation address, first try to reverse map it
+ * to a canonical hostname. This allows us to generate a list of
+ * equivalent addresses to try. If the address in @hostname doesn't
+ * have a reverse map, just convert the address to an addrinfo.
+ *
+ * Returns addrinfo list, or NULL if an error occurs. Caller must free
+ * addrinfo list via freeaddrinfo(3).
+ */
+struct addrinfo *
+statd_get_address_list(const char *hostname, const struct addrinfo *gai_hint)
+{
+ struct addrinfo reverse_hint, *gai_results;
+ char *name;
+ int error;
+
+ /* AI_NUMERICHOST means this one doesn't go on the wire */
+ reverse_hint = *gai_hint;
+ reverse_hint.ai_flags |= AI_NUMERICHOST;
+
+ error = getaddrinfo(hostname, NULL, &reverse_hint, &gai_results);
+ switch (error) {
+ case 0:
+ xlog(D_CALL, "Hostname '%s' is a presentation format address",
+ hostname);
+ name = statd_reverse_lookup(gai_results->ai_addr,
+ gai_results->ai_addrlen,
+ hostname);
+ if (name == NULL)
+ return gai_results;
+
+ freeaddrinfo(gai_results);
+ gai_results = statd_get_addrinfo(name, gai_hint);
+ free(name);
+ return gai_results;
+ case EAI_NONAME:
+ xlog(D_CALL, "Hostname '%s' is not a presentation format address",
+ hostname);
+ return statd_get_addrinfo(hostname, gai_hint);
+ case EAI_SYSTEM:
+ xlog(L_ERROR, "unexpected system error %m while "
+ "resolving hostname '%s'", hostname);
+ break;
+ default:
+ xlog(L_ERROR, "unexpected getaddrinfo(3) error %s while "
+ "resolving hostname '%s'", gai_strerror(error), hostname);
+ }
+
+ return NULL;
+}
+
+#ifdef IPV6_SUPPORTED
+static bool_t
+statd_compare_sockaddrs(const struct sockaddr *sa1,
+ const struct sockaddr *sa2)
+{
+ const struct sockaddr_in6 *sin1 = (const struct sockaddr_in6 *)sa1;
+ const struct sockaddr_in6 *sin2 = (const struct sockaddr_in6 *)sa2;
+
+ if (statd_is_loopback46(sin1) && statd_is_loopback46(sin2))
+ return TRUE;
+
+ if ((IN6_IS_ADDR_LINKLOCAL(&sin1->sin6_addr) &&
+ IN6_IS_ADDR_LINKLOCAL(&sin2->sin6_addr)) ||
+ (IN6_IS_ADDR_SITELOCAL(&sin1->sin6_addr) &&
+ IN6_IS_ADDR_SITELOCAL(&sin2->sin6_addr)))
+ if (sin1->sin6_scope_id != sin2->sin6_scope_id)
+ return FALSE;
+
+ return IN6_ARE_ADDR_EQUAL(&sin1->sin6_addr, &sin2->sin6_addr) ?
+ TRUE : FALSE;
+}
+
+#else /* !IPV6_SUPPORTED */
+static bool_t
+statd_compare_sockaddrs(const struct sockaddr *sa1,
+ const struct sockaddr *sa2)
+{
+ const struct sockaddr_in *sin1 = (const struct sockaddr_in *)sa1;
+ const struct sockaddr_in *sin2 = (const struct sockaddr_in *)sa2;
+
+ if (statd_is_loopback4(sin1) && statd_is_loopback4(sin2))
+ return TRUE;
+ return sin1->sin_addr.s_addr == sin2->sin_addr.s_addr ?
+ TRUE : FALSE;
+}
+#endif /* !IPV6_SUPPORTED */
+
+static bool_t
+statd_compare_addrlists(struct addrinfo *ai1, struct addrinfo *ai2)
+{
+ xlog(D_CALL, "%s", __func__);
+
+ for (; ai1 != NULL; ai1 = ai1->ai_next) {
+ struct addrinfo *ai;
+ for (ai = ai2; ai != NULL; ai = ai->ai_next)
+ if (statd_compare_sockaddrs(ai1->ai_addr, ai->ai_addr))
+ return TRUE;
+ }
+ return FALSE;
+}
+
+/*
+ * If IPV6 is enabled, we always get back a list of only AF_INET6
+ * addresses (including mapped v4 addresses) from getaddrinfo(3).
+ * Thus we can always use IP6_ARE_ADDR_EQUAL to compare these addresses.
+ */
+static bool_t
+statd_matchhostname_dns(const char *hostname1, const char *hostname2)
+{
+ static const struct addrinfo gai_hint = {
+#ifdef IPV6_SUPPORTED
+ .ai_family = AF_INET6,
+ .ai_flags = AI_CANONNAME | AI_V4MAPPED,
+#else /* !IPV6_SUPPORTED */
+ .ai_family = AF_INET,
+ .ai_flags = AI_CANONNAME,
+#endif /* !IPV6_SUPPORTED */
+ .ai_protocol = IPPROTO_UDP,
+ };
+ struct addrinfo *gai_results1, *gai_results2;
+ bool_t result;
+
+ result = FALSE;
+
+ /*
+ * Acquire a list of associated IP addresses for each hostname.
+ */
+ gai_results1 = statd_get_address_list(hostname1, &gai_hint);
+ if (gai_results1 == NULL)
+ return result;
+ gai_results2 = statd_get_address_list(hostname2, &gai_hint);
+ if (gai_results2 == NULL) {
+ freeaddrinfo(gai_results1);
+ return result;
+ }
+
+ /*
+ * If the canonical hostnames match, or one address in each
+ * list matches, then we have a positive result.
+ */
+ if ((strcasecmp(gai_results1->ai_canonname,
+ gai_results2->ai_canonname) != 0) ||
+ !statd_compare_addrlists(gai_results1, gai_results2))
+ goto out;
+
+ result = TRUE;
+
+out:
+ freeaddrinfo(gai_results2);
+ freeaddrinfo(gai_results1);
+ return result;
+}
+
+/**
+ * statd_match_hostname - check if two hostnames are equivalent
+ * @hostname1: C string containing hostname
+ * @hostname2: C string containing hostname
+ *
+ * Returns TRUE if the hostnames are the same, the hostnames resolve
+ * to the same canonical name, or the hostnames resolve to at least
+ * one address that is the same. FALSE is returned if the hostnames
+ * do not match in any of these ways, if either hostname contains
+ * illegal characters, or if an error occurs.
+ */
+bool_t
+statd_match_hostname(const char *hostname1, const char *hostname2)
+{
+ bool_t result;
+
+ xlog(D_CALL, "Comparing '%s' and '%s'", hostname1, hostname2);
+
+ if (strcasecmp(hostname1, hostname2) == 0) {
+ xlog(D_GENERAL, "Exact match between '%s' and '%s'",
+ hostname1, hostname2);
+ return TRUE;
+ }
+
+ result = statd_matchhostname_dns(hostname1, hostname2);
+ if (result == TRUE)
+ xlog(D_GENERAL, "DNS match between '%s' and '%s'",
+ hostname1, hostname2);
+ else
+ xlog(D_GENERAL, "Failed to match '%s' and '%s':",
+ hostname1, hostname2);
+ return result;
+}
+
+/**
+ * statd_match_address - check if hostname and IP address are equivalent
+ * @hostname: C string containing hostname
+ * @sap: pointer to socket address
+ *
+ * Returns TRUE if the hostname resolves to @sap. FALSE is returned if
+ * the hostname does not resolve to address, if @hostname contains
+ * illegal characters, or if an error occurs.
+ */
+bool_t
+statd_match_address(const char *hostname, const struct sockaddr *sap)
+{
+ struct addrinfo gai_hint = {
+ .ai_protocol = IPPROTO_UDP,
+ };
+ struct addrinfo *gai_results, *ai;
+ char buf[INET6_ADDRSTRLEN];
+
+ if (opt_debug) {
+ const struct sockaddr_in *sin = (const struct sockaddr_in *)sap;
+ const struct sockaddr_in6 *sin6 = (const struct sockaddr_in6 *)sap;
+
+ switch (sap->sa_family) {
+ case AF_INET:
+ (void)inet_ntop(AF_INET, &sin->sin_addr.s_addr,
+ buf, sizeof(buf));
+ break;
+ case AF_INET6:
+ (void)inet_ntop(AF_INET6, &sin6->sin6_addr,
+ buf, sizeof(buf));
+ break;
+ default:
+ xlog(L_NOTICE, "%s: Invalid address family", __func__);
+ return FALSE;
+ }
+
+ xlog(D_CALL, "Comparing '%s' and %s", hostname, buf);
+ }
+
+ gai_hint.ai_family = sap->sa_family;
+ gai_results = statd_get_address_list(hostname, &gai_hint);
+ if (gai_results == NULL)
+ goto out_fail;
+
+ for (ai = gai_results; ai != NULL; ai = ai->ai_next)
+ if (statd_compare_sockaddrs(ai->ai_addr, sap)) {
+ xlog(D_GENERAL, "DNS match between '%s' and %s",
+ hostname, buf);
+ return TRUE;
+ }
+
+ freeaddrinfo(gai_results);
+
+out_fail:
+ xlog(D_GENERAL, "Failed to match '%s' and %s", hostname, buf);
+ return FALSE;
+}
diff --git a/utils/new-statd/nlmcall.c b/utils/new-statd/nlmcall.c
new file mode 100644
index 0000000..3a7b3b0
--- /dev/null
+++ b/utils/new-statd/nlmcall.c
@@ -0,0 +1,525 @@
+/*
+ * Copyright 2009 Oracle. All rights reserved.
+ *
+ * This file is part of nfs-utils.
+ *
+ * nfs-utils is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * nfs-utils is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with nfs-utils. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * NSM for Linux.
+ *
+ * Manage NLM callbacks.
+ *
+ * There are some complications.
+ *
+ * 1. Standard RPC library client calls that expect a reply are always
+ * synchronous with the remote server, and use system calls that
+ * cause the RPC caller process to block.
+ *
+ * However, statd cannot block, because its NSM service is
+ * implemented as a single thread. Blocking would cause statd to
+ * become unresponsive.
+ *
+ * To allow the use of standard RPC client library calls to
+ * implement our NLM callback, callbacks are done in a separate
+ * process. The statd service process hands NLM callback requests
+ * to this process via a packet socket.
+ *
+ * 2. Statd drops its root privileges before creating listener
+ * sockets to minimize exposure to network attacks. Linux lockd
+ * requires, however, that NLM callbacks are sent from a
+ * privileged port, in order to prevent users from sending local
+ * reboot notification spoofs.
+ *
+ * After starting up, then, the statd process forks the NLM
+ * callback child process before dropping its root privileges. The
+ * child NLM callback process retains its root privileges so it can
+ * create sockets with privileged source ports as needed.
+ *
+ * Each transport socket used for NLM callbacks is destroyed after
+ * the callback finishes to minimize exposure to RPC reply spoofing.
+ *
+ * 3. A UDP socket is used to send an NLM callback. This prevents
+ * the callback process from leaving privileged ports in TIME_WAIT
+ * after each callback request is finished.
+ *
+ * 4. The callback mechanism must tolerate the temporary absence of
+ * lockd and/or rpcbind. The NLM callback implementation uses
+ * a long timeout and a large number of retries to improve the
+ * chances that the local lockd will see the request.
+ *
+ */
+#ifdef HAVE_CONFIG_H
+#include <config.h>
+#endif
+
+#include <sys/resource.h>
+#include <sys/time.h>
+
+#include <time.h>
+#include <string.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <signal.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <netdb.h>
+
+#include "statd.h"
+#include "nfsrpc.h"
+
+#if SIZEOF_SOCKLEN_T - 0 == 0
+#define socklen_t int
+#endif
+
+struct statd_callback {
+ int state;
+ struct sockaddr_storage address;
+ char mon_name[SM_MAXSTRLEN];
+};
+
+static int statd_sv[2];
+
+/*
+ * Show the priv in hexadecimal. For debugging only.
+ */
+#define STATD_SIZEOF_HEXBYTE sizeof("ff")
+#define STATD_PRIVBUF_LEN (SM_PRIV_SIZE * STATD_SIZEOF_HEXBYTE)
+static const char *
+statd_show_priv(const char *priv)
+{
+ static char buf[STATD_PRIVBUF_LEN];
+ unsigned int i;
+ char *p;
+
+ memset(buf, 0, sizeof(buf));
+ p = &buf[0];
+ for (i = 0; i < SM_PRIV_SIZE; i++)
+ p += sprintf(p, "%02x", 0xff & priv[i]);
+
+ return (const char *)buf;
+}
+
+/*
+ * Synchronously create a socket and RPC client, perform an rpcbind,
+ * send the callback, then shutdown the client and destroy the socket.
+ *
+ * Lockd requires a privileged port for these calls, as a simple form
+ * of authentication. A UDP transport is used to prevent tying up a
+ * pair of privileged ports for 120 seconds in TIME_WAIT (one for the
+ * rpcbind and one for the call itself).
+ *
+ * Note that some recent kernels do not create a UDP listener for
+ * lockd. In that case we try the client creation with TCP.
+ *
+ * Even though the reply is void, we leave our RPC client available
+ * to receive it. The reply tells us the kernel actually saw and
+ * executed the request, thus we don't need to retransmit it.
+ *
+ * Returns TRUE if the call succeeded, or FALSE if caller should try again.
+ */
+static bool_t
+statd_send_one_nlm_callback(const struct mon *monp, const int state)
+{
+ const struct my_id *my_idp = &monp->mon_id.my_id;
+ const rpcprog_t program = (rpcprog_t)my_idp->my_prog;
+ const rpcvers_t version = (rpcvers_t)my_idp->my_vers;
+ const rpcproc_t procedure = (rpcproc_t)my_idp->my_proc;
+ struct timeval timeout = { STATD_NLM_CALLBACK_TIMEOUT, 0 };
+ struct nlm_reboot new_status = {
+ .mon_name = monp->mon_id.mon_name,
+ .state = (int)state,
+ };
+ enum clnt_stat status;
+ CLIENT *clnt;
+
+ rpc_createerr.cf_stat = 0;
+ clnt = clnt_create_timed("localhost", program, version,
+ "udp", &timeout);
+ /* Sometimes lockd starts only a TCP listener */
+ if (clnt == NULL && rpc_createerr.cf_stat == RPC_PROGNOTREGISTERED) {
+ rpc_createerr.cf_stat = 0;
+ clnt = clnt_create_timed("localhost", program,
+ version, "tcp", &timeout);
+ }
+ if (clnt == NULL) {
+ xlog(L_ERROR, "Failed to create RPC client for NLM callback: %s",
+ clnt_sperrno(rpc_createerr.cf_stat));
+ return FALSE;
+ }
+
+ memcpy(new_status.priv, monp->priv, sizeof(new_status.priv));
+ xlog(D_GENERAL, "Arguments for callback: %s, %u",
+ new_status.mon_name, new_status.state);
+
+ status = clnt_call(clnt, procedure,
+ (xdrproc_t)xdr_nlm_reboot, (caddr_t)&new_status,
+ (xdrproc_t)xdr_void, NULL, timeout);
+ switch (status) {
+ case RPC_SUCCESS:
+ xlog(D_GENERAL, "NLM callback succeeded");
+ break;
+ default:
+ xlog(D_GENERAL, "NLM callback failed: %s", clnt_sperrno(status));
+ break;
+ }
+
+ clnt_destroy(clnt);
+ return status == RPC_SUCCESS ? TRUE : FALSE;
+}
+
+static bool_t
+statd_send_nlm_callback(const struct mon *monp, const int state)
+{
+ unsigned int retries;
+
+ xlog(D_CALL, "Sending callback for %s/%s (%u, %u, %u) priv: %s",
+ monp->mon_id.mon_name, monp->mon_id.my_id.my_name,
+ monp->mon_id.my_id.my_prog,
+ monp->mon_id.my_id.my_vers,
+ monp->mon_id.my_id.my_proc,
+ statd_show_priv(monp->priv));
+
+ retries = STATD_NLM_CALLBACK_RETRIES;
+ while (retries--) {
+ if (statd_send_one_nlm_callback(monp, state))
+ return TRUE;
+ sleep(2);
+ }
+
+ return FALSE;
+}
+
+/*
+ * Extract the NLM callback arguments from our database,
+ * and send the the RPC.
+ *
+ * Returns TRUE if all went to plan, or FALSE if some error
+ * occurred.
+ */
+static bool_t
+statd_post_nlm_callback(sqlite3 *db, char *mon_name,
+ char *my_name, const int state)
+{
+ sqlite3_stmt *stmt;
+ struct mon mon;
+ bool_t result;
+ int rc, tmp;
+ char *sql;
+
+ memset(&mon, 0, sizeof(mon));
+ result = FALSE;
+
+ sql = sqlite3_mprintf("SELECT priv,program,version,procedure FROM "
+ STATD_MONITOR_TABLENAME
+ " WHERE mon_name='%q' and my_name='%q';",
+ mon_name, my_name);
+ if (sql == NULL) {
+ xlog(L_ERROR, "Failed to generate SQL command in %s", __func__);
+ return result;
+ }
+
+ rc = statd_prepare_stmt(db, &stmt, sql);
+ sqlite3_free(sql);
+ if (!rc)
+ goto out;
+
+ rc = sqlite3_step(stmt);
+ switch (rc) {
+ case SQLITE_ROW:
+ tmp = sqlite3_column_count(stmt);
+ if (tmp != 4) {
+ xlog(L_ERROR, "Incorrect column count %d "
+ "in SELECT result", tmp);
+ break;
+ }
+
+ tmp = sqlite3_column_bytes(stmt, 0);
+ if (tmp != SM_PRIV_SIZE) {
+ xlog(L_ERROR, "Incorrect priv cookie size %d "
+ "in SELECT result", tmp);
+ //break;
+ }
+ memcpy(mon.priv, sqlite3_column_blob(stmt, 0), sizeof(mon.priv));
+
+ mon.mon_id.mon_name = mon_name;
+ mon.mon_id.my_id.my_name = my_name;
+ mon.mon_id.my_id.my_prog = sqlite3_column_int(stmt, 1);
+ mon.mon_id.my_id.my_vers = sqlite3_column_int(stmt, 2);
+ mon.mon_id.my_id.my_proc = sqlite3_column_int(stmt, 3);
+
+ result = statd_send_nlm_callback(&mon, state);
+ break;
+ case SQLITE_DONE:
+ xlog(L_ERROR, "The monitor list does not contain an entry "
+ "for '%s'", mon_name);
+ break;
+ default:
+ xlog(L_ERROR, "Failed to find row for '%s' in table '"
+ STATD_MONITOR_TABLENAME "': %s",
+ mon_name, sqlite3_errmsg(db));
+ }
+
+ statd_finalize_stmt(stmt);
+
+out:
+ return result;
+}
+
+/*
+ * Remember the new NSM state for this mon_name/my_name.
+ */
+static bool_t
+statd_set_mon_state(sqlite3 *db, const char *mon_name,
+ const char *my_name, const int state)
+{
+ char *sql, *err_msg;
+ bool_t result;
+
+ result = FALSE;
+
+ sql = sqlite3_mprintf("UPDATE " STATD_MONITOR_TABLENAME " SET state=%d"
+ " WHERE mon_name='%q' and my_name='%q';",
+ state, mon_name, my_name);
+ if (sql == NULL) {
+ xlog(L_ERROR, "Failed to generate SQL command in %s", __func__);
+ return result;
+ }
+
+ err_msg = NULL;
+ if (sqlite3_exec(db, sql, NULL, 0, &err_msg) != SQLITE_OK) {
+ xlog(L_ERROR, "Failed to update state for mon_name '%s': %s",
+ mon_name, err_msg);
+ xlog(L_ERROR, "SQL: %s", sql);
+ sqlite3_free(err_msg);
+ goto out_free;
+ }
+
+ result = TRUE;
+ xlog(D_CALL, "Set state to %d for mon_name '%s'", state, mon_name);
+
+out_free:
+ sqlite3_free(sql);
+ return result;
+}
+
+/*
+ * Since it's not likely we will get an exact match on the mon_name
+ * (if the kernel sends us IP addresses to monitor instead of real
+ * hostnames, or the NFS server's actual hostname does not match
+ * the name sent to us by the local NFS client), we must use a
+ * heuristic to match incoming SM_NOTIFY arguments to rows in our
+ * monitor list.
+ */
+static void
+statd_match_monitored_host(const char *mon_name, const int state,
+ const struct sockaddr *sap)
+{
+ int err, i, nrow, ncolumn;
+ char **resultp;
+ char *err_msg;
+ bool_t result;
+ sqlite3 *db;
+
+ result = FALSE;
+
+ db = statd_open_db(SQLITE_OPEN_READWRITE);
+ if (db == NULL)
+ goto out;
+
+ err_msg = NULL;
+ err = sqlite3_get_table(db, "SELECT"
+ " mon_name,my_name,state"
+ " FROM " STATD_MONITOR_TABLENAME ";",
+ &resultp, &nrow, &ncolumn, &err_msg);
+ if (err != SQLITE_OK) {
+ xlog(L_ERROR, "Failed to get table '"
+ STATD_MONITOR_TABLENAME "': %s", err_msg);
+ sqlite3_free(err_msg);
+ goto out_close;
+ }
+ if (ncolumn == 0) {
+ xlog(D_GENERAL, "The monitor list is empty");
+ goto out_free_table;
+ }
+ if (ncolumn != 3) {
+ xlog(L_ERROR, "Incorrect column count %d in SELECT result",
+ ncolumn);
+ goto out_free_table;
+ }
+
+ for (i = ncolumn; i < (nrow + 1) * ncolumn ; i += ncolumn) {
+ /*
+ * Does the row's mon_name or hostname "match" the incoming
+ * mon_name? Does the row's hostname "match" the caller's
+ * IP address?
+ */
+ if (!statd_match_hostname(resultp[i], mon_name) &&
+ !statd_match_address(resultp[i], sap))
+ continue;
+
+ /*
+ * If we've been sent this state number update
+ * already, don't send callback.
+ */
+ if (state == atoi(resultp[i + 2])) {
+ xlog(D_GENERAL, "Already seen state %d from %s",
+ state, mon_name);
+ continue;
+ }
+
+ if (statd_post_nlm_callback(db, resultp[i], resultp[i + 1], state)) {
+ statd_set_mon_state(db, resultp[i], resultp[i + 1],
+ state);
+ result = TRUE;
+ }
+ }
+
+out_free_table:
+ sqlite3_free_table(resultp);
+
+out_close:
+ statd_close_db(db);
+
+out:
+ if (result == FALSE)
+ xlog(D_GENERAL, "SM_NOTIFY for %s call ignored", mon_name);
+}
+
+/*
+ * The statd child process that manages NLM callbacks.
+ *
+ * Read NLM callback requests from our socket, match them against
+ * our monitor list, and execute them.
+ */
+static void
+statd_nlm_callback_process(void)
+{
+ (void)setpriority(PRIO_PROCESS, 0, 19);
+
+ xlog(D_CALL, "NLM callback process %u running", getpid());
+
+ if (!statd_drop_privileges(1))
+ return;
+
+ for (;;) {
+ static struct statd_callback buf;
+
+ if (read(statd_sv[1], &buf, sizeof(buf)) < 1)
+ break;
+
+ xlog(D_CALL, "Received queued NLM callback request");
+
+ statd_match_monitored_host(buf.mon_name, buf.state,
+ (struct sockaddr *)&buf.address);
+ }
+
+ xlog(D_CALL, "Shutting down NLM callback process");
+}
+
+/**
+ * statd_queue_nlm_callback - queue an NLM callback
+ * @mon_name: C string containing mon_name argument of SM_NOTIFY request
+ * @state: NSM state number argument of SM_NOTIFY request
+ * @sap: pointer to IP address of caller
+ *
+ * Called by the statd RPC listener process when a remote has sent
+ * us an SM_NOTIFY for a host we're monitoring.
+ *
+ * Tests show a packet socket can queue over 3000 packets, independent
+ * of packet size.
+ */
+void
+statd_queue_nlm_callback(const char *mon_name, const int state,
+ const struct sockaddr *sap)
+{
+ static struct statd_callback args;
+ ssize_t len;
+
+ switch (sap->sa_family) {
+ case AF_INET:
+ memcpy(&args.address, sap, sizeof(struct sockaddr_in));
+ break;
+ case AF_INET6:
+ memcpy(&args.address, sap, sizeof(struct sockaddr_in6));
+ break;
+ default:
+ xlog(L_ERROR, "Unrecognized SM_NOTIFY caller address");
+ return;
+ }
+ args.state = state;
+ strncpy(args.mon_name, mon_name, sizeof(args.mon_name));
+
+ len = write(statd_sv[0], &args, sizeof(args));
+ if (__exact_error_check(len, sizeof(args))) {
+ if (errno == EAGAIN)
+ xlog(L_ERROR, "NLM callback process not responding");
+ else
+ xlog(L_ERROR, "NLM callback socket write failed: %m");
+ return;
+ }
+
+ xlog(D_CALL, "Queued callback request for %s", mon_name);
+}
+
+/**
+ * statd_init_nlm_callback - set up callback process
+ *
+ * Returns TRUE if all went to plan; otherwise returns FALSE.
+ */
+bool_t
+statd_init_nlm_callback(void)
+{
+ const struct sigaction statd_sigchld_action = {
+ .sa_handler = SIG_DFL,
+ .sa_flags = SA_NOCLDSTOP | SA_NOCLDWAIT,
+ };
+ bool_t result;
+
+ result = FALSE;
+
+ if (socketpair(AF_UNIX, SOCK_SEQPACKET, 0, statd_sv) < 0) {
+ xlog(L_ERROR, "Failed to create NLM callback sockets: %m");
+ return result;
+ }
+
+ /*
+ * Make the write end non-blocking so the NSM service doesn't
+ * stall if the child end stops responding.
+ */
+ if (fcntl(statd_sv[0], F_SETFL, O_NONBLOCK) == -1)
+ xlog(L_ERROR, "Failed to make NLM callback socket nonblocking: %m");
+ else {
+ switch (fork()) {
+ case -1:
+ xlog(L_ERROR, "Failed to fork NLM callback process: %m");
+ close(statd_sv[0]);
+ break;
+ case 0:
+ /* Child. */
+ close(statd_sv[0]);
+ statd_nlm_callback_process();
+ break;
+ default:
+ /* Parent. */
+ (void)sigaction(SIGCHLD, &statd_sigchld_action, NULL);
+ usleep(10000); /* Let child run */
+ result = TRUE;
+ }
+ }
+
+ close(statd_sv[1]);
+ return result;
+}



2009-08-05 18:26:40

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)


On Aug 5, 2009, at 2:15 PM, J. Bruce Fields wrote:

> On Wed, Aug 05, 2009 at 02:05:44PM -0400, Chuck Lever wrote:
>> On Aug 5, 2009, at 1:48 PM, J. Bruce Fields wrote:
>>> On Wed, Aug 05, 2009 at 10:45:40AM -0400, Chuck Lever wrote:
>>>> Provide a new implementation of statd that supports IPv6. The new
>>>> statd implementation resides under
>>>>
>>>> utils/new-statd/
>>>>
>>>> The contents of this directory are built if --enable-tirpc is set
>>>> on the ./configure command line, and sqlite3 is available on the
>>>> build system. Otherwise, the legacy version of statd, which still
>>>> resides under utils/statd/, is built.
>>>>
>>>> The goals of this re-write are:
>>>>
>>>> o Support IPv6 networking
>>>>
>>>> Support interoperation with TI-RPC-based NSM implementations.
>>>> Transport Independent RPC, or TI-RPC, provides IPv6 network
>>>> support
>>>> for Linux's NSM implementation.
>>>>
>>>> To support TI-RPC, open code to construct RPC requests in socket
>>>> buffers and then schedule them has been replaced with standard
>>>> library calls.
>>>>
>>>> o Support notification via TCP
>>>>
>>>> As a secondary benefit of using TI-RPC library calls, reboot
>>>> notifications and NLM callbacks can now be sent via connection-
>>>> oriented transport protocols.
>>>>
>>>> Note that lockd does not (yet) tell statd what transport protocol
>>>> to use when sending reboot notifications. statd/sm-notify will
>>>> continue to use UDP for the time being.
>>>>
>>>> o Use an embedded database for storing on-disk callback data
>>>>
>>>> This whole exercise is for the purpose of crash robustness. There
>>>> are well-known deficiencies with simple create/rename/unlink
>>>> disk storage schemes during system crashes. Replace the current
>>>> flat-file monitor list mechanism which uses sync(2) with sqlite3,
>>>> which uses fsync(3).
>>>
>>> If someone wants to move around that data, is it still simple to do
>>> that? (Where is it kept on the filesystem?)
>>>
>>> (I'm thinking of someone that shares it for high-availabity, as in:
>>>
>>> http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat_p3
>>>
>>> Or maybe somebody that just needs to move their /var partition to a
>>> different disk one day.)
>>
>> Statd's monitor lists and state number are stored in a single regular
>> file, /var/lib/nfs/statd/statdb by default. This file can be easily
>> backed up, or used on other systems, if desired. I would recommend
>> ensuring the NSM state number is reset in the latter case, which
>> can be
>> done with the sqlite3 command.
>>
>> I've had some dialog with Lon Hohberger about clustering
>> requirements. I
>> think we are looking at crafting a separate utility that uses
>> sqlite3 C
>> function calls to extract data that's interesting to the clustering
>> implementation. Again, this could even be scripted with bash and the
>> sqlite3 command, but perhaps a C program is more maintainable.
>
> OK, good.
>
> And for the simplest cases, it should still be enough to just copy
> /var/lib/nfs/, right?

I don't see why that wouldn't work, as long statd/sm-notify aren't
updating the database at that moment. For safety I think there is an
sqlite3 backup mechanism for database files that respects the
library's locking semantics.

sqlite3 doesn't do anything special under the covers. It uses only
POSIX file access and locking calls, as far as I know. So I think
hosting /var on most well-behaved clustering file systems won't have
any problem with this arrangement.

One (admittedly minor) reason I did this is so we have some sample
code to try for other NFS-related daemons that need to store
information in /var robustly, potentially in clustered environments.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-08-05 18:06:18

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Aug 5, 2009, at 1:48 PM, J. Bruce Fields wrote:
> On Wed, Aug 05, 2009 at 10:45:40AM -0400, Chuck Lever wrote:
>> Provide a new implementation of statd that supports IPv6. The new
>> statd implementation resides under
>>
>> utils/new-statd/
>>
>> The contents of this directory are built if --enable-tirpc is set
>> on the ./configure command line, and sqlite3 is available on the
>> build system. Otherwise, the legacy version of statd, which still
>> resides under utils/statd/, is built.
>>
>> The goals of this re-write are:
>>
>> o Support IPv6 networking
>>
>> Support interoperation with TI-RPC-based NSM implementations.
>> Transport Independent RPC, or TI-RPC, provides IPv6 network support
>> for Linux's NSM implementation.
>>
>> To support TI-RPC, open code to construct RPC requests in socket
>> buffers and then schedule them has been replaced with standard
>> library calls.
>>
>> o Support notification via TCP
>>
>> As a secondary benefit of using TI-RPC library calls, reboot
>> notifications and NLM callbacks can now be sent via connection-
>> oriented transport protocols.
>>
>> Note that lockd does not (yet) tell statd what transport protocol
>> to use when sending reboot notifications. statd/sm-notify will
>> continue to use UDP for the time being.
>>
>> o Use an embedded database for storing on-disk callback data
>>
>> This whole exercise is for the purpose of crash robustness. There
>> are well-known deficiencies with simple create/rename/unlink
>> disk storage schemes during system crashes. Replace the current
>> flat-file monitor list mechanism which uses sync(2) with sqlite3,
>> which uses fsync(3).
>
> If someone wants to move around that data, is it still simple to do
> that? (Where is it kept on the filesystem?)
>
> (I'm thinking of someone that shares it for high-availabity, as in:
>
> http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat_p3
>
> Or maybe somebody that just needs to move their /var partition to a
> different disk one day.)

Statd's monitor lists and state number are stored in a single regular
file, /var/lib/nfs/statd/statdb by default. This file can be easily
backed up, or used on other systems, if desired. I would recommend
ensuring the NSM state number is reset in the latter case, which can
be done with the sqlite3 command.

I've had some dialog with Lon Hohberger about clustering
requirements. I think we are looking at crafting a separate utility
that uses sqlite3 C function calls to extract data that's interesting
to the clustering implementation. Again, this could even be scripted
with bash and the sqlite3 command, but perhaps a C program is more
maintainable.

>> o Share code between sm-notify and statd
>>
>> Statd and sm-notify access the same set of on-disk data. These
>> separate programs now share the same code and implementation, with
>> access to on-disk data serialized by sqlite3. The two remain
>> separate executables to allow other system facilities to send
>> reboot notifications without poking statd.
>>
>> o Reduce impact of DNS outages
>>
>> The heuristics used by SM_NOTIFY to figure out which remote peer
>> has rebooted are heavily dependent on DNS. If the DNS service is
>> slow or hangs, that will make the NSM listener unresponsive.
>> Incoming SM_NOTIFY requests are now handled in a sidecar process
>> to reduce the impact of DNS outages on the NSM service listener.
>>
>> o Proper my_name support
>>
>> The current version of statd uses gethostname(3) to generate the
>> mon_name argument of SM_NOTIFY. This value can change across a
>> reboot. The new version of statd records lockd's my_name, passed
>> by every SM_MON request, and uses that when sending SM_NOTIFY.
>>
>> This can be useful for multi-homed and DHCP configured hosts.
>>
>> o Send SM_NOTIFY more aggressively
>>
>> It has been recommended that statd/sm-notify send SM_NOTIFY
>> more aggressively (for example, to the entire list returned by
>> getaddrinfo(3)). Since SM_NOTIFY's reply is NULL, there's no
>> way to tell whether the remote peer recognized the mon_name we
>> sent. More study is required, but this implementation attempts
>> to send an SM_NOTIFY request to each address returned by
>> getaddrinfo(3).
>>
>> This re-implementation paves the way for a number of future
>> improvements. However, it does not immediately address:
>>
>> o lockd/statd start-up serialization issues
>>
>> Sending reboot notifications, starting statd and lockd, and opening
>> the lockd grace period are still determined independently in user
>> space and the kernel.
>>
>> o Binding mon_names to caller IP addresses
>>
>> By default, lockd continues to send IP addresses as the mon_name
>> argument of the SM_MON procedure. This provides a better guarantee
>> of being able to contact remote peers during a reboot, but means
>> statd must continue to use heuristics to match incoming SM_NOTIFY
>> requests with peers on the monitor list.
>>
>> o Distinct logic for NFS client- and server-side
>>
>> Client-side and server-side monitoring requirements are different.
>> Statd continues to use the same logic for both NFS client and
>> server, as the NSMv1 protocol does not provide any indication
>> that a mon_name is for a client or server peer.
>
> Note we probably don't need to be limited by the protocol here, only
> by
> kernel backwards-compatibility requirements, as long as this is just
> kernel<->statd communication and not something that goes across the
> wire
> to other statd implementations.

Agreed.

It would be possible to export the kernel's NSM host cache via sysfs,
for instance. An SM_MON upcall could cause statd to look in /sys for
more information like whether the remote peer is a client or server,
and what transport protocol and what network address the caller used
to contact the local host. This kind of scheme would work well for
both old kernels running the new statd, and new kernels running the
old statd.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-08-05 17:48:15

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, Aug 05, 2009 at 10:45:40AM -0400, Chuck Lever wrote:
> Provide a new implementation of statd that supports IPv6. The new
> statd implementation resides under
>
> utils/new-statd/
>
> The contents of this directory are built if --enable-tirpc is set
> on the ./configure command line, and sqlite3 is available on the
> build system. Otherwise, the legacy version of statd, which still
> resides under utils/statd/, is built.
>
> The goals of this re-write are:
>
> o Support IPv6 networking
>
> Support interoperation with TI-RPC-based NSM implementations.
> Transport Independent RPC, or TI-RPC, provides IPv6 network support
> for Linux's NSM implementation.
>
> To support TI-RPC, open code to construct RPC requests in socket
> buffers and then schedule them has been replaced with standard
> library calls.
>
> o Support notification via TCP
>
> As a secondary benefit of using TI-RPC library calls, reboot
> notifications and NLM callbacks can now be sent via connection-
> oriented transport protocols.
>
> Note that lockd does not (yet) tell statd what transport protocol
> to use when sending reboot notifications. statd/sm-notify will
> continue to use UDP for the time being.
>
> o Use an embedded database for storing on-disk callback data
>
> This whole exercise is for the purpose of crash robustness. There
> are well-known deficiencies with simple create/rename/unlink
> disk storage schemes during system crashes. Replace the current
> flat-file monitor list mechanism which uses sync(2) with sqlite3,
> which uses fsync(3).

If someone wants to move around that data, is it still simple to do
that? (Where is it kept on the filesystem?)

(I'm thinking of someone that shares it for high-availabity, as in:

http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat_p3

Or maybe somebody that just needs to move their /var partition to a
different disk one day.)

> o Share code between sm-notify and statd
>
> Statd and sm-notify access the same set of on-disk data. These
> separate programs now share the same code and implementation, with
> access to on-disk data serialized by sqlite3. The two remain
> separate executables to allow other system facilities to send
> reboot notifications without poking statd.
>
> o Reduce impact of DNS outages
>
> The heuristics used by SM_NOTIFY to figure out which remote peer
> has rebooted are heavily dependent on DNS. If the DNS service is
> slow or hangs, that will make the NSM listener unresponsive.
> Incoming SM_NOTIFY requests are now handled in a sidecar process
> to reduce the impact of DNS outages on the NSM service listener.
>
> o Proper my_name support
>
> The current version of statd uses gethostname(3) to generate the
> mon_name argument of SM_NOTIFY. This value can change across a
> reboot. The new version of statd records lockd's my_name, passed
> by every SM_MON request, and uses that when sending SM_NOTIFY.
>
> This can be useful for multi-homed and DHCP configured hosts.
>
> o Send SM_NOTIFY more aggressively
>
> It has been recommended that statd/sm-notify send SM_NOTIFY
> more aggressively (for example, to the entire list returned by
> getaddrinfo(3)). Since SM_NOTIFY's reply is NULL, there's no
> way to tell whether the remote peer recognized the mon_name we
> sent. More study is required, but this implementation attempts
> to send an SM_NOTIFY request to each address returned by
> getaddrinfo(3).
>
> This re-implementation paves the way for a number of future
> improvements. However, it does not immediately address:
>
> o lockd/statd start-up serialization issues
>
> Sending reboot notifications, starting statd and lockd, and opening
> the lockd grace period are still determined independently in user
> space and the kernel.
>
> o Binding mon_names to caller IP addresses
>
> By default, lockd continues to send IP addresses as the mon_name
> argument of the SM_MON procedure. This provides a better guarantee
> of being able to contact remote peers during a reboot, but means
> statd must continue to use heuristics to match incoming SM_NOTIFY
> requests with peers on the monitor list.
>
> o Distinct logic for NFS client- and server-side
>
> Client-side and server-side monitoring requirements are different.
> Statd continues to use the same logic for both NFS client and
> server, as the NSMv1 protocol does not provide any indication
> that a mon_name is for a client or server peer.

Note we probably don't need to be limited by the protocol here, only by
kernel backwards-compatibility requirements, as long as this is just
kernel<->statd communication and not something that goes across the wire
to other statd implementations.

--b.

2009-08-05 18:15:47

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, Aug 05, 2009 at 02:05:44PM -0400, Chuck Lever wrote:
> On Aug 5, 2009, at 1:48 PM, J. Bruce Fields wrote:
>> On Wed, Aug 05, 2009 at 10:45:40AM -0400, Chuck Lever wrote:
>>> Provide a new implementation of statd that supports IPv6. The new
>>> statd implementation resides under
>>>
>>> utils/new-statd/
>>>
>>> The contents of this directory are built if --enable-tirpc is set
>>> on the ./configure command line, and sqlite3 is available on the
>>> build system. Otherwise, the legacy version of statd, which still
>>> resides under utils/statd/, is built.
>>>
>>> The goals of this re-write are:
>>>
>>> o Support IPv6 networking
>>>
>>> Support interoperation with TI-RPC-based NSM implementations.
>>> Transport Independent RPC, or TI-RPC, provides IPv6 network support
>>> for Linux's NSM implementation.
>>>
>>> To support TI-RPC, open code to construct RPC requests in socket
>>> buffers and then schedule them has been replaced with standard
>>> library calls.
>>>
>>> o Support notification via TCP
>>>
>>> As a secondary benefit of using TI-RPC library calls, reboot
>>> notifications and NLM callbacks can now be sent via connection-
>>> oriented transport protocols.
>>>
>>> Note that lockd does not (yet) tell statd what transport protocol
>>> to use when sending reboot notifications. statd/sm-notify will
>>> continue to use UDP for the time being.
>>>
>>> o Use an embedded database for storing on-disk callback data
>>>
>>> This whole exercise is for the purpose of crash robustness. There
>>> are well-known deficiencies with simple create/rename/unlink
>>> disk storage schemes during system crashes. Replace the current
>>> flat-file monitor list mechanism which uses sync(2) with sqlite3,
>>> which uses fsync(3).
>>
>> If someone wants to move around that data, is it still simple to do
>> that? (Where is it kept on the filesystem?)
>>
>> (I'm thinking of someone that shares it for high-availabity, as in:
>>
>> http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat_p3
>>
>> Or maybe somebody that just needs to move their /var partition to a
>> different disk one day.)
>
> Statd's monitor lists and state number are stored in a single regular
> file, /var/lib/nfs/statd/statdb by default. This file can be easily
> backed up, or used on other systems, if desired. I would recommend
> ensuring the NSM state number is reset in the latter case, which can be
> done with the sqlite3 command.
>
> I've had some dialog with Lon Hohberger about clustering requirements. I
> think we are looking at crafting a separate utility that uses sqlite3 C
> function calls to extract data that's interesting to the clustering
> implementation. Again, this could even be scripted with bash and the
> sqlite3 command, but perhaps a C program is more maintainable.

OK, good.

And for the simplest cases, it should still be enough to just copy
/var/lib/nfs/, right?

--b.

2009-08-05 21:22:44

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:

> sqlite3 doesn't do anything special under the covers. It uses only
> POSIX file access and locking calls, as far as I know. So I think
> hosting /var on most well-behaved clustering file systems won't have
> any problem with this arrangement.

So we're basically introducing a dependency on a completely new library
that will have to be added to boot partitions/nfsroot/etc, and we have
no real reason for doing it other than because we want to move from
using sync() to fsync()?

Sounds like a NACK to me...

Trond


2009-08-05 22:24:35

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
> On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
>> sqlite3 doesn't do anything special under the covers. It uses only
>> POSIX file access and locking calls, as far as I know. So I think
>> hosting /var on most well-behaved clustering file systems won't have
>> any problem with this arrangement.
>
> So we're basically introducing a dependency on a completely new
> library
> that will have to be added to boot partitions/nfsroot/etc, and we have
> no real reason for doing it other than because we want to move from
> using sync() to fsync()?
>
> Sounds like a NACK to me...

Which library are you talking about, libsqlite3 or libtirpc? Because
NEITHER of those is in /lib.

In any event, it's not just sync(2) that is a problem. sync(2) by
itself is a boot performance problem, but it's the combination of
rename and sync that is known to be especially unreliable during
system crashes. Statd, being a crash monitor, shouldn't depend on
rename/sync to maintain persistent data in the face of system
instability. I'd call that a real reason to use something more robust.

Can we try to be a little more constructive, please? Asking the list
(which includes distributors, who actually have to worry about such
things) whether this would be a problem is significantly less abrasive
then just saying "NACK" outright.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-08-05 23:30:12

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
> On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
> > On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
> >> sqlite3 doesn't do anything special under the covers. It uses only
> >> POSIX file access and locking calls, as far as I know. So I think
> >> hosting /var on most well-behaved clustering file systems won't have
> >> any problem with this arrangement.
> >
> > So we're basically introducing a dependency on a completely new
> > library
> > that will have to be added to boot partitions/nfsroot/etc, and we have
> > no real reason for doing it other than because we want to move from
> > using sync() to fsync()?
> >
> > Sounds like a NACK to me...
>
> Which library are you talking about, libsqlite3 or libtirpc? Because
> NEITHER of those is in /lib.

libsqlite is the problem. Unlike libtirpc, it's utility has yet to be
established.

> In any event, it's not just sync(2) that is a problem. sync(2) by
> itself is a boot performance problem, but it's the combination of
> rename and sync that is known to be especially unreliable during
> system crashes. Statd, being a crash monitor, shouldn't depend on
> rename/sync to maintain persistent data in the face of system
> instability. I'd call that a real reason to use something more robust.

What are you talking about? Is this about the truncate + rename issue
leaving empty files upon a crash?
That issue is solved trivially by doing an fsync() before you rename the
file. That entire discussion was about whether or not existing
applications should be _required_ to do this kind of POSIX pedantry,
when previously they could get away without it.

IOW: that issue alone does not justify replacing the current simple file
based scheme.

> Can we try to be a little more constructive, please? Asking the list
> (which includes distributors, who actually have to worry about such
> things) whether this would be a problem is significantly less abrasive
> then just saying "NACK" outright.

It would be constructive if you could actually _justify_ these
backward-incompatible changes instead of hand waving, and accusing
others of being obstructionist.

Trond


2009-09-09 18:29:56

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, 05 Aug 2009 19:30:04 -0400
Trond Myklebust <[email protected]> wrote:

> On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
> > On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
> > > On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
> > >> sqlite3 doesn't do anything special under the covers. It uses only
> > >> POSIX file access and locking calls, as far as I know. So I think
> > >> hosting /var on most well-behaved clustering file systems won't have
> > >> any problem with this arrangement.
> > >
> > > So we're basically introducing a dependency on a completely new
> > > library
> > > that will have to be added to boot partitions/nfsroot/etc, and we have
> > > no real reason for doing it other than because we want to move from
> > > using sync() to fsync()?
> > >
> > > Sounds like a NACK to me...
> >
> > Which library are you talking about, libsqlite3 or libtirpc? Because
> > NEITHER of those is in /lib.
>
> libsqlite is the problem. Unlike libtirpc, it's utility has yet to be
> established.
>

Sorry to revive this so late, but I think we need to come to some
sort of resolution here. The only missing piece for client side IPv6
support is statd...

I'm not sure I understand the objection to using libsqlite3 here. We
certainly could roll our own routines to handle data storage, but why
would we want to do so? sqlite3 is quite good at what it does. Why
wouldn't we want to use it?

> > In any event, it's not just sync(2) that is a problem. sync(2) by
> > itself is a boot performance problem, but it's the combination of
> > rename and sync that is known to be especially unreliable during
> > system crashes. Statd, being a crash monitor, shouldn't depend on
> > rename/sync to maintain persistent data in the face of system
> > instability. I'd call that a real reason to use something more robust.
>
> What are you talking about? Is this about the truncate + rename issue
> leaving empty files upon a crash?
> That issue is solved trivially by doing an fsync() before you rename the
> file. That entire discussion was about whether or not existing
> applications should be _required_ to do this kind of POSIX pedantry,
> when previously they could get away without it.
>
> IOW: that issue alone does not justify replacing the current simple file
> based scheme.
>

There are other reasons, not to use the simple file-based scheme too...

Internationalized domain names will be easier to deal with via sqlite3,
for instance.

Certainly we could code this up ourselves, but what's the benefit to
doing that when we have a perfectly good data storage engine available?

--
Jeff Layton <[email protected]>

2009-09-10 20:40:21

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 10, 2009, at 12:23 PM, J. Bruce Fields wrote:
> On Thu, Sep 10, 2009 at 12:14:27PM -0400, Chuck Lever wrote:
>> On Sep 10, 2009, at 11:03 AM, J. Bruce Fields wrote:
>>> On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
>>>> IDNs are UTF16. /var therefore has to support UTF16 filenames;
>>>> either
>>>> byte in a double-byte character can be '/' or '\0'. That means the
>>>> underlying fs implementation has to support UTF16 (FAT32 anyone?),
>>>> and
>>>> the system's locale has to be configured correctly. If we decide
>>>> not to
>>>> depend on the file system to support UTF16 filenames, then statd
>>>> has
>>>> to
>>>> be intelligent enough to figure out how to deal with converting
>>>> UTF16
>>>> hostnames before storing them as filenames. Then, we have to teach
>>>> matchhostname() and friends how to deal with double-byte character
>>>> strings...
>>>
>>> Googling around.... Is this accurate?:
>>>
>>> http://en.wikipedia.org/wiki/Internationalized_domain_name
>>>
>>> That makes it sound like domain names are staying ascii, and they're
>>> just adding something on top to allow encoding unicode using ascii,
>>> which may optionally be used by applications.
>>
>> There is a mechanism that provides an ASCII-ized version of domain
>> names
>> that may contain non-ASCII characters, expressly for applications
>> that
>> need to perform DNS queries but can't be easily converted to handle
>> double-byte character strings. This can be adapted for statd,
>> though I'm
>> not sure if the converted ASCII version of such names specifically
>> exclude '/'.
>>
>> Internationalized domain names themselves are still expressed in
>> UTF16,
>> as far as I understand it.
>
> From a quick skim of http://www.ietf.org/rfc/rfc3490.txt, it appears
> to
> me that protocols (at the very least, any preexisting protocols) are
> all
> expected to use the ascii representation on the wire, and that the
> translation to unicode is meant by use for applications.
>
> So in our case we'd continue to expect ascii domain names on the wire,
> and I believe that's also what we should store in any database. But
> if
> someone were to write a gui administrative interface to that data, for
> example, they might choose to use idna for display.

That's a reasonable and specific objection to my claim that our
current host record storage format is inadequate to support IDNA.
I've also confirmed that ToAscii with the UseSTD3ASCIIRules flag set
is not supposed to generate a domain label string with a '/' in it.
My remaining concern here is that we could possibly see hostnames that
are too long to be stored in directory entries of some file systems,
especially considering that the ASCII-fied Unicode names will be
longer than typical ASCII names we normally encounter today.

What about multi-homed host support? The same mon_name can be used
with more than one my_name, for multi-homed hosts. Using the current
on-disk scheme, statd turns that SM_MON request into a no-op. So
additional records for the same hostname can't be stored, or we have
to resort to adding multiple lines in the same file. This is possible
to do with just POSIX file system calls, but it does add complexity to
manage several lines in each hostname file without increasing the risk
of corruption if a file update (especially the deletion of one record
in the middle) is interrupted.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2009-09-09 19:42:14

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, 2009-09-09 at 15:17 -0400, Chuck Lever wrote:
> On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
> The old statd still exists in nfs-utils. The new statd is an entirely
> separate component. Distributions can continue to use the old statd
> as long as they want. This is a red herring.

Bullshit. If they are adding IPv6 support, then they will have to
upgrade at some point.

> > Simplicity is another reason. WTF do we need a full SQL database, when
> > all we want to do is store 2 pieces of data (a hostname and a cookie)?
> > It isn't as if this has been a major problem for us previously.
>
> Because we are not storing just a hostname and a cookie. We are
> storing several different data items for each host, and we need to
> search over the records, and provide uniqueness constraints, and
> handle data conversion (for binary data like the cookie, for string
> data like the hostname, and for integers, like the prog/vers/proc
> tuple). We need to store them durably on persistent storage to have
> some protection against crashes. These are all things that an
> embedded database can do well, and that we therefore don't have to
> code ourselves.

Speaking of red herrings. Why are we adding all this crap?

This is a legacy filesystem! We shouldn't not be rewriting NLM/NSM from
scratch, just add minimal support for IPv6.

> >>>> In any event, it's not just sync(2) that is a problem. sync(2) by
> >>>> itself is a boot performance problem, but it's the combination of
> >>>> rename and sync that is known to be especially unreliable during
> >>>> system crashes. Statd, being a crash monitor, shouldn't depend on
> >>>> rename/sync to maintain persistent data in the face of system
> >>>> instability. I'd call that a real reason to use something more
> >>>> robust.
> >>>
> >>> What are you talking about? Is this about the truncate + rename
> >>> issue
> >>> leaving empty files upon a crash?
> >>> That issue is solved trivially by doing an fsync() before you
> >>> rename the
> >>> file. That entire discussion was about whether or not existing
> >>> applications should be _required_ to do this kind of POSIX pedantry,
> >>> when previously they could get away without it.
> >>>
> >>> IOW: that issue alone does not justify replacing the current
> >>> simple file
> >>> based scheme.
> >>>
> >>
> >> There are other reasons, not to use the simple file-based scheme
> >> too...
> >>
> >> Internationalized domain names will be easier to deal with via
> >> sqlite3,
> >> for instance.
> >
> > Please explain...
>
> IPv6 is used in Asia, where they almost certainly need to use non-
> ASCII characters in their hostnames. Internationalized domain names
> are stored in double-wide character sets. To provide reliable support
> for IDNs in statd, we will have to guarantee somehow that we can store
> an IDN as a file name (if we want to stay with the current scheme), no
> matter what file system is used for /var.

So, what's stopping us? These are POSIX filesystems. They can store any
filename as long as it doesn't contain '/' or '\0'.

> What's more, multi-homed host support will need to store multiple
> records for the same hostname. The mon_name is the same, but my_name
> is different, for each of these records. So we could do that by
> adding more than one line in each hostname file, but it's also a
> simple matter to set this up in SQL.
>
> When we want to have statd remember things like multiple addresses for
> the same hostname, or whether the remote is a client or server, we
> will need to make more adjustments to the files.
>
> As we get more and more new requirements, why lock ourselves into the
> current on-disk format? Using statd means we can store new fields and
> new records without any backwards-compatibility issues. It's all
> handled by the database code. So, we can think about the high level
> problem of getting statd to behave correctly rather than worry about
> the details of exactly how we are going to get the next data item
> stored in our current files in a backward compatible way.

Again. This is a legacy filesystem. Why are we adding requirements?

> >> Certainly we could code this up ourselves, but what's the benefit to
> >> doing that when we have a perfectly good data storage engine
> >> available?
> >
> > Why change something that works???? Rewriting from scratch is _NOT_
> > the
> > Linux way, and has usually bitten us hard when we've done it.
>
> Because we are adding a bunch of new feature requirements.
> Internationalized domain names, multi-homed host support, IPv6 and TI-
> RPC, fast boot times, keeping better track of remote host addresses,
> keeping track of which remotes are clients and which are servers, and
> support for sending notifications via TCP all require significant
> modifications to this code base.
>
> At some point you have to look at the code you have, and decide it's
> simply not going to be adequate, going forward.
>
> > The 2.6.19 rewrite of the kernel mount code springs to mind...
>
> One can just as easily argue that we've been bitten hard precisely
> because we've let things rot, or because we have inadequate testing
> for these components.
>
> Another red herring, and especially annoying because you've known I
> was rewriting statd for months. Only now, when I'm done, do you say
> "rewriting is not the Linux way."

I have _NEVER_ agreed to a rewrite of the storage formats. You sprang
this crap on me a month ago, and I made my feelings quite clear then.



2009-09-09 19:18:06

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
> On Wed, 2009-09-09 at 14:29 -0400, Jeff Layton wrote:
>> On Wed, 05 Aug 2009 19:30:04 -0400
>> Trond Myklebust <[email protected]> wrote:
>>
>>> On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
>>>> On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
>>>>> On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
>>>>>> sqlite3 doesn't do anything special under the covers. It uses
>>>>>> only
>>>>>> POSIX file access and locking calls, as far as I know. So I
>>>>>> think
>>>>>> hosting /var on most well-behaved clustering file systems won't
>>>>>> have
>>>>>> any problem with this arrangement.
>>>>>
>>>>> So we're basically introducing a dependency on a completely new
>>>>> library
>>>>> that will have to be added to boot partitions/nfsroot/etc, and
>>>>> we have
>>>>> no real reason for doing it other than because we want to move
>>>>> from
>>>>> using sync() to fsync()?
>>>>>
>>>>> Sounds like a NACK to me...
>>>>
>>>> Which library are you talking about, libsqlite3 or libtirpc?
>>>> Because
>>>> NEITHER of those is in /lib.
>>>
>>> libsqlite is the problem. Unlike libtirpc, it's utility has yet to
>>> be
>>> established.
>>>
>>
>> Sorry to revive this so late, but I think we need to come to some
>> sort of resolution here. The only missing piece for client side IPv6
>> support is statd...
>>
>> I'm not sure I understand the objection to using libsqlite3 here. We
>> certainly could roll our own routines to handle data storage, but why
>> would we want to do so? sqlite3 is quite good at what it does. Why
>> wouldn't we want to use it?
>
> Backwards compatibility is one major reason. statd already exists, and
> is in use out there. I shouldn't be forced to reboot all my clients
> when
> I upgrade the nfs-utils package on my server.

The old statd still exists in nfs-utils. The new statd is an entirely
separate component. Distributions can continue to use the old statd
as long as they want. This is a red herring.

> Simplicity is another reason. WTF do we need a full SQL database, when
> all we want to do is store 2 pieces of data (a hostname and a cookie)?
> It isn't as if this has been a major problem for us previously.

Because we are not storing just a hostname and a cookie. We are
storing several different data items for each host, and we need to
search over the records, and provide uniqueness constraints, and
handle data conversion (for binary data like the cookie, for string
data like the hostname, and for integers, like the prog/vers/proc
tuple). We need to store them durably on persistent storage to have
some protection against crashes. These are all things that an
embedded database can do well, and that we therefore don't have to
code ourselves.

Simplicity is in the eye of the beholder. I can easily counterargue
that it's simpler to rely on a library than to duplicate the
functionality in statd. If we think it is appropriate to use glibc
for managing a memory heap rather than open coding it, and libtirpc
for handling RPC calls rather than open coding it, then why shouldn't
we use another library for managing our host records on disk, rather
than open coding our data conversion and record searching logic?

Again, this is a red herring.

>>>> In any event, it's not just sync(2) that is a problem. sync(2) by
>>>> itself is a boot performance problem, but it's the combination of
>>>> rename and sync that is known to be especially unreliable during
>>>> system crashes. Statd, being a crash monitor, shouldn't depend on
>>>> rename/sync to maintain persistent data in the face of system
>>>> instability. I'd call that a real reason to use something more
>>>> robust.
>>>
>>> What are you talking about? Is this about the truncate + rename
>>> issue
>>> leaving empty files upon a crash?
>>> That issue is solved trivially by doing an fsync() before you
>>> rename the
>>> file. That entire discussion was about whether or not existing
>>> applications should be _required_ to do this kind of POSIX pedantry,
>>> when previously they could get away without it.
>>>
>>> IOW: that issue alone does not justify replacing the current
>>> simple file
>>> based scheme.
>>>
>>
>> There are other reasons, not to use the simple file-based scheme
>> too...
>>
>> Internationalized domain names will be easier to deal with via
>> sqlite3,
>> for instance.
>
> Please explain...

IPv6 is used in Asia, where they almost certainly need to use non-
ASCII characters in their hostnames. Internationalized domain names
are stored in double-wide character sets. To provide reliable support
for IDNs in statd, we will have to guarantee somehow that we can store
an IDN as a file name (if we want to stay with the current scheme), no
matter what file system is used for /var.

What's more, multi-homed host support will need to store multiple
records for the same hostname. The mon_name is the same, but my_name
is different, for each of these records. So we could do that by
adding more than one line in each hostname file, but it's also a
simple matter to set this up in SQL.

When we want to have statd remember things like multiple addresses for
the same hostname, or whether the remote is a client or server, we
will need to make more adjustments to the files.

As we get more and more new requirements, why lock ourselves into the
current on-disk format? Using statd means we can store new fields and
new records without any backwards-compatibility issues. It's all
handled by the database code. So, we can think about the high level
problem of getting statd to behave correctly rather than worry about
the details of exactly how we are going to get the next data item
stored in our current files in a backward compatible way.

>> Certainly we could code this up ourselves, but what's the benefit to
>> doing that when we have a perfectly good data storage engine
>> available?
>
> Why change something that works???? Rewriting from scratch is _NOT_
> the
> Linux way, and has usually bitten us hard when we've done it.

Because we are adding a bunch of new feature requirements.
Internationalized domain names, multi-homed host support, IPv6 and TI-
RPC, fast boot times, keeping better track of remote host addresses,
keeping track of which remotes are clients and which are servers, and
support for sending notifications via TCP all require significant
modifications to this code base.

At some point you have to look at the code you have, and decide it's
simply not going to be adequate, going forward.

> The 2.6.19 rewrite of the kernel mount code springs to mind...

One can just as easily argue that we've been bitten hard precisely
because we've let things rot, or because we have inadequate testing
for these components.

Another red herring, and especially annoying because you've known I
was rewriting statd for months. Only now, when I'm done, do you say
"rewriting is not the Linux way."

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2009-09-09 23:15:42

by Steve Dickson

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)



On 09/09/2009 06:18 PM, Chuck Lever wrote:
> On Sep 9, 2009, at 3:42 PM, Trond Myklebust wrote:
>> On Wed, 2009-09-09 at 15:17 -0400, Chuck Lever wrote:
>>> On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
>>> The old statd still exists in nfs-utils. The new statd is an entirely
>>> separate component. Distributions can continue to use the old statd
>>> as long as they want. This is a red herring.
>>
>> Bullshit. If they are adding IPv6 support, then they will have to
>> upgrade at some point.
>
> I don't see a problem with a distribution upgrade using old statd and a
> fresh install using new statd. You have to install a lot of new
> components to get NFS/IPv6 support.
What new components that are not already being installed??

> It's not like the only thing that needs to change is statd.
> People will install a new distribution to get IPv6 support.
> With so many simple ways to install from scratch, the days of someone
> upgrading just a few pieces of an old system to get a new feature,
> especially one as extensive as NFS/IPv6, are long gone.
I'm not sure how people could only install bits and pieces of
nfs-utils... Even 'make install' in the git tree installs everything...

>
> And you have never clearly answered why it wouldn't be enough to add a
> little code to convert the current on-disk format to sqlite3 when
> upgrading to the new statd, if upgradability is truly an important
> requirement. Possibly this is because it eliminates the only real
> technical objection you have to using sqlite3 here.
The issue I would have with using sqlite3 is it would add yet another
requirement on nfs-utils... I really don't know how big sqlite3 and/or
sqlite3-devel (possibly needed for builds) packages are but it just
one more thing will be need for nfs-utils to function...

>
>>>> Simplicity is another reason. WTF do we need a full SQL database, when
>>>> all we want to do is store 2 pieces of data (a hostname and a cookie)?
>>>> It isn't as if this has been a major problem for us previously.
>>>
>>> Because we are not storing just a hostname and a cookie. We are
>>> storing several different data items for each host, and we need to
>>> search over the records, and provide uniqueness constraints, and
>>> handle data conversion (for binary data like the cookie, for string
>>> data like the hostname, and for integers, like the prog/vers/proc
>>> tuple). We need to store them durably on persistent storage to have
>>> some protection against crashes. These are all things that an
>>> embedded database can do well, and that we therefore don't have to
>>> code ourselves.
>>
>> Speaking of red herrings. Why are we adding all this crap?
>>
>> This is a legacy filesystem! We shouldn't not be rewriting NLM/NSM from
>> scratch, just add minimal support for IPv6.
>
> You and Bruce brought up a number of work items related to statd,
> including having distinct statd behavior for remotes who are clients and
> remotes who are servers. Tom Talpey suggested we needed to send
> multiple SM_NOTIFY requests to each host, and use TCP to do it when
> possible, and you even specifically encouraged me to read his
> connectathon presentation on this. If Asian countries are driving the
> IPv6 requirement, why wouldn't they want IDN support as well?
> Interoperable NFS/IPv6 support requires TI-RPC. Plus, NFS/IPv6
> practically requires multi-homed NLM/NSM support -- see Alex's RFC draft
> for details on that.
So a database is needed to accomplish all this?

>
> Let me also point out that old statd is already broken in a number of
> ways, and I certainly haven't heard a lot of complaints about it. Our
> client NLM has sent "0" as our NSM state number for years, for example.
> Thus I hardly think there is a lot of risk in making changes here. It
> can only get better.
>
I can agree with you here...

>>> IPv6 is used in Asia, where they almost certainly need to use non-
>>> ASCII characters in their hostnames. Internationalized domain names
>>> are stored in double-wide character sets. To provide reliable support
>>> for IDNs in statd, we will have to guarantee somehow that we can store
>>> an IDN as a file name (if we want to stay with the current scheme), no
>>> matter what file system is used for /var.
>>
>> So, what's stopping us? These are POSIX filesystems. They can store any
>> filename as long as it doesn't contain '/' or '\0'.
>
> IDNs are UTF16. /var therefore has to support UTF16 filenames; either
> byte in a double-byte character can be '/' or '\0'. That means the
> underlying fs implementation has to support UTF16 (FAT32 anyone?), and
> the system's locale has to be configured correctly. If we decide not to
> depend on the file system to support UTF16 filenames, then statd has to
> be intelligent enough to figure out how to deal with converting UTF16
> hostnames before storing them as filenames. Then, we have to teach
> matchhostname() and friends how to deal with double-byte character
> strings...
Has this been a problem in the past? How are other implementations
dealing with this? Have they gone to use a db as well?

>
> Or we just tell sqlite3 that this is a double-byte character string, and
> let it handle the collation and on-disk storage details for us.
>
> The point is, this is yet another detail we have to either worry about
> and open code in statd, or we can simply rely on what's already provided
> in sqlite3. No one, repeat NO ONE, is arguing that you can't implement
> these features without sqlite3. My argument is that we quickly bury a
> whole bunch of details if we use sqlite3, and can then focus on larger
> issues. That's the prime goal of software layering with libraries.
What kind of performance hit will there be (if any)? The nice thing
about a file is you only have to read it once in to a cache verses
doing a number of queries... or can one also cache queries?

>
> We can open code any or all of statd. In fact the current statd open
> codes RPC request creation in socket buffers rather than using glibc's
> RPC API, and I think we agree that is not an optimal solution. The
> question is: should we duplicate code and bugs by open coding statd's
> RPC and data storage? Or should we pretend to be modern software
> engineers, and use widely used and known good code that other people
> have written already to handle these details?
I'm all for using moving forward with "modern software" but, as
a common theme with me, I'm always worried about becoming
needlessly complicated or over engineering... which might be
the case with having statd use a db...

steved.

2009-09-15 02:45:34

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 14, 2009, at 3:08 AM, Neil Brown wrote:
> On Thursday September 10, [email protected] wrote:
>> On Sep 10, 2009, at 4:44 AM, NeilBrown wrote:

> But you will leave one day. How can you best make sure that you leave
> something that others can maintain????

By writing code that is self-explanatory, providing lots of comments
in the code, adding to the git log (as you suggested) and writing
expansive man pages that describe the interfaces in as clear a manner
as possible. The review process is also part of that effort.

There is also the possibility of mentoring others, as FreeBSD does,
and providing extensive written documentation and specifications in
wikis. Agile methodologies suggest that rewriting as a regular
practice is a good way for a team to retain familiarity with a code
base. Having a full test suite that can be used to verify the
behavior of new or existing code is also a way to codify requirements
and create an institutional memory of regressions, as well as to
insulate users from regressions in new code.

>> My point is that many of the items I mentioned above are expressly
>> designed to allow quicker, less risky change, precisely to decrease
>> the amount of time and effort to get new features into our code. Yet
>> we turn our back on all of them in favor of an antique "don't touch
>> that!" policy. "Don't touch that!" is not a reasonable argument
>> against replacing components that need to be replaced.
>
> The only "Don't touch that" which I am aware of relates to interfaces,
> particularly with established code.
> In the case of statd, the files in sm/ and sm.bak/ are a well
> established interface. Exactly how much is dependant on it is hard to
> say. Not much formal code I expect but maybe some obscure scripts and
> lots of sysadmin knowledge.

There is no documentation I'm aware of of statd's on-disk format as a
formal interface. I have had some recent conversations with Lon about
this, to handle any dependencies his clustering scripts may have, and
he didn't throw up any flags. He told me that all we needed was to
provide a mechanism to access this data from a shell script, which we
would have in 'sqlite3' the executable.

So this is a new requirement (to me, anyway). If these files
constitute a formal interface, how can statd be modified to store
additional data or new data types in these files? Am I allowed to put
IPv6 presentation addresses in these files in place of IPv4
addresses? Am I allowed to add new fields? Not rhetorical
questions... really... how should I go about doing this and testing
the result? You seem to be suggesting that the sm/* files can't be
used for the kind of features we want to add.

> Can you run them both in parallel?? i.e. have a database with all the
> data, but also store it in the files (if the hostname can be
> represented in ASCII)... It is hard to guess how easy that would be
> and how worthwhile it would be. And it doesn't answer the question of
> whether sqlite is stable enough.

Is it even a good thing to freeze the sm/* files as a formal
interface, or should we go about providing a real documented
programming interface for this, and migrate to it? There is a real
risk to maintaining undocumented interfaces like this, and that is
that we can't make any change to this code without a significant
possibility of breaking something.

>>> I think that the switch from portmap to rpcbind was a bad idea,
>>> and I think that a wholesale replacement of statd is probably a
>>> bad idea too. It might seem like the easiest way to get something
>>> useful working, but you'll probably be paying the price for years as
>>> little regression turn up because you didn't completely understand
>>> the original statd (and face it, who does?)
>>
>> Yes, but _why_ is it a bad idea? All I hear is "this is a bad idea"
>> and "you could do it some other way" but these are qualitative, not
>> quantitative arguments. They are religious statements, not specific
>> technical criticisms.
>
> It is a bad idea because it doesn't have the legacy of testing and
> refinement. Almost as soon as we started using it bugs were found -
> or at least differences in behaviour to portmap (something about the
> privilege level required to register a binding I think).
>
> Now I admit that no one put their hand up to add IPv6 support to
> portmap, arguably it could have been a worse idea to stay with portmap
> as it meant no IPv6. But changing was still a bad idea.
>
> Had we (had the man power to) incrementally enhance portmap we would
> have had a much more reviewable process, and a bisectable result which
> would allow regression to be isolated more directly.

What we have now is an inherited body of code (with its own history of
incremental improvement) that is shared with many other operating
systems, which improves our ability to interoperate with them, and
includes bug fixes that have been made to it over the years.

I think we would have had some bugs and regressions pursuing either
path. There are well-understood ways to manage these risks, either way.

But this is a sidebar.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2009-09-10 15:01:47

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 9, 2009, at 7:15 PM, Steve Dickson wrote:
> On 09/09/2009 06:18 PM, Chuck Lever wrote:
>> On Sep 9, 2009, at 3:42 PM, Trond Myklebust wrote:
>>> On Wed, 2009-09-09 at 15:17 -0400, Chuck Lever wrote:
>>>> On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
>>>> The old statd still exists in nfs-utils. The new statd is an
>>>> entirely
>>>> separate component. Distributions can continue to use the old
>>>> statd
>>>> as long as they want. This is a red herring.
>>>
>>> Bullshit. If they are adding IPv6 support, then they will have to
>>> upgrade at some point.
>>
>> I don't see a problem with a distribution upgrade using old statd
>> and a
>> fresh install using new statd. You have to install a lot of new
>> components to get NFS/IPv6 support.
> What new components that are not already being installed??

You need a kernel that can do NFS/IPv6, you need to install rpcbind
and libtirpc, you need the new mount command, you need all the user
space network pieces to manage IPv6, you need to consider firewall and
address distribution on your local network, and you need statd and
mountd/exportfs to get NFS/IPv6 support.

Configuring a system for IPv6 support can also be nontrivial, and not
something people will do on a whim.

I didn't mean to imply that some of these components are not already
installed. My point is that the required changes for NFS/IPv6 are
wide spread, and that most people would opt for installing a new OS on
their systems to get these features, rather than upgrade all of these
items piecemeal.

>> And you have never clearly answered why it wouldn't be enough to
>> add a
>> little code to convert the current on-disk format to sqlite3 when
>> upgrading to the new statd, if upgradability is truly an important
>> requirement. Possibly this is because it eliminates the only real
>> technical objection you have to using sqlite3 here.
> The issue I would have with using sqlite3 is it would add yet another
> requirement on nfs-utils... I really don't know how big sqlite3 and/or
> sqlite3-devel (possibly needed for builds) packages are but it just
> one more thing will be need for nfs-utils to function...

sqlite3.org provides a single source file version of sqlite3 that is
licensed and designed explicitly for folks to include in their own
code, without the need for linking a library. You can even disable a
number of build time options to reduce object size.

This means that the libsqlite3 and libsqlite3-devel packages would not
be required on either the build system or the end system, and it
eliminates the issue of whether libsqlite3.so can be moved to /lib.

>>>>> Simplicity is another reason. WTF do we need a full SQL
>>>>> database, when
>>>>> all we want to do is store 2 pieces of data (a hostname and a
>>>>> cookie)?
>>>>> It isn't as if this has been a major problem for us previously.
>>>>
>>>> Because we are not storing just a hostname and a cookie. We are
>>>> storing several different data items for each host, and we need to
>>>> search over the records, and provide uniqueness constraints, and
>>>> handle data conversion (for binary data like the cookie, for string
>>>> data like the hostname, and for integers, like the prog/vers/proc
>>>> tuple). We need to store them durably on persistent storage to
>>>> have
>>>> some protection against crashes. These are all things that an
>>>> embedded database can do well, and that we therefore don't have to
>>>> code ourselves.
>>>
>>> Speaking of red herrings. Why are we adding all this crap?
>>>
>>> This is a legacy filesystem! We shouldn't not be rewriting NLM/NSM
>>> from
>>> scratch, just add minimal support for IPv6.
>>
>> You and Bruce brought up a number of work items related to statd,
>> including having distinct statd behavior for remotes who are
>> clients and
>> remotes who are servers. Tom Talpey suggested we needed to send
>> multiple SM_NOTIFY requests to each host, and use TCP to do it when
>> possible, and you even specifically encouraged me to read his
>> connectathon presentation on this. If Asian countries are driving
>> the
>> IPv6 requirement, why wouldn't they want IDN support as well?
>> Interoperable NFS/IPv6 support requires TI-RPC. Plus, NFS/IPv6
>> practically requires multi-homed NLM/NSM support -- see Alex's RFC
>> draft
>> for details on that.
> So a database is needed to accomplish all this?

No, a database is not specifically required.

However, libsqlite3 is a library that contains all of the elements --
durable on-disk storage, proper data conversion for binary blobs,
single- and double-width character strings, integers, the ability to
constrain record uniqueness, the ability to add new data items easily
to each record, and a facility for collating and searching the host
records.

sqlite3 is an embedded database, meaning the implementation is
purposely smaller than a full SQL database, and is designed explicitly
to have zero database administration requirements. sqlite3 is
designed for managing data for long-running network daemons, and it is
widely used for that purpose.

If there is some other pre-existing code that can do this, I'm open to
considering it.

>> Let me also point out that old statd is already broken in a number of
>> ways, and I certainly haven't heard a lot of complaints about it.
>> Our
>> client NLM has sent "0" as our NSM state number for years, for
>> example.
>> Thus I hardly think there is a lot of risk in making changes here.
>> It
>> can only get better.
>>
> I can agree with you here...
>
>>>> IPv6 is used in Asia, where they almost certainly need to use non-
>>>> ASCII characters in their hostnames. Internationalized domain
>>>> names
>>>> are stored in double-wide character sets. To provide reliable
>>>> support
>>>> for IDNs in statd, we will have to guarantee somehow that we can
>>>> store
>>>> an IDN as a file name (if we want to stay with the current
>>>> scheme), no
>>>> matter what file system is used for /var.
>>>
>>> So, what's stopping us? These are POSIX filesystems. They can
>>> store any
>>> filename as long as it doesn't contain '/' or '\0'.
>>
>> IDNs are UTF16. /var therefore has to support UTF16 filenames;
>> either
>> byte in a double-byte character can be '/' or '\0'. That means the
>> underlying fs implementation has to support UTF16 (FAT32 anyone?),
>> and
>> the system's locale has to be configured correctly. If we decide
>> not to
>> depend on the file system to support UTF16 filenames, then statd
>> has to
>> be intelligent enough to figure out how to deal with converting UTF16
>> hostnames before storing them as filenames. Then, we have to teach
>> matchhostname() and friends how to deal with double-byte character
>> strings...
> Has this been a problem in the past? How are other implementations
> dealing with this? Have they gone to use a db as well?

No, IDNs are recent, but it is reasonable to think that
internationalized domain names is a feature that would appeal to the
same folks who are driving the IPv6 requirement. This is not a hard
requirement, but it is one reason why statd's current on-disk format
is not adequate.

Yes, I understand that there are some statd implementations that use a
database rather than flat files. statd is nothing if not exactly a
mechanism for storing structured data across system crashes. That's
exactly what databases are for.

>> Or we just tell sqlite3 that this is a double-byte character
>> string, and
>> let it handle the collation and on-disk storage details for us.
>>
>> The point is, this is yet another detail we have to either worry
>> about
>> and open code in statd, or we can simply rely on what's already
>> provided
>> in sqlite3. No one, repeat NO ONE, is arguing that you can't
>> implement
>> these features without sqlite3. My argument is that we quickly
>> bury a
>> whole bunch of details if we use sqlite3, and can then focus on
>> larger
>> issues. That's the prime goal of software layering with libraries.
> What kind of performance hit will there be (if any)? The nice thing
> about a file is you only have to read it once in to a cache verses
> doing a number of queries... or can one also cache queries?

sqlite3's performance for the statd application would actually be
better than what we have today.

Naturally the database is cached in memory, making queries as fast as
memory reads. The better performance comes with record insertion and
deletion. Today statd does a file create and then an O_SYNC write to
that file. This requires synchronous metadata updates to the file
system to create the new file and create a new directory entry for
it. If the directory becomes large, creating a new directory entry
becomes even slower. Likewise for record deletion, multiple
synchronous metadata updates are required to remove the directory
entry and the file containing the host record.

With sqlite3 (or any database style solution) record insertion and
deletion can usually be handled with a single O_SYNC write to the
database file.

You could argue that using sqlite3 means more CPU and memory
consumption. Perhaps, but that's a less onerous resource requirement
than synchronous disk activity, in my view.

>> We can open code any or all of statd. In fact the current statd open
>> codes RPC request creation in socket buffers rather than using
>> glibc's
>> RPC API, and I think we agree that is not an optimal solution. The
>> question is: should we duplicate code and bugs by open coding statd's
>> RPC and data storage? Or should we pretend to be modern software
>> engineers, and use widely used and known good code that other people
>> have written already to handle these details?
> I'm all for using moving forward with "modern software" but, as
> a common theme with me, I'm always worried about becoming
> needlessly complicated or over engineering... which might be
> the case with having statd use a db...

Consider what would happen if we open coded all of the details of on-
disk storage and record searching into statd itself. I think
something like sqlite3 is a better and less complex solution than open
coding because all these details are moved out of statd into a pre-
existing library, thus making statd itself architecturally simpler,
and therefore easier to understand and maintain.

The one weakness here is the dependence on SQL. That makes the statd
code uglier and more complex than I would like, and is something I
want to address.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2009-09-14 07:07:39

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Thursday September 10, [email protected] wrote:
> On Sep 10, 2009, at 4:44 AM, NeilBrown wrote:
> > On Thu, September 10, 2009 8:18 am, Chuck Lever wrote:
> >> The idea that "the Linux way" is the best and only way is ridiculous
> >> on its face, anyway. I mean, what do you expect when we have no
> >> requirements and specification process, no formal testing, C coding
> >> style conventions based on 20-year old coding practices, a hit-or-
> >> miss
> >> review process that relies more on reviewers' personal preferences
> >> than any kind of standards, no static code analysis tools, no defect
> >> metrics or bug meta-analysis tools, kernel debuggers are verboten, a
> >> combative mailing list environment, and parts of our knowledge base
> >> and team history are lost every time a developer leaves (in this
> >> case,
> >> Olaf and Neil)? It's no wonder we never change anything unless
> >> absolutely necessary!
> >
> > And yet is largely works!
> > I could summarise a lot of your points by observing that the community
> > values people over process. I really think that is the right place to
> > put value, because people are richer and more flexible than process.
>
> Agreed, but there are risks to that approach as well, which are
> largely ignored by the Linux community. The last point in my list is
> probably the biggest risk: when people leave, we are stuck with
> decades-old code that no-one understands. Cf: statd.
>

Much of my understanding of statd is embedded in the code in
nfs-utils, and in the change log (which is much better since we
changed to git).
By discarding all that and creating a new project you risk losing that
legacy.

There can certainly be a case of discarding and starting again. I did
that with the support tools for raid (raidtools is no-more, mdadm
reigns :-). But this is not a decision to be taken lightly and is not
without it's costs. The fact that raidtools was by that time
essentially an unmaintained orphan made some of those costs
unavoidable in my case.

You obviously could do the same thing - you don't need anyone's
permission: create a new project called "statd" and make it whatever
you want and hope that distributors will pick it up. If your statd
supports IPv6, and the nfs-utils one does not, then there is a good
chance that distros will pick it up as people like to see that big
tick next to "IPv6 support".

Maintaining and developing such a thing long enough to establish a
self-sustaining community would be a big effort. You would need to
compare that with the effort of taking the incremental approach and
revising the code in nfs-utils and getting those revisions accepted.
That would probably have a higher development cost, but a lower
maintenance cost. It is very hard to know up front which cost is
lower.

But you will leave one day. How can you best make sure that you leave
something that others can maintain????


> My point is that many of the items I mentioned above are expressly
> designed to allow quicker, less risky change, precisely to decrease
> the amount of time and effort to get new features into our code. Yet
> we turn our back on all of them in favor of an antique "don't touch
> that!" policy. "Don't touch that!" is not a reasonable argument
> against replacing components that need to be replaced.

The only "Don't touch that" which I am aware of relates to interfaces,
particularly with established code.
In the case of statd, the files in sm/ and sm.bak/ are a well
established interface. Exactly how much is dependant on it is hard to
say. Not much formal code I expect but maybe some obscure scripts and
lots of sysadmin knowledge.

Can you run them both in parallel?? i.e. have a database with all the
data, but also store it in the files (if the hostname can be
represented in ASCII)... It is hard to guess how easy that would be
and how worthwhile it would be. And it doesn't answer the question of
whether sqlite is stable enough.

>
> > I agree that combative mailing lists are a problem, but even there, I
> > believe most of the aggression is more perceived than real, and that
> > a graceful, humble, polite attitude can have a positive-feedback
> > effect
> > too.
>
> Years ago I believed that, but I have seen much evidence to the
> contrary in this community. More often such an attitude is entirely
> ignored, or treated as an invitation for abuse, especially by people
> who have no interest in politeness. This kind of approach has no
> effect on the leaders in the Linux community, who set an example of
> extreme rudeness and belligerence.

That last comment is interesting. At the 2007 kernel summit (the last
one I was at) the topic of mailing list etiquette was discussed and
there seemed to be agreement that we, the leaders (I guess the kernel
summit attendees are the closest we have to group leadership) have a
role in damping down the fire, not building it up. My feeling is that
most high profile people do quite well, but maybe I am too forgiving.

>
> I've made an effort to stop arguing small points, and to make
> observations and not argue. I still get e-mail full of "crap" this
> and "bullshit" that and "NACK!" with little explanation.

I certainly agree that sort of response is best not sent. And I must
confess to a recent experience (on a different list) where I gave up
due to similar behaviour. That is partly why I decided to join in
this discussion (though I'm not sure if I'm being helpful yet).

>
> > Yes, there are lots of practices that might improve things that we
> > don't
> > have standardised. But one practice we do have that has proven very
> > effective is incremental refinement. It can be hard to understand
> > what
> > order to make changes until after you have made them, but once you
> > understand what you want to do, going back and doing it in logical
> > order really is very effective. It makes it easier for others to
> > review, it makes it easy for you to review yourself. It means
> > less controversial bits can be included quickly leaving room for the
> > more controversial bits to be discussed in isolation.
>
> I am a fan of incremental refinement, and I use that approach as often
> as I can. There are some things that incremental refinement cannot
> do, however.
>
> > I think that the switch from portmap to rpcbind was a bad idea,
> > and I think that a wholesale replacement of statd is probably a
> > bad idea too. It might seem like the easiest way to get something
> > useful working, but you'll probably be paying the price for years as
> > little regression turn up because you didn't completely understand
> > the original statd (and face it, who does?)
>
> Yes, but _why_ is it a bad idea? All I hear is "this is a bad idea"
> and "you could do it some other way" but these are qualitative, not
> quantitative arguments. They are religious statements, not specific
> technical criticisms.

It is a bad idea because it doesn't have the legacy of testing and
refinement. Almost as soon as we started using it bugs were found -
or at least differences in behaviour to portmap (something about the
privilege level required to register a binding I think).

Now I admit that no one put their hand up to add IPv6 support to
portmap, arguably it could have been a worse idea to stay with portmap
as it meant no IPv6. But changing was still a bad idea.

Had we (had the man power to) incrementally enhance portmap we would
have had a much more reviewable process, and a bisectable result which
would allow regression to be isolated more directly.

>
> This leaves me with the impression that folks are responding out of a
> fear of the unknown, and not out of a considered technical opinion.
> If sqlite3 is outside of people's comfort zone, that's OK. Please
> let's be honest about it instead of slinging mud and throwing up a
> bunch of generic arguments that no one can rebut.
>
> > As for the use of sql-lite ... I must admit that I wouldn't choose
> > it. Maybe it is a good idea. If it is, you probably need to merge
> > that change early with a clear argument and tools to make it
> > manageable
> > (e.g. a developer will want to tool to be able to look inside the
> > database easily and make changes, without having to know sql).
> > It is much easier to discuss one thing at a time on these
> > combative mailing lists ;-)
>
> That's a fine and constructive comment, thanks.
>
> There is already a tool for managing the data in the database:
> 'sqlite3' the executable, which can be used in shell scripts. There
> are also sqlite3 libraries for Python and Perl and C. This is really
> not very different from POSIX file system calls and using 'cat'. SQL
> is not difficult to learn, and a one or two page recipe document is
> easy to provide. I certainly have not used any advanced features of
> SQL to implement new statd.
>
> Given the complexity of the change, it makes it much easier to argue
> against sqlite3, however, if it is separated from the set of changes
> that motivate its use. Bruce, for example, has stated to me
> specifically that he prefers having such changes and their
> motivational requirements included in the same patch. I regularly
> code for several different maintainers, and each one has his own
> preferences, often contradicting other maintainers. I don't think
> regular maintainers have any idea how confusing and challenging this is.
>

:-)

I think there has to be room for balance here. Certainly it is best
to use new functionality as soon as it is added so the use case is
clear. But I can imagine that converting from a files database to a
sqlite database would be several patches in itself. And then if you
want to start adding 'search' or 'utf-16' functionality, that would be
more patches still. Combining all that in to one big patch just to
keep the change with the motivation seems unlikely to be a win from
anyone's perspective.
Certainly having a real big changelog comment early on that
explains the value of sqlite would be essential, and that patch would
not be merged until the subsequent ones were reviewed.


> Note, however, I am not married to the specifics of sqlite3. What I
> am attached to is the ideas that the current system is inadequate for
> the kinds of features we want to add to statd, and that statd should
> not worry about the details of data storage, or search and management
> of the host records, because we have other tools that are better at
> this. If there is another solution that provides a durable and
> flexible way to store and search host records and offload the details
> to pre-existing code, I'm open. However, sqlite3 is the most widely
> used embedded database on the planet, and is eminently suitable for
> this task.
>
> > And we do have static code analysis tools. Both 'gcc' and 'sparse'
> > fit that description.
>
> Yes, gcc can provide some static analysis, if the correct options are
> specified, and care is taken to eliminate the noise of false
> positives. There is a prevailing attitude, however, that this is a
> worthless endeavor. Witness the amount of noise that comes out when
> you build a Red Hat kernel or the nfs-utils package. Witness also the
> sarcasm of Linus who repeatedly chides folks for not running sparse
> regularly.
>
> Additionally, gcc is not the best tool for this job, given the often
> oblique way it calls out errors and warnings. There are purify,
> fortify, and splint, just to name three, that are standard analysis
> tools we don't even consider.
>
> My comment goes more to the point that static analysis is not
> considered of any value.

You might be right....

I see there be a "law of diminishing returns" here.
sparse (aka "make C=1") has almost never shown me anything
interesting. So while there is a (small) cost in running it, there is
almost no perceived value.

For my own C projects I always compile with "-Wall -Werror" to
remove the cost of running it and to artificially increase the value.

I just tried "make C=1" on a couple of bits of kernel code and it did
provide some vaguely interesting things that I'll probably fix.
But there was a lot of noise like:
warning: potentially expensive pointer subtraction
which I don't think I want to fix, but don't know how to silence the
warning for just that case. It would be nice of "C=1" was a default
and either the warning, or sparse, were fixed. That might increase
the perception that this sort of thing was of value.

Some time ago Greg Banks went on a pursuit of warnings in nfs-utils
and got rid of all of them -- except those generated by rpcgen.
Have more been introduced?

NeilBrown

2009-09-10 14:09:39

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 10, 2009, at 4:44 AM, NeilBrown wrote:
> On Thu, September 10, 2009 8:18 am, Chuck Lever wrote:
>> The idea that "the Linux way" is the best and only way is ridiculous
>> on its face, anyway. I mean, what do you expect when we have no
>> requirements and specification process, no formal testing, C coding
>> style conventions based on 20-year old coding practices, a hit-or-
>> miss
>> review process that relies more on reviewers' personal preferences
>> than any kind of standards, no static code analysis tools, no defect
>> metrics or bug meta-analysis tools, kernel debuggers are verboten, a
>> combative mailing list environment, and parts of our knowledge base
>> and team history are lost every time a developer leaves (in this
>> case,
>> Olaf and Neil)? It's no wonder we never change anything unless
>> absolutely necessary!
>
> And yet is largely works!
> I could summarise a lot of your points by observing that the community
> values people over process. I really think that is the right place to
> put value, because people are richer and more flexible than process.

Agreed, but there are risks to that approach as well, which are
largely ignored by the Linux community. The last point in my list is
probably the biggest risk: when people leave, we are stuck with
decades-old code that no-one understands. Cf: statd.

My point is that many of the items I mentioned above are expressly
designed to allow quicker, less risky change, precisely to decrease
the amount of time and effort to get new features into our code. Yet
we turn our back on all of them in favor of an antique "don't touch
that!" policy. "Don't touch that!" is not a reasonable argument
against replacing components that need to be replaced.

> I agree that combative mailing lists are a problem, but even there, I
> believe most of the aggression is more perceived than real, and that
> a graceful, humble, polite attitude can have a positive-feedback
> effect
> too.

Years ago I believed that, but I have seen much evidence to the
contrary in this community. More often such an attitude is entirely
ignored, or treated as an invitation for abuse, especially by people
who have no interest in politeness. This kind of approach has no
effect on the leaders in the Linux community, who set an example of
extreme rudeness and belligerence.

I've made an effort to stop arguing small points, and to make
observations and not argue. I still get e-mail full of "crap" this
and "bullshit" that and "NACK!" with little explanation.

> Yes, there are lots of practices that might improve things that we
> don't
> have standardised. But one practice we do have that has proven very
> effective is incremental refinement. It can be hard to understand
> what
> order to make changes until after you have made them, but once you
> understand what you want to do, going back and doing it in logical
> order really is very effective. It makes it easier for others to
> review, it makes it easy for you to review yourself. It means
> less controversial bits can be included quickly leaving room for the
> more controversial bits to be discussed in isolation.

I am a fan of incremental refinement, and I use that approach as often
as I can. There are some things that incremental refinement cannot
do, however.

> I think that the switch from portmap to rpcbind was a bad idea,
> and I think that a wholesale replacement of statd is probably a
> bad idea too. It might seem like the easiest way to get something
> useful working, but you'll probably be paying the price for years as
> little regression turn up because you didn't completely understand
> the original statd (and face it, who does?)

Yes, but _why_ is it a bad idea? All I hear is "this is a bad idea"
and "you could do it some other way" but these are qualitative, not
quantitative arguments. They are religious statements, not specific
technical criticisms.

This leaves me with the impression that folks are responding out of a
fear of the unknown, and not out of a considered technical opinion.
If sqlite3 is outside of people's comfort zone, that's OK. Please
let's be honest about it instead of slinging mud and throwing up a
bunch of generic arguments that no one can rebut.

> As for the use of sql-lite ... I must admit that I wouldn't choose
> it. Maybe it is a good idea. If it is, you probably need to merge
> that change early with a clear argument and tools to make it
> manageable
> (e.g. a developer will want to tool to be able to look inside the
> database easily and make changes, without having to know sql).
> It is much easier to discuss one thing at a time on these
> combative mailing lists ;-)

That's a fine and constructive comment, thanks.

There is already a tool for managing the data in the database:
'sqlite3' the executable, which can be used in shell scripts. There
are also sqlite3 libraries for Python and Perl and C. This is really
not very different from POSIX file system calls and using 'cat'. SQL
is not difficult to learn, and a one or two page recipe document is
easy to provide. I certainly have not used any advanced features of
SQL to implement new statd.

Given the complexity of the change, it makes it much easier to argue
against sqlite3, however, if it is separated from the set of changes
that motivate its use. Bruce, for example, has stated to me
specifically that he prefers having such changes and their
motivational requirements included in the same patch. I regularly
code for several different maintainers, and each one has his own
preferences, often contradicting other maintainers. I don't think
regular maintainers have any idea how confusing and challenging this is.

Note, however, I am not married to the specifics of sqlite3. What I
am attached to is the ideas that the current system is inadequate for
the kinds of features we want to add to statd, and that statd should
not worry about the details of data storage, or search and management
of the host records, because we have other tools that are better at
this. If there is another solution that provides a durable and
flexible way to store and search host records and offload the details
to pre-existing code, I'm open. However, sqlite3 is the most widely
used embedded database on the planet, and is eminently suitable for
this task.

> And we do have static code analysis tools. Both 'gcc' and 'sparse'
> fit that description.

Yes, gcc can provide some static analysis, if the correct options are
specified, and care is taken to eliminate the noise of false
positives. There is a prevailing attitude, however, that this is a
worthless endeavor. Witness the amount of noise that comes out when
you build a Red Hat kernel or the nfs-utils package. Witness also the
sarcasm of Linus who repeatedly chides folks for not running sparse
regularly.

Additionally, gcc is not the best tool for this job, given the often
oblique way it calls out errors and warnings. There are purify,
fortify, and splint, just to name three, that are standard analysis
tools we don't even consider.

My comment goes more to the point that static analysis is not
considered of any value.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2009-09-14 13:54:35

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Thu, 2009-09-10 at 10:09 -0400, Chuck Lever wrote:
> On Sep 10, 2009, at 4:44 AM, NeilBrown wrote:
> > I agree that combative mailing lists are a problem, but even there, I
> > believe most of the aggression is more perceived than real, and that
> > a graceful, humble, polite attitude can have a positive-feedback
> > effect
> > too.
>
> Years ago I believed that, but I have seen much evidence to the
> contrary in this community. More often such an attitude is entirely
> ignored, or treated as an invitation for abuse, especially by people
> who have no interest in politeness. This kind of approach has no
> effect on the leaders in the Linux community, who set an example of
> extreme rudeness and belligerence.
>
> I've made an effort to stop arguing small points, and to make
> observations and not argue. I still get e-mail full of "crap" this
> and "bullshit" that and "NACK!" with little explanation.

As you said above, you've been part of the community for years. It is
not as if you haven't learned by now that a review might turn up issues
that may give rise to a NACK, and that you need to be open to changing
your code should this happen.
If you need more information about what needs to be changed, then you
know to ask.

That hasn't been your approach in this case, though, and the responses
you got were a direct consequence of that approach.
You tried reversing the burden of proof as to why we should change an
established interface instead of supplying adequate evidence justifying
that change.
When problems were pointed out to you (e.g. backward compatibility) your
response was to deny they existed instead of proposing a change to your
code.
Finally, you tried changing the thread into a discussion about mean rude
people obstructing you and failing to give you adequate guidance.



The problems with the code remain, and you will need to change it in
order to make it acceptable. The question I haven't seen you asking, and
that you should have be asking from the very start is "what would be the
minimal set of changes?".

I'm quite willing to discuss that with you.

Trond


2009-09-10 08:44:10

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Thu, September 10, 2009 8:18 am, Chuck Lever wrote:

> The idea that "the Linux way" is the best and only way is ridiculous
> on its face, anyway. I mean, what do you expect when we have no
> requirements and specification process, no formal testing, C coding
> style conventions based on 20-year old coding practices, a hit-or-miss
> review process that relies more on reviewers' personal preferences
> than any kind of standards, no static code analysis tools, no defect
> metrics or bug meta-analysis tools, kernel debuggers are verboten, a
> combative mailing list environment, and parts of our knowledge base
> and team history are lost every time a developer leaves (in this case,
> Olaf and Neil)? It's no wonder we never change anything unless
> absolutely necessary!

And yet is largely works!
I could summarise a lot of your points by observing that the community
values people over process. I really think that is the right place to
put value, because people are richer and more flexible than process.

I agree that combative mailing lists are a problem, but even there, I
believe most of the aggression is more perceived than real, and that
a graceful, humble, polite attitude can have a positive-feedback effect
too.

Yes, there are lots of practices that might improve things that we don't
have standardised. But one practice we do have that has proven very
effective is incremental refinement. It can be hard to understand what
order to make changes until after you have made them, but once you
understand what you want to do, going back and doing it in logical
order really is very effective. It makes it easier for others to
review, it makes it easy for you to review yourself. It means
less controversial bits can be included quickly leaving room for the
more controversial bits to be discussed in isolation.

I think that the switch from portmap to rpcbind was a bad idea,
and I think that a wholesale replacement of statd is probably a
bad idea too. It might seem like the easiest way to get something
useful working, but you'll probably be paying the price for years as
little regression turn up because you didn't completely understand
the original statd (and face it, who does?)

As for the use of sql-lite ... I must admit that I wouldn't choose
it. Maybe it is a good idea. If it is, you probably need to merge
that change early with a clear argument and tools to make it manageable
(e.g. a developer will want to tool to be able to look inside the
database easily and make changes, without having to know sql).
It is much easier to discuss one thing at a time on these
combative mailing lists ;-)

NeilBrown

P.S.
And we do have static code analysis tools. Both 'gcc' and 'sparse'
fit that description.



2009-09-15 01:30:13

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 14, 2009, at 9:54 AM, Trond Myklebust wrote:
> On Thu, 2009-09-10 at 10:09 -0400, Chuck Lever wrote:
>> On Sep 10, 2009, at 4:44 AM, NeilBrown wrote:
>>> I agree that combative mailing lists are a problem, but even
>>> there, I
>>> believe most of the aggression is more perceived than real, and that
>>> a graceful, humble, polite attitude can have a positive-feedback
>>> effect
>>> too.
>>
>> Years ago I believed that, but I have seen much evidence to the
>> contrary in this community. More often such an attitude is entirely
>> ignored, or treated as an invitation for abuse, especially by people
>> who have no interest in politeness. This kind of approach has no
>> effect on the leaders in the Linux community, who set an example of
>> extreme rudeness and belligerence.
>>
>> I've made an effort to stop arguing small points, and to make
>> observations and not argue. I still get e-mail full of "crap" this
>> and "bullshit" that and "NACK!" with little explanation.
>
> As you said above, you've been part of the community for years. It is
> not as if you haven't learned by now that a review might turn up
> issues
> that may give rise to a NACK, and that you need to be open to changing
> your code should this happen.

The tone of your vetoes is unnecessarily aggressive, and often the
comments are entirely negative. A mention of the pieces of new statd
that you liked, for example, would have been welcome, and even useful
for this conversation.

Naturally, you are free to disagree with me as much as you like. But
I wish you could be more constructive about it. This is not a
contest... we're supposed to be working together to improve the code
base.

> If you need more information about what needs to be changed, then you
> know to ask.
>
> That hasn't been your approach in this case, though, and the responses
> you got were a direct consequence of that approach.
> You tried reversing the burden of proof as to why we should change an
> established interface instead of supplying adequate evidence
> justifying
> that change.

I think it's reasonable to ask for evidence on both sides.

> When problems were pointed out to you (e.g. backward compatibility)
> your
> response was to deny they existed instead of proposing a change to
> your
> code.

The problem you described seemed to stem from a basic disagreement
about how statd with IPv6 would be deployed -- an issue that neither
of us (being upstream developers and not distributors) can resolve.
Your objection still doesn't make sense to me. But see below.

> Finally, you tried changing the thread into a discussion about mean
> rude
> people obstructing you and failing to give you adequate guidance.

I was responding to Neil's comments, not changing the subject.

In any event, since all of the NFS maintainers have now passed their
judgement, it's clear that I will have to withdraw new statd, and
proceed with a re-write that uses the existing on-disk format. I've
never claimed that sqlite3 is a _requirement_ to solve these issues,
but only that some changes would be necessary to the on-disk format,
and that a database seemed an appropriate improvement.

No maintainers agree with that, so I will rework it.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2009-09-09 22:18:42

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 9, 2009, at 3:42 PM, Trond Myklebust wrote:
> On Wed, 2009-09-09 at 15:17 -0400, Chuck Lever wrote:
>> On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
>> The old statd still exists in nfs-utils. The new statd is an
>> entirely
>> separate component. Distributions can continue to use the old statd
>> as long as they want. This is a red herring.
>
> Bullshit. If they are adding IPv6 support, then they will have to
> upgrade at some point.

I don't see a problem with a distribution upgrade using old statd and
a fresh install using new statd. You have to install a lot of new
components to get NFS/IPv6 support. It's not like the only thing that
needs to change is statd. People will install a new distribution to
get IPv6 support. With so many simple ways to install from scratch,
the days of someone upgrading just a few pieces of an old system to
get a new feature, especially one as extensive as NFS/IPv6, are long
gone.

I don't hear a lot of distributors objecting to this idea.

And you have never clearly answered why it wouldn't be enough to add a
little code to convert the current on-disk format to sqlite3 when
upgrading to the new statd, if upgradability is truly an important
requirement. Possibly this is because it eliminates the only real
technical objection you have to using sqlite3 here.

>>> Simplicity is another reason. WTF do we need a full SQL database,
>>> when
>>> all we want to do is store 2 pieces of data (a hostname and a
>>> cookie)?
>>> It isn't as if this has been a major problem for us previously.
>>
>> Because we are not storing just a hostname and a cookie. We are
>> storing several different data items for each host, and we need to
>> search over the records, and provide uniqueness constraints, and
>> handle data conversion (for binary data like the cookie, for string
>> data like the hostname, and for integers, like the prog/vers/proc
>> tuple). We need to store them durably on persistent storage to have
>> some protection against crashes. These are all things that an
>> embedded database can do well, and that we therefore don't have to
>> code ourselves.
>
> Speaking of red herrings. Why are we adding all this crap?
>
> This is a legacy filesystem! We shouldn't not be rewriting NLM/NSM
> from
> scratch, just add minimal support for IPv6.

You and Bruce brought up a number of work items related to statd,
including having distinct statd behavior for remotes who are clients
and remotes who are servers. Tom Talpey suggested we needed to send
multiple SM_NOTIFY requests to each host, and use TCP to do it when
possible, and you even specifically encouraged me to read his
connectathon presentation on this. If Asian countries are driving the
IPv6 requirement, why wouldn't they want IDN support as well?
Interoperable NFS/IPv6 support requires TI-RPC. Plus, NFS/IPv6
practically requires multi-homed NLM/NSM support -- see Alex's RFC
draft for details on that.

Which would you like me to drop?

Let me also point out that old statd is already broken in a number of
ways, and I certainly haven't heard a lot of complaints about it. Our
client NLM has sent "0" as our NSM state number for years, for
example. Thus I hardly think there is a lot of risk in making changes
here. It can only get better.

>>>>>> In any event, it's not just sync(2) that is a problem. sync(2)
>>>>>> by
>>>>>> itself is a boot performance problem, but it's the combination of
>>>>>> rename and sync that is known to be especially unreliable during
>>>>>> system crashes. Statd, being a crash monitor, shouldn't depend
>>>>>> on
>>>>>> rename/sync to maintain persistent data in the face of system
>>>>>> instability. I'd call that a real reason to use something more
>>>>>> robust.
>>>>>
>>>>> What are you talking about? Is this about the truncate + rename
>>>>> issue
>>>>> leaving empty files upon a crash?
>>>>> That issue is solved trivially by doing an fsync() before you
>>>>> rename the
>>>>> file. That entire discussion was about whether or not existing
>>>>> applications should be _required_ to do this kind of POSIX
>>>>> pedantry,
>>>>> when previously they could get away without it.
>>>>>
>>>>> IOW: that issue alone does not justify replacing the current
>>>>> simple file
>>>>> based scheme.
>>>>>
>>>>
>>>> There are other reasons, not to use the simple file-based scheme
>>>> too...
>>>>
>>>> Internationalized domain names will be easier to deal with via
>>>> sqlite3,
>>>> for instance.
>>>
>>> Please explain...
>>
>> IPv6 is used in Asia, where they almost certainly need to use non-
>> ASCII characters in their hostnames. Internationalized domain names
>> are stored in double-wide character sets. To provide reliable
>> support
>> for IDNs in statd, we will have to guarantee somehow that we can
>> store
>> an IDN as a file name (if we want to stay with the current scheme),
>> no
>> matter what file system is used for /var.
>
> So, what's stopping us? These are POSIX filesystems. They can store
> any
> filename as long as it doesn't contain '/' or '\0'.

IDNs are UTF16. /var therefore has to support UTF16 filenames; either
byte in a double-byte character can be '/' or '\0'. That means the
underlying fs implementation has to support UTF16 (FAT32 anyone?), and
the system's locale has to be configured correctly. If we decide not
to depend on the file system to support UTF16 filenames, then statd
has to be intelligent enough to figure out how to deal with converting
UTF16 hostnames before storing them as filenames. Then, we have to
teach matchhostname() and friends how to deal with double-byte
character strings...

Or we just tell sqlite3 that this is a double-byte character string,
and let it handle the collation and on-disk storage details for us.

The point is, this is yet another detail we have to either worry about
and open code in statd, or we can simply rely on what's already
provided in sqlite3. No one, repeat NO ONE, is arguing that you can't
implement these features without sqlite3. My argument is that we
quickly bury a whole bunch of details if we use sqlite3, and can then
focus on larger issues. That's the prime goal of software layering
with libraries.

We can open code any or all of statd. In fact the current statd open
codes RPC request creation in socket buffers rather than using glibc's
RPC API, and I think we agree that is not an optimal solution. The
question is: should we duplicate code and bugs by open coding statd's
RPC and data storage? Or should we pretend to be modern software
engineers, and use widely used and known good code that other people
have written already to handle these details?

>> What's more, multi-homed host support will need to store multiple
>> records for the same hostname. The mon_name is the same, but my_name
>> is different, for each of these records. So we could do that by
>> adding more than one line in each hostname file, but it's also a
>> simple matter to set this up in SQL.
>>
>> When we want to have statd remember things like multiple addresses
>> for
>> the same hostname, or whether the remote is a client or server, we
>> will need to make more adjustments to the files.
>>
>> As we get more and more new requirements, why lock ourselves into the
>> current on-disk format? Using statd means we can store new fields
>> and
>> new records without any backwards-compatibility issues. It's all
>> handled by the database code. So, we can think about the high level
>> problem of getting statd to behave correctly rather than worry about
>> the details of exactly how we are going to get the next data item
>> stored in our current files in a backward compatible way.
>
> Again. This is a legacy filesystem. Why are we adding requirements?

Maybe you should ask the people who are requesting NFS/IPv6 from Red
Hat and other distributors. Or ask yourself why we would add an
engine to allow NFSv2 to shift authentication flavors without
remounting, since NFSv2 is a legacy filesystem. Or ask why you think
we should add support to statd to recognize the difference between
remote clients and servers, if this is a legacy filesystem.

I don't object to any of those work items, but I do have trouble with
you dropping in and saying "why diddle a legacy file system" when
clearly that is not a show stopper in these other cases.

>>>> Certainly we could code this up ourselves, but what's the benefit
>>>> to
>>>> doing that when we have a perfectly good data storage engine
>>>> available?
>>>
>>> Why change something that works???? Rewriting from scratch is _NOT_
>>> the
>>> Linux way, and has usually bitten us hard when we've done it.
>>
>> Because we are adding a bunch of new feature requirements.
>> Internationalized domain names, multi-homed host support, IPv6 and
>> TI-
>> RPC, fast boot times, keeping better track of remote host addresses,
>> keeping track of which remotes are clients and which are servers, and
>> support for sending notifications via TCP all require significant
>> modifications to this code base.
>>
>> At some point you have to look at the code you have, and decide it's
>> simply not going to be adequate, going forward.
>>
>>> The 2.6.19 rewrite of the kernel mount code springs to mind...
>>
>> One can just as easily argue that we've been bitten hard precisely
>> because we've let things rot, or because we have inadequate testing
>> for these components.
>>
>> Another red herring, and especially annoying because you've known I
>> was rewriting statd for months. Only now, when I'm done, do you say
>> "rewriting is not the Linux way."
>
> I have _NEVER_ agreed to a rewrite of the storage formats. You sprang
> this crap on me a month ago, and I made my feelings quite clear then.

"Rewriting is not the Linux way" is not the same as saying you don't
want to change storage formats. Don't change the subject.

The idea that "the Linux way" is the best and only way is ridiculous
on its face, anyway. I mean, what do you expect when we have no
requirements and specification process, no formal testing, C coding
style conventions based on 20-year old coding practices, a hit-or-miss
review process that relies more on reviewers' personal preferences
than any kind of standards, no static code analysis tools, no defect
metrics or bug meta-analysis tools, kernel debuggers are verboten, a
combative mailing list environment, and parts of our knowledge base
and team history are lost every time a developer leaves (in this case,
Olaf and Neil)? It's no wonder we never change anything unless
absolutely necessary!

You told me to implement IPv6 support in statd. Now you are spitting
on what I worked out without any guidance from you because you're too
busy working on IETF standards and NFSv4.1 to bother discussing
"legacy" code, other than to say "ewe!". Just how should I react to
that, pray tell?

Clearly you do not want to admit that even "minimal" IPv6 support is a
significant effort, especially given how far behind the Linux RPC and
NLM/NSM implementations are. Yelling at me, throwing a bunch of
generic objections up, and calling my code "crap" is not going to make
the problem any simpler.

Cooperating with me to get your way and giving specific and
constructive criticism will go a lot farther than your aggressive and
disrespectful attitude and obnoxious tone. I will happily sit down
and discuss this with you in a rational tone, but I will no longer
tolerate your unwarranted harassment on a public mailing list.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2009-09-10 21:26:46

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 10, 2009, at 4:49 PM, J. Bruce Fields wrote:
> On Thu, Sep 10, 2009 at 04:39:51PM -0400, Chuck Lever wrote:
>> On Sep 10, 2009, at 12:23 PM, J. Bruce Fields wrote:
>>> On Thu, Sep 10, 2009 at 12:14:27PM -0400, Chuck Lever wrote:
>>>> On Sep 10, 2009, at 11:03 AM, J. Bruce Fields wrote:
>>>>> On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
>>>>>> IDNs are UTF16. /var therefore has to support UTF16 filenames;
>>>>>> either
>>>>>> byte in a double-byte character can be '/' or '\0'. That means
>>>>>> the
>>>>>> underlying fs implementation has to support UTF16 (FAT32
>>>>>> anyone?),
>>>>>> and
>>>>>> the system's locale has to be configured correctly. If we decide
>>>>>> not to
>>>>>> depend on the file system to support UTF16 filenames, then statd
>>>>>> has
>>>>>> to
>>>>>> be intelligent enough to figure out how to deal with converting
>>>>>> UTF16
>>>>>> hostnames before storing them as filenames. Then, we have to
>>>>>> teach
>>>>>> matchhostname() and friends how to deal with double-byte
>>>>>> character
>>>>>> strings...
>>>>>
>>>>> Googling around.... Is this accurate?:
>>>>>
>>>>> http://en.wikipedia.org/wiki/Internationalized_domain_name
>>>>>
>>>>> That makes it sound like domain names are staying ascii, and
>>>>> they're
>>>>> just adding something on top to allow encoding unicode using
>>>>> ascii,
>>>>> which may optionally be used by applications.
>>>>
>>>> There is a mechanism that provides an ASCII-ized version of domain
>>>> names
>>>> that may contain non-ASCII characters, expressly for applications
>>>> that
>>>> need to perform DNS queries but can't be easily converted to handle
>>>> double-byte character strings. This can be adapted for statd,
>>>> though I'm
>>>> not sure if the converted ASCII version of such names specifically
>>>> exclude '/'.
>>>>
>>>> Internationalized domain names themselves are still expressed in
>>>> UTF16,
>>>> as far as I understand it.
>>>
>>> From a quick skim of http://www.ietf.org/rfc/rfc3490.txt, it appears
>>> to
>>> me that protocols (at the very least, any preexisting protocols) are
>>> all
>>> expected to use the ascii representation on the wire, and that the
>>> translation to unicode is meant by use for applications.
>>>
>>> So in our case we'd continue to expect ascii domain names on the
>>> wire,
>>> and I believe that's also what we should store in any database. But
>>> if
>>> someone were to write a gui administrative interface to that data,
>>> for
>>> example, they might choose to use idna for display.
>>
>> That's a reasonable and specific objection to my claim that our
>> current
>> host record storage format is inadequate to support IDNA. I've also
>> confirmed that ToAscii with the UseSTD3ASCIIRules flag set is not
>> supposed to generate a domain label string with a '/' in it. My
>> remaining concern here is that we could possibly see hostnames that
>> are
>> too long to be stored in directory entries of some file systems,
>> especially considering that the ASCII-fied Unicode names will be
>> longer
>> than typical ASCII names we normally encounter today.
>
> Googling around some more.... the normal limits for dns appear to be
> 63
> bytes per component, and 255 for the whole string, and those limits
> are
> still in force on the output of that mapping. I suspect this isn't a
> huge deal.

I bring this up because NI_MAXHOST, declared in /usr/include/netdb.h,
is 1025, not 255. You are probably correct, practically speaking.

>> What about multi-homed host support? The same mon_name can be used
>> with
>> more than one my_name, for multi-homed hosts. Using the current on-
>> disk
>> scheme, statd turns that SM_MON request into a no-op.
>
> I don't know what we're supposed to do in that case. You want to
> store
> them all so you can send notifies to them all on reboot?

Something like that. Basically I think we want to send SM_NOTIFY to
the monitored host _from_ every registered my_name we have for that
mon_name. The kernel will probably use a separate nsm_host for each
one of these, so statd should probably keep track of each of the
cookies as well.

Remembering which my_names have been used is important because the
remote often uses the sender's name or IP address to identify which
monitored host has rebooted.

We could throw up our hands and just keep track of all the my_names
that were used during the last reboot, and notify each mon_name from
all of those. That doesn't help with remembering the cookies, though,
and makes sending all notifications at reboot take longer.

>> So additional
>> records for the same hostname can't be stored, or we have to resort
>> to
>> adding multiple lines in the same file. This is possible to do
>> with just
>> POSIX file system calls, but it does add complexity to manage several
>> lines in each hostname file without increasing the risk of
>> corruption if
>> a file update (especially the deletion of one record in the middle)
>> is
>> interrupted.
>>
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>>
>>
>>

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2009-09-09 19:14:04

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, 09 Sep 2009 14:39:59 -0400
Trond Myklebust <[email protected]> wrote:

> On Wed, 2009-09-09 at 14:29 -0400, Jeff Layton wrote:
> > On Wed, 05 Aug 2009 19:30:04 -0400
> > Trond Myklebust <[email protected]> wrote:
> >
> > > On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
> > > > On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
> > > > > On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
> > > > >> sqlite3 doesn't do anything special under the covers. It uses only
> > > > >> POSIX file access and locking calls, as far as I know. So I think
> > > > >> hosting /var on most well-behaved clustering file systems won't have
> > > > >> any problem with this arrangement.
> > > > >
> > > > > So we're basically introducing a dependency on a completely new
> > > > > library
> > > > > that will have to be added to boot partitions/nfsroot/etc, and we have
> > > > > no real reason for doing it other than because we want to move from
> > > > > using sync() to fsync()?
> > > > >
> > > > > Sounds like a NACK to me...
> > > >
> > > > Which library are you talking about, libsqlite3 or libtirpc? Because
> > > > NEITHER of those is in /lib.
> > >
> > > libsqlite is the problem. Unlike libtirpc, it's utility has yet to be
> > > established.
> > >
> >
> > Sorry to revive this so late, but I think we need to come to some
> > sort of resolution here. The only missing piece for client side IPv6
> > support is statd...
> >
> > I'm not sure I understand the objection to using libsqlite3 here. We
> > certainly could roll our own routines to handle data storage, but why
> > would we want to do so? sqlite3 is quite good at what it does. Why
> > wouldn't we want to use it?
>
> Backwards compatibility is one major reason. statd already exists, and
> is in use out there. I shouldn't be forced to reboot all my clients when
> I upgrade the nfs-utils package on my server.
>

We could roll a conversion utility for this if it would help. nfs-utils
upgrades usually mean restarting statd anyway:

shut down old-statd
convert flat file db to sqlite
start up new-statd

...so I don't think we necessarily need to reboot the clients for this.
It should (in theory) be possible to do the reverse even.

> Simplicity is another reason. WTF do we need a full SQL database, when
> all we want to do is store 2 pieces of data (a hostname and a cookie)?
> It isn't as if this has been a major problem for us previously.
>

Been a little while since I took my initial look at new-statd, but I
see that Chuck is has this (just a for-instance):

rc = sqlite3_prepare_v2(db, "CREATE TABLE " STATD_MONITOR_TABLENAME
" (priv BLOB,"
" mon_name TEXT NOT NULL,"
" my_name TEXT NOT NULL,"
" program INTEGER,"
" version INTEGER,"
" procedure INTEGER,"
" protocol TEXT NOT NULL,"
" state INTEGER,"
" UNIQUE(mon_name,my_name));",
-1, &stmt, NULL);


He's tracking some other info there too. Is this all necessary? Maybe
not now, but having a storage engine that can cope with tracking extra
info will make it easier to handle things like multihomed clients and
servers correctly (something that the existing statd is not very good
at at the moment).

> > > > In any event, it's not just sync(2) that is a problem. sync(2) by
> > > > itself is a boot performance problem, but it's the combination of
> > > > rename and sync that is known to be especially unreliable during
> > > > system crashes. Statd, being a crash monitor, shouldn't depend on
> > > > rename/sync to maintain persistent data in the face of system
> > > > instability. I'd call that a real reason to use something more robust.
> > >
> > > What are you talking about? Is this about the truncate + rename issue
> > > leaving empty files upon a crash?
> > > That issue is solved trivially by doing an fsync() before you rename the
> > > file. That entire discussion was about whether or not existing
> > > applications should be _required_ to do this kind of POSIX pedantry,
> > > when previously they could get away without it.
> > >
> > > IOW: that issue alone does not justify replacing the current simple file
> > > based scheme.
> > >
> >
> > There are other reasons, not to use the simple file-based scheme too...
> >
> > Internationalized domain names will be easier to deal with via sqlite3,
> > for instance.
>
> Please explain...
>

Well, we currently store statd info in flat files named with the
hostname. With an internationalized domain name, we may have a
multibyte character in that name. We could try to store that as an
ASCII or UTF8 name, but we'd have to roll conversion routines for it.
Why bother when we have a storage engine that does that work for us?

> > Certainly we could code this up ourselves, but what's the benefit to
> > doing that when we have a perfectly good data storage engine available?
>
> Why change something that works???? Rewriting from scratch is _NOT_ the
> Linux way, and has usually bitten us hard when we've done it.
>
> The 2.6.19 rewrite of the kernel mount code springs to mind...
>

A good point. Given who I work for, I *really* hate regressions since
they tend to mean a lot of my time tends to get eaten up.

At some point however it becomes very difficult to patch up old code.
old-statd in particular hasn't seen the attention that it probably
should have.

I trust that Chuck will be willing to fix problems that come along. The
new code seems to be on par with the old in complexity so I don't think
we'd be taking on a great maintenance burden with it even if Chuck
isn't available.

Could we add IPv6 support as a patchset to the existing statd instead?
Sure. That patch would be smaller than Chuck's rewrite, but I still
think there are advantages to considering a major overhaul here.

--
Jeff Layton <[email protected]>

2009-09-10 20:50:02

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Thu, Sep 10, 2009 at 04:39:51PM -0400, Chuck Lever wrote:
> On Sep 10, 2009, at 12:23 PM, J. Bruce Fields wrote:
>> On Thu, Sep 10, 2009 at 12:14:27PM -0400, Chuck Lever wrote:
>>> On Sep 10, 2009, at 11:03 AM, J. Bruce Fields wrote:
>>>> On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
>>>>> IDNs are UTF16. /var therefore has to support UTF16 filenames;
>>>>> either
>>>>> byte in a double-byte character can be '/' or '\0'. That means the
>>>>> underlying fs implementation has to support UTF16 (FAT32 anyone?),
>>>>> and
>>>>> the system's locale has to be configured correctly. If we decide
>>>>> not to
>>>>> depend on the file system to support UTF16 filenames, then statd
>>>>> has
>>>>> to
>>>>> be intelligent enough to figure out how to deal with converting
>>>>> UTF16
>>>>> hostnames before storing them as filenames. Then, we have to teach
>>>>> matchhostname() and friends how to deal with double-byte character
>>>>> strings...
>>>>
>>>> Googling around.... Is this accurate?:
>>>>
>>>> http://en.wikipedia.org/wiki/Internationalized_domain_name
>>>>
>>>> That makes it sound like domain names are staying ascii, and they're
>>>> just adding something on top to allow encoding unicode using ascii,
>>>> which may optionally be used by applications.
>>>
>>> There is a mechanism that provides an ASCII-ized version of domain
>>> names
>>> that may contain non-ASCII characters, expressly for applications
>>> that
>>> need to perform DNS queries but can't be easily converted to handle
>>> double-byte character strings. This can be adapted for statd,
>>> though I'm
>>> not sure if the converted ASCII version of such names specifically
>>> exclude '/'.
>>>
>>> Internationalized domain names themselves are still expressed in
>>> UTF16,
>>> as far as I understand it.
>>
>> From a quick skim of http://www.ietf.org/rfc/rfc3490.txt, it appears
>> to
>> me that protocols (at the very least, any preexisting protocols) are
>> all
>> expected to use the ascii representation on the wire, and that the
>> translation to unicode is meant by use for applications.
>>
>> So in our case we'd continue to expect ascii domain names on the wire,
>> and I believe that's also what we should store in any database. But
>> if
>> someone were to write a gui administrative interface to that data, for
>> example, they might choose to use idna for display.
>
> That's a reasonable and specific objection to my claim that our current
> host record storage format is inadequate to support IDNA. I've also
> confirmed that ToAscii with the UseSTD3ASCIIRules flag set is not
> supposed to generate a domain label string with a '/' in it. My
> remaining concern here is that we could possibly see hostnames that are
> too long to be stored in directory entries of some file systems,
> especially considering that the ASCII-fied Unicode names will be longer
> than typical ASCII names we normally encounter today.

Googling around some more.... the normal limits for dns appear to be 63
bytes per component, and 255 for the whole string, and those limits are
still in force on the output of that mapping. I suspect this isn't a
huge deal.

> What about multi-homed host support? The same mon_name can be used with
> more than one my_name, for multi-homed hosts. Using the current on-disk
> scheme, statd turns that SM_MON request into a no-op.

I don't know what we're supposed to do in that case. You want to store
them all so you can send notifies to them all on reboot?

--b.

> So additional
> records for the same hostname can't be stored, or we have to resort to
> adding multiple lines in the same file. This is possible to do with just
> POSIX file system calls, but it does add complexity to manage several
> lines in each hostname file without increasing the risk of corruption if
> a file update (especially the deletion of one record in the middle) is
> interrupted.
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>

2009-09-14 15:48:52

by Steve Dickson

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)



On 09/14/2009 03:08 AM, Neil Brown wrote:
>
> Some time ago Greg Banks went on a pursuit of warnings in nfs-utils
> and got rid of all of them -- except those generated by rpcgen.
> Have more been introduced?
The only other warnings I'm struggling with are the
warning: dereferencing pointer 'local_addr' does break strict-aliasing rules

in sm-notify.c, which seem to be a bunch of noise... But I will like
to get ride of them... Other than that, there are no warnings that
I know of...

steved.


2009-09-14 15:55:52

by Steve Dickson

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

> The problems with the code remain, and you will need to change it in
> order to make it acceptable. The question I haven't seen you asking, and
> that you should have be asking from the very start is "what would be the
> minimal set of changes?".

I too would like to see what are minimal change set would be... Just
to see how much a db is really needed... If at all..

Now with that said... I've taken on the task of how to deal with
pNFS exports and it might make sense that a db is need to deal with
that complexity... So I'm not totally against the idea of introducing
a db to nfs-utils I just think we need to do it for the right reason....

steved.


2009-09-09 18:40:07

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, 2009-09-09 at 14:29 -0400, Jeff Layton wrote:
> On Wed, 05 Aug 2009 19:30:04 -0400
> Trond Myklebust <[email protected]> wrote:
>
> > On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
> > > On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
> > > > On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
> > > >> sqlite3 doesn't do anything special under the covers. It uses only
> > > >> POSIX file access and locking calls, as far as I know. So I think
> > > >> hosting /var on most well-behaved clustering file systems won't have
> > > >> any problem with this arrangement.
> > > >
> > > > So we're basically introducing a dependency on a completely new
> > > > library
> > > > that will have to be added to boot partitions/nfsroot/etc, and we have
> > > > no real reason for doing it other than because we want to move from
> > > > using sync() to fsync()?
> > > >
> > > > Sounds like a NACK to me...
> > >
> > > Which library are you talking about, libsqlite3 or libtirpc? Because
> > > NEITHER of those is in /lib.
> >
> > libsqlite is the problem. Unlike libtirpc, it's utility has yet to be
> > established.
> >
>
> Sorry to revive this so late, but I think we need to come to some
> sort of resolution here. The only missing piece for client side IPv6
> support is statd...
>
> I'm not sure I understand the objection to using libsqlite3 here. We
> certainly could roll our own routines to handle data storage, but why
> would we want to do so? sqlite3 is quite good at what it does. Why
> wouldn't we want to use it?

Backwards compatibility is one major reason. statd already exists, and
is in use out there. I shouldn't be forced to reboot all my clients when
I upgrade the nfs-utils package on my server.

Simplicity is another reason. WTF do we need a full SQL database, when
all we want to do is store 2 pieces of data (a hostname and a cookie)?
It isn't as if this has been a major problem for us previously.

> > > In any event, it's not just sync(2) that is a problem. sync(2) by
> > > itself is a boot performance problem, but it's the combination of
> > > rename and sync that is known to be especially unreliable during
> > > system crashes. Statd, being a crash monitor, shouldn't depend on
> > > rename/sync to maintain persistent data in the face of system
> > > instability. I'd call that a real reason to use something more robust.
> >
> > What are you talking about? Is this about the truncate + rename issue
> > leaving empty files upon a crash?
> > That issue is solved trivially by doing an fsync() before you rename the
> > file. That entire discussion was about whether or not existing
> > applications should be _required_ to do this kind of POSIX pedantry,
> > when previously they could get away without it.
> >
> > IOW: that issue alone does not justify replacing the current simple file
> > based scheme.
> >
>
> There are other reasons, not to use the simple file-based scheme too...
>
> Internationalized domain names will be easier to deal with via sqlite3,
> for instance.

Please explain...

> Certainly we could code this up ourselves, but what's the benefit to
> doing that when we have a perfectly good data storage engine available?

Why change something that works???? Rewriting from scratch is _NOT_ the
Linux way, and has usually bitten us hard when we've done it.

The 2.6.19 rewrite of the kernel mount code springs to mind...

Trond


2009-09-09 19:19:31

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)


On Sep 9, 2009, at 3:13 PM, Jeff Layton wrote:

> On Wed, 09 Sep 2009 14:39:59 -0400
> Trond Myklebust <[email protected]> wrote:
>
>> On Wed, 2009-09-09 at 14:29 -0400, Jeff Layton wrote:
>>> On Wed, 05 Aug 2009 19:30:04 -0400
>>> Trond Myklebust <[email protected]> wrote:
>>>
>>>> On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
>>>>> On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
>>>>>> On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
>>>>>>> sqlite3 doesn't do anything special under the covers. It uses
>>>>>>> only
>>>>>>> POSIX file access and locking calls, as far as I know. So I
>>>>>>> think
>>>>>>> hosting /var on most well-behaved clustering file systems
>>>>>>> won't have
>>>>>>> any problem with this arrangement.
>>>>>>
>>>>>> So we're basically introducing a dependency on a completely new
>>>>>> library
>>>>>> that will have to be added to boot partitions/nfsroot/etc, and
>>>>>> we have
>>>>>> no real reason for doing it other than because we want to move
>>>>>> from
>>>>>> using sync() to fsync()?
>>>>>>
>>>>>> Sounds like a NACK to me...
>>>>>
>>>>> Which library are you talking about, libsqlite3 or libtirpc?
>>>>> Because
>>>>> NEITHER of those is in /lib.
>>>>
>>>> libsqlite is the problem. Unlike libtirpc, it's utility has yet
>>>> to be
>>>> established.
>>>>
>>>
>>> Sorry to revive this so late, but I think we need to come to some
>>> sort of resolution here. The only missing piece for client side IPv6
>>> support is statd...
>>>
>>> I'm not sure I understand the objection to using libsqlite3 here. We
>>> certainly could roll our own routines to handle data storage, but
>>> why
>>> would we want to do so? sqlite3 is quite good at what it does. Why
>>> wouldn't we want to use it?
>>
>> Backwards compatibility is one major reason. statd already exists,
>> and
>> is in use out there. I shouldn't be forced to reboot all my clients
>> when
>> I upgrade the nfs-utils package on my server.
>>
>
> We could roll a conversion utility for this if it would help. nfs-
> utils
> upgrades usually mean restarting statd anyway:
>
> shut down old-statd
> convert flat file db to sqlite
> start up new-statd
>
> ...so I don't think we necessarily need to reboot the clients for
> this.
> It should (in theory) be possible to do the reverse even.
>
>> Simplicity is another reason. WTF do we need a full SQL database,
>> when
>> all we want to do is store 2 pieces of data (a hostname and a
>> cookie)?
>> It isn't as if this has been a major problem for us previously.
>>
>
> Been a little while since I took my initial look at new-statd, but I
> see that Chuck is has this (just a for-instance):
>
> rc = sqlite3_prepare_v2(db, "CREATE TABLE "
> STATD_MONITOR_TABLENAME
> " (priv BLOB,"
> " mon_name TEXT NOT NULL,"
> " my_name TEXT NOT NULL,"
> " program INTEGER,"
> " version INTEGER,"
> " procedure INTEGER,"
> " protocol TEXT NOT NULL,"
> " state INTEGER,"
> " UNIQUE(mon_name,my_name));",
> -1, &stmt, NULL);
>
>
> He's tracking some other info there too. Is this all necessary? Maybe
> not now, but having a storage engine that can cope with tracking extra
> info will make it easier to handle things like multihomed clients and
> servers correctly (something that the existing statd is not very good
> at at the moment).
>
>>>>> In any event, it's not just sync(2) that is a problem. sync(2) by
>>>>> itself is a boot performance problem, but it's the combination of
>>>>> rename and sync that is known to be especially unreliable during
>>>>> system crashes. Statd, being a crash monitor, shouldn't depend on
>>>>> rename/sync to maintain persistent data in the face of system
>>>>> instability. I'd call that a real reason to use something more
>>>>> robust.
>>>>
>>>> What are you talking about? Is this about the truncate + rename
>>>> issue
>>>> leaving empty files upon a crash?
>>>> That issue is solved trivially by doing an fsync() before you
>>>> rename the
>>>> file. That entire discussion was about whether or not existing
>>>> applications should be _required_ to do this kind of POSIX
>>>> pedantry,
>>>> when previously they could get away without it.
>>>>
>>>> IOW: that issue alone does not justify replacing the current
>>>> simple file
>>>> based scheme.
>>>>
>>>
>>> There are other reasons, not to use the simple file-based scheme
>>> too...
>>>
>>> Internationalized domain names will be easier to deal with via
>>> sqlite3,
>>> for instance.
>>
>> Please explain...
>>
>
> Well, we currently store statd info in flat files named with the
> hostname. With an internationalized domain name, we may have a
> multibyte character in that name. We could try to store that as an
> ASCII or UTF8 name, but we'd have to roll conversion routines for it.
> Why bother when we have a storage engine that does that work for us?
>
>>> Certainly we could code this up ourselves, but what's the benefit to
>>> doing that when we have a perfectly good data storage engine
>>> available?
>>
>> Why change something that works???? Rewriting from scratch is _NOT_
>> the
>> Linux way, and has usually bitten us hard when we've done it.
>>
>> The 2.6.19 rewrite of the kernel mount code springs to mind...
>>
>
> A good point. Given who I work for, I *really* hate regressions since
> they tend to mean a lot of my time tends to get eaten up.
>
> At some point however it becomes very difficult to patch up old code.
> old-statd in particular hasn't seen the attention that it probably
> should have.
>
> I trust that Chuck will be willing to fix problems that come along.
> The
> new code seems to be on par with the old in complexity so I don't
> think
> we'd be taking on a great maintenance burden with it even if Chuck
> isn't available.
>
> Could we add IPv6 support as a patchset to the existing statd instead?
> Sure.

Not as easy as you might think.

The old statd open codes a bunch of the RPC client side, which means
all of that would have to be rewritten anyway.

> That patch would be smaller than Chuck's rewrite, but I still
> think there are advantages to considering a major overhaul here.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2009-09-10 15:03:23

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
> IDNs are UTF16. /var therefore has to support UTF16 filenames; either
> byte in a double-byte character can be '/' or '\0'. That means the
> underlying fs implementation has to support UTF16 (FAT32 anyone?), and
> the system's locale has to be configured correctly. If we decide not to
> depend on the file system to support UTF16 filenames, then statd has to
> be intelligent enough to figure out how to deal with converting UTF16
> hostnames before storing them as filenames. Then, we have to teach
> matchhostname() and friends how to deal with double-byte character
> strings...

Googling around.... Is this accurate?:

http://en.wikipedia.org/wiki/Internationalized_domain_name

That makes it sound like domain names are staying ascii, and they're
just adding something on top to allow encoding unicode using ascii,
which may optionally be used by applications.

--b.

2009-09-10 15:05:16

by Ben Myers

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Thu, Sep 10, 2009 at 06:44:00PM +1000, NeilBrown wrote:
> I could summarise a lot of your points by observing that the community
> values people over process. I really think that is the right place to
> put value, because people are richer and more flexible than process.

What a refreshing point of view. I hope it catches on.

> As for the use of sql-lite ... I must admit that I wouldn't choose
> it.

A number of other projects support multiple backends. Maybe that makes
sense here.

-Ben

2009-09-10 16:14:57

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 10, 2009, at 11:03 AM, J. Bruce Fields wrote:
> On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
>> IDNs are UTF16. /var therefore has to support UTF16 filenames;
>> either
>> byte in a double-byte character can be '/' or '\0'. That means the
>> underlying fs implementation has to support UTF16 (FAT32 anyone?),
>> and
>> the system's locale has to be configured correctly. If we decide
>> not to
>> depend on the file system to support UTF16 filenames, then statd
>> has to
>> be intelligent enough to figure out how to deal with converting UTF16
>> hostnames before storing them as filenames. Then, we have to teach
>> matchhostname() and friends how to deal with double-byte character
>> strings...
>
> Googling around.... Is this accurate?:
>
> http://en.wikipedia.org/wiki/Internationalized_domain_name
>
> That makes it sound like domain names are staying ascii, and they're
> just adding something on top to allow encoding unicode using ascii,
> which may optionally be used by applications.

There is a mechanism that provides an ASCII-ized version of domain
names that may contain non-ASCII characters, expressly for
applications that need to perform DNS queries but can't be easily
converted to handle double-byte character strings. This can be
adapted for statd, though I'm not sure if the converted ASCII version
of such names specifically exclude '/'.

Internationalized domain names themselves are still expressed in
UTF16, as far as I understand it.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2009-09-10 16:23:30

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Thu, Sep 10, 2009 at 12:14:27PM -0400, Chuck Lever wrote:
> On Sep 10, 2009, at 11:03 AM, J. Bruce Fields wrote:
>> On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
>>> IDNs are UTF16. /var therefore has to support UTF16 filenames;
>>> either
>>> byte in a double-byte character can be '/' or '\0'. That means the
>>> underlying fs implementation has to support UTF16 (FAT32 anyone?),
>>> and
>>> the system's locale has to be configured correctly. If we decide
>>> not to
>>> depend on the file system to support UTF16 filenames, then statd has
>>> to
>>> be intelligent enough to figure out how to deal with converting UTF16
>>> hostnames before storing them as filenames. Then, we have to teach
>>> matchhostname() and friends how to deal with double-byte character
>>> strings...
>>
>> Googling around.... Is this accurate?:
>>
>> http://en.wikipedia.org/wiki/Internationalized_domain_name
>>
>> That makes it sound like domain names are staying ascii, and they're
>> just adding something on top to allow encoding unicode using ascii,
>> which may optionally be used by applications.
>
> There is a mechanism that provides an ASCII-ized version of domain names
> that may contain non-ASCII characters, expressly for applications that
> need to perform DNS queries but can't be easily converted to handle
> double-byte character strings. This can be adapted for statd, though I'm
> not sure if the converted ASCII version of such names specifically
> exclude '/'.
>
> Internationalized domain names themselves are still expressed in UTF16,
> as far as I understand it.

>From a quick skim of http://www.ietf.org/rfc/rfc3490.txt, it appears to
me that protocols (at the very least, any preexisting protocols) are all
expected to use the ascii representation on the wire, and that the
translation to unicode is meant by use for applications.

So in our case we'd continue to expect ascii domain names on the wire,
and I believe that's also what we should store in any database. But if
someone were to write a gui administrative interface to that data, for
example, they might choose to use idna for display.

--b.

2009-09-10 16:43:43

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Thu, 2009-09-10 at 12:23 -0400, J. Bruce Fields wrote:
> From a quick skim of http://www.ietf.org/rfc/rfc3490.txt, it appears to
> me that protocols (at the very least, any preexisting protocols) are all
> expected to use the ascii representation on the wire, and that the
> translation to unicode is meant by use for applications.
>
> So in our case we'd continue to expect ascii domain names on the wire,
> and I believe that's also what we should store in any database. But if
> someone were to write a gui administrative interface to that data, for
> example, they might choose to use idna for display.

Right. There should be no need to translate anything. Just store as is
and use as is...

Trond