LinuxLists.cc - [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Aug 5, 2009, at 2:15 PM, J. Bruce Fields wrote:

> On Wed, Aug 05, 2009 at 02:05:44PM -0400, Chuck Lever wrote:
>> On Aug 5, 2009, at 1:48 PM, J. Bruce Fields wrote:
>>> On Wed, Aug 05, 2009 at 10:45:40AM -0400, Chuck Lever wrote:
>>>> Provide a new implementation of statd that supports IPv6. The new
>>>> statd implementation resides under
>>>>
>>>> utils/new-statd/
>>>>
>>>> The contents of this directory are built if --enable-tirpc is set
>>>> on the ./configure command line, and sqlite3 is available on the
>>>> build system. Otherwise, the legacy version of statd, which still
>>>> resides under utils/statd/, is built.
>>>>
>>>> The goals of this re-write are:
>>>>
>>>> o Support IPv6 networking
>>>>
>>>> Support interoperation with TI-RPC-based NSM implementations.
>>>> Transport Independent RPC, or TI-RPC, provides IPv6 network
>>>> support
>>>> for Linux's NSM implementation.
>>>>
>>>> To support TI-RPC, open code to construct RPC requests in socket
>>>> buffers and then schedule them has been replaced with standard
>>>> library calls.
>>>>
>>>> o Support notification via TCP
>>>>
>>>> As a secondary benefit of using TI-RPC library calls, reboot
>>>> notifications and NLM callbacks can now be sent via connection-
>>>> oriented transport protocols.
>>>>
>>>> Note that lockd does not (yet) tell statd what transport protocol
>>>> to use when sending reboot notifications. statd/sm-notify will
>>>> continue to use UDP for the time being.
>>>>
>>>> o Use an embedded database for storing on-disk callback data
>>>>
>>>> This whole exercise is for the purpose of crash robustness. There
>>>> are well-known deficiencies with simple create/rename/unlink
>>>> disk storage schemes during system crashes. Replace the current
>>>> flat-file monitor list mechanism which uses sync(2) with sqlite3,
>>>> which uses fsync(3).
>>>
>>> If someone wants to move around that data, is it still simple to do
>>> that? (Where is it kept on the filesystem?)
>>>
>>> (I'm thinking of someone that shares it for high-availabity, as in:
>>>
>>> http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat_p3
>>>
>>> Or maybe somebody that just needs to move their /var partition to a
>>> different disk one day.)
>>
>> Statd's monitor lists and state number are stored in a single regular
>> file, /var/lib/nfs/statd/statdb by default. This file can be easily
>> backed up, or used on other systems, if desired. I would recommend
>> ensuring the NSM state number is reset in the latter case, which
>> can be
>> done with the sqlite3 command.
>>
>> I've had some dialog with Lon Hohberger about clustering
>> requirements. I
>> think we are looking at crafting a separate utility that uses
>> sqlite3 C
>> function calls to extract data that's interesting to the clustering
>> implementation. Again, this could even be scripted with bash and the
>> sqlite3 command, but perhaps a C program is more maintainable.
>
> OK, good.
>
> And for the simplest cases, it should still be enough to just copy
> /var/lib/nfs/, right?

I don't see why that wouldn't work, as long statd/sm-notify aren't
updating the database at that moment. For safety I think there is an
sqlite3 backup mechanism for database files that respects the
library's locking semantics.

sqlite3 doesn't do anything special under the covers. It uses only
POSIX file access and locking calls, as far as I know. So I think
hosting /var on most well-behaved clustering file systems won't have
any problem with this arrangement.

One (admittedly minor) reason I did this is so we have some sample
code to try for other NFS-related daemons that need to store
information in /var robustly, potentially in clustered environments.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-08-05 18:06:18

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Aug 5, 2009, at 1:48 PM, J. Bruce Fields wrote:
> On Wed, Aug 05, 2009 at 10:45:40AM -0400, Chuck Lever wrote:
>> Provide a new implementation of statd that supports IPv6. The new
>> statd implementation resides under
>>
>> utils/new-statd/
>>
>> The contents of this directory are built if --enable-tirpc is set
>> on the ./configure command line, and sqlite3 is available on the
>> build system. Otherwise, the legacy version of statd, which still
>> resides under utils/statd/, is built.
>>
>> The goals of this re-write are:
>>
>> o Support IPv6 networking
>>
>> Support interoperation with TI-RPC-based NSM implementations.
>> Transport Independent RPC, or TI-RPC, provides IPv6 network support
>> for Linux's NSM implementation.
>>
>> To support TI-RPC, open code to construct RPC requests in socket
>> buffers and then schedule them has been replaced with standard
>> library calls.
>>
>> o Support notification via TCP
>>
>> As a secondary benefit of using TI-RPC library calls, reboot
>> notifications and NLM callbacks can now be sent via connection-
>> oriented transport protocols.
>>
>> Note that lockd does not (yet) tell statd what transport protocol
>> to use when sending reboot notifications. statd/sm-notify will
>> continue to use UDP for the time being.
>>
>> o Use an embedded database for storing on-disk callback data
>>
>> This whole exercise is for the purpose of crash robustness. There
>> are well-known deficiencies with simple create/rename/unlink
>> disk storage schemes during system crashes. Replace the current
>> flat-file monitor list mechanism which uses sync(2) with sqlite3,
>> which uses fsync(3).
>
> If someone wants to move around that data, is it still simple to do
> that? (Where is it kept on the filesystem?)
>
> (I'm thinking of someone that shares it for high-availabity, as in:
>
> http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat_p3
>
> Or maybe somebody that just needs to move their /var partition to a
> different disk one day.)

Statd's monitor lists and state number are stored in a single regular
file, /var/lib/nfs/statd/statdb by default. This file can be easily
backed up, or used on other systems, if desired. I would recommend
ensuring the NSM state number is reset in the latter case, which can
be done with the sqlite3 command.

I've had some dialog with Lon Hohberger about clustering
requirements. I think we are looking at crafting a separate utility
that uses sqlite3 C function calls to extract data that's interesting
to the clustering implementation. Again, this could even be scripted
with bash and the sqlite3 command, but perhaps a C program is more
maintainable.

>> o Share code between sm-notify and statd
>>
>> Statd and sm-notify access the same set of on-disk data. These
>> separate programs now share the same code and implementation, with
>> access to on-disk data serialized by sqlite3. The two remain
>> separate executables to allow other system facilities to send
>> reboot notifications without poking statd.
>>
>> o Reduce impact of DNS outages
>>
>> The heuristics used by SM_NOTIFY to figure out which remote peer
>> has rebooted are heavily dependent on DNS. If the DNS service is
>> slow or hangs, that will make the NSM listener unresponsive.
>> Incoming SM_NOTIFY requests are now handled in a sidecar process
>> to reduce the impact of DNS outages on the NSM service listener.
>>
>> o Proper my_name support
>>
>> The current version of statd uses gethostname(3) to generate the
>> mon_name argument of SM_NOTIFY. This value can change across a
>> reboot. The new version of statd records lockd's my_name, passed
>> by every SM_MON request, and uses that when sending SM_NOTIFY.
>>
>> This can be useful for multi-homed and DHCP configured hosts.
>>
>> o Send SM_NOTIFY more aggressively
>>
>> It has been recommended that statd/sm-notify send SM_NOTIFY
>> more aggressively (for example, to the entire list returned by
>> getaddrinfo(3)). Since SM_NOTIFY's reply is NULL, there's no
>> way to tell whether the remote peer recognized the mon_name we
>> sent. More study is required, but this implementation attempts
>> to send an SM_NOTIFY request to each address returned by
>> getaddrinfo(3).
>>
>> This re-implementation paves the way for a number of future
>> improvements. However, it does not immediately address:
>>
>> o lockd/statd start-up serialization issues
>>
>> Sending reboot notifications, starting statd and lockd, and opening
>> the lockd grace period are still determined independently in user
>> space and the kernel.
>>
>> o Binding mon_names to caller IP addresses
>>
>> By default, lockd continues to send IP addresses as the mon_name
>> argument of the SM_MON procedure. This provides a better guarantee
>> of being able to contact remote peers during a reboot, but means
>> statd must continue to use heuristics to match incoming SM_NOTIFY
>> requests with peers on the monitor list.
>>
>> o Distinct logic for NFS client- and server-side
>>
>> Client-side and server-side monitoring requirements are different.
>> Statd continues to use the same logic for both NFS client and
>> server, as the NSMv1 protocol does not provide any indication
>> that a mon_name is for a client or server peer.
>
> Note we probably don't need to be limited by the protocol here, only
> by
> kernel backwards-compatibility requirements, as long as this is just
> kernel<->statd communication and not something that goes across the
> wire
> to other statd implementations.

Agreed.

It would be possible to export the kernel's NSM host cache via sysfs,
for instance. An SM_MON upcall could cause statd to look in /sys for
more information like whether the remote peer is a client or server,
and what transport protocol and what network address the caller used
to contact the local host. This kind of scheme would work well for
both old kernels running the new statd, and new kernels running the
old statd.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-08-05 17:48:15

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, Aug 05, 2009 at 10:45:40AM -0400, Chuck Lever wrote:
> Provide a new implementation of statd that supports IPv6. The new
> statd implementation resides under
>
> utils/new-statd/
>
> The contents of this directory are built if --enable-tirpc is set
> on the ./configure command line, and sqlite3 is available on the
> build system. Otherwise, the legacy version of statd, which still
> resides under utils/statd/, is built.
>
> The goals of this re-write are:
>
> o Support IPv6 networking
>
> Support interoperation with TI-RPC-based NSM implementations.
> Transport Independent RPC, or TI-RPC, provides IPv6 network support
> for Linux's NSM implementation.
>
> To support TI-RPC, open code to construct RPC requests in socket
> buffers and then schedule them has been replaced with standard
> library calls.
>
> o Support notification via TCP
>
> As a secondary benefit of using TI-RPC library calls, reboot
> notifications and NLM callbacks can now be sent via connection-
> oriented transport protocols.
>
> Note that lockd does not (yet) tell statd what transport protocol
> to use when sending reboot notifications. statd/sm-notify will
> continue to use UDP for the time being.
>
> o Use an embedded database for storing on-disk callback data
>
> This whole exercise is for the purpose of crash robustness. There
> are well-known deficiencies with simple create/rename/unlink
> disk storage schemes during system crashes. Replace the current
> flat-file monitor list mechanism which uses sync(2) with sqlite3,
> which uses fsync(3).

If someone wants to move around that data, is it still simple to do
that? (Where is it kept on the filesystem?)

(I'm thinking of someone that shares it for high-availabity, as in:

http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat_p3

Or maybe somebody that just needs to move their /var partition to a
different disk one day.)

> o Share code between sm-notify and statd
>
> Statd and sm-notify access the same set of on-disk data. These
> separate programs now share the same code and implementation, with
> access to on-disk data serialized by sqlite3. The two remain
> separate executables to allow other system facilities to send
> reboot notifications without poking statd.
>
> o Reduce impact of DNS outages
>
> The heuristics used by SM_NOTIFY to figure out which remote peer
> has rebooted are heavily dependent on DNS. If the DNS service is
> slow or hangs, that will make the NSM listener unresponsive.
> Incoming SM_NOTIFY requests are now handled in a sidecar process
> to reduce the impact of DNS outages on the NSM service listener.
>
> o Proper my_name support
>
> The current version of statd uses gethostname(3) to generate the
> mon_name argument of SM_NOTIFY. This value can change across a
> reboot. The new version of statd records lockd's my_name, passed
> by every SM_MON request, and uses that when sending SM_NOTIFY.
>
> This can be useful for multi-homed and DHCP configured hosts.
>
> o Send SM_NOTIFY more aggressively
>
> It has been recommended that statd/sm-notify send SM_NOTIFY
> more aggressively (for example, to the entire list returned by
> getaddrinfo(3)). Since SM_NOTIFY's reply is NULL, there's no
> way to tell whether the remote peer recognized the mon_name we
> sent. More study is required, but this implementation attempts
> to send an SM_NOTIFY request to each address returned by
> getaddrinfo(3).
>
> This re-implementation paves the way for a number of future
> improvements. However, it does not immediately address:
>
> o lockd/statd start-up serialization issues
>
> Sending reboot notifications, starting statd and lockd, and opening
> the lockd grace period are still determined independently in user
> space and the kernel.
>
> o Binding mon_names to caller IP addresses
>
> By default, lockd continues to send IP addresses as the mon_name
> argument of the SM_MON procedure. This provides a better guarantee
> of being able to contact remote peers during a reboot, but means
> statd must continue to use heuristics to match incoming SM_NOTIFY
> requests with peers on the monitor list.
>
> o Distinct logic for NFS client- and server-side
>
> Client-side and server-side monitoring requirements are different.
> Statd continues to use the same logic for both NFS client and
> server, as the NSMv1 protocol does not provide any indication
> that a mon_name is for a client or server peer.

Note we probably don't need to be limited by the protocol here, only by
kernel backwards-compatibility requirements, as long as this is just
kernel<->statd communication and not something that goes across the wire
to other statd implementations.

--b.

2009-08-05 18:15:47

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, Aug 05, 2009 at 02:05:44PM -0400, Chuck Lever wrote:
> On Aug 5, 2009, at 1:48 PM, J. Bruce Fields wrote:
>> On Wed, Aug 05, 2009 at 10:45:40AM -0400, Chuck Lever wrote:
>>> Provide a new implementation of statd that supports IPv6. The new
>>> statd implementation resides under
>>>
>>> utils/new-statd/
>>>
>>> The contents of this directory are built if --enable-tirpc is set
>>> on the ./configure command line, and sqlite3 is available on the
>>> build system. Otherwise, the legacy version of statd, which still
>>> resides under utils/statd/, is built.
>>>
>>> The goals of this re-write are:
>>>
>>> o Support IPv6 networking
>>>
>>> Support interoperation with TI-RPC-based NSM implementations.
>>> Transport Independent RPC, or TI-RPC, provides IPv6 network support
>>> for Linux's NSM implementation.
>>>
>>> To support TI-RPC, open code to construct RPC requests in socket
>>> buffers and then schedule them has been replaced with standard
>>> library calls.
>>>
>>> o Support notification via TCP
>>>
>>> As a secondary benefit of using TI-RPC library calls, reboot
>>> notifications and NLM callbacks can now be sent via connection-
>>> oriented transport protocols.
>>>
>>> Note that lockd does not (yet) tell statd what transport protocol
>>> to use when sending reboot notifications. statd/sm-notify will
>>> continue to use UDP for the time being.
>>>
>>> o Use an embedded database for storing on-disk callback data
>>>
>>> This whole exercise is for the purpose of crash robustness. There
>>> are well-known deficiencies with simple create/rename/unlink
>>> disk storage schemes during system crashes. Replace the current
>>> flat-file monitor list mechanism which uses sync(2) with sqlite3,
>>> which uses fsync(3).
>>
>> If someone wants to move around that data, is it still simple to do
>> that? (Where is it kept on the filesystem?)
>>
>> (I'm thinking of someone that shares it for high-availabity, as in:
>>
>> http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat_p3
>>
>> Or maybe somebody that just needs to move their /var partition to a
>> different disk one day.)
>
> Statd's monitor lists and state number are stored in a single regular
> file, /var/lib/nfs/statd/statdb by default. This file can be easily
> backed up, or used on other systems, if desired. I would recommend
> ensuring the NSM state number is reset in the latter case, which can be
> done with the sqlite3 command.
>
> I've had some dialog with Lon Hohberger about clustering requirements. I
> think we are looking at crafting a separate utility that uses sqlite3 C
> function calls to extract data that's interesting to the clustering
> implementation. Again, this could even be scripted with bash and the
> sqlite3 command, but perhaps a C program is more maintainable.

OK, good.

And for the simplest cases, it should still be enough to just copy
/var/lib/nfs/, right?

--b.

2009-08-05 21:22:44

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:

> sqlite3 doesn't do anything special under the covers. It uses only
> POSIX file access and locking calls, as far as I know. So I think
> hosting /var on most well-behaved clustering file systems won't have
> any problem with this arrangement.

So we're basically introducing a dependency on a completely new library
that will have to be added to boot partitions/nfsroot/etc, and we have
no real reason for doing it other than because we want to move from
using sync() to fsync()?

Sounds like a NACK to me...

Trond

2009-08-05 22:24:35

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
> On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
>> sqlite3 doesn't do anything special under the covers. It uses only
>> POSIX file access and locking calls, as far as I know. So I think
>> hosting /var on most well-behaved clustering file systems won't have
>> any problem with this arrangement.
>
> So we're basically introducing a dependency on a completely new
> library
> that will have to be added to boot partitions/nfsroot/etc, and we have
> no real reason for doing it other than because we want to move from
> using sync() to fsync()?
>
> Sounds like a NACK to me...

Which library are you talking about, libsqlite3 or libtirpc? Because
NEITHER of those is in /lib.

In any event, it's not just sync(2) that is a problem. sync(2) by
itself is a boot performance problem, but it's the combination of
rename and sync that is known to be especially unreliable during
system crashes. Statd, being a crash monitor, shouldn't depend on
rename/sync to maintain persistent data in the face of system
instability. I'd call that a real reason to use something more robust.

Can we try to be a little more constructive, please? Asking the list
(which includes distributors, who actually have to worry about such
things) whether this would be a problem is significantly less abrasive
then just saying "NACK" outright.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-08-05 23:30:12

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
> On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
> > On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
> >> sqlite3 doesn't do anything special under the covers. It uses only
> >> POSIX file access and locking calls, as far as I know. So I think
> >> hosting /var on most well-behaved clustering file systems won't have
> >> any problem with this arrangement.
> >
> > So we're basically introducing a dependency on a completely new
> > library
> > that will have to be added to boot partitions/nfsroot/etc, and we have
> > no real reason for doing it other than because we want to move from
> > using sync() to fsync()?
> >
> > Sounds like a NACK to me...
>
> Which library are you talking about, libsqlite3 or libtirpc? Because
> NEITHER of those is in /lib.

libsqlite is the problem. Unlike libtirpc, it's utility has yet to be
established.

> In any event, it's not just sync(2) that is a problem. sync(2) by
> itself is a boot performance problem, but it's the combination of
> rename and sync that is known to be especially unreliable during
> system crashes. Statd, being a crash monitor, shouldn't depend on
> rename/sync to maintain persistent data in the face of system
> instability. I'd call that a real reason to use something more robust.

What are you talking about? Is this about the truncate + rename issue
leaving empty files upon a crash?
That issue is solved trivially by doing an fsync() before you rename the
file. That entire discussion was about whether or not existing
applications should be _required_ to do this kind of POSIX pedantry,
when previously they could get away without it.

IOW: that issue alone does not justify replacing the current simple file
based scheme.

> Can we try to be a little more constructive, please? Asking the list
> (which includes distributors, who actually have to worry about such
> things) whether this would be a problem is significantly less abrasive
> then just saying "NACK" outright.

It would be constructive if you could actually _justify_ these
backward-incompatible changes instead of hand waving, and accusing
others of being obstructionist.

Trond

2009-09-09 18:29:56

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, 05 Aug 2009 19:30:04 -0400
Trond Myklebust <[email protected]> wrote:

> On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
> > On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
> > > On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
> > >> sqlite3 doesn't do anything special under the covers. It uses only
> > >> POSIX file access and locking calls, as far as I know. So I think
> > >> hosting /var on most well-behaved clustering file systems won't have
> > >> any problem with this arrangement.
> > >
> > > So we're basically introducing a dependency on a completely new
> > > library
> > > that will have to be added to boot partitions/nfsroot/etc, and we have
> > > no real reason for doing it other than because we want to move from
> > > using sync() to fsync()?
> > >
> > > Sounds like a NACK to me...
> >
> > Which library are you talking about, libsqlite3 or libtirpc? Because
> > NEITHER of those is in /lib.
>
> libsqlite is the problem. Unlike libtirpc, it's utility has yet to be
> established.
>

Sorry to revive this so late, but I think we need to come to some
sort of resolution here. The only missing piece for client side IPv6
support is statd...

I'm not sure I understand the objection to using libsqlite3 here. We
certainly could roll our own routines to handle data storage, but why
would we want to do so? sqlite3 is quite good at what it does. Why
wouldn't we want to use it?

> > In any event, it's not just sync(2) that is a problem. sync(2) by
> > itself is a boot performance problem, but it's the combination of
> > rename and sync that is known to be especially unreliable during
> > system crashes. Statd, being a crash monitor, shouldn't depend on
> > rename/sync to maintain persistent data in the face of system
> > instability. I'd call that a real reason to use something more robust.
>
> What are you talking about? Is this about the truncate + rename issue
> leaving empty files upon a crash?
> That issue is solved trivially by doing an fsync() before you rename the
> file. That entire discussion was about whether or not existing
> applications should be _required_ to do this kind of POSIX pedantry,
> when previously they could get away without it.
>
> IOW: that issue alone does not justify replacing the current simple file
> based scheme.
>

There are other reasons, not to use the simple file-based scheme too...

Internationalized domain names will be easier to deal with via sqlite3,
for instance.

Certainly we could code this up ourselves, but what's the benefit to
doing that when we have a perfectly good data storage engine available?

--
Jeff Layton <[email protected]>

2009-09-10 20:40:21

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 10, 2009, at 12:23 PM, J. Bruce Fields wrote:
> On Thu, Sep 10, 2009 at 12:14:27PM -0400, Chuck Lever wrote:
>> On Sep 10, 2009, at 11:03 AM, J. Bruce Fields wrote:
>>> On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
>>>> IDNs are UTF16. /var therefore has to support UTF16 filenames;
>>>> either
>>>> byte in a double-byte character can be '/' or '\0'. That means the
>>>> underlying fs implementation has to support UTF16 (FAT32 anyone?),
>>>> and
>>>> the system's locale has to be configured correctly. If we decide
>>>> not to
>>>> depend on the file system to support UTF16 filenames, then statd
>>>> has
>>>> to
>>>> be intelligent enough to figure out how to deal with converting
>>>> UTF16
>>>> hostnames before storing them as filenames. Then, we have to teach
>>>> matchhostname() and friends how to deal with double-byte character
>>>> strings...
>>>
>>> Googling around.... Is this accurate?:
>>>
>>> http://en.wikipedia.org/wiki/Internationalized_domain_name
>>>
>>> That makes it sound like domain names are staying ascii, and they're
>>> just adding something on top to allow encoding unicode using ascii,
>>> which may optionally be used by applications.
>>
>> There is a mechanism that provides an ASCII-ized version of domain
>> names
>> that may contain non-ASCII characters, expressly for applications
>> that
>> need to perform DNS queries but can't be easily converted to handle
>> double-byte character strings. This can be adapted for statd,
>> though I'm
>> not sure if the converted ASCII version of such names specifically
>> exclude '/'.
>>
>> Internationalized domain names themselves are still expressed in
>> UTF16,
>> as far as I understand it.
>
> From a quick skim of http://www.ietf.org/rfc/rfc3490.txt, it appears
> to
> me that protocols (at the very least, any preexisting protocols) are
> all
> expected to use the ascii representation on the wire, and that the
> translation to unicode is meant by use for applications.
>
> So in our case we'd continue to expect ascii domain names on the wire,
> and I believe that's also what we should store in any database. But
> if
> someone were to write a gui administrative interface to that data, for
> example, they might choose to use idna for display.

That's a reasonable and specific objection to my claim that our
current host record storage format is inadequate to support IDNA.
I've also confirmed that ToAscii with the UseSTD3ASCIIRules flag set
is not supposed to generate a domain label string with a '/' in it.
My remaining concern here is that we could possibly see hostnames that
are too long to be stored in directory entries of some file systems,
especially considering that the ASCII-fied Unicode names will be
longer than typical ASCII names we normally encounter today.

What about multi-homed host support? The same mon_name can be used
with more than one my_name, for multi-homed hosts. Using the current
on-disk scheme, statd turns that SM_MON request into a no-op. So
additional records for the same hostname can't be stored, or we have
to resort to adding multiple lines in the same file. This is possible
to do with just POSIX file system calls, but it does add complexity to
manage several lines in each hostname file without increasing the risk
of corruption if a file update (especially the deletion of one record
in the middle) is interrupted.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-09-09 19:42:14

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, 2009-09-09 at 15:17 -0400, Chuck Lever wrote:
> On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
> The old statd still exists in nfs-utils. The new statd is an entirely
> separate component. Distributions can continue to use the old statd
> as long as they want. This is a red herring.

Bullshit. If they are adding IPv6 support, then they will have to
upgrade at some point.

> > Simplicity is another reason. WTF do we need a full SQL database, when
> > all we want to do is store 2 pieces of data (a hostname and a cookie)?
> > It isn't as if this has been a major problem for us previously.
>
> Because we are not storing just a hostname and a cookie. We are
> storing several different data items for each host, and we need to
> search over the records, and provide uniqueness constraints, and
> handle data conversion (for binary data like the cookie, for string
> data like the hostname, and for integers, like the prog/vers/proc
> tuple). We need to store them durably on persistent storage to have
> some protection against crashes. These are all things that an
> embedded database can do well, and that we therefore don't have to
> code ourselves.

Speaking of red herrings. Why are we adding all this crap?

This is a legacy filesystem! We shouldn't not be rewriting NLM/NSM from
scratch, just add minimal support for IPv6.

> >>>> In any event, it's not just sync(2) that is a problem. sync(2) by
> >>>> itself is a boot performance problem, but it's the combination of
> >>>> rename and sync that is known to be especially unreliable during
> >>>> system crashes. Statd, being a crash monitor, shouldn't depend on
> >>>> rename/sync to maintain persistent data in the face of system
> >>>> instability. I'd call that a real reason to use something more
> >>>> robust.
> >>>
> >>> What are you talking about? Is this about the truncate + rename
> >>> issue
> >>> leaving empty files upon a crash?
> >>> That issue is solved trivially by doing an fsync() before you
> >>> rename the
> >>> file. That entire discussion was about whether or not existing
> >>> applications should be _required_ to do this kind of POSIX pedantry,
> >>> when previously they could get away without it.
> >>>
> >>> IOW: that issue alone does not justify replacing the current
> >>> simple file
> >>> based scheme.
> >>>
> >>
> >> There are other reasons, not to use the simple file-based scheme
> >> too...
> >>
> >> Internationalized domain names will be easier to deal with via
> >> sqlite3,
> >> for instance.
> >
> > Please explain...
>
> IPv6 is used in Asia, where they almost certainly need to use non-
> ASCII characters in their hostnames. Internationalized domain names
> are stored in double-wide character sets. To provide reliable support
> for IDNs in statd, we will have to guarantee somehow that we can store
> an IDN as a file name (if we want to stay with the current scheme), no
> matter what file system is used for /var.

So, what's stopping us? These are POSIX filesystems. They can store any
filename as long as it doesn't contain '/' or '\0'.

> What's more, multi-homed host support will need to store multiple
> records for the same hostname. The mon_name is the same, but my_name
> is different, for each of these records. So we could do that by
> adding more than one line in each hostname file, but it's also a
> simple matter to set this up in SQL.
>
> When we want to have statd remember things like multiple addresses for
> the same hostname, or whether the remote is a client or server, we
> will need to make more adjustments to the files.
>
> As we get more and more new requirements, why lock ourselves into the
> current on-disk format? Using statd means we can store new fields and
> new records without any backwards-compatibility issues. It's all
> handled by the database code. So, we can think about the high level
> problem of getting statd to behave correctly rather than worry about
> the details of exactly how we are going to get the next data item
> stored in our current files in a backward compatible way.

Again. This is a legacy filesystem. Why are we adding requirements?

> >> Certainly we could code this up ourselves, but what's the benefit to
> >> doing that when we have a perfectly good data storage engine
> >> available?
> >
> > Why change something that works???? Rewriting from scratch is _NOT_
> > the
> > Linux way, and has usually bitten us hard when we've done it.
>
> Because we are adding a bunch of new feature requirements.
> Internationalized domain names, multi-homed host support, IPv6 and TI-
> RPC, fast boot times, keeping better track of remote host addresses,
> keeping track of which remotes are clients and which are servers, and
> support for sending notifications via TCP all require significant
> modifications to this code base.
>
> At some point you have to look at the code you have, and decide it's
> simply not going to be adequate, going forward.
>
> > The 2.6.19 rewrite of the kernel mount code springs to mind...
>
> One can just as easily argue that we've been bitten hard precisely
> because we've let things rot, or because we have inadequate testing
> for these components.
>
> Another red herring, and especially annoying because you've known I
> was rewriting statd for months. Only now, when I'm done, do you say
> "rewriting is not the Linux way."

I have _NEVER_ agreed to a rewrite of the storage formats. You sprang
this crap on me a month ago, and I made my feelings quite clear then.

2009-09-09 19:18:06

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
> On Wed, 2009-09-09 at 14:29 -0400, Jeff Layton wrote:
>> On Wed, 05 Aug 2009 19:30:04 -0400
>> Trond Myklebust <[email protected]> wrote:
>>
>>> On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
>>>> On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
>>>>> On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
>>>>>> sqlite3 doesn't do anything special under the covers. It uses
>>>>>> only
>>>>>> POSIX file access and locking calls, as far as I know. So I
>>>>>> think
>>>>>> hosting /var on most well-behaved clustering file systems won't
>>>>>> have
>>>>>> any problem with this arrangement.
>>>>>
>>>>> So we're basically introducing a dependency on a completely new
>>>>> library
>>>>> that will have to be added to boot partitions/nfsroot/etc, and
>>>>> we have
>>>>> no real reason for doing it other than because we want to move
>>>>> from
>>>>> using sync() to fsync()?
>>>>>
>>>>> Sounds like a NACK to me...
>>>>
>>>> Which library are you talking about, libsqlite3 or libtirpc?
>>>> Because
>>>> NEITHER of those is in /lib.
>>>
>>> libsqlite is the problem. Unlike libtirpc, it's utility has yet to
>>> be
>>> established.
>>>
>>
>> Sorry to revive this so late, but I think we need to come to some
>> sort of resolution here. The only missing piece for client side IPv6
>> support is statd...
>>
>> I'm not sure I understand the objection to using libsqlite3 here. We
>> certainly could roll our own routines to handle data storage, but why
>> would we want to do so? sqlite3 is quite good at what it does. Why
>> wouldn't we want to use it?
>
> Backwards compatibility is one major reason. statd already exists, and
> is in use out there. I shouldn't be forced to reboot all my clients
> when
> I upgrade the nfs-utils package on my server.

The old statd still exists in nfs-utils. The new statd is an entirely
separate component. Distributions can continue to use the old statd
as long as they want. This is a red herring.

> Simplicity is another reason. WTF do we need a full SQL database, when
> all we want to do is store 2 pieces of data (a hostname and a cookie)?
> It isn't as if this has been a major problem for us previously.

Because we are not storing just a hostname and a cookie. We are
storing several different data items for each host, and we need to
search over the records, and provide uniqueness constraints, and
handle data conversion (for binary data like the cookie, for string
data like the hostname, and for integers, like the prog/vers/proc
tuple). We need to store them durably on persistent storage to have
some protection against crashes. These are all things that an
embedded database can do well, and that we therefore don't have to
code ourselves.

Simplicity is in the eye of the beholder. I can easily counterargue
that it's simpler to rely on a library than to duplicate the
functionality in statd. If we think it is appropriate to use glibc
for managing a memory heap rather than open coding it, and libtirpc
for handling RPC calls rather than open coding it, then why shouldn't
we use another library for managing our host records on disk, rather
than open coding our data conversion and record searching logic?

Again, this is a red herring.

>>>> In any event, it's not just sync(2) that is a problem. sync(2) by
>>>> itself is a boot performance problem, but it's the combination of
>>>> rename and sync that is known to be especially unreliable during
>>>> system crashes. Statd, being a crash monitor, shouldn't depend on
>>>> rename/sync to maintain persistent data in the face of system
>>>> instability. I'd call that a real reason to use something more
>>>> robust.
>>>
>>> What are you talking about? Is this about the truncate + rename
>>> issue
>>> leaving empty files upon a crash?
>>> That issue is solved trivially by doing an fsync() before you
>>> rename the
>>> file. That entire discussion was about whether or not existing
>>> applications should be _required_ to do this kind of POSIX pedantry,
>>> when previously they could get away without it.
>>>
>>> IOW: that issue alone does not justify replacing the current
>>> simple file
>>> based scheme.
>>>
>>
>> There are other reasons, not to use the simple file-based scheme
>> too...
>>
>> Internationalized domain names will be easier to deal with via
>> sqlite3,
>> for instance.
>
> Please explain...

IPv6 is used in Asia, where they almost certainly need to use non-
ASCII characters in their hostnames. Internationalized domain names
are stored in double-wide character sets. To provide reliable support
for IDNs in statd, we will have to guarantee somehow that we can store
an IDN as a file name (if we want to stay with the current scheme), no
matter what file system is used for /var.

What's more, multi-homed host support will need to store multiple
records for the same hostname. The mon_name is the same, but my_name
is different, for each of these records. So we could do that by
adding more than one line in each hostname file, but it's also a
simple matter to set this up in SQL.

When we want to have statd remember things like multiple addresses for
the same hostname, or whether the remote is a client or server, we
will need to make more adjustments to the files.

As we get more and more new requirements, why lock ourselves into the
current on-disk format? Using statd means we can store new fields and
new records without any backwards-compatibility issues. It's all
handled by the database code. So, we can think about the high level
problem of getting statd to behave correctly rather than worry about
the details of exactly how we are going to get the next data item
stored in our current files in a backward compatible way.

>> Certainly we could code this up ourselves, but what's the benefit to
>> doing that when we have a perfectly good data storage engine
>> available?
>
> Why change something that works???? Rewriting from scratch is _NOT_
> the
> Linux way, and has usually bitten us hard when we've done it.

Because we are adding a bunch of new feature requirements.
Internationalized domain names, multi-homed host support, IPv6 and TI-
RPC, fast boot times, keeping better track of remote host addresses,
keeping track of which remotes are clients and which are servers, and
support for sending notifications via TCP all require significant
modifications to this code base.

At some point you have to look at the code you have, and decide it's
simply not going to be adequate, going forward.

> The 2.6.19 rewrite of the kernel mount code springs to mind...

One can just as easily argue that we've been bitten hard precisely
because we've let things rot, or because we have inadequate testing
for these components.

Another red herring, and especially annoying because you've known I
was rewriting statd for months. Only now, when I'm done, do you say
"rewriting is not the Linux way."

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-09-09 23:15:42

by Steve Dickson

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On 09/09/2009 06:18 PM, Chuck Lever wrote:
> On Sep 9, 2009, at 3:42 PM, Trond Myklebust wrote:
>> On Wed, 2009-09-09 at 15:17 -0400, Chuck Lever wrote:
>>> On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
>>> The old statd still exists in nfs-utils. The new statd is an entirely
>>> separate component. Distributions can continue to use the old statd
>>> as long as they want. This is a red herring.
>>
>> Bullshit. If they are adding IPv6 support, then they will have to
>> upgrade at some point.
>
> I don't see a problem with a distribution upgrade using old statd and a
> fresh install using new statd. You have to install a lot of new
> components to get NFS/IPv6 support.
What new components that are not already being installed??

> It's not like the only thing that needs to change is statd.
> People will install a new distribution to get IPv6 support.
> With so many simple ways to install from scratch, the days of someone
> upgrading just a few pieces of an old system to get a new feature,
> especially one as extensive as NFS/IPv6, are long gone.
I'm not sure how people could only install bits and pieces of
nfs-utils... Even 'make install' in the git tree installs everything...

>
> And you have never clearly answered why it wouldn't be enough to add a
> little code to convert the current on-disk format to sqlite3 when
> upgrading to the new statd, if upgradability is truly an important
> requirement. Possibly this is because it eliminates the only real
> technical objection you have to using sqlite3 here.
The issue I would have with using sqlite3 is it would add yet another
requirement on nfs-utils... I really don't know how big sqlite3 and/or
sqlite3-devel (possibly needed for builds) packages are but it just
one more thing will be need for nfs-utils to function...

>
>>>> Simplicity is another reason. WTF do we need a full SQL database, when
>>>> all we want to do is store 2 pieces of data (a hostname and a cookie)?
>>>> It isn't as if this has been a major problem for us previously.
>>>
>>> Because we are not storing just a hostname and a cookie. We are
>>> storing several different data items for each host, and we need to
>>> search over the records, and provide uniqueness constraints, and
>>> handle data conversion (for binary data like the cookie, for string
>>> data like the hostname, and for integers, like the prog/vers/proc
>>> tuple). We need to store them durably on persistent storage to have
>>> some protection against crashes. These are all things that an
>>> embedded database can do well, and that we therefore don't have to
>>> code ourselves.
>>
>> Speaking of red herrings. Why are we adding all this crap?
>>
>> This is a legacy filesystem! We shouldn't not be rewriting NLM/NSM from
>> scratch, just add minimal support for IPv6.
>
> You and Bruce brought up a number of work items related to statd,
> including having distinct statd behavior for remotes who are clients and
> remotes who are servers. Tom Talpey suggested we needed to send
> multiple SM_NOTIFY requests to each host, and use TCP to do it when
> possible, and you even specifically encouraged me to read his
> connectathon presentation on this. If Asian countries are driving the
> IPv6 requirement, why wouldn't they want IDN support as well?
> Interoperable NFS/IPv6 support requires TI-RPC. Plus, NFS/IPv6
> practically requires multi-homed NLM/NSM support -- see Alex's RFC draft
> for details on that.
So a database is needed to accomplish all this?

>
> Let me also point out that old statd is already broken in a number of
> ways, and I certainly haven't heard a lot of complaints about it. Our
> client NLM has sent "0" as our NSM state number for years, for example.
> Thus I hardly think there is a lot of risk in making changes here. It
> can only get better.
>
I can agree with you here...

>>> IPv6 is used in Asia, where they almost certainly need to use non-
>>> ASCII characters in their hostnames. Internationalized domain names
>>> are stored in double-wide character sets. To provide reliable support
>>> for IDNs in statd, we will have to guarantee somehow that we can store
>>> an IDN as a file name (if we want to stay with the current scheme), no
>>> matter what file system is used for /var.
>>
>> So, what's stopping us? These are POSIX filesystems. They can store any
>> filename as long as it doesn't contain '/' or '\0'.
>
> IDNs are UTF16. /var therefore has to support UTF16 filenames; either
> byte in a double-byte character can be '/' or '\0'. That means the
> underlying fs implementation has to support UTF16 (FAT32 anyone?), and
> the system's locale has to be configured correctly. If we decide not to
> depend on the file system to support UTF16 filenames, then statd has to
> be intelligent enough to figure out how to deal with converting UTF16
> hostnames before storing them as filenames. Then, we have to teach
> matchhostname() and friends how to deal with double-byte character
> strings...
Has this been a problem in the past? How are other implementations
dealing with this? Have they gone to use a db as well?

>
> Or we just tell sqlite3 that this is a double-byte character string, and
> let it handle the collation and on-disk storage details for us.
>
> The point is, this is yet another detail we have to either worry about
> and open code in statd, or we can simply rely on what's already provided
> in sqlite3. No one, repeat NO ONE, is arguing that you can't implement
> these features without sqlite3. My argument is that we quickly bury a
> whole bunch of details if we use sqlite3, and can then focus on larger
> issues. That's the prime goal of software layering with libraries.
What kind of performance hit will there be (if any)? The nice thing
about a file is you only have to read it once in to a cache verses
doing a number of queries... or can one also cache queries?

>
> We can open code any or all of statd. In fact the current statd open
> codes RPC request creation in socket buffers rather than using glibc's
> RPC API, and I think we agree that is not an optimal solution. The
> question is: should we duplicate code and bugs by open coding statd's
> RPC and data storage? Or should we pretend to be modern software
> engineers, and use widely used and known good code that other people
> have written already to handle these details?
I'm all for using moving forward with "modern software" but, as
a common theme with me, I'm always worried about becoming
needlessly complicated or over engineering... which might be
the case with having statd use a db...

steved.

2009-09-15 02:45:34

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 14, 2009, at 3:08 AM, Neil Brown wrote:
> On Thursday September 10, [email protected] wrote:
>> On Sep 10, 2009, at 4:44 AM, NeilBrown wrote:

> But you will leave one day. How can you best make sure that you leave
> something that others can maintain????

By writing code that is self-explanatory, providing lots of comments
in the code, adding to the git log (as you suggested) and writing
expansive man pages that describe the interfaces in as clear a manner
as possible. The review process is also part of that effort.

There is also the possibility of mentoring others, as FreeBSD does,
and providing extensive written documentation and specifications in
wikis. Agile methodologies suggest that rewriting as a regular
practice is a good way for a team to retain familiarity with a code
base. Having a full test suite that can be used to verify the
behavior of new or existing code is also a way to codify requirements
and create an institutional memory of regressions, as well as to
insulate users from regressions in new code.

>> My point is that many of the items I mentioned above are expressly
>> designed to allow quicker, less risky change, precisely to decrease
>> the amount of time and effort to get new features into our code. Yet
>> we turn our back on all of them in favor of an antique "don't touch
>> that!" policy. "Don't touch that!" is not a reasonable argument
>> against replacing components that need to be replaced.
>
> The only "Don't touch that" which I am aware of relates to interfaces,
> particularly with established code.
> In the case of statd, the files in sm/ and sm.bak/ are a well
> established interface. Exactly how much is dependant on it is hard to
> say. Not much formal code I expect but maybe some obscure scripts and
> lots of sysadmin knowledge.

There is no documentation I'm aware of of statd's on-disk format as a
formal interface. I have had some recent conversations with Lon about
this, to handle any dependencies his clustering scripts may have, and
he didn't throw up any flags. He told me that all we needed was to
provide a mechanism to access this data from a shell script, which we
would have in 'sqlite3' the executable.

So this is a new requirement (to me, anyway). If these files
constitute a formal interface, how can statd be modified to store
additional data or new data types in these files? Am I allowed to put
IPv6 presentation addresses in these files in place of IPv4
addresses? Am I allowed to add new fields? Not rhetorical
questions... really... how should I go about doing this and testing
the result? You seem to be suggesting that the sm/* files can't be
used for the kind of features we want to add.

> Can you run them both in parallel?? i.e. have a database with all the
> data, but also store it in the files (if the hostname can be
> represented in ASCII)... It is hard to guess how easy that would be
> and how worthwhile it would be. And it doesn't answer the question of
> whether sqlite is stable enough.

Is it even a good thing to freeze the sm/* files as a formal
interface, or should we go about providing a real documented
programming interface for this, and migrate to it? There is a real
risk to maintaining undocumented interfaces like this, and that is
that we can't make any change to this code without a significant
possibility of breaking something.

>>> I think that the switch from portmap to rpcbind was a bad idea,
>>> and I think that a wholesale replacement of statd is probably a
>>> bad idea too. It might seem like the easiest way to get something
>>> useful working, but you'll probably be paying the price for years as
>>> little regression turn up because you didn't completely understand
>>> the original statd (and face it, who does?)
>>
>> Yes, but _why_ is it a bad idea? All I hear is "this is a bad idea"
>> and "you could do it some other way" but these are qualitative, not
>> quantitative arguments. They are religious statements, not specific
>> technical criticisms.
>
> It is a bad idea because it doesn't have the legacy of testing and
> refinement. Almost as soon as we started using it bugs were found -
> or at least differences in behaviour to portmap (something about the
> privilege level required to register a binding I think).
>
> Now I admit that no one put their hand up to add IPv6 support to
> portmap, arguably it could have been a worse idea to stay with portmap
> as it meant no IPv6. But changing was still a bad idea.
>
> Had we (had the man power to) incrementally enhance portmap we would
> have had a much more reviewable process, and a bisectable result which
> would allow regression to be isolated more directly.

What we have now is an inherited body of code (with its own history of
incremental improvement) that is shared with many other operating
systems, which improves our ability to interoperate with them, and
includes bug fixes that have been made to it over the years.

I think we would have had some bugs and regressions pursuing either
path. There are well-understood ways to manage these risks, either way.

But this is a sidebar.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-09-10 15:01:47

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 9, 2009, at 7:15 PM, Steve Dickson wrote:
> On 09/09/2009 06:18 PM, Chuck Lever wrote:
>> On Sep 9, 2009, at 3:42 PM, Trond Myklebust wrote:
>>> On Wed, 2009-09-09 at 15:17 -0400, Chuck Lever wrote:
>>>> On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
>>>> The old statd still exists in nfs-utils. The new statd is an
>>>> entirely
>>>> separate component. Distributions can continue to use the old
>>>> statd
>>>> as long as they want. This is a red herring.
>>>
>>> Bullshit. If they are adding IPv6 support, then they will have to
>>> upgrade at some point.
>>
>> I don't see a problem with a distribution upgrade using old statd
>> and a
>> fresh install using new statd. You have to install a lot of new
>> components to get NFS/IPv6 support.
> What new components that are not already being installed??

You need a kernel that can do NFS/IPv6, you need to install rpcbind
and libtirpc, you need the new mount command, you need all the user
space network pieces to manage IPv6, you need to consider firewall and
address distribution on your local network, and you need statd and
mountd/exportfs to get NFS/IPv6 support.

Configuring a system for IPv6 support can also be nontrivial, and not
something people will do on a whim.

I didn't mean to imply that some of these components are not already
installed. My point is that the required changes for NFS/IPv6 are
wide spread, and that most people would opt for installing a new OS on
their systems to get these features, rather than upgrade all of these
items piecemeal.

>> And you have never clearly answered why it wouldn't be enough to
>> add a
>> little code to convert the current on-disk format to sqlite3 when
>> upgrading to the new statd, if upgradability is truly an important
>> requirement. Possibly this is because it eliminates the only real
>> technical objection you have to using sqlite3 here.
> The issue I would have with using sqlite3 is it would add yet another
> requirement on nfs-utils... I really don't know how big sqlite3 and/or
> sqlite3-devel (possibly needed for builds) packages are but it just
> one more thing will be need for nfs-utils to function...

sqlite3.org provides a single source file version of sqlite3 that is
licensed and designed explicitly for folks to include in their own
code, without the need for linking a library. You can even disable a
number of build time options to reduce object size.

This means that the libsqlite3 and libsqlite3-devel packages would not
be required on either the build system or the end system, and it
eliminates the issue of whether libsqlite3.so can be moved to /lib.

>>>>> Simplicity is another reason. WTF do we need a full SQL
>>>>> database, when
>>>>> all we want to do is store 2 pieces of data (a hostname and a
>>>>> cookie)?
>>>>> It isn't as if this has been a major problem for us previously.
>>>>
>>>> Because we are not storing just a hostname and a cookie. We are
>>>> storing several different data items for each host, and we need to
>>>> search over the records, and provide uniqueness constraints, and
>>>> handle data conversion (for binary data like the cookie, for string
>>>> data like the hostname, and for integers, like the prog/vers/proc
>>>> tuple). We need to store them durably on persistent storage to
>>>> have
>>>> some protection against crashes. These are all things that an
>>>> embedded database can do well, and that we therefore don't have to
>>>> code ourselves.
>>>
>>> Speaking of red herrings. Why are we adding all this crap?
>>>
>>> This is a legacy filesystem! We shouldn't not be rewriting NLM/NSM
>>> from
>>> scratch, just add minimal support for IPv6.
>>
>> You and Bruce brought up a number of work items related to statd,
>> including having distinct statd behavior for remotes who are
>> clients and
>> remotes who are servers. Tom Talpey suggested we needed to send
>> multiple SM_NOTIFY requests to each host, and use TCP to do it when
>> possible, and you even specifically encouraged me to read his
>> connectathon presentation on this. If Asian countries are driving
>> the
>> IPv6 requirement, why wouldn't they want IDN support as well?
>> Interoperable NFS/IPv6 support requires TI-RPC. Plus, NFS/IPv6
>> practically requires multi-homed NLM/NSM support -- see Alex's RFC
>> draft
>> for details on that.
> So a database is needed to accomplish all this?

No, a database is not specifically required.

However, libsqlite3 is a library that contains all of the elements --
durable on-disk storage, proper data conversion for binary blobs,
single- and double-width character strings, integers, the ability to
constrain record uniqueness, the ability to add new data items easily
to each record, and a facility for collating and searching the host
records.

sqlite3 is an embedded database, meaning the implementation is
purposely smaller than a full SQL database, and is designed explicitly
to have zero database administration requirements. sqlite3 is
designed for managing data for long-running network daemons, and it is
widely used for that purpose.

If there is some other pre-existing code that can do this, I'm open to
considering it.

>> Let me also point out that old statd is already broken in a number of
>> ways, and I certainly haven't heard a lot of complaints about it.
>> Our
>> client NLM has sent "0" as our NSM state number for years, for
>> example.
>> Thus I hardly think there is a lot of risk in making changes here.
>> It
>> can only get better.
>>
> I can agree with you here...
>
>>>> IPv6 is used in Asia, where they almost certainly need to use non-
>>>> ASCII characters in their hostnames. Internationalized domain
>>>> names
>>>> are stored in double-wide character sets. To provide reliable
>>>> support
>>>> for IDNs in statd, we will have to guarantee somehow that we can
>>>> store
>>>> an IDN as a file name (if we want to stay with the current
>>>> scheme), no
>>>> matter what file system is used for /var.
>>>
>>> So, what's stopping us? These are POSIX filesystems. They can
>>> store any
>>> filename as long as it doesn't contain '/' or '\0'.
>>
>> IDNs are UTF16. /var therefore has to support UTF16 filenames;
>> either
>> byte in a double-byte character can be '/' or '\0'. That means the
>> underlying fs implementation has to support UTF16 (FAT32 anyone?),
>> and
>> the system's locale has to be configured correctly. If we decide
>> not to
>> depend on the file system to support UTF16 filenames, then statd
>> has to
>> be intelligent enough to figure out how to deal with converting UTF16
>> hostnames before storing them as filenames. Then, we have to teach
>> matchhostname() and friends how to deal with double-byte character
>> strings...
> Has this been a problem in the past? How are other implementations
> dealing with this? Have they gone to use a db as well?

No, IDNs are recent, but it is reasonable to think that
internationalized domain names is a feature that would appeal to the
same folks who are driving the IPv6 requirement. This is not a hard
requirement, but it is one reason why statd's current on-disk format
is not adequate.

Yes, I understand that there are some statd implementations that use a
database rather than flat files. statd is nothing if not exactly a
mechanism for storing structured data across system crashes. That's
exactly what databases are for.

>> Or we just tell sqlite3 that this is a double-byte character
>> string, and
>> let it handle the collation and on-disk storage details for us.
>>
>> The point is, this is yet another detail we have to either worry
>> about
>> and open code in statd, or we can simply rely on what's already
>> provided
>> in sqlite3. No one, repeat NO ONE, is arguing that you can't
>> implement
>> these features without sqlite3. My argument is that we quickly
>> bury a
>> whole bunch of details if we use sqlite3, and can then focus on
>> larger
>> issues. That's the prime goal of software layering with libraries.
> What kind of performance hit will there be (if any)? The nice thing
> about a file is you only have to read it once in to a cache verses
> doing a number of queries... or can one also cache queries?

sqlite3's performance for the statd application would actually be
better than what we have today.

Naturally the database is cached in memory, making queries as fast as
memory reads. The better performance comes with record insertion and
deletion. Today statd does a file create and then an O_SYNC write to
that file. This requires synchronous metadata updates to the file
system to create the new file and create a new directory entry for
it. If the directory becomes large, creating a new directory entry
becomes even slower. Likewise for record deletion, multiple
synchronous metadata updates are required to remove the directory
entry and the file containing the host record.

With sqlite3 (or any database style solution) record insertion and
deletion can usually be handled with a single O_SYNC write to the
database file.

You could argue that using sqlite3 means more CPU and memory
consumption. Perhaps, but that's a less onerous resource requirement
than synchronous disk activity, in my view.

>> We can open code any or all of statd. In fact the current statd open
>> codes RPC request creation in socket buffers rather than using
>> glibc's
>> RPC API, and I think we agree that is not an optimal solution. The
>> question is: should we duplicate code and bugs by open coding statd's
>> RPC and data storage? Or should we pretend to be modern software
>> engineers, and use widely used and known good code that other people
>> have written already to handle these details?
> I'm all for using moving forward with "modern software" but, as
> a common theme with me, I'm always worried about becoming
> needlessly complicated or over engineering... which might be
> the case with having statd use a db...

Consider what would happen if we open coded all of the details of on-
disk storage and record searching into statd itself. I think
something like sqlite3 is a better and less complex solution than open
coding because all these details are moved out of statd into a pre-
existing library, thus making statd itself architecturally simpler,
and therefore easier to understand and maintain.

The one weakness here is the dependence on SQL. That makes the statd
code uglier and more complex than I would like, and is something I
want to address.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-09-14 07:07:39

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Thursday September 10, [email protected] wrote:
> On Sep 10, 2009, at 4:44 AM, NeilBrown wrote:
> > On Thu, September 10, 2009 8:18 am, Chuck Lever wrote:
> >> The idea that "the Linux way" is the best and only way is ridiculous
> >> on its face, anyway. I mean, what do you expect when we have no
> >> requirements and specification process, no formal testing, C coding
> >> style conventions based on 20-year old coding practices, a hit-or-
> >> miss
> >> review process that relies more on reviewers' personal preferences
> >> than any kind of standards, no static code analysis tools, no defect
> >> metrics or bug meta-analysis tools, kernel debuggers are verboten, a
> >> combative mailing list environment, and parts of our knowledge base
> >> and team history are lost every time a developer leaves (in this
> >> case,
> >> Olaf and Neil)? It's no wonder we never change anything unless
> >> absolutely necessary!
> >
> > And yet is largely works!
> > I could summarise a lot of your points by observing that the community
> > values people over process. I really think that is the right place to
> > put value, because people are richer and more flexible than process.
>
> Agreed, but there are risks to that approach as well, which are
> largely ignored by the Linux community. The last point in my list is
> probably the biggest risk: when people leave, we are stuck with
> decades-old code that no-one understands. Cf: statd.
>

Much of my understanding of statd is embedded in the code in
nfs-utils, and in the change log (which is much better since we
changed to git).
By discarding all that and creating a new project you risk losing that
legacy.

There can certainly be a case of discarding and starting again. I did
that with the support tools for raid (raidtools is no-more, mdadm
reigns :-). But this is not a decision to be taken lightly and is not
without it's costs. The fact that raidtools was by that time
essentially an unmaintained orphan made some of those costs
unavoidable in my case.

You obviously could do the same thing - you don't need anyone's
permission: create a new project called "statd" and make it whatever
you want and hope that distributors will pick it up. If your statd
supports IPv6, and the nfs-utils one does not, then there is a good
chance that distros will pick it up as people like to see that big
tick next to "IPv6 support".

Maintaining and developing such a thing long enough to establish a
self-sustaining community would be a big effort. You would need to
compare that with the effort of taking the incremental approach and
revising the code in nfs-utils and getting those revisions accepted.
That would probably have a higher development cost, but a lower
maintenance cost. It is very hard to know up front which cost is
lower.

But you will leave one day. How can you best make sure that you leave
something that others can maintain????

> My point is that many of the items I mentioned above are expressly
> designed to allow quicker, less risky change, precisely to decrease
> the amount of time and effort to get new features into our code. Yet
> we turn our back on all of them in favor of an antique "don't touch
> that!" policy. "Don't touch that!" is not a reasonable argument
> against replacing components that need to be replaced.

The only "Don't touch that" which I am aware of relates to interfaces,
particularly with established code.
In the case of statd, the files in sm/ and sm.bak/ are a well
established interface. Exactly how much is dependant on it is hard to
say. Not much formal code I expect but maybe some obscure scripts and
lots of sysadmin knowledge.

Can you run them both in parallel?? i.e. have a database with all the
data, but also store it in the files (if the hostname can be
represented in ASCII)... It is hard to guess how easy that would be
and how worthwhile it would be. And it doesn't answer the question of
whether sqlite is stable enough.

>
> > I agree that combative mailing lists are a problem, but even there, I
> > believe most of the aggression is more perceived than real, and that
> > a graceful, humble, polite attitude can have a positive-feedback
> > effect
> > too.
>
> Years ago I believed that, but I have seen much evidence to the
> contrary in this community. More often such an attitude is entirely
> ignored, or treated as an invitation for abuse, especially by people
> who have no interest in politeness. This kind of approach has no
> effect on the leaders in the Linux community, who set an example of
> extreme rudeness and belligerence.

That last comment is interesting. At the 2007 kernel summit (the last
one I was at) the topic of mailing list etiquette was discussed and
there seemed to be agreement that we, the leaders (I guess the kernel
summit attendees are the closest we have to group leadership) have a
role in damping down the fire, not building it up. My feeling is that
most high profile people do quite well, but maybe I am too forgiving.

>
> I've made an effort to stop arguing small points, and to make
> observations and not argue. I still get e-mail full of "crap" this
> and "bullshit" that and "NACK!" with little explanation.

I certainly agree that sort of response is best not sent. And I must
confess to a recent experience (on a different list) where I gave up
due to similar behaviour. That is partly why I decided to join in
this discussion (though I'm not sure if I'm being helpful yet).

>
> > Yes, there are lots of practices that might improve things that we
> > don't
> > have standardised. But one practice we do have that has proven very
> > effective is incremental refinement. It can be hard to understand
> > what
> > order to make changes until after you have made them, but once you
> > understand what you want to do, going back and doing it in logical
> > order really is very effective. It makes it easier for others to
> > review, it makes it easy for you to review yourself. It means
> > less controversial bits can be included quickly leaving room for the
> > more controversial bits to be discussed in isolation.
>
> I am a fan of incremental refinement, and I use that approach as often
> as I can. There are some things that incremental refinement cannot
> do, however.
>
> > I think that the switch from portmap to rpcbind was a bad idea,
> > and I think that a wholesale replacement of statd is probably a
> > bad idea too. It might seem like the easiest way to get something
> > useful working, but you'll probably be paying the price for years as
> > little regression turn up because you didn't completely understand
> > the original statd (and face it, who does?)
>
> Yes, but _why_ is it a bad idea? All I hear is "this is a bad idea"
> and "you could do it some other way" but these are qualitative, not
> quantitative arguments. They are religious statements, not specific
> technical criticisms.

It is a bad idea because it doesn't have the legacy of testing and
refinement. Almost as soon as we started using it bugs were found -
or at least differences in behaviour to portmap (something about the
privilege level required to register a binding I think).

Now I admit that no one put their hand up to add IPv6 support to
portmap, arguably it could have been a worse idea to stay with portmap
as it meant no IPv6. But changing was still a bad idea.

Had we (had the man power to) incrementally enhance portmap we would
have had a much more reviewable process, and a bisectable result which
would allow regression to be isolated more directly.

>
> This leaves me with the impression that folks are responding out of a
> fear of the unknown, and not out of a considered technical opinion.
> If sqlite3 is outside of people's comfort zone, that's OK. Please
> let's be honest about it instead of slinging mud and throwing up a
> bunch of generic arguments that no one can rebut.
>
> > As for the use of sql-lite ... I must admit that I wouldn't choose
> > it. Maybe it is a good idea. If it is, you probably need to merge
> > that change early with a clear argument and tools to make it
> > manageable
> > (e.g. a developer will want to tool to be able to look inside the
> > database easily and make changes, without having to know sql).
> > It is much easier to discuss one thing at a time on these
> > combative mailing lists ;-)
>
> That's a fine and constructive comment, thanks.
>
> There is already a tool for managing the data in the database:
> 'sqlite3' the executable, which can be used in shell scripts. There
> are also sqlite3 libraries for Python and Perl and C. This is really
> not very different from POSIX file system calls and using 'cat'. SQL
> is not difficult to learn, and a one or two page recipe document is
> easy to provide. I certainly have not used any advanced features of
> SQL to implement new statd.
>
> Given the complexity of the change, it makes it much easier to argue
> against sqlite3, however, if it is separated from the set of changes
> that motivate its use. Bruce, for example, has stated to me
> specifically that he prefers having such changes and their
> motivational requirements included in the same patch. I regularly
> code for several different maintainers, and each one has his own
> preferences, often contradicting other maintainers. I don't think
> regular maintainers have any idea how confusing and challenging this is.
>

:-)

I think there has to be room for balance here. Certainly it is best
to use new functionality as soon as it is added so the use case is
clear. But I can imagine that converting from a files database to a
sqlite database would be several patches in itself. And then if you
want to start adding 'search' or 'utf-16' functionality, that would be
more patches still. Combining all that in to one big patch just to
keep the change with the motivation seems unlikely to be a win from
anyone's perspective.
Certainly having a real big changelog comment early on that
explains the value of sqlite would be essential, and that patch would
not be merged until the subsequent ones were reviewed.

> Note, however, I am not married to the specifics of sqlite3. What I
> am attached to is the ideas that the current system is inadequate for
> the kinds of features we want to add to statd, and that statd should
> not worry about the details of data storage, or search and management
> of the host records, because we have other tools that are better at
> this. If there is another solution that provides a durable and
> flexible way to store and search host records and offload the details
> to pre-existing code, I'm open. However, sqlite3 is the most widely
> used embedded database on the planet, and is eminently suitable for
> this task.
>
> > And we do have static code analysis tools. Both 'gcc' and 'sparse'
> > fit that description.
>
> Yes, gcc can provide some static analysis, if the correct options are
> specified, and care is taken to eliminate the noise of false
> positives. There is a prevailing attitude, however, that this is a
> worthless endeavor. Witness the amount of noise that comes out when
> you build a Red Hat kernel or the nfs-utils package. Witness also the
> sarcasm of Linus who repeatedly chides folks for not running sparse
> regularly.
>
> Additionally, gcc is not the best tool for this job, given the often
> oblique way it calls out errors and warnings. There are purify,
> fortify, and splint, just to name three, that are standard analysis
> tools we don't even consider.
>
> My comment goes more to the point that static analysis is not
> considered of any value.

You might be right....

I see there be a "law of diminishing returns" here.
sparse (aka "make C=1") has almost never shown me anything
interesting. So while there is a (small) cost in running it, there is
almost no perceived value.

For my own C projects I always compile with "-Wall -Werror" to
remove the cost of running it and to artificially increase the value.

I just tried "make C=1" on a couple of bits of kernel code and it did
provide some vaguely interesting things that I'll probably fix.
But there was a lot of noise like:
warning: potentially expensive pointer subtraction
which I don't think I want to fix, but don't know how to silence the
warning for just that case. It would be nice of "C=1" was a default
and either the warning, or sparse, were fixed. That might increase
the perception that this sort of thing was of value.

Some time ago Greg Banks went on a pursuit of warnings in nfs-utils
and got rid of all of them -- except those generated by rpcgen.
Have more been introduced?

NeilBrown

2009-09-10 14:09:39

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 10, 2009, at 4:44 AM, NeilBrown wrote:
> On Thu, September 10, 2009 8:18 am, Chuck Lever wrote:
>> The idea that "the Linux way" is the best and only way is ridiculous
>> on its face, anyway. I mean, what do you expect when we have no
>> requirements and specification process, no formal testing, C coding
>> style conventions based on 20-year old coding practices, a hit-or-
>> miss
>> review process that relies more on reviewers' personal preferences
>> than any kind of standards, no static code analysis tools, no defect
>> metrics or bug meta-analysis tools, kernel debuggers are verboten, a
>> combative mailing list environment, and parts of our knowledge base
>> and team history are lost every time a developer leaves (in this
>> case,
>> Olaf and Neil)? It's no wonder we never change anything unless
>> absolutely necessary!
>
> And yet is largely works!
> I could summarise a lot of your points by observing that the community
> values people over process. I really think that is the right place to
> put value, because people are richer and more flexible than process.

Agreed, but there are risks to that approach as well, which are
largely ignored by the Linux community. The last point in my list is
probably the biggest risk: when people leave, we are stuck with
decades-old code that no-one understands. Cf: statd.

My point is that many of the items I mentioned above are expressly
designed to allow quicker, less risky change, precisely to decrease
the amount of time and effort to get new features into our code. Yet
we turn our back on all of them in favor of an antique "don't touch
that!" policy. "Don't touch that!" is not a reasonable argument
against replacing components that need to be replaced.

> I agree that combative mailing lists are a problem, but even there, I
> believe most of the aggression is more perceived than real, and that
> a graceful, humble, polite attitude can have a positive-feedback
> effect
> too.

Years ago I believed that, but I have seen much evidence to the
contrary in this community. More often such an attitude is entirely
ignored, or treated as an invitation for abuse, especially by people
who have no interest in politeness. This kind of approach has no
effect on the leaders in the Linux community, who set an example of
extreme rudeness and belligerence.

I've made an effort to stop arguing small points, and to make
observations and not argue. I still get e-mail full of "crap" this
and "bullshit" that and "NACK!" with little explanation.

> Yes, there are lots of practices that might improve things that we
> don't
> have standardised. But one practice we do have that has proven very
> effective is incremental refinement. It can be hard to understand
> what
> order to make changes until after you have made them, but once you
> understand what you want to do, going back and doing it in logical
> order really is very effective. It makes it easier for others to
> review, it makes it easy for you to review yourself. It means
> less controversial bits can be included quickly leaving room for the
> more controversial bits to be discussed in isolation.

I am a fan of incremental refinement, and I use that approach as often
as I can. There are some things that incremental refinement cannot
do, however.

> I think that the switch from portmap to rpcbind was a bad idea,
> and I think that a wholesale replacement of statd is probably a
> bad idea too. It might seem like the easiest way to get something
> useful working, but you'll probably be paying the price for years as
> little regression turn up because you didn't completely understand
> the original statd (and face it, who does?)

Yes, but _why_ is it a bad idea? All I hear is "this is a bad idea"
and "you could do it some other way" but these are qualitative, not
quantitative arguments. They are religious statements, not specific
technical criticisms.

This leaves me with the impression that folks are responding out of a
fear of the unknown, and not out of a considered technical opinion.
If sqlite3 is outside of people's comfort zone, that's OK. Please
let's be honest about it instead of slinging mud and throwing up a
bunch of generic arguments that no one can rebut.

> As for the use of sql-lite ... I must admit that I wouldn't choose
> it. Maybe it is a good idea. If it is, you probably need to merge
> that change early with a clear argument and tools to make it
> manageable
> (e.g. a developer will want to tool to be able to look inside the
> database easily and make changes, without having to know sql).
> It is much easier to discuss one thing at a time on these
> combative mailing lists ;-)

That's a fine and constructive comment, thanks.

There is already a tool for managing the data in the database:
'sqlite3' the executable, which can be used in shell scripts. There
are also sqlite3 libraries for Python and Perl and C. This is really
not very different from POSIX file system calls and using 'cat'. SQL
is not difficult to learn, and a one or two page recipe document is
easy to provide. I certainly have not used any advanced features of
SQL to implement new statd.

Given the complexity of the change, it makes it much easier to argue
against sqlite3, however, if it is separated from the set of changes
that motivate its use. Bruce, for example, has stated to me
specifically that he prefers having such changes and their
motivational requirements included in the same patch. I regularly
code for several different maintainers, and each one has his own
preferences, often contradicting other maintainers. I don't think
regular maintainers have any idea how confusing and challenging this is.

Note, however, I am not married to the specifics of sqlite3. What I
am attached to is the ideas that the current system is inadequate for
the kinds of features we want to add to statd, and that statd should
not worry about the details of data storage, or search and management
of the host records, because we have other tools that are better at
this. If there is another solution that provides a durable and
flexible way to store and search host records and offload the details
to pre-existing code, I'm open. However, sqlite3 is the most widely
used embedded database on the planet, and is eminently suitable for
this task.

> And we do have static code analysis tools. Both 'gcc' and 'sparse'
> fit that description.

Yes, gcc can provide some static analysis, if the correct options are
specified, and care is taken to eliminate the noise of false
positives. There is a prevailing attitude, however, that this is a
worthless endeavor. Witness the amount of noise that comes out when
you build a Red Hat kernel or the nfs-utils package. Witness also the
sarcasm of Linus who repeatedly chides folks for not running sparse
regularly.

Additionally, gcc is not the best tool for this job, given the often
oblique way it calls out errors and warnings. There are purify,
fortify, and splint, just to name three, that are standard analysis
tools we don't even consider.

My comment goes more to the point that static analysis is not
considered of any value.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-09-14 13:54:35

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Thu, 2009-09-10 at 10:09 -0400, Chuck Lever wrote:
> On Sep 10, 2009, at 4:44 AM, NeilBrown wrote:
> > I agree that combative mailing lists are a problem, but even there, I
> > believe most of the aggression is more perceived than real, and that
> > a graceful, humble, polite attitude can have a positive-feedback
> > effect
> > too.
>
> Years ago I believed that, but I have seen much evidence to the
> contrary in this community. More often such an attitude is entirely
> ignored, or treated as an invitation for abuse, especially by people
> who have no interest in politeness. This kind of approach has no
> effect on the leaders in the Linux community, who set an example of
> extreme rudeness and belligerence.
>
> I've made an effort to stop arguing small points, and to make
> observations and not argue. I still get e-mail full of "crap" this
> and "bullshit" that and "NACK!" with little explanation.

As you said above, you've been part of the community for years. It is
not as if you haven't learned by now that a review might turn up issues
that may give rise to a NACK, and that you need to be open to changing
your code should this happen.
If you need more information about what needs to be changed, then you
know to ask.

That hasn't been your approach in this case, though, and the responses
you got were a direct consequence of that approach.
You tried reversing the burden of proof as to why we should change an
established interface instead of supplying adequate evidence justifying
that change.
When problems were pointed out to you (e.g. backward compatibility) your
response was to deny they existed instead of proposing a change to your
code.
Finally, you tried changing the thread into a discussion about mean rude
people obstructing you and failing to give you adequate guidance.

The problems with the code remain, and you will need to change it in
order to make it acceptable. The question I haven't seen you asking, and
that you should have be asking from the very start is "what would be the
minimal set of changes?".

I'm quite willing to discuss that with you.

Trond

2009-09-10 08:44:10

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Thu, September 10, 2009 8:18 am, Chuck Lever wrote:

> The idea that "the Linux way" is the best and only way is ridiculous
> on its face, anyway. I mean, what do you expect when we have no
> requirements and specification process, no formal testing, C coding
> style conventions based on 20-year old coding practices, a hit-or-miss
> review process that relies more on reviewers' personal preferences
> than any kind of standards, no static code analysis tools, no defect
> metrics or bug meta-analysis tools, kernel debuggers are verboten, a
> combative mailing list environment, and parts of our knowledge base
> and team history are lost every time a developer leaves (in this case,
> Olaf and Neil)? It's no wonder we never change anything unless
> absolutely necessary!

And yet is largely works!
I could summarise a lot of your points by observing that the community
values people over process. I really think that is the right place to
put value, because people are richer and more flexible than process.

I agree that combative mailing lists are a problem, but even there, I
believe most of the aggression is more perceived than real, and that
a graceful, humble, polite attitude can have a positive-feedback effect
too.

Yes, there are lots of practices that might improve things that we don't
have standardised. But one practice we do have that has proven very
effective is incremental refinement. It can be hard to understand what
order to make changes until after you have made them, but once you
understand what you want to do, going back and doing it in logical
order really is very effective. It makes it easier for others to
review, it makes it easy for you to review yourself. It means
less controversial bits can be included quickly leaving room for the
more controversial bits to be discussed in isolation.

I think that the switch from portmap to rpcbind was a bad idea,
and I think that a wholesale replacement of statd is probably a
bad idea too. It might seem like the easiest way to get something
useful working, but you'll probably be paying the price for years as
little regression turn up because you didn't completely understand
the original statd (and face it, who does?)

As for the use of sql-lite ... I must admit that I wouldn't choose
it. Maybe it is a good idea. If it is, you probably need to merge
that change early with a clear argument and tools to make it manageable
(e.g. a developer will want to tool to be able to look inside the
database easily and make changes, without having to know sql).
It is much easier to discuss one thing at a time on these
combative mailing lists ;-)

NeilBrown

P.S.
And we do have static code analysis tools. Both 'gcc' and 'sparse'
fit that description.

2009-09-15 01:30:13

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 14, 2009, at 9:54 AM, Trond Myklebust wrote:
> On Thu, 2009-09-10 at 10:09 -0400, Chuck Lever wrote:
>> On Sep 10, 2009, at 4:44 AM, NeilBrown wrote:
>>> I agree that combative mailing lists are a problem, but even
>>> there, I
>>> believe most of the aggression is more perceived than real, and that
>>> a graceful, humble, polite attitude can have a positive-feedback
>>> effect
>>> too.
>>
>> Years ago I believed that, but I have seen much evidence to the
>> contrary in this community. More often such an attitude is entirely
>> ignored, or treated as an invitation for abuse, especially by people
>> who have no interest in politeness. This kind of approach has no
>> effect on the leaders in the Linux community, who set an example of
>> extreme rudeness and belligerence.
>>
>> I've made an effort to stop arguing small points, and to make
>> observations and not argue. I still get e-mail full of "crap" this
>> and "bullshit" that and "NACK!" with little explanation.
>
> As you said above, you've been part of the community for years. It is
> not as if you haven't learned by now that a review might turn up
> issues
> that may give rise to a NACK, and that you need to be open to changing
> your code should this happen.

The tone of your vetoes is unnecessarily aggressive, and often the
comments are entirely negative. A mention of the pieces of new statd
that you liked, for example, would have been welcome, and even useful
for this conversation.

Naturally, you are free to disagree with me as much as you like. But
I wish you could be more constructive about it. This is not a
contest... we're supposed to be working together to improve the code
base.

> If you need more information about what needs to be changed, then you
> know to ask.
>
> That hasn't been your approach in this case, though, and the responses
> you got were a direct consequence of that approach.
> You tried reversing the burden of proof as to why we should change an
> established interface instead of supplying adequate evidence
> justifying
> that change.

I think it's reasonable to ask for evidence on both sides.

> When problems were pointed out to you (e.g. backward compatibility)
> your
> response was to deny they existed instead of proposing a change to
> your
> code.

The problem you described seemed to stem from a basic disagreement
about how statd with IPv6 would be deployed -- an issue that neither
of us (being upstream developers and not distributors) can resolve.
Your objection still doesn't make sense to me. But see below.

> Finally, you tried changing the thread into a discussion about mean
> rude
> people obstructing you and failing to give you adequate guidance.

I was responding to Neil's comments, not changing the subject.

In any event, since all of the NFS maintainers have now passed their
judgement, it's clear that I will have to withdraw new statd, and
proceed with a re-write that uses the existing on-disk format. I've
never claimed that sqlite3 is a _requirement_ to solve these issues,
but only that some changes would be necessary to the on-disk format,
and that a database seemed an appropriate improvement.

No maintainers agree with that, so I will rework it.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-09-09 22:18:42

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 9, 2009, at 3:42 PM, Trond Myklebust wrote:
> On Wed, 2009-09-09 at 15:17 -0400, Chuck Lever wrote:
>> On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
>> The old statd still exists in nfs-utils. The new statd is an
>> entirely
>> separate component. Distributions can continue to use the old statd
>> as long as they want. This is a red herring.
>
> Bullshit. If they are adding IPv6 support, then they will have to
> upgrade at some point.

I don't see a problem with a distribution upgrade using old statd and
a fresh install using new statd. You have to install a lot of new
components to get NFS/IPv6 support. It's not like the only thing that
needs to change is statd. People will install a new distribution to
get IPv6 support. With so many simple ways to install from scratch,
the days of someone upgrading just a few pieces of an old system to
get a new feature, especially one as extensive as NFS/IPv6, are long
gone.

I don't hear a lot of distributors objecting to this idea.

And you have never clearly answered why it wouldn't be enough to add a
little code to convert the current on-disk format to sqlite3 when
upgrading to the new statd, if upgradability is truly an important
requirement. Possibly this is because it eliminates the only real
technical objection you have to using sqlite3 here.

>>> Simplicity is another reason. WTF do we need a full SQL database,
>>> when
>>> all we want to do is store 2 pieces of data (a hostname and a
>>> cookie)?
>>> It isn't as if this has been a major problem for us previously.
>>
>> Because we are not storing just a hostname and a cookie. We are
>> storing several different data items for each host, and we need to
>> search over the records, and provide uniqueness constraints, and
>> handle data conversion (for binary data like the cookie, for string
>> data like the hostname, and for integers, like the prog/vers/proc
>> tuple). We need to store them durably on persistent storage to have
>> some protection against crashes. These are all things that an
>> embedded database can do well, and that we therefore don't have to
>> code ourselves.
>
> Speaking of red herrings. Why are we adding all this crap?
>
> This is a legacy filesystem! We shouldn't not be rewriting NLM/NSM
> from
> scratch, just add minimal support for IPv6.

You and Bruce brought up a number of work items related to statd,
including having distinct statd behavior for remotes who are clients
and remotes who are servers. Tom Talpey suggested we needed to send
multiple SM_NOTIFY requests to each host, and use TCP to do it when
possible, and you even specifically encouraged me to read his
connectathon presentation on this. If Asian countries are driving the
IPv6 requirement, why wouldn't they want IDN support as well?
Interoperable NFS/IPv6 support requires TI-RPC. Plus, NFS/IPv6
practically requires multi-homed NLM/NSM support -- see Alex's RFC
draft for details on that.

Which would you like me to drop?

Let me also point out that old statd is already broken in a number of
ways, and I certainly haven't heard a lot of complaints about it. Our
client NLM has sent "0" as our NSM state number for years, for
example. Thus I hardly think there is a lot of risk in making changes
here. It can only get better.

>>>>>> In any event, it's not just sync(2) that is a problem. sync(2)
>>>>>> by
>>>>>> itself is a boot performance problem, but it's the combination of
>>>>>> rename and sync that is known to be especially unreliable during
>>>>>> system crashes. Statd, being a crash monitor, shouldn't depend
>>>>>> on
>>>>>> rename/sync to maintain persistent data in the face of system
>>>>>> instability. I'd call that a real reason to use something more
>>>>>> robust.
>>>>>
>>>>> What are you talking about? Is this about the truncate + rename
>>>>> issue
>>>>> leaving empty files upon a crash?
>>>>> That issue is solved trivially by doing an fsync() before you
>>>>> rename the
>>>>> file. That entire discussion was about whether or not existing
>>>>> applications should be _required_ to do this kind of POSIX
>>>>> pedantry,
>>>>> when previously they could get away without it.
>>>>>
>>>>> IOW: that issue alone does not justify replacing the current
>>>>> simple file
>>>>> based scheme.
>>>>>
>>>>
>>>> There are other reasons, not to use the simple file-based scheme
>>>> too...
>>>>
>>>> Internationalized domain names will be easier to deal with via
>>>> sqlite3,
>>>> for instance.
>>>
>>> Please explain...
>>
>> IPv6 is used in Asia, where they almost certainly need to use non-
>> ASCII characters in their hostnames. Internationalized domain names
>> are stored in double-wide character sets. To provide reliable
>> support
>> for IDNs in statd, we will have to guarantee somehow that we can
>> store
>> an IDN as a file name (if we want to stay with the current scheme),
>> no
>> matter what file system is used for /var.
>
> So, what's stopping us? These are POSIX filesystems. They can store
> any
> filename as long as it doesn't contain '/' or '\0'.

IDNs are UTF16. /var therefore has to support UTF16 filenames; either
byte in a double-byte character can be '/' or '\0'. That means the
underlying fs implementation has to support UTF16 (FAT32 anyone?), and
the system's locale has to be configured correctly. If we decide not
to depend on the file system to support UTF16 filenames, then statd
has to be intelligent enough to figure out how to deal with converting
UTF16 hostnames before storing them as filenames. Then, we have to
teach matchhostname() and friends how to deal with double-byte
character strings...

Or we just tell sqlite3 that this is a double-byte character string,
and let it handle the collation and on-disk storage details for us.

The point is, this is yet another detail we have to either worry about
and open code in statd, or we can simply rely on what's already
provided in sqlite3. No one, repeat NO ONE, is arguing that you can't
implement these features without sqlite3. My argument is that we
quickly bury a whole bunch of details if we use sqlite3, and can then
focus on larger issues. That's the prime goal of software layering
with libraries.

We can open code any or all of statd. In fact the current statd open
codes RPC request creation in socket buffers rather than using glibc's
RPC API, and I think we agree that is not an optimal solution. The
question is: should we duplicate code and bugs by open coding statd's
RPC and data storage? Or should we pretend to be modern software
engineers, and use widely used and known good code that other people
have written already to handle these details?

>> What's more, multi-homed host support will need to store multiple
>> records for the same hostname. The mon_name is the same, but my_name
>> is different, for each of these records. So we could do that by
>> adding more than one line in each hostname file, but it's also a
>> simple matter to set this up in SQL.
>>
>> When we want to have statd remember things like multiple addresses
>> for
>> the same hostname, or whether the remote is a client or server, we
>> will need to make more adjustments to the files.
>>
>> As we get more and more new requirements, why lock ourselves into the
>> current on-disk format? Using statd means we can store new fields
>> and
>> new records without any backwards-compatibility issues. It's all
>> handled by the database code. So, we can think about the high level
>> problem of getting statd to behave correctly rather than worry about
>> the details of exactly how we are going to get the next data item
>> stored in our current files in a backward compatible way.
>
> Again. This is a legacy filesystem. Why are we adding requirements?

Maybe you should ask the people who are requesting NFS/IPv6 from Red
Hat and other distributors. Or ask yourself why we would add an
engine to allow NFSv2 to shift authentication flavors without
remounting, since NFSv2 is a legacy filesystem. Or ask why you think
we should add support to statd to recognize the difference between
remote clients and servers, if this is a legacy filesystem.

I don't object to any of those work items, but I do have trouble with
you dropping in and saying "why diddle a legacy file system" when
clearly that is not a show stopper in these other cases.

>>>> Certainly we could code this up ourselves, but what's the benefit
>>>> to
>>>> doing that when we have a perfectly good data storage engine
>>>> available?
>>>
>>> Why change something that works???? Rewriting from scratch is _NOT_
>>> the
>>> Linux way, and has usually bitten us hard when we've done it.
>>
>> Because we are adding a bunch of new feature requirements.
>> Internationalized domain names, multi-homed host support, IPv6 and
>> TI-
>> RPC, fast boot times, keeping better track of remote host addresses,
>> keeping track of which remotes are clients and which are servers, and
>> support for sending notifications via TCP all require significant
>> modifications to this code base.
>>
>> At some point you have to look at the code you have, and decide it's
>> simply not going to be adequate, going forward.
>>
>>> The 2.6.19 rewrite of the kernel mount code springs to mind...
>>
>> One can just as easily argue that we've been bitten hard precisely
>> because we've let things rot, or because we have inadequate testing
>> for these components.
>>
>> Another red herring, and especially annoying because you've known I
>> was rewriting statd for months. Only now, when I'm done, do you say
>> "rewriting is not the Linux way."
>
> I have _NEVER_ agreed to a rewrite of the storage formats. You sprang
> this crap on me a month ago, and I made my feelings quite clear then.

"Rewriting is not the Linux way" is not the same as saying you don't
want to change storage formats. Don't change the subject.

The idea that "the Linux way" is the best and only way is ridiculous
on its face, anyway. I mean, what do you expect when we have no
requirements and specification process, no formal testing, C coding
style conventions based on 20-year old coding practices, a hit-or-miss
review process that relies more on reviewers' personal preferences
than any kind of standards, no static code analysis tools, no defect
metrics or bug meta-analysis tools, kernel debuggers are verboten, a
combative mailing list environment, and parts of our knowledge base
and team history are lost every time a developer leaves (in this case,
Olaf and Neil)? It's no wonder we never change anything unless
absolutely necessary!

You told me to implement IPv6 support in statd. Now you are spitting
on what I worked out without any guidance from you because you're too
busy working on IETF standards and NFSv4.1 to bother discussing
"legacy" code, other than to say "ewe!". Just how should I react to
that, pray tell?

Clearly you do not want to admit that even "minimal" IPv6 support is a
significant effort, especially given how far behind the Linux RPC and
NLM/NSM implementations are. Yelling at me, throwing a bunch of
generic objections up, and calling my code "crap" is not going to make
the problem any simpler.

Cooperating with me to get your way and giving specific and
constructive criticism will go a lot farther than your aggressive and
disrespectful attitude and obnoxious tone. I will happily sit down
and discuss this with you in a rational tone, but I will no longer
tolerate your unwarranted harassment on a public mailing list.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-09-10 21:26:46

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 10, 2009, at 4:49 PM, J. Bruce Fields wrote:
> On Thu, Sep 10, 2009 at 04:39:51PM -0400, Chuck Lever wrote:
>> On Sep 10, 2009, at 12:23 PM, J. Bruce Fields wrote:
>>> On Thu, Sep 10, 2009 at 12:14:27PM -0400, Chuck Lever wrote:
>>>> On Sep 10, 2009, at 11:03 AM, J. Bruce Fields wrote:
>>>>> On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
>>>>>> IDNs are UTF16. /var therefore has to support UTF16 filenames;
>>>>>> either
>>>>>> byte in a double-byte character can be '/' or '\0'. That means
>>>>>> the
>>>>>> underlying fs implementation has to support UTF16 (FAT32
>>>>>> anyone?),
>>>>>> and
>>>>>> the system's locale has to be configured correctly. If we decide
>>>>>> not to
>>>>>> depend on the file system to support UTF16 filenames, then statd
>>>>>> has
>>>>>> to
>>>>>> be intelligent enough to figure out how to deal with converting
>>>>>> UTF16
>>>>>> hostnames before storing them as filenames. Then, we have to
>>>>>> teach
>>>>>> matchhostname() and friends how to deal with double-byte
>>>>>> character
>>>>>> strings...
>>>>>
>>>>> Googling around.... Is this accurate?:
>>>>>
>>>>> http://en.wikipedia.org/wiki/Internationalized_domain_name
>>>>>
>>>>> That makes it sound like domain names are staying ascii, and
>>>>> they're
>>>>> just adding something on top to allow encoding unicode using
>>>>> ascii,
>>>>> which may optionally be used by applications.
>>>>
>>>> There is a mechanism that provides an ASCII-ized version of domain
>>>> names
>>>> that may contain non-ASCII characters, expressly for applications
>>>> that
>>>> need to perform DNS queries but can't be easily converted to handle
>>>> double-byte character strings. This can be adapted for statd,
>>>> though I'm
>>>> not sure if the converted ASCII version of such names specifically
>>>> exclude '/'.
>>>>
>>>> Internationalized domain names themselves are still expressed in
>>>> UTF16,
>>>> as far as I understand it.
>>>
>>> From a quick skim of http://www.ietf.org/rfc/rfc3490.txt, it appears
>>> to
>>> me that protocols (at the very least, any preexisting protocols) are
>>> all
>>> expected to use the ascii representation on the wire, and that the
>>> translation to unicode is meant by use for applications.
>>>
>>> So in our case we'd continue to expect ascii domain names on the
>>> wire,
>>> and I believe that's also what we should store in any database. But
>>> if
>>> someone were to write a gui administrative interface to that data,
>>> for
>>> example, they might choose to use idna for display.
>>
>> That's a reasonable and specific objection to my claim that our
>> current
>> host record storage format is inadequate to support IDNA. I've also
>> confirmed that ToAscii with the UseSTD3ASCIIRules flag set is not
>> supposed to generate a domain label string with a '/' in it. My
>> remaining concern here is that we could possibly see hostnames that
>> are
>> too long to be stored in directory entries of some file systems,
>> especially considering that the ASCII-fied Unicode names will be
>> longer
>> than typical ASCII names we normally encounter today.
>
> Googling around some more.... the normal limits for dns appear to be
> 63
> bytes per component, and 255 for the whole string, and those limits
> are
> still in force on the output of that mapping. I suspect this isn't a
> huge deal.

I bring this up because NI_MAXHOST, declared in /usr/include/netdb.h,
is 1025, not 255. You are probably correct, practically speaking.

>> What about multi-homed host support? The same mon_name can be used
>> with
>> more than one my_name, for multi-homed hosts. Using the current on-
>> disk
>> scheme, statd turns that SM_MON request into a no-op.
>
> I don't know what we're supposed to do in that case. You want to
> store
> them all so you can send notifies to them all on reboot?

Something like that. Basically I think we want to send SM_NOTIFY to
the monitored host _from_ every registered my_name we have for that
mon_name. The kernel will probably use a separate nsm_host for each
one of these, so statd should probably keep track of each of the
cookies as well.

Remembering which my_names have been used is important because the
remote often uses the sender's name or IP address to identify which
monitored host has rebooted.

We could throw up our hands and just keep track of all the my_names
that were used during the last reboot, and notify each mon_name from
all of those. That doesn't help with remembering the cookies, though,
and makes sending all notifications at reboot take longer.

>> So additional
>> records for the same hostname can't be stored, or we have to resort
>> to
>> adding multiple lines in the same file. This is possible to do
>> with just
>> POSIX file system calls, but it does add complexity to manage several
>> lines in each hostname file without increasing the risk of
>> corruption if
>> a file update (especially the deletion of one record in the middle)
>> is
>> interrupted.
>>
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>>
>>
>>

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-09-09 19:14:04

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, 09 Sep 2009 14:39:59 -0400
Trond Myklebust <[email protected]> wrote:

> On Wed, 2009-09-09 at 14:29 -0400, Jeff Layton wrote:
> > On Wed, 05 Aug 2009 19:30:04 -0400
> > Trond Myklebust <[email protected]> wrote:
> >
> > > On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
> > > > On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
> > > > > On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
> > > > >> sqlite3 doesn't do anything special under the covers. It uses only
> > > > >> POSIX file access and locking calls, as far as I know. So I think
> > > > >> hosting /var on most well-behaved clustering file systems won't have
> > > > >> any problem with this arrangement.
> > > > >
> > > > > So we're basically introducing a dependency on a completely new
> > > > > library
> > > > > that will have to be added to boot partitions/nfsroot/etc, and we have
> > > > > no real reason for doing it other than because we want to move from
> > > > > using sync() to fsync()?
> > > > >
> > > > > Sounds like a NACK to me...
> > > >
> > > > Which library are you talking about, libsqlite3 or libtirpc? Because
> > > > NEITHER of those is in /lib.
> > >
> > > libsqlite is the problem. Unlike libtirpc, it's utility has yet to be
> > > established.
> > >
> >
> > Sorry to revive this so late, but I think we need to come to some
> > sort of resolution here. The only missing piece for client side IPv6
> > support is statd...
> >
> > I'm not sure I understand the objection to using libsqlite3 here. We
> > certainly could roll our own routines to handle data storage, but why
> > would we want to do so? sqlite3 is quite good at what it does. Why
> > wouldn't we want to use it?
>
> Backwards compatibility is one major reason. statd already exists, and
> is in use out there. I shouldn't be forced to reboot all my clients when
> I upgrade the nfs-utils package on my server.
>

We could roll a conversion utility for this if it would help. nfs-utils
upgrades usually mean restarting statd anyway:

shut down old-statd
convert flat file db to sqlite
start up new-statd

...so I don't think we necessarily need to reboot the clients for this.
It should (in theory) be possible to do the reverse even.

> Simplicity is another reason. WTF do we need a full SQL database, when
> all we want to do is store 2 pieces of data (a hostname and a cookie)?
> It isn't as if this has been a major problem for us previously.
>

Been a little while since I took my initial look at new-statd, but I
see that Chuck is has this (just a for-instance):

rc = sqlite3_prepare_v2(db, "CREATE TABLE " STATD_MONITOR_TABLENAME
" (priv BLOB,"
" mon_name TEXT NOT NULL,"
" my_name TEXT NOT NULL,"
" program INTEGER,"
" version INTEGER,"
" procedure INTEGER,"
" protocol TEXT NOT NULL,"
" state INTEGER,"
" UNIQUE(mon_name,my_name));",
-1, &stmt, NULL);

He's tracking some other info there too. Is this all necessary? Maybe
not now, but having a storage engine that can cope with tracking extra
info will make it easier to handle things like multihomed clients and
servers correctly (something that the existing statd is not very good
at at the moment).

> > > > In any event, it's not just sync(2) that is a problem. sync(2) by
> > > > itself is a boot performance problem, but it's the combination of
> > > > rename and sync that is known to be especially unreliable during
> > > > system crashes. Statd, being a crash monitor, shouldn't depend on
> > > > rename/sync to maintain persistent data in the face of system
> > > > instability. I'd call that a real reason to use something more robust.
> > >
> > > What are you talking about? Is this about the truncate + rename issue
> > > leaving empty files upon a crash?
> > > That issue is solved trivially by doing an fsync() before you rename the
> > > file. That entire discussion was about whether or not existing
> > > applications should be _required_ to do this kind of POSIX pedantry,
> > > when previously they could get away without it.
> > >
> > > IOW: that issue alone does not justify replacing the current simple file
> > > based scheme.
> > >
> >
> > There are other reasons, not to use the simple file-based scheme too...
> >
> > Internationalized domain names will be easier to deal with via sqlite3,
> > for instance.
>
> Please explain...
>

Well, we currently store statd info in flat files named with the
hostname. With an internationalized domain name, we may have a
multibyte character in that name. We could try to store that as an
ASCII or UTF8 name, but we'd have to roll conversion routines for it.
Why bother when we have a storage engine that does that work for us?

> > Certainly we could code this up ourselves, but what's the benefit to
> > doing that when we have a perfectly good data storage engine available?
>
> Why change something that works???? Rewriting from scratch is _NOT_ the
> Linux way, and has usually bitten us hard when we've done it.
>
> The 2.6.19 rewrite of the kernel mount code springs to mind...
>

A good point. Given who I work for, I *really* hate regressions since
they tend to mean a lot of my time tends to get eaten up.

At some point however it becomes very difficult to patch up old code.
old-statd in particular hasn't seen the attention that it probably
should have.

I trust that Chuck will be willing to fix problems that come along. The
new code seems to be on par with the old in complexity so I don't think
we'd be taking on a great maintenance burden with it even if Chuck
isn't available.

Could we add IPv6 support as a patchset to the existing statd instead?
Sure. That patch would be smaller than Chuck's rewrite, but I still
think there are advantages to considering a major overhaul here.

--
Jeff Layton <[email protected]>

2009-09-10 20:50:02

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Thu, Sep 10, 2009 at 04:39:51PM -0400, Chuck Lever wrote:
> On Sep 10, 2009, at 12:23 PM, J. Bruce Fields wrote:
>> On Thu, Sep 10, 2009 at 12:14:27PM -0400, Chuck Lever wrote:
>>> On Sep 10, 2009, at 11:03 AM, J. Bruce Fields wrote:
>>>> On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
>>>>> IDNs are UTF16. /var therefore has to support UTF16 filenames;
>>>>> either
>>>>> byte in a double-byte character can be '/' or '\0'. That means the
>>>>> underlying fs implementation has to support UTF16 (FAT32 anyone?),
>>>>> and
>>>>> the system's locale has to be configured correctly. If we decide
>>>>> not to
>>>>> depend on the file system to support UTF16 filenames, then statd
>>>>> has
>>>>> to
>>>>> be intelligent enough to figure out how to deal with converting
>>>>> UTF16
>>>>> hostnames before storing them as filenames. Then, we have to teach
>>>>> matchhostname() and friends how to deal with double-byte character
>>>>> strings...
>>>>
>>>> Googling around.... Is this accurate?:
>>>>
>>>> http://en.wikipedia.org/wiki/Internationalized_domain_name
>>>>
>>>> That makes it sound like domain names are staying ascii, and they're
>>>> just adding something on top to allow encoding unicode using ascii,
>>>> which may optionally be used by applications.
>>>
>>> There is a mechanism that provides an ASCII-ized version of domain
>>> names
>>> that may contain non-ASCII characters, expressly for applications
>>> that
>>> need to perform DNS queries but can't be easily converted to handle
>>> double-byte character strings. This can be adapted for statd,
>>> though I'm
>>> not sure if the converted ASCII version of such names specifically
>>> exclude '/'.
>>>
>>> Internationalized domain names themselves are still expressed in
>>> UTF16,
>>> as far as I understand it.
>>
>> From a quick skim of http://www.ietf.org/rfc/rfc3490.txt, it appears
>> to
>> me that protocols (at the very least, any preexisting protocols) are
>> all
>> expected to use the ascii representation on the wire, and that the
>> translation to unicode is meant by use for applications.
>>
>> So in our case we'd continue to expect ascii domain names on the wire,
>> and I believe that's also what we should store in any database. But
>> if
>> someone were to write a gui administrative interface to that data, for
>> example, they might choose to use idna for display.
>
> That's a reasonable and specific objection to my claim that our current
> host record storage format is inadequate to support IDNA. I've also
> confirmed that ToAscii with the UseSTD3ASCIIRules flag set is not
> supposed to generate a domain label string with a '/' in it. My
> remaining concern here is that we could possibly see hostnames that are
> too long to be stored in directory entries of some file systems,
> especially considering that the ASCII-fied Unicode names will be longer
> than typical ASCII names we normally encounter today.

Googling around some more.... the normal limits for dns appear to be 63
bytes per component, and 255 for the whole string, and those limits are
still in force on the output of that mapping. I suspect this isn't a
huge deal.

> What about multi-homed host support? The same mon_name can be used with
> more than one my_name, for multi-homed hosts. Using the current on-disk
> scheme, statd turns that SM_MON request into a no-op.

I don't know what we're supposed to do in that case. You want to store
them all so you can send notifies to them all on reboot?

--b.

> So additional
> records for the same hostname can't be stored, or we have to resort to
> adding multiple lines in the same file. This is possible to do with just
> POSIX file system calls, but it does add complexity to manage several
> lines in each hostname file without increasing the risk of corruption if
> a file update (especially the deletion of one record in the middle) is
> interrupted.
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>

2009-09-14 15:48:52

by Steve Dickson

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On 09/14/2009 03:08 AM, Neil Brown wrote:
>
> Some time ago Greg Banks went on a pursuit of warnings in nfs-utils
> and got rid of all of them -- except those generated by rpcgen.
> Have more been introduced?
The only other warnings I'm struggling with are the
warning: dereferencing pointer 'local_addr' does break strict-aliasing rules

in sm-notify.c, which seem to be a bunch of noise... But I will like
to get ride of them... Other than that, there are no warnings that
I know of...

steved.

2009-09-14 15:55:52

by Steve Dickson

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

> The problems with the code remain, and you will need to change it in
> order to make it acceptable. The question I haven't seen you asking, and
> that you should have be asking from the very start is "what would be the
> minimal set of changes?".

I too would like to see what are minimal change set would be... Just
to see how much a db is really needed... If at all..

Now with that said... I've taken on the task of how to deal with
pNFS exports and it might make sense that a db is need to deal with
that complexity... So I'm not totally against the idea of introducing
a db to nfs-utils I just think we need to do it for the right reason....

steved.

2009-09-09 18:40:07

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, 2009-09-09 at 14:29 -0400, Jeff Layton wrote:
> On Wed, 05 Aug 2009 19:30:04 -0400
> Trond Myklebust <[email protected]> wrote:
>
> > On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
> > > On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
> > > > On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
> > > >> sqlite3 doesn't do anything special under the covers. It uses only
> > > >> POSIX file access and locking calls, as far as I know. So I think
> > > >> hosting /var on most well-behaved clustering file systems won't have
> > > >> any problem with this arrangement.
> > > >
> > > > So we're basically introducing a dependency on a completely new
> > > > library
> > > > that will have to be added to boot partitions/nfsroot/etc, and we have
> > > > no real reason for doing it other than because we want to move from
> > > > using sync() to fsync()?
> > > >
> > > > Sounds like a NACK to me...
> > >
> > > Which library are you talking about, libsqlite3 or libtirpc? Because
> > > NEITHER of those is in /lib.
> >
> > libsqlite is the problem. Unlike libtirpc, it's utility has yet to be
> > established.
> >
>
> Sorry to revive this so late, but I think we need to come to some
> sort of resolution here. The only missing piece for client side IPv6
> support is statd...
>
> I'm not sure I understand the objection to using libsqlite3 here. We
> certainly could roll our own routines to handle data storage, but why
> would we want to do so? sqlite3 is quite good at what it does. Why
> wouldn't we want to use it?

Backwards compatibility is one major reason. statd already exists, and
is in use out there. I shouldn't be forced to reboot all my clients when
I upgrade the nfs-utils package on my server.

Simplicity is another reason. WTF do we need a full SQL database, when
all we want to do is store 2 pieces of data (a hostname and a cookie)?
It isn't as if this has been a major problem for us previously.

> > > In any event, it's not just sync(2) that is a problem. sync(2) by
> > > itself is a boot performance problem, but it's the combination of
> > > rename and sync that is known to be especially unreliable during
> > > system crashes. Statd, being a crash monitor, shouldn't depend on
> > > rename/sync to maintain persistent data in the face of system
> > > instability. I'd call that a real reason to use something more robust.
> >
> > What are you talking about? Is this about the truncate + rename issue
> > leaving empty files upon a crash?
> > That issue is solved trivially by doing an fsync() before you rename the
> > file. That entire discussion was about whether or not existing
> > applications should be _required_ to do this kind of POSIX pedantry,
> > when previously they could get away without it.
> >
> > IOW: that issue alone does not justify replacing the current simple file
> > based scheme.
> >
>
> There are other reasons, not to use the simple file-based scheme too...
>
> Internationalized domain names will be easier to deal with via sqlite3,
> for instance.

Please explain...

> Certainly we could code this up ourselves, but what's the benefit to
> doing that when we have a perfectly good data storage engine available?

Why change something that works???? Rewriting from scratch is _NOT_ the
Linux way, and has usually bitten us hard when we've done it.

The 2.6.19 rewrite of the kernel mount code springs to mind...

Trond

2009-09-09 19:19:31

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 9, 2009, at 3:13 PM, Jeff Layton wrote:

> On Wed, 09 Sep 2009 14:39:59 -0400
> Trond Myklebust <[email protected]> wrote:
>
>> On Wed, 2009-09-09 at 14:29 -0400, Jeff Layton wrote:
>>> On Wed, 05 Aug 2009 19:30:04 -0400
>>> Trond Myklebust <[email protected]> wrote:
>>>
>>>> On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
>>>>> On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
>>>>>> On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
>>>>>>> sqlite3 doesn't do anything special under the covers. It uses
>>>>>>> only
>>>>>>> POSIX file access and locking calls, as far as I know. So I
>>>>>>> think
>>>>>>> hosting /var on most well-behaved clustering file systems
>>>>>>> won't have
>>>>>>> any problem with this arrangement.
>>>>>>
>>>>>> So we're basically introducing a dependency on a completely new
>>>>>> library
>>>>>> that will have to be added to boot partitions/nfsroot/etc, and
>>>>>> we have
>>>>>> no real reason for doing it other than because we want to move
>>>>>> from
>>>>>> using sync() to fsync()?
>>>>>>
>>>>>> Sounds like a NACK to me...
>>>>>
>>>>> Which library are you talking about, libsqlite3 or libtirpc?
>>>>> Because
>>>>> NEITHER of those is in /lib.
>>>>
>>>> libsqlite is the problem. Unlike libtirpc, it's utility has yet
>>>> to be
>>>> established.
>>>>
>>>
>>> Sorry to revive this so late, but I think we need to come to some
>>> sort of resolution here. The only missing piece for client side IPv6
>>> support is statd...
>>>
>>> I'm not sure I understand the objection to using libsqlite3 here. We
>>> certainly could roll our own routines to handle data storage, but
>>> why
>>> would we want to do so? sqlite3 is quite good at what it does. Why
>>> wouldn't we want to use it?
>>
>> Backwards compatibility is one major reason. statd already exists,
>> and
>> is in use out there. I shouldn't be forced to reboot all my clients
>> when
>> I upgrade the nfs-utils package on my server.
>>
>
> We could roll a conversion utility for this if it would help. nfs-
> utils
> upgrades usually mean restarting statd anyway:
>
> shut down old-statd
> convert flat file db to sqlite
> start up new-statd
>
> ...so I don't think we necessarily need to reboot the clients for
> this.
> It should (in theory) be possible to do the reverse even.
>
>> Simplicity is another reason. WTF do we need a full SQL database,
>> when
>> all we want to do is store 2 pieces of data (a hostname and a
>> cookie)?
>> It isn't as if this has been a major problem for us previously.
>>
>
> Been a little while since I took my initial look at new-statd, but I
> see that Chuck is has this (just a for-instance):
>
> rc = sqlite3_prepare_v2(db, "CREATE TABLE "
> STATD_MONITOR_TABLENAME
> " (priv BLOB,"
> " mon_name TEXT NOT NULL,"
> " my_name TEXT NOT NULL,"
> " program INTEGER,"
> " version INTEGER,"
> " procedure INTEGER,"
> " protocol TEXT NOT NULL,"
> " state INTEGER,"
> " UNIQUE(mon_name,my_name));",
> -1, &stmt, NULL);
>
>
> He's tracking some other info there too. Is this all necessary? Maybe
> not now, but having a storage engine that can cope with tracking extra
> info will make it easier to handle things like multihomed clients and
> servers correctly (something that the existing statd is not very good
> at at the moment).
>
>>>>> In any event, it's not just sync(2) that is a problem. sync(2) by
>>>>> itself is a boot performance problem, but it's the combination of
>>>>> rename and sync that is known to be especially unreliable during
>>>>> system crashes. Statd, being a crash monitor, shouldn't depend on
>>>>> rename/sync to maintain persistent data in the face of system
>>>>> instability. I'd call that a real reason to use something more
>>>>> robust.
>>>>
>>>> What are you talking about? Is this about the truncate + rename
>>>> issue
>>>> leaving empty files upon a crash?
>>>> That issue is solved trivially by doing an fsync() before you
>>>> rename the
>>>> file. That entire discussion was about whether or not existing
>>>> applications should be _required_ to do this kind of POSIX
>>>> pedantry,
>>>> when previously they could get away without it.
>>>>
>>>> IOW: that issue alone does not justify replacing the current
>>>> simple file
>>>> based scheme.
>>>>
>>>
>>> There are other reasons, not to use the simple file-based scheme
>>> too...
>>>
>>> Internationalized domain names will be easier to deal with via
>>> sqlite3,
>>> for instance.
>>
>> Please explain...
>>
>
> Well, we currently store statd info in flat files named with the
> hostname. With an internationalized domain name, we may have a
> multibyte character in that name. We could try to store that as an
> ASCII or UTF8 name, but we'd have to roll conversion routines for it.
> Why bother when we have a storage engine that does that work for us?
>
>>> Certainly we could code this up ourselves, but what's the benefit to
>>> doing that when we have a perfectly good data storage engine
>>> available?
>>
>> Why change something that works???? Rewriting from scratch is _NOT_
>> the
>> Linux way, and has usually bitten us hard when we've done it.
>>
>> The 2.6.19 rewrite of the kernel mount code springs to mind...
>>
>
> A good point. Given who I work for, I *really* hate regressions since
> they tend to mean a lot of my time tends to get eaten up.
>
> At some point however it becomes very difficult to patch up old code.
> old-statd in particular hasn't seen the attention that it probably
> should have.
>
> I trust that Chuck will be willing to fix problems that come along.
> The
> new code seems to be on par with the old in complexity so I don't
> think
> we'd be taking on a great maintenance burden with it even if Chuck
> isn't available.
>
> Could we add IPv6 support as a patchset to the existing statd instead?
> Sure.

Not as easy as you might think.

The old statd open codes a bunch of the RPC client side, which means
all of that would have to be rewritten anyway.

> That patch would be smaller than Chuck's rewrite, but I still
> think there are advantages to considering a major overhaul here.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-09-10 15:03:23

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
> IDNs are UTF16. /var therefore has to support UTF16 filenames; either
> byte in a double-byte character can be '/' or '\0'. That means the
> underlying fs implementation has to support UTF16 (FAT32 anyone?), and
> the system's locale has to be configured correctly. If we decide not to
> depend on the file system to support UTF16 filenames, then statd has to
> be intelligent enough to figure out how to deal with converting UTF16
> hostnames before storing them as filenames. Then, we have to teach
> matchhostname() and friends how to deal with double-byte character
> strings...

Googling around.... Is this accurate?:

http://en.wikipedia.org/wiki/Internationalized_domain_name

That makes it sound like domain names are staying ascii, and they're
just adding something on top to allow encoding unicode using ascii,
which may optionally be used by applications.

--b.

2009-09-10 15:05:16

by Ben Myers

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Thu, Sep 10, 2009 at 06:44:00PM +1000, NeilBrown wrote:
> I could summarise a lot of your points by observing that the community
> values people over process. I really think that is the right place to
> put value, because people are richer and more flexible than process.

What a refreshing point of view. I hope it catches on.

> As for the use of sql-lite ... I must admit that I wouldn't choose
> it.

A number of other projects support multiple backends. Maybe that makes
sense here.

-Ben

2009-09-10 16:14:57

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Sep 10, 2009, at 11:03 AM, J. Bruce Fields wrote:
> On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
>> IDNs are UTF16. /var therefore has to support UTF16 filenames;
>> either
>> byte in a double-byte character can be '/' or '\0'. That means the
>> underlying fs implementation has to support UTF16 (FAT32 anyone?),
>> and
>> the system's locale has to be configured correctly. If we decide
>> not to
>> depend on the file system to support UTF16 filenames, then statd
>> has to
>> be intelligent enough to figure out how to deal with converting UTF16
>> hostnames before storing them as filenames. Then, we have to teach
>> matchhostname() and friends how to deal with double-byte character
>> strings...
>
> Googling around.... Is this accurate?:
>
> http://en.wikipedia.org/wiki/Internationalized_domain_name
>
> That makes it sound like domain names are staying ascii, and they're
> just adding something on top to allow encoding unicode using ascii,
> which may optionally be used by applications.

There is a mechanism that provides an ASCII-ized version of domain
names that may contain non-ASCII characters, expressly for
applications that need to perform DNS queries but can't be easily
converted to handle double-byte character strings. This can be
adapted for statd, though I'm not sure if the converted ASCII version
of such names specifically exclude '/'.

Internationalized domain names themselves are still expressed in
UTF16, as far as I understand it.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-09-10 16:23:30

[permalink] [raw]

Subject: Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

On Thu, Sep 10, 2009 at 12:14:27PM -0400, Chuck Lever wrote:
> On Sep 10, 2009, at 11:03 AM, J. Bruce Fields wrote:
>> On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
>>> IDNs are UTF16. /var therefore has to support UTF16 filenames;
>>> either
>>> byte in a double-byte character can be '/' or '\0'. That means the
>>> underlying fs implementation has to support UTF16 (FAT32 anyone?),
>>> and
>>> the system's locale has to be configured correctly. If we decide
>>> not to
>>> depend on the file system to support UTF16 filenames, then statd has
>>> to
>>> be intelligent enough to figure out how to deal with converting UTF16
>>> hostnames before storing them as filenames. Then, we have to teach
>>> matchhostname() and friends how to deal with double-byte character
>>> strings...
>>
>> Googling around.... Is this accurate?:
>>
>> http://en.wikipedia.org/wiki/Internationalized_domain_name
>>
>> That makes it sound like domain names are staying ascii, and they're
>> just adding something on top to allow encoding unicode using ascii,
>> which may optionally be used by applications.
>
> There is a mechanism that provides an ASCII-ized version of domain names
> that may contain non-ASCII characters, expressly for applications that
> need to perform DNS queries but can't be easily converted to handle
> double-byte character strings. This can be adapted for statd, though I'm
> not sure if the converted ASCII version of such names specifically
> exclude '/'.
>
> Internationalized domain names themselves are still expressed in UTF16,
> as far as I understand it.

>From a quick skim of http://www.ietf.org/rfc/rfc3490.txt, it appears to
me that protocols (at the very least, any preexisting protocols) are all
expected to use the ascii representation on the wire, and that the
translation to unicode is meant by use for applications.

So in our case we'd continue to expect ascii domain names on the wire,
and I believe that's also what we should store in any database. But if
someone were to write a gui administrative interface to that data, for
example, they might choose to use idna for display.

--b.

2009-09-10 16:43:43