LinuxLists.cc - Re: [opensuse] nfs_update_inode: inode X mode changed, Y to Z

2008-03-25 22:13:42

Subject: Re: [opensuse] nfs_update_inode: inode X mode changed, Y to Z

On Wed, Mar 26, 2008 at 08:38:22AM +1100, NeilBrown wrote:
...
> However you still need to do something about the generation number. It
> must be set to something.

Right.

> When you allocate an inode that doesn't currently exist on the device,
> you obviously cannot increment the old value and use that.

Makes sense.

> However you can do a lot better than always using 0.

I looked at the code (xfs_ialloc.c:xfs_ialloc_ag_alloc)

290 /*
291 * Set initial values for the inodes in this buffer.
292 */
293 xfs_biozero(fbuf, 0, ninodes << args.mp->m_sb.sb_inodelog);
294 for (i = 0; i < ninodes; i++) {
295 free = XFS_MAKE_IPTR(args.mp, fbuf, i);
296 free->di_core.di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
297 free->di_core.di_version = version;
298 free->di_next_unlinked = cpu_to_be32(NULLAGINO);
299 xfs_ialloc_log_di(tp, fbuf, i,
300 XFS_DI_CORE_BITS | XFS_DI_NEXT_UNLINKED);
301 }

xfs_biozero(...) turns into a memset(buf, 0, len), and since the loop that
follows doesn't change the generation number, it'll stay 0.

> The simplest would be to generate a 'random' number (get_random_bytes).
> Slightly better would be to generate a random number at boot time
> and use that, incrementing it each time it is used to set the
> generation number for an inode.

I'm not familiar enough with NFS, do you want something that's monotonically
increasing or do you just test for inequality? If it is inequality, why not
just use something like the jiffies - that should be unique enough.

> Even better would be store store that 'next generation number' in the
> superblock so there would be even less risk of the 'random' generation
> producing repeats.
> This is what ext3 does. It doesn't dynamically allocate inodes,
> but it doesn't want to pay the cost of reading an old inode from
> storage just to see what the generation number is. So it has
> a number in the superblock which is incremented on each inode allocation
> and is used as the generation number.

Something tells me that the SGI folks might not be all too happy with the
in-sb number...XFS tries to be as parallel as possible, and this would cause
the counter variable to bounce around their NUMA systems. Perhaps a per-ag
variable would be better, but I remember reading that parallelizing updates
to some inode count variable (I forget which) in the superblock
\cite{dchinner-ols2006} led to a rather big improvement. It's almost
morning down under, so I guess we'll get their comments on this soon.

Josef 'Jeff' Sipek.

--
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it.
- Brian W. Kernighan

2008-03-25 23:09:51

by NeilBrown

[permalink] [raw]

Subject: Re: [opensuse] nfs_update_inode: inode X mode changed, Y to Z

On Wed, March 26, 2008 9:13 am, Josef 'Jeff' Sipek wrote:
> On Wed, Mar 26, 2008 at 08:38:22AM +1100, NeilBrown wrote:
> ...
>> However you still need to do something about the generation number. It
>> must be set to something.
>
> Right.
>
>> When you allocate an inode that doesn't currently exist on the device,
>> you obviously cannot increment the old value and use that.
>
> Makes sense.
>
>> However you can do a lot better than always using 0.
>
> I looked at the code (xfs_ialloc.c:xfs_ialloc_ag_alloc)
>
> 290 /*
> 291 * Set initial values for the inodes in this buffer.
> 292 */
> 293 xfs_biozero(fbuf, 0, ninodes <<
> args.mp->m_sb.sb_inodelog);
> 294 for (i = 0; i < ninodes; i++) {
> 295 free = XFS_MAKE_IPTR(args.mp, fbuf, i);
> 296 free->di_core.di_magic =
> cpu_to_be16(XFS_DINODE_MAGIC);
> 297 free->di_core.di_version = version;
> 298 free->di_next_unlinked =
> cpu_to_be32(NULLAGINO);
> 299 xfs_ialloc_log_di(tp, fbuf, i,
> 300 XFS_DI_CORE_BITS |
> XFS_DI_NEXT_UNLINKED);
> 301 }
>
> xfs_biozero(...) turns into a memset(buf, 0, len), and since the loop that
> follows doesn't change the generation number, it'll stay 0.
>
>> The simplest would be to generate a 'random' number (get_random_bytes).
>> Slightly better would be to generate a random number at boot time
>> and use that, incrementing it each time it is used to set the
>> generation number for an inode.
>
> I'm not familiar enough with NFS, do you want something that's
> monotonically
> increasing or do you just test for inequality? If it is inequality, why
> not
> just use something like the jiffies - that should be unique enough.
>

What we need is for the "filehandle" to be stable and unique.
By 'stable' I mean that every time I get the filehandle for a particular
file, I get the same string of bytes.
By 'uniqie' I mean that if I get two filehandles for two different
files, they must differ in at least one bit.
If a file is deleted and the inode is re-used for a new file, then the
old and new files are different and must have different file handles.

The filehandle is traditionally generated from the inode number and
a generation number, but the filesystem can actually do whatever it
likes. xfs does it with xfs_fs_encode_fh().

Certainly you could initialise the i_generation to jiffies in
xfs_ialloc_ag_alloc. That would be a suitable fix. get_random_bytes
might be better, but the difference probably wouldn't be noticeable.

NeilBrown

2008-03-26 03:38:21

by David Chinner

[permalink] [raw]

Subject: Re: [opensuse] nfs_update_inode: inode X mode changed, Y to Z

On Tue, Mar 25, 2008 at 06:13:21PM -0400, Josef 'Jeff' Sipek wrote:
> On Wed, Mar 26, 2008 at 08:38:22AM +1100, NeilBrown wrote:
> ...
> > However you still need to do something about the generation number. It
> > must be set to something.
.....
> > Even better would be store store that 'next generation number' in the
> > superblock so there would be even less risk of the 'random' generation
> > producing repeats.
> > This is what ext3 does. It doesn't dynamically allocate inodes,
> > but it doesn't want to pay the cost of reading an old inode from
> > storage just to see what the generation number is. So it has
> > a number in the superblock which is incremented on each inode allocation
> > and is used as the generation number.
>
> Something tells me that the SGI folks might not be all too happy with the
> in-sb number...
.....
> Perhaps a per-ag variable would be better,

/me goes back to the bug from last year about stable inode/gen numbers
for a HSM.

dgc> Right, except the last thing we want is yet more global state needing to
dgc> be updated in inode allocation. The best way to do this is a max generation
dgc> number per AG (held in the AGI) so that it can be updated at the same time
dgc> inodes are freed and not cause additional serialisation.

Which was soundly rejected by the HSM folk because it wraps at 4 billion
inode create/unlink cycles in an AG rather than per inode. The only thing
they were happy with was the old behaviour and so they now mount their
filesystems with ikeep. At that point the issue was dropped on the floor;
the NFS side of things apparently weren't causing any problems so we didn't
consider it urgent to fix....

Given this state of affairs (i.e. HSM using ikeep), I guess we can do
anything we want for the noikeep case. I'll cook up a patch that does
something similar to ext3 generation numbers for the initial seeding....

> but I remember reading that parallelizing updates
> to some inode count variable (I forget which) in the superblock
> \cite{dchinner-ols2006} led to a rather big improvement.

That was for in memory counters not on disk, and the problem really was
free block counts rather than free inode counts. Yes, I converted the
inode counters at the same time, but that wasn't the limiting factor.
Updates to the on disk superblock, OTOH, are a limiting factor and
that was the lazy superblock counter modifications solve....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2008-03-26 05:02:44

by David Chinner

[permalink] [raw]

Subject: Re: [opensuse] nfs_update_inode: inode X mode changed, Y to Z

On Wed, Mar 26, 2008 at 02:37:38PM +1100, David Chinner wrote:
> Given this state of affairs (i.e. HSM using ikeep), I guess we can do
> anything we want for the noikeep case. I'll cook up a patch that does
> something similar to ext3 generation numbers for the initial seeding....

Patch below for comments. It passes xfsqa, but there's no userspace
support for it yet. 2.6.26 is the likely target for this change.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

---

Don't initialise new inode generation numbers to zero

When we allocation new inode chunks, we initialise the generation
numbers to zero. This works fine until we delete a chunk and then
reallocate it, resulting in the same inode numbers but with a
reset generation count. This can result in inode/generation
pairs of different inodes occurring relatively close together.

Given that the inode/gen pair makes up the "unique" portion of
an NFS filehandle on XFS, this can result in file handles cached
on clients being seen on the wire from the server but refer to
a different file. This causes .... issues for NFS clients.

Hence we need a unique generation number initialisation for
each inode to prevent reuse of a small portion of the generation
number space. Make this initialiser per-allocation group so
that it is not a single point of contention in the filesystem,
and increment it on every allocation within an AG to reduce the
chance that a generation number is reused for a given inode number
if the inode chunk is deleted and reallocated immediately
afterwards.

It is safe to add the agi_newinogen field to the AGI without
using a feature bit. If an older kernel is used, it simply
will not update the field on allocation. If the kernel is
updated and the field has garbage in it, then it's like having a
random seed to the generation number....

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/xfs_ag.h | 4 +++-
fs/xfs/xfs_ialloc.c | 30 ++++++++++++++++++++++--------
2 files changed, 25 insertions(+), 9 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_ag.h
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_ag.h 2008-01-18 18:30:06.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_ag.h 2008-03-26 13:03:41.122918236 +1100
@@ -121,6 +121,7 @@ typedef struct xfs_agi {
* still being referenced.
*/
__be32 agi_unlinked[XFS_AGI_UNLINKED_BUCKETS];
+ __be32 agi_newinogen; /* inode cluster generation */
} xfs_agi_t;

#define XFS_AGI_MAGICNUM 0x00000001
@@ -134,7 +135,8 @@ typedef struct xfs_agi {
#define XFS_AGI_NEWINO 0x00000100
#define XFS_AGI_DIRINO 0x00000200
#define XFS_AGI_UNLINKED 0x00000400
-#define XFS_AGI_NUM_BITS 11
+#define XFS_AGI_NEWINOGEN 0x00000800
+#define XFS_AGI_NUM_BITS 12
#define XFS_AGI_ALL_BITS ((1 << XFS_AGI_NUM_BITS) - 1)

/* disk block (xfs_daddr_t) in the AG */
Index: 2.6.x-xfs-new/fs/xfs/xfs_ialloc.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_ialloc.c 2008-03-25 15:41:27.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_ialloc.c 2008-03-26 14:29:47.998554368 +1100
@@ -309,6 +309,8 @@ xfs_ialloc_ag_alloc(
free = XFS_MAKE_IPTR(args.mp, fbuf, i);
free->di_core.di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
free->di_core.di_version = version;
+ free->di_core.di_gen = agi->agi_newinogen;
+ be32_add_cpu(&agi->agi_newinogen, 1);
free->di_next_unlinked = cpu_to_be32(NULLAGINO);
xfs_ialloc_log_di(tp, fbuf, i,
XFS_DI_CORE_BITS | XFS_DI_NEXT_UNLINKED);
@@ -347,7 +349,8 @@ xfs_ialloc_ag_alloc(
* Log allocation group header fields
*/
xfs_ialloc_log_agi(tp, agbp,
- XFS_AGI_COUNT | XFS_AGI_FREECOUNT | XFS_AGI_NEWINO);
+ XFS_AGI_COUNT | XFS_AGI_FREECOUNT |
+ XFS_AGI_NEWINO | XFS_AGI_NEWINOGEN);
/*
* Modify/log superblock values for inode count and inode free count.
*/
@@ -896,11 +899,12 @@ nextag:
ino = XFS_AGINO_TO_INO(mp, agno, rec.ir_startino + offset);
XFS_INOBT_CLR_FREE(&rec, offset);
rec.ir_freecount--;
+ be32_add_cpu(&agi->agi_newinogen, 1);
if ((error = xfs_inobt_update(cur, rec.ir_startino, rec.ir_freecount,
rec.ir_free)))
goto error0;
be32_add(&agi->agi_freecount, -1);
- xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT);
+ xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT | XFS_AGI_NEWINOGEN);
down_read(&mp->m_peraglock);
mp->m_perag[tagno].pagi_freecount--;
up_read(&mp->m_peraglock);
@@ -1320,6 +1324,11 @@ xfs_ialloc_compute_maxlevels(

/*
* Log specified fields for the ag hdr (inode section)
+ *
+ * We don't log the unlinked inode fields through here; they
+ * get logged directly to the buffer. Hence we have a discontinuity
+ * in the fields we are logging and we need two calls to map all
+ * the dirtied parts of the agi....
*/
void
xfs_ialloc_log_agi(
@@ -1342,22 +1351,27 @@ xfs_ialloc_log_agi(
offsetof(xfs_agi_t, agi_newino),
offsetof(xfs_agi_t, agi_dirino),
offsetof(xfs_agi_t, agi_unlinked),
+ offsetof(xfs_agi_t, agi_newinogen),
sizeof(xfs_agi_t)
};
+ int log_newino = fields & XFS_AGI_NEWINOGEN;
+
#ifdef DEBUG
xfs_agi_t *agi; /* allocation group header */

agi = XFS_BUF_TO_AGI(bp);
ASSERT(be32_to_cpu(agi->agi_magicnum) == XFS_AGI_MAGIC);
#endif
- /*
- * Compute byte offsets for the first and last fields.
- */
+ fields &= ~XFS_AGI_NEWINOGEN;
+
+ /* Compute byte offsets for the first and last fields. */
xfs_btree_offsets(fields, offsets, XFS_AGI_NUM_BITS, &first, &last);
- /*
- * Log the allocation group inode header buffer.
- */
xfs_trans_log_buf(tp, bp, first, last);
+ if (log_newino) {
+ xfs_btree_offsets(XFS_AGI_NEWINOGEN, offsets, XFS_AGI_NUM_BITS,
+ &first, &last);
+ xfs_trans_log_buf(tp, bp, first, last);
+ }
}

/*