2011-05-10 01:23:53

by Robert Morell

[permalink] [raw]
Subject: mmap vs. ctime bug?

I tracked an intermittent failure in one of our build systems down to
questionable kernel behavior.

The makefile for the build seems completely reasonable. It essentially does
this (greatly simplified):
output: $(OBJECTS)
ld -o output $(OBJECTS)
$(POSTPROCESS) output

tarball.tgz: output
tar zcf tarball.tgz output

$(POSTPROCESS) in this case is just a program that modifies some ELF headers.
This program does so through libelf, but the important part is that libelf
operates on the file using mmap().

The problem is that the "tar" step sometimes fails with the error:
/bin/tar: output: file changed as we read it

As tar is adding a file to a tarball, it first stats the file, reads the entire
file, then stats it again. It reports the above error if the ctime does not
match between the two stat calls. In the case of the intermittent failure, the
ctime does not match for the file as reported by stat(1).

Adding a sync between the postprocess program's termination and the tar
invocation "fixes" the problem, but adds a significant delay to the overall
build time, so I'd prefer to not do that.

I was able to reproduce the behavior with a simple test case (attached) with
the latest git kernel built from 26822eebb25. To run the test, simply
put test.c and the Makefile in a new directory and run "make runtest".
Note that the filesystem blocks and ctime change between the two stat
invocations, although the mtime remains the same:

# make runtest
gcc test.c -o test
rm -f out
./test out
stat out
File: `out'
Size: 268435456 Blocks: 377096 IO Block: 4096 regular file
Device: 304h/772d Inode: 655367 Links: 1
Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2011-05-09 18:06:24.000000000 -0700
Modify: 2011-05-09 18:06:27.000000000 -0700
Change: 2011-05-09 18:06:27.000000000 -0700
sync
stat out
File: `out'
Size: 268435456 Blocks: 524808 IO Block: 4096 regular file
Device: 304h/772d Inode: 655367 Links: 1
Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2011-05-09 18:06:24.000000000 -0700
Modify: 2011-05-09 18:06:27.000000000 -0700
Change: 2011-05-09 18:06:28.000000000 -0700

(note: depending on your system, you may need to tweak the "SIZE" constant in
test.c up to see ctime actually change at a resolution of 1s)


Does this seem like a bug to anyone else? For the normal "make" flow to work
properly, files really need to be done changing by the time a process exits and
wait(3) returns to the parent. The heavy-hammer workaround of adding a
sync(1) throws away a ton of potential benefit from the filesystem cache.
Adding an msync(MS_SYNC) in the toy test app also "fixes" the problem, but
that's not feasible in the production environment since libelf is doing the
modification internally and besides, it seems like it shouldn't be necessary.

If it matters, the filesystem is a dead simple ext3 with no special mount
flags, but I suspect this is not specific to FS:
# mount
/dev/hda4 on / type ext3 (rw)
[...]

Thanks,
Robert


Attachments:
(No filename) (3.00 kB)
test.c (892.00 B)
Makefile (114.00 B)
Download all attachments

2011-05-11 16:37:27

by Jan Kara

[permalink] [raw]
Subject: Re: mmap vs. ctime bug?

Hello,

On Mon 09-05-11 18:23:48, [email protected] wrote:
> I tracked an intermittent failure in one of our build systems down to
> questionable kernel behavior.
>
> The makefile for the build seems completely reasonable. It essentially does
> this (greatly simplified):
> output: $(OBJECTS)
> ld -o output $(OBJECTS)
> $(POSTPROCESS) output
>
> tarball.tgz: output
> tar zcf tarball.tgz output
>
> $(POSTPROCESS) in this case is just a program that modifies some ELF headers.
> This program does so through libelf, but the important part is that libelf
> operates on the file using mmap().
>
> The problem is that the "tar" step sometimes fails with the error:
> /bin/tar: output: file changed as we read it
>
> As tar is adding a file to a tarball, it first stats the file, reads the entire
> file, then stats it again. It reports the above error if the ctime does not
> match between the two stat calls. In the case of the intermittent failure, the
> ctime does not match for the file as reported by stat(1).
>
> Adding a sync between the postprocess program's termination and the tar
> invocation "fixes" the problem, but adds a significant delay to the overall
> build time, so I'd prefer to not do that.
>
> I was able to reproduce the behavior with a simple test case (attached) with
> the latest git kernel built from 26822eebb25. To run the test, simply
> put test.c and the Makefile in a new directory and run "make runtest".
> Note that the filesystem blocks and ctime change between the two stat
> invocations, although the mtime remains the same:
>
> # make runtest
> gcc test.c -o test
> rm -f out
> ./test out
> stat out
> File: `out'
> Size: 268435456 Blocks: 377096 IO Block: 4096 regular file
> Device: 304h/772d Inode: 655367 Links: 1
> Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2011-05-09 18:06:24.000000000 -0700
> Modify: 2011-05-09 18:06:27.000000000 -0700
> Change: 2011-05-09 18:06:27.000000000 -0700
> sync
> stat out
> File: `out'
> Size: 268435456 Blocks: 524808 IO Block: 4096 regular file
> Device: 304h/772d Inode: 655367 Links: 1
> Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2011-05-09 18:06:24.000000000 -0700
> Modify: 2011-05-09 18:06:27.000000000 -0700
> Change: 2011-05-09 18:06:28.000000000 -0700
>
> (note: depending on your system, you may need to tweak the "SIZE" constant in
> test.c up to see ctime actually change at a resolution of 1s)
>
>
> Does this seem like a bug to anyone else? For the normal "make" flow to work
> properly, files really need to be done changing by the time a process exits and
> wait(3) returns to the parent. The heavy-hammer workaround of adding a
> sync(1) throws away a ton of potential benefit from the filesystem cache.
> Adding an msync(MS_SYNC) in the toy test app also "fixes" the problem, but
> that's not feasible in the production environment since libelf is doing the
> modification internally and besides, it seems like it shouldn't be necessary.
>
> If it matters, the filesystem is a dead simple ext3 with no special mount
> flags, but I suspect this is not specific to FS:
OK, so let me explain what happens: When a sparse file is created and
written to via mmap, we just store the data in memory. Later, we decide
it's time to store the data on disk and thus we allocate blocks for the
data. At this point we also update ctime and mtime - naturally since the
amount of space occupied by the file has changed. I've looked at the
specification and it says:
The st_ctime and st_mtime field for a file mapped with PROT_WRITE and
MAP_SHARED will be updated after a write to the mapped region, and
before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if one
occurs.

So although I can see why the combination of this behavior and your
libelf+tar usecase causes problems the kernel behaves according to the spec
and I don't think changing the kernel is the right solution. I'd rather
think that you should be able to disable the ctime check in tar.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-11 16:24:47

by Robert Morell

[permalink] [raw]
Subject: Re: mmap vs. ctime bug?

On Wed, May 11, 2011 at 03:43:58AM -0700, Jan Kara wrote:
> Hello,
>
> On Mon 09-05-11 18:23:48, [email protected] wrote:

[...]

> >
> > I was able to reproduce the behavior with a simple test case (attached) with
> > the latest git kernel built from 26822eebb25. To run the test, simply
> > put test.c and the Makefile in a new directory and run "make runtest".
> > Note that the filesystem blocks and ctime change between the two stat
> > invocations, although the mtime remains the same:
> >
> > # make runtest
> > gcc test.c -o test
> > rm -f out
> > ./test out
> > stat out
> > File: `out'
> > Size: 268435456 Blocks: 377096 IO Block: 4096 regular file
> > Device: 304h/772d Inode: 655367 Links: 1
> > Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root)
> > Access: 2011-05-09 18:06:24.000000000 -0700
> > Modify: 2011-05-09 18:06:27.000000000 -0700
> > Change: 2011-05-09 18:06:27.000000000 -0700
> > sync
> > stat out
> > File: `out'
> > Size: 268435456 Blocks: 524808 IO Block: 4096 regular file
> > Device: 304h/772d Inode: 655367 Links: 1
> > Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root)
> > Access: 2011-05-09 18:06:24.000000000 -0700
> > Modify: 2011-05-09 18:06:27.000000000 -0700
> > Change: 2011-05-09 18:06:28.000000000 -0700
> >
> > (note: depending on your system, you may need to tweak the "SIZE" constant in
> > test.c up to see ctime actually change at a resolution of 1s)
> >
> >
> > Does this seem like a bug to anyone else? For the normal "make" flow to work
> > properly, files really need to be done changing by the time a process exits and
> > wait(3) returns to the parent. The heavy-hammer workaround of adding a
> > sync(1) throws away a ton of potential benefit from the filesystem cache.
> > Adding an msync(MS_SYNC) in the toy test app also "fixes" the problem, but
> > that's not feasible in the production environment since libelf is doing the
> > modification internally and besides, it seems like it shouldn't be necessary.
> >
> > If it matters, the filesystem is a dead simple ext3 with no special mount
> > flags, but I suspect this is not specific to FS:
> OK, so let me explain what happens: When a sparse file is created and
> written to via mmap, we just store the data in memory. Later, we decide
> it's time to store the data on disk and thus we allocate blocks for the
> data. At this point we also update ctime and mtime - naturally since the

Note that mtime has not changed, only ctime.

> amount of space occupied by the file has changed. I've looked at the
> specification and it says:
> The st_ctime and st_mtime field for a file mapped with PROT_WRITE and
> MAP_SHARED will be updated after a write to the mapped region, and
> before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if one
> occurs.

Sure, that makes sense while the file is still mapped. But after
munmap and close, it seems like all updates should at least be updated
as far as software is concerned (the cache and dirty page writeback
should be transparent).

If we want to quote specifications, see:
http://pubs.opengroup.org/onlinepubs/9699919799/

"Section 4.8 "File Times Update"
[...]
An implementation may update timestamps that are marked for update
immediately, or it may update such timestamps periodically. At the point
in time when an update occurs, any marked timestamps shall be set to the
current time and the update marks shall be cleared. All timestamps that
are marked for update shall be updated when the file ceases to be open
by any process or before a fstat(), fstatat(), fsync(), futimens(),
lstat(), stat(), utime(), utimensat(), or utimes() is successfully
performed on the file."

> So although I can see why the combination of this behavior and your
> libelf+tar usecase causes problems the kernel behaves according to the spec
> and I don't think changing the kernel is the right solution. I'd rather
> think that you should be able to disable the ctime check in tar.

This really breaks basic assumptions about process lifetime and I/O. In
the basic shell flow:
$ ./a && ./b
When b is invoked, it is assumed that a has been terminated and any
I/O it has performed will be reflected if b tries to read it. (I assume
the shell achieves this with wait(pid)?()). Again, it is not guaranteed
that the output be flushed to disk, but the cache should be transparent
to software.

- Robert

2011-05-11 17:01:34

by Jan Kara

[permalink] [raw]
Subject: Re: mmap vs. ctime bug?

On Wed 11-05-11 09:24:41, [email protected] wrote:
> On Wed, May 11, 2011 at 03:43:58AM -0700, Jan Kara wrote:
> > > I was able to reproduce the behavior with a simple test case (attached) with
> > > the latest git kernel built from 26822eebb25. To run the test, simply
> > > put test.c and the Makefile in a new directory and run "make runtest".
> > > Note that the filesystem blocks and ctime change between the two stat
> > > invocations, although the mtime remains the same:
> > >
> > > # make runtest
> > > gcc test.c -o test
> > > rm -f out
> > > ./test out
> > > stat out
> > > File: `out'
> > > Size: 268435456 Blocks: 377096 IO Block: 4096 regular file
> > > Device: 304h/772d Inode: 655367 Links: 1
> > > Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root)
> > > Access: 2011-05-09 18:06:24.000000000 -0700
> > > Modify: 2011-05-09 18:06:27.000000000 -0700
> > > Change: 2011-05-09 18:06:27.000000000 -0700
> > > sync
> > > stat out
> > > File: `out'
> > > Size: 268435456 Blocks: 524808 IO Block: 4096 regular file
> > > Device: 304h/772d Inode: 655367 Links: 1
> > > Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root)
> > > Access: 2011-05-09 18:06:24.000000000 -0700
> > > Modify: 2011-05-09 18:06:27.000000000 -0700
> > > Change: 2011-05-09 18:06:28.000000000 -0700
> > >
> > > (note: depending on your system, you may need to tweak the "SIZE" constant in
> > > test.c up to see ctime actually change at a resolution of 1s)
> > >
> > >
> > > Does this seem like a bug to anyone else? For the normal "make" flow to work
> > > properly, files really need to be done changing by the time a process exits and
> > > wait(3) returns to the parent. The heavy-hammer workaround of adding a
> > > sync(1) throws away a ton of potential benefit from the filesystem cache.
> > > Adding an msync(MS_SYNC) in the toy test app also "fixes" the problem, but
> > > that's not feasible in the production environment since libelf is doing the
> > > modification internally and besides, it seems like it shouldn't be necessary.
> > >
> > > If it matters, the filesystem is a dead simple ext3 with no special mount
> > > flags, but I suspect this is not specific to FS:
> > OK, so let me explain what happens: When a sparse file is created and
> > written to via mmap, we just store the data in memory. Later, we decide
> > it's time to store the data on disk and thus we allocate blocks for the
> > data. At this point we also update ctime and mtime - naturally since the
>
> Note that mtime has not changed, only ctime.
Yep, because the data was not modified. Only the inode has been modified
by the writeback.

> > amount of space occupied by the file has changed. I've looked at the
> > specification and it says:
> > The st_ctime and st_mtime field for a file mapped with PROT_WRITE and
> > MAP_SHARED will be updated after a write to the mapped region, and
> > before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if one
> > occurs.
>
> Sure, that makes sense while the file is still mapped. But after
> munmap and close, it seems like all updates should at least be updated
> as far as software is concerned (the cache and dirty page writeback
> should be transparent).
I agree with the transparency as far as data is concerned. But it simply
cannot work for metadata - we don't know some things (like the number of
used blocks) in advance until the file is written.

> If we want to quote specifications, see:
> http://pubs.opengroup.org/onlinepubs/9699919799/
>
> "Section 4.8 "File Times Update"
> [...]
> An implementation may update timestamps that are marked for update
> immediately, or it may update such timestamps periodically. At the point
> in time when an update occurs, any marked timestamps shall be set to the
> current time and the update marks shall be cleared. All timestamps that
> are marked for update shall be updated when the file ceases to be open
> by any process or before a fstat(), fstatat(), fsync(), futimens(),
> lstat(), stat(), utime(), utimensat(), or utimes() is successfully
> performed on the file."
But the allocation of a disk block (and thus a change to the inode) happens
after the file is closed. So the timestamp is marked for update after the
file is closed and we are consistent with the above paragraph. In fact we
should not avoid updating the time stamp because then applications would
miss that the metadata information in inode has actually changed.

> > So although I can see why the combination of this behavior and your
> > libelf+tar usecase causes problems the kernel behaves according to the spec
> > and I don't think changing the kernel is the right solution. I'd rather
> > think that you should be able to disable the ctime check in tar.
>
> This really breaks basic assumptions about process lifetime and I/O. In
> the basic shell flow:
> $ ./a && ./b
> When b is invoked, it is assumed that a has been terminated and any
> I/O it has performed will be reflected if b tries to read it. (I assume
> the shell achieves this with wait(pid)?()). Again, it is not guaranteed
> that the output be flushed to disk, but the cache should be transparent
> to software.
Again, cache is transparent for data, not for metadata. So if b is
dependent on metadata changed by a, things get complicated. There are some
basic things defined by POSIX but apart from that all bets are off. Basically
the only way to get some guarantees is to use fsync/sync which is dumb but
that's how it is. Sorry. If you wanted that perfectly metadata consistent
behavior, kernel would have to basically fsync the file behind the scenes
and people certainly would not like that.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-11 17:37:33

by Robert Morell

[permalink] [raw]
Subject: Re: mmap vs. ctime bug?

On Wed, May 11, 2011 at 10:01:30AM -0700, Jan Kara wrote:

[...]

> I agree with the transparency as far as data is concerned. But it simply
> cannot work for metadata - we don't know some things (like the number of
> used blocks) in advance until the file is written.
>
> > If we want to quote specifications, see:
> > http://pubs.opengroup.org/onlinepubs/9699919799/
> >
> > "Section 4.8 "File Times Update"
> > [...]
> > An implementation may update timestamps that are marked for update
> > immediately, or it may update such timestamps periodically. At the point
> > in time when an update occurs, any marked timestamps shall be set to the
> > current time and the update marks shall be cleared. All timestamps that
> > are marked for update shall be updated when the file ceases to be open
> > by any process or before a fstat(), fstatat(), fsync(), futimens(),
> > lstat(), stat(), utime(), utimensat(), or utimes() is successfully
> > performed on the file."
> But the allocation of a disk block (and thus a change to the inode) happens
> after the file is closed. So the timestamp is marked for update after the
> file is closed and we are consistent with the above paragraph. In fact we
> should not avoid updating the time stamp because then applications would
> miss that the metadata information in inode has actually changed.

Practically speaking, does anything that monitors ctimes actually care
about st_blocks changes? Certainly tar and other similar backup or
archive-type programs shouldn't care; they only care about data that can
be restored on a new filesystem. Maybe an acceptable change would be to
simply not trigger ctime updates based solely on disk block allocations?

I realize that this is not spec-compliant since "file status" has
changed, but this behavior could be tweaked with filesystem mount
options to turn on struct ctime behavior, similar to strictatime.

> > > So although I can see why the combination of this behavior and your
> > > libelf+tar usecase causes problems the kernel behaves according to the spec
> > > and I don't think changing the kernel is the right solution. I'd rather
> > > think that you should be able to disable the ctime check in tar.
> >
> > This really breaks basic assumptions about process lifetime and I/O. In
> > the basic shell flow:
> > $ ./a && ./b
> > When b is invoked, it is assumed that a has been terminated and any
> > I/O it has performed will be reflected if b tries to read it. (I assume
> > the shell achieves this with wait(pid)?()). Again, it is not guaranteed
> > that the output be flushed to disk, but the cache should be transparent
> > to software.
> Again, cache is transparent for data, not for metadata. So if b is
> dependent on metadata changed by a, things get complicated. There are some
> basic things defined by POSIX but apart from that all bets are off. Basically
> the only way to get some guarantees is to use fsync/sync which is dumb but
> that's how it is. Sorry. If you wanted that perfectly metadata consistent
> behavior, kernel would have to basically fsync the file behind the scenes
> and people certainly would not like that.

fsync/sync are much heavier-weight than should be necessary, though.
None of the data has to actually hit the disk; filesystem blocks are at
the end of the day just software state; requiring disk latency here is
rather unfortunate. An alternative fstatsync() or so that tar could
call on its files would be sufficient as well.

Thanks,
Robert

2011-05-12 12:24:52

by Jan Kara

[permalink] [raw]
Subject: Re: mmap vs. ctime bug?

On Wed 11-05-11 10:37:26, Robert Morell wrote:
> On Wed, May 11, 2011 at 10:01:30AM -0700, Jan Kara wrote:
> > I agree with the transparency as far as data is concerned. But it simply
> > cannot work for metadata - we don't know some things (like the number of
> > used blocks) in advance until the file is written.
> >
> > > If we want to quote specifications, see:
> > > http://pubs.opengroup.org/onlinepubs/9699919799/
> > >
> > > "Section 4.8 "File Times Update"
> > > [...]
> > > An implementation may update timestamps that are marked for update
> > > immediately, or it may update such timestamps periodically. At the point
> > > in time when an update occurs, any marked timestamps shall be set to the
> > > current time and the update marks shall be cleared. All timestamps that
> > > are marked for update shall be updated when the file ceases to be open
> > > by any process or before a fstat(), fstatat(), fsync(), futimens(),
> > > lstat(), stat(), utime(), utimensat(), or utimes() is successfully
> > > performed on the file."
> > But the allocation of a disk block (and thus a change to the inode) happens
> > after the file is closed. So the timestamp is marked for update after the
> > file is closed and we are consistent with the above paragraph. In fact we
> > should not avoid updating the time stamp because then applications would
> > miss that the metadata information in inode has actually changed.
>
> Practically speaking, does anything that monitors ctimes actually care
> about st_blocks changes? Certainly tar and other similar backup or
> archive-type programs shouldn't care; they only care about data that can
> be restored on a new filesystem. Maybe an acceptable change would be to
> simply not trigger ctime updates based solely on disk block allocations?
Yes, tar shouldn't care but if you had e.g. disk space monitor, it could
care. Actually if tar cared about data only, it should check mtime, but I
guess it also cares about owners, permissions, etc. which is why it's
checking ctime.

> I realize that this is not spec-compliant since "file status" has
> changed, but this behavior could be tweaked with filesystem mount
> options to turn on struct ctime behavior, similar to strictatime.
This is possible but you'll run into problems when you want to run both
tar and imaginatory disk monitor...

> > > > So although I can see why the combination of this behavior and your
> > > > libelf+tar usecase causes problems the kernel behaves according to the spec
> > > > and I don't think changing the kernel is the right solution. I'd rather
> > > > think that you should be able to disable the ctime check in tar.
> > >
> > > This really breaks basic assumptions about process lifetime and I/O. In
> > > the basic shell flow:
> > > $ ./a && ./b
> > > When b is invoked, it is assumed that a has been terminated and any
> > > I/O it has performed will be reflected if b tries to read it. (I assume
> > > the shell achieves this with wait(pid)?()). Again, it is not guaranteed
> > > that the output be flushed to disk, but the cache should be transparent
> > > to software.
> > Again, cache is transparent for data, not for metadata. So if b is
> > dependent on metadata changed by a, things get complicated. There are some
> > basic things defined by POSIX but apart from that all bets are off. Basically
> > the only way to get some guarantees is to use fsync/sync which is dumb but
> > that's how it is. Sorry. If you wanted that perfectly metadata consistent
> > behavior, kernel would have to basically fsync the file behind the scenes
> > and people certainly would not like that.
>
> fsync/sync are much heavier-weight than should be necessary, though.
> None of the data has to actually hit the disk; filesystem blocks are at
> the end of the day just software state; requiring disk latency here is
> rather unfortunate.
It may seem so but then you have to guarantee that you don't expose stale
data after a system crash so you have to be sure that data hits the disk
before metadata changes. And you also have to guarantee that after a crash
you can replay filesystem to a consistent state. On the other hand
filesystem allocation structures are shared among processes which modify
them. So all these requirements create a rather complex system which isn't
that easy to handle and theoretically possible "it's just software state"
ends up being tough practical problem.

> An alternative fstatsync() or so that tar could
> call on its files would be sufficient as well.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR