LinuxLists.cc - [RFC] Improved inode number allocation for HTree

2003-03-10 20:33:54

Subject: [RFC] Improved inode number allocation for HTree

Summary of the problem:

HTree's random relationship between directory entry order and inode order has
been observed to cause increased cache pressure, affecting performance under
certain conditions. Most of this pressure is due to inode table usage, which
is normally several times larger than the directory itself. Even when the
inode table usage is much less than available cache, it may still exceed the
size of the journal file (8 MB by default). Though the journal only becomes
involved when blocks are modified, unfortunately, because of atime updates,
this includes all directory operations. We could suggest to users that they
should disable atime updating if they care about performance, but we ought to
be able to do better than that.

To reduce this cache pressure, what we need to do is make the directory entry
order correspond more closely to the inode number order. It doesn't have to
correspond perfectly in order to show substantial improvement, as Tomas
demonstrated with his recent getdents sorting patch.

Analysis:

To understand why sorting relatively small sections of the directory entries
reduces the cache pressure, we can think about the lifetime of an inode table
block over the course of a directory traversal. This traversal lifetime
extends from when it is first accessed (always a write as noted above) to
when it is last accessed, or the directory traversal finishes. Each inode
table block contains multiple inodes, so if those inodes are accessed
randomly, the traversal lifetime of every inode table block will tend to be
from nearly the beginning of the traversal to nearly the end, and so the load
on the cache will be nearly the same as the number of inode table blocks. If
this load is greater than available cache, various blocks will have to be
written out and reloaded later for further processing. We can call the
period between loading and evicting an inode table block, its "cache period".

The number of times a block must be written/reloaded depends not only on the
number of inode table blocks covered by the directory and the availablity of
cache memory, but on the probable number of the inodes within the block that
will be written during each of the block's cache period. By sorting the
directory chunkwise, the probable number of inodes processed in one cache
period increases, reducing the number of cache periods, in turn reducing disk
seeking and transfer overhead.

To apply the above model to the journal, just substitute "journal period" for
"cache period", where journal period is defined as the time from when a dirty
block first enters the journal to when it is written out to disk.

We can see that local sorting or directory chunks has hardly any effect on
the traversal lifetime of a block; it only reduces the number of cache
periods. The size of the chunk of directory that we sort locally presumably
stays constant, so the probable number of inodes on each inode table block
processed in each cache period approaches one as the directory size
increases. At some point, there is no longer any benefit from sorting, andwe
arrive at that point at rather realistic directory sizes.

We can always do the job perfectly by sorting the whole directory. This
would require changes in every getdents-using application that cares about
performance, to either make the getdents window as big as the directory or to
do the sort in user space. A spinoff benefit of doing the full sort is that
it becomes trivial to detect and eliminate duplicate entries. Higher memory
load is a drawback, as is the work required to update user space programs.
There will also be noticable latency introduced by the sort, which didn't
appear before. If sorting is to be done in user space then we impose the
requirement on all filesystems - even those that do not use inode numbers
internally - that they supply an inode key for sorting, and that the supplied
sorting key be meaningful in terms of cache locality.

Proposed Solution:

What if we could maintain the inodes in the inode table in the same order as
the directory entries, permanently, so that no sorting is required? If we
could do this, the traversal lifetime of each inode table block would be the
same as the number of inodes it contains, and the cache load due to the inode
table would be exactly one block. That is, the directory traversal would
process each inode table block completely before moving on to the next one.
Though the current Ext3 filesystem structure does not support this directly,
we can approximate the effect quite closely within the existing framework.

Due to Ted's recent work, we are already traversing the directory in hash
order, as opposed to physical allocation order, because of the particular
approach adopted to address the telldir/seekdir problem. This is relatively
inexpensive. So now, what we would like to do is make the inode number of
each directory entry increase monotonically, corresponding to monotonically
increasing hash values. We also want to allocate those inode numbers densely
in the inode table, to minimize seeking. If we know beforehand how many
entries the directory will have, this is easy: when we create an entry, we
multiply its hash value by the total number of entries divided by the total
hash range (2**31), add in a base target, and use that as the goal to
allocate a new inode. Due to the randomness of the hash value, we will
sometimes be forced to allocate an inode number out of order, but hopefully
there won't be too many of those, and they won't be very far out of order.
We may also end up with some gaps in the inode table. In fact, we may
intentionally leave some slack in order to reduce the number of out-of-order
allocations. These problems are not serious.

The big problem is that we don't know the number of directory entries ahead
of time. To get around this, we can use the current size of the directory,
rounded up to a power of two as the size estimate. Each time the directory
increases to a new binary size we establish a new region in the inode table,
which will map to the to-be-created half of directory entries, roughly in
order.

Now, consider what happens to a group of directory entries created when the
directory was so small that it mapped to a single inode table block. After a
series of leaf splits, these former neighbors will end up far apart in the
hash order, but their inode numbers will still be ordered monotonically over
a traversal. The traversal lifetime of that group of entries will be nearly
the entire directory traversal, but there is only one such block.

When the directory became larger, some newly created entries were mapped to
two new inode table blocks, the traversal lifetimes of which are roughly half
the entire traversal. There are only two such blocks, and their lifetimes do
not overlap, so these two blocks only generate one block worth of cache load.

So it goes, with each new set of entries of size 2**i generating only one
additional block of cache load. The final cache load is O log2(N), a
pleasingly small number.

A very nice property of this inode allocation strategy is that the ordering
will tend to be reinforced as a directory ages, as opposed to slowly decaying
towards randomness as happens with linear directories.

So this approach is attractive, especially as it can be implemented in a
small number of lines of code. The main outstanding issue I have not
addressed is how to arrange things so that the inode allocation patterns of
neighbouring directories don't interfere with each other too much. I expect
that each time the directory grows to a new power of two size, we need to
scan the inode allocation map for a relatively unpopulated region of that
size and store the region's base in the directory header. Would we need to
store the entire history of allocation bases? Probably not. This aspect
requires further consideration.

Regards,

Daniel

2003-03-10 20:52:17

by John Bradford

[permalink] [raw]

Subject: Re: [RFC] Improved inode number allocation for HTree

> Though the journal only becomes involved when blocks are modified,
> unfortunately, because of atime updates, this includes all directory
> operations. We could suggest to users that they should disable
> atime updating if they care about performance, but we ought to be
> able to do better than that.

On a separate note, since atime updates are not usually very important
anyway, why not have an option to cache atime updates for a long time,
or until either a write occurs anyway. Holding a large number of
atime updates in a write cache is generally not going to be a major
issue - the worst case if a partition isn't cleanly unmounted is that
some atimes will be wrong.

John.

2003-03-10 21:17:46

by Andreas Schwab

[permalink] [raw]

Subject: Re: [RFC] Improved inode number allocation for HTree

John Bradford <[email protected]> writes:

|> > Though the journal only becomes involved when blocks are modified,
|> > unfortunately, because of atime updates, this includes all directory
|> > operations. We could suggest to users that they should disable
|> > atime updating if they care about performance, but we ought to be
|> > able to do better than that.
|>
|> On a separate note, since atime updates are not usually very important
|> anyway, why not have an option to cache atime updates for a long time,
|> or until either a write occurs anyway.

mount -o noatime

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 N?rnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2003-03-10 21:19:24

by Daniel Phillips

[permalink] [raw]

Subject: Re: [RFC] Improved inode number allocation for HTree

On Mon 10 Mar 03 22:04, John Bradford wrote:
> > Though the journal only becomes involved when blocks are modified,
> > unfortunately, because of atime updates, this includes all directory
> > operations. We could suggest to users that they should disable
> > atime updating if they care about performance, but we ought to be
> > able to do better than that.
>
> On a separate note, since atime updates are not usually very important
> anyway, why not have an option to cache atime updates for a long time,
> or until either a write occurs anyway. Holding a large number of
> atime updates in a write cache is generally not going to be a major
> issue - the worst case if a partition isn't cleanly unmounted is that
> some atimes will be wrong.

It sounds practical. Why stop there? Since Ted is seriously considering
making a batch of incompatible extensions to the on-disk format anyway, how
about adding an atime table to each block group, four bytes per inode. Even
without lazy updating, it would cut down the dirty blocks generated by r/o
operations a lot. If actual atime is the latest of the atime table value and
the inode atime value[1], then inode write operations won't generate extra
traffic. You will only get new traffic when somebody wants the real atime.
I'd put this under the category of "things to add to Ted's long list of fun
new ideas for Ext3/4".

[1] How to handle wrapping is left as an exercise for the interested reader.

Regards,

Daniel

2003-03-10 21:36:00

by Bryan O'Sullivan

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree

On Mon, 2003-03-10 at 13:33, Daniel Phillips wrote:

> It sounds practical. Why stop there?

Why start? Who actually uses atime for anything at all, other than the
tiny number of shops that care about moving untouched files to tertiary
storage?

Surely if you want to heap someone else's plate with work, you should
offer a reason why :-)

<b

2003-03-10 21:38:31

by John Bradford

[permalink] [raw]

Subject: Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree)

> |> > Though the journal only becomes involved when blocks are modified,
> |> > unfortunately, because of atime updates, this includes all directory
> |> > operations. We could suggest to users that they should disable
> |> > atime updating if they care about performance, but we ought to be
> |> > able to do better than that.
> |>
> |> On a separate note, since atime updates are not usually very important
> |> anyway, why not have an option to cache atime updates for a long time,
> |> or until either a write occurs anyway.
>
> mount -o noatime

Well, yes, I use noatime by default, but I was thinking that there
might be users who wanted to gain most of the performance that using
noatime would, who still wanted access times available generally, but
who were not bothered about loosing them on an unclean shutdown.

Infact, taking this one step further, we could have assign directories
priorities, analogous to nice values for processes, and make the
lazyness of writes dependant on that.

So, for example, on a webserver, you could assign databases a high
priority for writes, but your webserver log a low priority. If the
box crashed, the data most likely to be lost would be the webserver
log, rather than the database.

Another benefit would be that you could spin down disks, and not have
them spin up just to write data to the webserver log - that could be
committed to the disk just once an hour, but a database write could
cause it to spin up immediately.

John.

2003-03-10 21:52:23

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree

On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote:
> Why start? Who actually uses atime for anything at all, other than the
> tiny number of shops that care about moving untouched files to tertiary
> storage?
>
> Surely if you want to heap someone else's plate with work, you should
> offer a reason why :-)

"You have new mail" vs "You have mail".

--
"It's not Hollywood. War is real, war is primarily not about defeat or
victory, it is about death. I've seen thousands and thousands of dead bodies.
Do you think I want to have an academic debate on this subject?" -- Robert Fisk

2003-03-11 08:36:56

by Jakob Oestergaard

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree

On Mon, Mar 10, 2003 at 10:02:54PM +0000, Matthew Wilcox wrote:
> On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote:
> > Why start? Who actually uses atime for anything at all, other than the
> > tiny number of shops that care about moving untouched files to tertiary
> > storage?
> >
> > Surely if you want to heap someone else's plate with work, you should
> > offer a reason why :-)
>
> "You have new mail" vs "You have mail".

And having a clean /tmp using tmpwatch. Everything in my /tmp not
accessed for a few weeks gets deleted. I move cruft to /tmp every
single day, and I have not cleaned it up for years. It just stays tidy
all by itself. Of course one has to remember this when leavning a few
database files there before taking a vacation ;)

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2003-03-11 10:34:38

by Hans Reiser

[permalink] [raw]

Subject: Re: [RFC] Improved inode number allocation for HTree

Let's make noatime the default for VFS.

Daniel Phillips wrote:

>Hi Hans,
>
>On Tue 11 Mar 03 00:17, Hans Reiser wrote:
>
>
>>What do you think of creating a new telldir/seekdir which uses filenames
>>instead of ints to convey position?
>>
>>
>
>What do you do if that file gets deleted in the middle of a traversal?
>
So what? It still tells you where to restart. Strictly speaking, you
would want to allow the filesystem to choose what it wants to return to
indicate position. For htree and reiserfs this would be a filename or
its hash, for ext2 without htree this would be a byte offset.

>
>If I were able to design Unix over again, I'd state that if you don't lock a
>directory before traversing it then it's your own fault if somebody changes
>it under you, and I would have provided an interface to inform you about your
>bad luck. Strictly wishful thinking. (There, it feels better now.)
>
We are designing Linux. You know, Microsoft and Steve Jobs continue to
design. We should too.

Needless change should be avoided. Failure to change when something is
known to be broken leads to being inferior to those who do change.

Let's design something that does it right. Or else SCO will be right in
saying that Linux is just a Unix ripoff by people who couldn't do
anything unless Unix had been written to tell them how to do it right.

--
Hans

2003-03-11 11:16:38

by John Bradford

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree

> > > Why start? Who actually uses atime for anything at all, other than the
> > > tiny number of shops that care about moving untouched files to tertiary
> > > storage?

What about the situation where the primary storage is a device which
has a limited number of write-cycles. That's just the sort of
application where you might be archiving data to secondary or tertiary
storage, and it would be a big advantage to save on writes. If a file
is read every ten minutes, we could just update the atime once an
hour. That would save five writes an hour.

John.

2003-03-11 12:47:35

by Helge Hafting

[permalink] [raw]

Subject: Re: [RFC] Improved inode number allocation for HTree

Hans Reiser wrote:
> Let's make noatime the default for VFS.
>
> Daniel Phillips wrote:
[...]
>> If I were able to design Unix over again, I'd state that if you don't
>> lock a directory before traversing it then it's your own fault if
>> somebody changes it under you, and I would have provided an interface
>> to inform you about your bad luck. Strictly wishful thinking.
>> (There, it feels better now.)

I'm happy nobody _can_ lock a directory like that. Think of it - unable
to create or delete files while some slow-moving program is traversing
the directory? Ouch. Plenty of options for DOS attacks too.
And how to do "rm *.bak" if rm locks the dir for traversal?

Helge Hafting

2003-03-11 13:26:24

by Daniel Phillips

[permalink] [raw]

Subject: Re: [RFC] Improved inode number allocation for HTree

On Tue 11 Mar 03 14:00, Helge Hafting wrote:
> Hans Reiser wrote:
> > Let's make noatime the default for VFS.
> >
> > Daniel Phillips wrote:
>
> [...]
>
> >> If I were able to design Unix over again, I'd state that if you don't
> >> lock a directory before traversing it then it's your own fault if
> >> somebody changes it under you, and I would have provided an interface
> >> to inform you about your bad luck. Strictly wishful thinking.
> >> (There, it feels better now.)
>
> I'm happy nobody _can_ lock a directory like that. Think of it - unable
> to create or delete files while some slow-moving program is traversing
> the directory? Ouch. Plenty of options for DOS attacks too.
> And how to do "rm *.bak" if rm locks the dir for traversal?

<wishful thinking>
Now that you mention it, just locking out create and rename during directory
traversal would eliminate the pain. Delete is easy enough to handle during
traversal. For a B-Tree, coalescing could simply be deferred until the
traversal is finished, so reading the directory in physical storage order
would be fine. Way, way cleaner than what we have to do now.
</wishful thinking>

Regards,

Daniel

2003-03-11 18:11:04

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC] Improved inode number allocation for HTree

On Mar 11, 2003 14:41 +0100, Daniel Phillips wrote:
> On Tue 11 Mar 03 14:00, Helge Hafting wrote:
> > I'm happy nobody _can_ lock a directory like that. Think of it - unable
> > to create or delete files while some slow-moving program is traversing
> > the directory? Ouch. Plenty of options for DOS attacks too.
> > And how to do "rm *.bak" if rm locks the dir for traversal?
>
> <wishful thinking>
> Now that you mention it, just locking out create and rename during directory
> traversal would eliminate the pain. Delete is easy enough to handle during
> traversal. For a B-Tree, coalescing could simply be deferred until the
> traversal is finished, so reading the directory in physical storage order
> would be fine. Way, way cleaner than what we have to do now.
> </wishful thinking>

The other problem, of course, is that this would essentially mean that a
user-space application has a lock that prevents other processes (even the
kernel) from doing anything. At best we have to worry about an application
going sour, and at worst it is a DOS hole as Helge says.

How about something like:

death:
for each directory
while readdir(directory, large_buffer)
if (directory)
run death(directory)
readdir(directory, small_buffer)

sleep forever

The first getdents walks recursively down to the leaves, and then does a
single getdents with enough space to hold a single dirent, locking the
directory, but then never progressing. It also does the single-dirent
getdents on the way up the tree. Now, everything is locked out of the
entire directory tree.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2003-03-11 19:26:34

by Helge Hafting

[permalink] [raw]

Subject: Re: [RFC] Improved inode number allocation for HTree

On Tue, Mar 11, 2003 at 02:41:06PM +0100, Daniel Phillips wrote:
> On Tue 11 Mar 03 14:00, Helge Hafting wrote:
> > I'm happy nobody _can_ lock a directory like that. Think of it - unable
> > to create or delete files while some slow-moving program is traversing
> > the directory? Ouch. Plenty of options for DOS attacks too.
> > And how to do "rm *.bak" if rm locks the dir for traversal?
>
> <wishful thinking>
> Now that you mention it, just locking out create and rename during directory
> traversal would eliminate the pain. Delete is easy enough to handle during
> traversal. For a B-Tree, coalescing could simply be deferred until the
> traversal is finished, so reading the directory in physical storage order
> would be fine. Way, way cleaner than what we have to do now.
> </wishful thinking>

Ok, so "rm" works. Then you have things like "mv *.c /usr/src" to worry
about. Lock for traversal, get stuck unable to work on the files.

A "dream" solution for this might involve something like:
Directory is traversed in some well defined order (defined by the fs)
Removing a file (or moving it out of the directory) leaves
a "empty" directory entry that isn't reclaimed until no more
traversals is in progress in that directory.

New files can be created, and will be created further out than
any traversal in progress so nothing will be missed.

This approach don't lock anything out, but I guess someone evil still
could cause trouble by keeping a traversal going forever, creating one dummy
file and deleting one whenever it makes progress. The directory would
get big, filled up with placeholders until some ulimit kicks in.

Helge Hafting

2003-03-11 20:04:05

by Pascal Schmidt

[permalink] [raw]

Subject: Re: [RFC] Improved inode number allocation for HTree

On Tue, 11 Mar 2003 20:40:14 +0100, you wrote in linux.kernel:

> Ok, so "rm" works. Then you have things like "mv *.c /usr/src" to worry
> about. Lock for traversal, get stuck unable to work on the files.

In both cases, the shell does the traversal and passes a complete list
of files to rm or mv... so rm and mv themselves don't need to do any
directory traversal.

--
Ciao,
Pascal

2003-03-11 20:04:33

by Daniel Phillips

[permalink] [raw]

Subject: Re: [RFC] Improved inode number allocation for HTree

On Tue 11 Mar 03 20:39, Helge Hafting wrote:
> On Tue, Mar 11, 2003 at 02:41:06PM +0100, Daniel Phillips wrote:
> > <wishful thinking>
> > Now that you mention it, just locking out create and rename during
> > directory traversal would eliminate the pain. Delete is easy enough to
> > handle during traversal. For a B-Tree, coalescing could simply be
> > deferred until the traversal is finished, so reading the directory in
> > physical storage order would be fine. Way, way cleaner than what we have
> > to do now.
> > </wishful thinking>
>
> Ok, so "rm" works. Then you have things like "mv *.c /usr/src" to worry
> about.

That's equivalent to ls, the shell expands it before doing any moving. You
can construct something more interesting with ls | xargs <something nasty>
into the same directory. Since the user is trying to shoot their foot off,
let the lock be recursive, and let them do that.

> ...someone evil still
> could cause trouble by keeping a traversal going forever, creating one
> dummy file and deleting one whenever it makes progress. The directory
> would get big, filled up with placeholders until some ulimit kicks in.
>
> Helge Hafting

It's not clear that's any more evil than things they can do already, eg,

seq 1 1000000 | xargs -l1 cp -a /usr

Welllll, this is all idle speculation anyway. Nobody's going to fix the
flaw this week, if it's even possible (I suspect it is).

Regards,

Daniel

2003-03-11 21:15:11

by Hans Reiser

[permalink] [raw]

Subject: atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree )

Daniel Phillips wrote:

>On Tue 11 Mar 03 14:00, Helge Hafting wrote:
>
>
>>Hans Reiser wrote:
>>
>>
>>>Let's make noatime the default for VFS.
>>>
>>>Daniel Phillips wrote:
>>>
>>>
>>[...]
>>
>>
>>
>>>>If I were able to design Unix over again, I'd state that if you don't
>>>>lock a directory before traversing it then it's your own fault if
>>>>somebody changes it under you, and I would have provided an interface
>>>>to inform you about your bad luck. Strictly wishful thinking.
>>>>(There, it feels better now.)
>>>>
>>>>
>>I'm happy nobody _can_ lock a directory like that. Think of it - unable
>>to create or delete files while some slow-moving program is traversing
>>the directory? Ouch. Plenty of options for DOS attacks too.
>>And how to do "rm *.bak" if rm locks the dir for traversal?
>>
>>
>
><wishful thinking>
>Now that you mention it, just locking out create and rename during directory
>traversal would eliminate the pain. Delete is easy enough to handle during
>traversal. For a B-Tree, coalescing could simply be deferred until the
>traversal is finished, so reading the directory in physical storage order
>would be fine. Way, way cleaner than what we have to do now.
></wishful thinking>
>
>Regards,
>
>Daniel
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>
You would want a directory listing command that is a single system
call, and then that could be made isolated, without risk of userspace
failing to unlock. Then you have to worry about very large directories
straining things in user space, but you can probably have the user
specify the maximimum size directory his process has space for, etc.

In general, DOS issues are what makes it difficult to introduce
transactions into the kernel. We are grappling with that now in
reiser4. It seems that the formula is to create atomic plugins, and
then it is the system administrator's responsibility to verify that he
can live with whatever DOS vulnerabilities it has before he installs it
in his kernel (or our responsibility if it is sent to us for inclusion
in the main kernel). Allowing arbitrary filesystem operations to be
combined into one atomic transaction seems problematic for either user
space or the kernel, depending on what you do.

In general, allowing user space to lock things means that you trust user
space to unlock. This creates all sorts of trust troubles, and if you
force the unlock after some timeout, then the user space application
becomes vulnerable to DOS from other processes causing it to exceed the
timeout.

Ideas on this are welcome.

--
Hans

2003-03-11 23:39:05

by Jamie Lokier

[permalink] [raw]

Subject: Re: atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree )

Hans Reiser wrote:
> Allowing arbitrary filesystem operations to be
> combined into one atomic transaction seems problematic for either user
> space or the kernel, depending on what you do.
>
> In general, allowing user space to lock things means that you trust user
> space to unlock. This creates all sorts of trust troubles, and if you
> force the unlock after some timeout, then the user space application
> becomes vulnerable to DOS from other processes causing it to exceed the
> timeout.
>
> Ideas on this are welcome.

You can allow user space to begin a transaction, do some operations
and end a transaction, possibly returning an "abort" result which
means userspace should assume the transaction did not commit any
results and/or whatever was read in the transaction was not reliable.

On the face of it this leaves userspace susceptible to DOS or indeed
fairness/livelock problems. For example if another program is always
changing a directory entry, how can you read that whole directory
in a transaction?

Fairness/livelock problems are hard to avoid with any kinds of lock.
Even the kernel's internal locks have these problems in corner cases
(for example, remember when gettimeofday()'s clock access had to be
converted from using a spinlock to a sequence lock - and that still
doesn't _guarantee_ there is no problem in principle, it just reduces
the probability in all reasonable scenarios).

However, some remedies can be applied to filesystem transactions. If
an operation would cause some other task's transaction to eventually
return an abort code, consider sleeping for a short duration.
Randomise that duration. If the other transaction(s) have been
aborting repeatedly, consider lengthening the sleep duration and/or
specifically waiting for the other transaction to complete, to boost
the other task(s) likilihood of transaction success. Randomise this
decision too. If you know something about the type of other
transactions (such as it is trying to implement a read-write lock by
doing atomic operations on bytes in a file), consider exactly what
policy you hope to offer (writer preference? reader preference?
something in between?)

By which point it has remarkable similarities to the problems of
fairness in the task scheduler, and fairness/livelock in locks.

-- Jamie

2003-03-14 21:45:00

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: [Ext2-devel] Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree)

Hi,

On Mon, 2003-03-10 at 21:50, John Bradford wrote:

> Well, yes, I use noatime by default, but I was thinking that there
> might be users who wanted to gain most of the performance that using
> noatime would, who still wanted access times available generally, but
> who were not bothered about loosing them on an unclean shutdown.

I've got new inode-flushing code which somewhat does that already. The
code in

http://people.redhat.com/sct/patches/ext3-2.4/dev-20030314/40-iflush-sct/

allows us to defer inode writes, primarily to eliminate redundant
copy-to-buffer-cache operations in mark_inode_dirty; but it has the
side-effect of allowing us to defer atime updates quite safely.

I'm currently integrating all the latest bits and pieces of orlov and
htree work into that set of patches to provide a stable base, and the
next order of business is to get ACL and O_DIRECT diffs in for 2.4
too. But beyond that, I need to get down to some serious performance
work with ext3, and the inode flush code is a major part of that.

--Stephen

2003-03-14 21:46:27

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree

Hi,

On Mon, 2003-03-10 at 22:02, Matthew Wilcox wrote:
> On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote:
> > Why start? Who actually uses atime for anything at all, other than the
> > tiny number of shops that care about moving untouched files to tertiary
> > storage?
> >
> > Surely if you want to heap someone else's plate with work, you should
> > offer a reason why :-)
>
> "You have new mail" vs "You have mail".

"nodiratime" can still help there.

--Stephen

2003-03-15 08:28:56

by jw schultz

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree

On Fri, Mar 14, 2003 at 09:57:12PM +0000, Stephen C. Tweedie wrote:
> Hi,
>
> On Mon, 2003-03-10 at 22:02, Matthew Wilcox wrote:
> > On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote:
> > > Why start? Who actually uses atime for anything at all, other than the
> > > tiny number of shops that care about moving untouched files to tertiary
> > > storage?
> > >
> > > Surely if you want to heap someone else's plate with work, you should
> > > offer a reason why :-)
> >
> > "You have new mail" vs "You have mail".
>
> "nodiratime" can still help there.

As may noatime. noatime only prevents the automatic update
of atime on read. It doesn't (at least in my experience)
prevent utime(2) from updating the atime field.

Using noatime works quite well with at least with mutt which
explicitly uses utime(2) to update atime. I cannot be
certain but mutt may actually work better with noatime
because then other tools (*biff &co) accesses won't break the
mailer's notion of new mail (mtime > atime).

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt