LinuxLists.cc - Tux3 Report: Faster than tmpfs, what?

2013-05-07 23:24:10

Subject: Tux3 Report: Faster than tmpfs, what?

When something sounds to good to be true, it usually is. But not always. Today
Hirofumi posted some nigh on unbelievable dbench results that show Tux3
beating tmpfs. To put this in perspective, we normally regard tmpfs as
unbeatable because it is just a thin shim between the standard VFS mechanisms
that every filesystem must use, and the swap device. Our usual definition of
successful optimization is that we end up somewhere between Ext4 and Tmpfs,
or in other words, faster than Ext4. This time we got an excellent surprise.

The benchmark:

dbench -t 30 -c client2.txt 1 & (while true; do sync; sleep 4; done)

Configuration:

KVM with two CPUs and 4 GB memory running on a Sandy Bridge four core host
at 3.4 GHz with 8 GB of memory. Spinning disk. (Disk drive details
to follow.)

Summary of results:

tmpfs: Throughput 1489.00 MB/sec max_latency=1.758 ms
tux3: Throughput 1546.81 MB/sec max_latency=12.950 ms
ext4: Throughput 1017.84 MB/sec max_latency=1441.585 ms

Tux3 edged out Tmpfs and stomped Ext4 righteously. What is going on?
Simple: Tux3 has a frontend/backend design that runs on two CPUs. This
allows handing off some of the work of unlink and delete to the kernel tux3d,
which runs asynchronously from the dbench task. All Tux3 needs to do in the
dbench context is set a flag in the deleted inode and add it to a dirty
list. The remaining work like truncating page cache pages is handled by the
backend tux3d. The effect is easily visible in the dbench details below
(See the Unlink and Deltree lines).

It is hard to overstate how pleased we are with these results. Particularly
after our first dbench tests a couple of days ago were embarrassing: more than
five times slower than Ext4. The issue turned out to be inefficient inode
allocation. Hirofumi changed the horribly slow itable btree search to a
simple "allocate the next inode number" counter, and shazam! The slowpoke
became a superstar. Now, this comes with a caveat: the code that produces
this benchmark currently relies on this benchmark-specific hack to speed up
inode number allocation. However, we are pretty sure that our production inode
allocation algorithm will have insignificant additional overhead versus this
temporary hack. If only because "allocate the next inode number" is nearly
always the best strategy.

With directory indexing now considered a solved problem, the only big
issue we feel needs to be addressed before offering Tux3 for merge is
allocation. For now we use the same overly simplistic strategy to allocate
both disk blocks and inode numbers, which is trivially easy to defeat to
generate horrible benchmark numbers on spinning disk. So the next round
of work, which I hope will only take a few weeks, consists of improving
these allocators to at least a somewhat respectable level.

For inode number allocation, I have proposed a strategy that looks a lot
like Ext2/3/4 inode bitmaps. Tux3's twist is that these bitmaps are just
volatile cache objects, never transferred to disk. According to me, the
overhead of allocating from these bitmaps will hardly affect today's
benchmark numbers at all, but that remains to be proven.

Detailed dbench results:

tux3:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 1477980 0.003 12.944
Close 1085650 0.001 0.307
Rename 62579 0.006 0.288
Unlink 298496 0.002 0.345
Deltree 38 0.083 0.157
Mkdir 19 0.001 0.002
Qpathinfo 1339597 0.002 0.468
Qfileinfo 234761 0.000 0.231
Qfsinfo 245654 0.001 0.259
Sfileinfo 120379 0.001 0.342
Find 517948 0.005 0.352
WriteX 736964 0.007 0.520
ReadX 2316653 0.002 0.499
LockX 4812 0.002 0.207
UnlockX 4812 0.001 0.221
Throughput 1546.81 MB/sec 1 clients 1 procs max_latency=12.950 ms

tmpfs:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 1423080 0.004 1.155
Close 1045354 0.001 0.578
Rename 60260 0.007 0.470
Unlink 287392 0.004 0.607
Deltree 36 0.651 1.352
Mkdir 18 0.001 0.002
Qpathinfo 1289893 0.002 0.575
Qfileinfo 226045 0.000 0.346
Qfsinfo 236518 0.001 0.383
Sfileinfo 115924 0.001 0.405
Find 498705 0.007 0.614
WriteX 709522 0.005 0.679
ReadX 2230794 0.002 1.271
LockX 4634 0.002 0.021
UnlockX 4634 0.001 0.324
Throughput 1489 MB/sec 1 clients 1 procs max_latency=1.758 ms

ext4:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 988446 0.005 29.226
Close 726028 0.001 0.247
Rename 41857 0.011 0.238
Unlink 199651 0.022 1441.552
Deltree 24 1.517 3.358
Mkdir 12 0.002 0.002
Qpathinfo 895940 0.003 15.849
Qfileinfo 156970 0.001 0.429
Qfsinfo 164303 0.001 0.210
Sfileinfo 80501 0.002 1.037
Find 346400 0.010 2.885
WriteX 492615 0.009 13.676
ReadX 1549654 0.002 0.808
LockX 3220 0.002 0.015
UnlockX 3220 0.001 0.010
Throughput 1017.84 MB/sec 1 clients 1 procs max_latency=1441.585 ms

Apologies for the formatting. I will get back to a real mailer soon.

Regards,

Daniel

2013-05-10 04:51:10

by Dave Chinner

[permalink] [raw]

Subject: Re: Tux3 Report: Faster than tmpfs, what?

On Tue, May 07, 2013 at 04:24:05PM -0700, Daniel Phillips wrote:
> When something sounds to good to be true, it usually is. But not always. Today
> Hirofumi posted some nigh on unbelievable dbench results that show Tux3
> beating tmpfs. To put this in perspective, we normally regard tmpfs as
> unbeatable because it is just a thin shim between the standard VFS mechanisms
> that every filesystem must use, and the swap device. Our usual definition of
> successful optimization is that we end up somewhere between Ext4 and Tmpfs,
> or in other words, faster than Ext4. This time we got an excellent surprise.
>
> The benchmark:
>
> dbench -t 30 -c client2.txt 1 & (while true; do sync; sleep 4; done)

I'm deeply suspicious of what is in that client2.txt file. dbench on
ext4 on a 4 SSD RAID0 array with a single process gets 130MB/s
(kernel is 3.9.0). Your workload gives you over 1GB/s on ext4.....

> tux3:
> Operation Count AvgLat MaxLat
> ----------------------------------------
> NTCreateX 1477980 0.003 12.944
....
> ReadX 2316653 0.002 0.499
> LockX 4812 0.002 0.207
> UnlockX 4812 0.001 0.221
> Throughput 1546.81 MB/sec 1 clients 1 procs max_latency=12.950 ms

Hmmm... No "Flush" operations. Gotcha - you've removed the data
integrity operations from the benchmark.

Ah, I get it now - you've done that so the front end of tux3 won't
encounter any blocking operations and so can offload 100% of
operations. It also explains the sync call every 4 seconds to keep
tux3 back end writing out to disk so that a) all the offloaded work
is done by the sync process and not measured by the benchmark, and
b) so the front end doesn't overrun queues and throttle or run out
of memory.

Oh, so nicely contrived. But terribly obvious now that I've found
it. You've carefully crafted the benchmark to demonstrate a best
case workload for the tux3 architecture, then carefully not
measured the overhead of the work tux3 has offloaded, and then not
disclosed any of this in the hope that all people will look at is
the headline.

This would make a great case study for a "BenchMarketing For
Dummies" book.

Shame for you that you sent it to a list where people see the dbench
numbers for ext4 and immediately think "that's not right" and then
look deeper. Phoronix might swallow your sensationalist headline
grab without analysis, but I don't think I'm alone in my suspicion
that there was something stinky about your numbers.

Perhaps in future you'll disclose such information with your
results, otherwise nobody is ever going to trust anything you say
about tux3....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-05-10 05:09:51

by Christian Stroetmann

[permalink] [raw]

Subject: Re: Tux3 Report: Faster than tmpfs, what?

Aloha hardcore coders

Thank you very much for working out the facts, Dave.
You proved why I had all the years such a special suspicious feeling by
reading between the lines of the Tux3 e-mails sent to the mailing-list,
which should not mean that I do not like the work around the Tux3 file
system in general. Quite contrary, it is highly interesting to watch if
there are possibilites to bring the whole field further. But this kind
of marketing seen in the past is truely not constructive but contemporary.

Have fun in the sun
Christian Stroetmann

> On Tue, May 07, 2013 at 04:24:05PM -0700, Daniel Phillips wrote:
>> When something sounds to good to be true, it usually is. But not always. Today
>> Hirofumi posted some nigh on unbelievable dbench results that show Tux3
>> beating tmpfs. To put this in perspective, we normally regard tmpfs as
>> unbeatable because it is just a thin shim between the standard VFS mechanisms
>> that every filesystem must use, and the swap device. Our usual definition of
>> successful optimization is that we end up somewhere between Ext4 and Tmpfs,
>> or in other words, faster than Ext4. This time we got an excellent surprise.
>>
>> The benchmark:
>>
>> dbench -t 30 -c client2.txt 1& (while true; do sync; sleep 4; done)
> I'm deeply suspicious of what is in that client2.txt file. dbench on
> ext4 on a 4 SSD RAID0 array with a single process gets 130MB/s
> (kernel is 3.9.0). Your workload gives you over 1GB/s on ext4.....
>
>> tux3:
>> Operation Count AvgLat MaxLat
>> ----------------------------------------
>> NTCreateX 1477980 0.003 12.944
> ....
>> ReadX 2316653 0.002 0.499
>> LockX 4812 0.002 0.207
>> UnlockX 4812 0.001 0.221
>> Throughput 1546.81 MB/sec 1 clients 1 procs max_latency=12.950 ms
> Hmmm... No "Flush" operations. Gotcha - you've removed the data
> integrity operations from the benchmark.
>
> Ah, I get it now - you've done that so the front end of tux3 won't
> encounter any blocking operations and so can offload 100% of
> operations. It also explains the sync call every 4 seconds to keep
> tux3 back end writing out to disk so that a) all the offloaded work
> is done by the sync process and not measured by the benchmark, and
> b) so the front end doesn't overrun queues and throttle or run out
> of memory.
>
> Oh, so nicely contrived. But terribly obvious now that I've found
> it. You've carefully crafted the benchmark to demonstrate a best
> case workload for the tux3 architecture, then carefully not
> measured the overhead of the work tux3 has offloaded, and then not
> disclosed any of this in the hope that all people will look at is
> the headline.
>
> This would make a great case study for a "BenchMarketing For
> Dummies" book.
>
> Shame for you that you sent it to a list where people see the dbench
> numbers for ext4 and immediately think "that's not right" and then
> look deeper. Phoronix might swallow your sensationalist headline
> grab without analysis, but I don't think I'm alone in my suspicion
> that there was something stinky about your numbers.
>
> Perhaps in future you'll disclose such information with your
> results, otherwise nobody is ever going to trust anything you say
> about tux3....
>
> Cheers,
>
> Dave.

2013-05-10 05:47:53

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: Tux3 Report: Faster than tmpfs, what?

Dave Chinner <[email protected]> writes:

>> tux3:
>> Operation Count AvgLat MaxLat
>> ----------------------------------------
>> NTCreateX 1477980 0.003 12.944
> ....
>> ReadX 2316653 0.002 0.499
>> LockX 4812 0.002 0.207
>> UnlockX 4812 0.001 0.221
>> Throughput 1546.81 MB/sec 1 clients 1 procs max_latency=12.950 ms
>
> Hmmm... No "Flush" operations. Gotcha - you've removed the data
> integrity operations from the benchmark.

Right. Because tux3 is not implementing fsync() yet. So, I did

grep -v Flush /usr/share/dbench/client.txt > client2.txt

Why is it important for comparing?

> Ah, I get it now - you've done that so the front end of tux3 won't
> encounter any blocking operations and so can offload 100% of
> operations. It also explains the sync call every 4 seconds to keep
> tux3 back end writing out to disk so that a) all the offloaded work
> is done by the sync process and not measured by the benchmark, and
> b) so the front end doesn't overrun queues and throttle or run out
> of memory.

Our backend is still using debugging mode (flush each 10 transactions
for stress/debugging). Since no interface to use normal writeback
timing yet, and I'm not tackling about it yet.

And if normal writeback can't beat crappy fixed timing (4 secs), Rather,
it means we have to improve writeback timing. I.e. sync should be rather
slower than best timing, right?

> Oh, so nicely contrived. But terribly obvious now that I've found
> it. You've carefully crafted the benchmark to demonstrate a best
> case workload for the tux3 architecture, then carefully not
> measured the overhead of the work tux3 has offloaded, and then not
> disclosed any of this in the hope that all people will look at is
> the headline.
>
> This would make a great case study for a "BenchMarketing For
> Dummies" book.

Simply wrong. I did this to start optimization of tux3 (We know we have
many places to optimize in tux3), but the result was that post. If you
can't see at all what we did by frontend/backend design from that, I'm a
bit sad for it.

I think I can improve tmpfs/ext4 like tux3 (Unlink/Deltree) if I want to
do, from this result.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-05-11 06:12:31

by Daniel Phillips

[permalink] [raw]

Subject: Re: Tux3 Report: Faster than tmpfs, what?

Hi Dave,

Thanks for the catch - I should indeed have noted that "modified
dbench" was used for this benchmark, thus amplifying Tux3's advantage
in delete performance. This literary oversight does not make the
results any less interesting: we beat Tmpfs on that particular load.
Beating tmpfs at anything is worthy of note. Obviously, all three
filesystems ran the same load.

We agree that "classic unadulterated dbench" is an important Linux
benchmark for comparison with other filesystems. I think we should
implement a proper fsync for that one and not just use fsync = sync.
That isn't very far in the future, however our main focus right now is
optimizing spinning disk allocation. It probably makes logistical
sense to leave fsync as it is for now and concentrate on the more
important issues.

I do not agree with your assertion that the benchmark as run is
invalid, only that the modified load should have been described in
detail. I presume you would like to see a new bakeoff using "classic"
dbench. Patience please, this will certainly come down the pipe in due
course. We might not beat Tmpfs on that load but we certainly expect
to outperform some other filesystems.

Note that Tux3 ran this benchmark using its normal strong consistency
semantics, roughly similar to Ext4's data=journal. In that light, the
results are even more interesting.

> ...you've done that so the front end of tux3 won't
> encounter any blocking operations and so can offload 100% of
> operations.

Yes, that is the entire point of our front/back design: reduce
application latency for buffered filesystem transactions.

> It also explains the sync call every 4 seconds to keep
> tux3 back end writing out to disk so that a) all the offloaded work
> is done by the sync process and not measured by the benchmark, and
> b) so the front end doesn't overrun queues and throttle or run out
> of memory.

Entirely correct. That's really nice, don't you think? You nicely
described a central part of Tux3's design: our "delta" mechanism. We
expect to spend considerable effort tuning the details of our delta
transition behaviour as time goes by. However this is not an immediate
priority because the simplistic "flush every 4 seconds" hack already
works pretty well for a lot of loads.

Thanks for your feedback,

Daniel

2013-05-11 18:35:47

by james northrup

[permalink] [raw]

Subject: Re: Tux3 Report: Faster than tmpfs, what?

also interesting information... Study of 2,047 papers on PubMed finds
that two-thirds of retracted papers were down to scientific
misconduct, not error

On Fri, May 10, 2013 at 11:12 PM, Daniel Phillips
<[email protected]> wrote:
> Hi Dave,
>
> Thanks for the catch - I should indeed have noted that "modified
> dbench" was used for this benchmark, thus amplifying Tux3's advantage
> in delete performance. This literary oversight does not make the
> results any less interesting: we beat Tmpfs on that particular load.
> Beating tmpfs at anything is worthy of note. Obviously, all three

2013-05-11 21:26:17

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Tux3 Report: Faster than tmpfs, what?

On Fri, May 10, 2013 at 11:12:27PM -0700, Daniel Phillips wrote:
> Hi Dave,
>
> Thanks for the catch - I should indeed have noted that "modified
> dbench" was used for this benchmark, thus amplifying Tux3's advantage
> in delete performance.

Dropping fsync() does a lot more than "amplify Tux3's advantage in
delete performace". Since fsync(2) is defined as not returning until
the data written to the file descriptor is flushed out to stable
storage --- so it is guaranteed to be seen after a system crash --- it
means that the foreground application must not continue until the data
is written by Tux3's back-end.

So it also means that any advantage of decoupling the front/back end
is nullified, since fsync(2) requires a temporal coupling. In fact,
if there is any delays introdued between when the front-end sends the
fsync request, and when the back-end finishes writing the data and
then communicates this back to the front-end --- i.e., caused by
schedular latencies, this may end up being a disadvantage compared to
more traditional file system designs.

Like many things in file system design, there are tradeoffs. It's
perhaps more quseful when having these discussions to be clear what
you are trading off for what; in this case, the front/back design may
be good for somethings, and less good for others, such as mail server
workloads where fsync(2) semantics is extremely important for
application correctness.

Best regards,

- Ted

2013-05-12 04:28:49

by Daniel Phillips

[permalink] [raw]

Subject: Re: Tux3 Report: Faster than tmpfs, what?

(resent as plain text)

On Sat, May 11, 2013 at 2:26 PM, Theodore Ts'o <[email protected]> wrote:
> Dropping fsync() does a lot more than "amplify Tux3's advantage in
> delete performace". Since fsync(2) is defined as not returning until
> the data written to the file descriptor is flushed out to stable
> storage --- so it is guaranteed to be seen after a system crash --- it
> means that the foreground application must not continue until the data
> is written by Tux3's back-end.
>
> So it also means that any advantage of decoupling the front/back end
> is nullified, since fsync(2) requires a temporal coupling. In fact,
> if there is any delays introdued between when the front-end sends the
> fsync request, and when the back-end finishes writing the data and
> then communicates this back to the front-end --- i.e., caused by
> schedular latencies, this may end up being a disadvantage compared to
> more traditional file system designs.
>
> Like many things in file system design, there are tradeoffs. It's
> perhaps more quseful when having these discussions to be clear what
> you are trading off for what; in this case, the front/back design may
> be good for somethings, and less good for others, such as mail server
> workloads where fsync(2) semantics is extremely important for
> application correctness.

Exactly, Ted. We avoided measuring the fsync load on this particular
benchmark because we have not yet optimized fsync. When we do get to
it (not an immediate priority) I expect Tux3 to perform competitively,
because our delta commit scheme does manage the job with a minimal
number of block writes. To have a really efficient fsync we need to
isolate just the changes for the fsynced file into a special "half
delta" that gets its own commit, ahead of any other pending changes
to the filesystem. There is a plan for this, however we would rather
not get sidetracked on that project now while we are getting ready
for merge.

The point that seems to be getting a little lost in this thread is,
the benchmark just as we ran it models an important and common type
of workload, arguably the most common workload for real users, and
the resulting performance measurement is easily reproducible for
anyone who cares to try. In fact, I think we should prepare and
post a detailed recipe for doing just that, since the interest
level seems to be high.

Regards,

Daniel

PS for any Googlers reading: do you know that using Gmail to post to
LKML is simply maddening for all concerned? If you want to know why
then try it yourself. Plain text. Some people need it, and need it to
be reliable instead of gratuitously changing back to html at
surprising times. And static word wrap. Necessary.

2013-05-12 04:39:15

by Daniel Phillips

[permalink] [raw]

Subject: Re: Tux3 Report: Faster than tmpfs, what?

On Sat, May 11, 2013 at 11:35 AM, james northrup
<[email protected]> wrote:
> also interesting information... Study of 2,047 papers on PubMed finds
> that two-thirds of retracted papers were down to scientific
> misconduct, not error

Could you please be specific about the meaning you intend? Because
innuendo is less than useful in this forum. If you mean to say that
our posted results might not be independently verifiable then I invite
you to run the tests as described (including removing fsync) yourself.
If you require any assistance from us in doing that, we will be
pleased to provide it.

Regards,

Daniel

2013-05-13 23:22:59

by Daniel Phillips

[permalink] [raw]

Subject: Re: Tux3 Report: Faster than tmpfs, what?

Hi Ted,

You said:

> ...any advantage of decoupling the front/back end
> is nullified, since fsync(2) requires a temporal coupling

After after pondering it for a while, I realized that is not
completely accurate. The reduced delete latency will
allow the dbench process to proceed to the fsync point
faster, then if our fsync is reasonably efficient (not the
case today, but planned) we may still see an overall
speedup.

> if there is any delays introdued between when the
> front-end sends the fsync request, and when the back-
> end finishes writing the data and then communicates
> this back to the front-end --- i.e., caused by schedular
> latencies, this may end up being a disadvantage
> compared to more traditional file system designs.

Nothing stops our frontend from calling its backend
synchronously, which is just what we intend to do for
fsync. The real design issue for Tux3 fsync is writing
out the minimal set of blocks to update a single file.
As it is now, Tux3 commits all dirty file data at each
delta, which is fine for many common loads, but not
all. Two examples of loads where this may be less
than optimal:

1) fsync (as you say)

2) multiple tasks accessing different files

To excel under those loads, Tux3 needs to be able to
break its "always commit everything rule" in an
organized way. We have considered several design
options for this but not yet prototyped any because we
feel that that work can reasonably be attacked later. As
always, we will seek the most rugged, efficient and
simple solution.

Regards,

Daniel

2013-05-14 06:25:32

by Daniel Phillips

[permalink] [raw]

Subject: Re: Tux3 Report: Faster than tmpfs, what?

Interesting, Andreas. We don't do anything as heavyweight as
allocating an inode in this path, just mark the inode dirty (which
puts it on a list) and set a bit in the inode flags.

Regards,

Daniel

2013-05-14 06:34:49

by Dave Chinner

[permalink] [raw]

Subject: Re: Tux3 Report: Faster than tmpfs, what?

On Fri, May 10, 2013 at 02:47:35PM +0900, OGAWA Hirofumi wrote:
> Dave Chinner <[email protected]> writes:
>
> >> tux3:
> >> Operation Count AvgLat MaxLat
> >> ----------------------------------------
> >> NTCreateX 1477980 0.003 12.944
> > ....
> >> ReadX 2316653 0.002 0.499
> >> LockX 4812 0.002 0.207
> >> UnlockX 4812 0.001 0.221
> >> Throughput 1546.81 MB/sec 1 clients 1 procs max_latency=12.950 ms
> >
> > Hmmm... No "Flush" operations. Gotcha - you've removed the data
> > integrity operations from the benchmark.
>
> Right. Because tux3 is not implementing fsync() yet. So, I did
>
> grep -v Flush /usr/share/dbench/client.txt > client2.txt
>
> Why is it important for comparing?

Because nobody could reproduce your results without working that
out. You didn't disclose that you'd made these changes, and that
makes it extremely misleading as to what the results mean. Given the
headline-grab nature of it, it's deceptive at best.

I don't care how fast tux3 is - I care about being able to reproduce
other people's results. Hence if you are going to report benchmark
results comparing filesystems then you need to tell everyone exactly
what you've tweaked and why, from the hardware all the way up to the
benchmark config.

Work on how *you* report *your* results - don't let Daniel turn them
into some silly marketing fluff that tries to grab headlines.

-Dave.
--
Dave Chinner
[email protected]

2013-05-14 07:59:34

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: Tux3 Report: Faster than tmpfs, what?

Dave Chinner <[email protected]> writes:

>> Right. Because tux3 is not implementing fsync() yet. So, I did
>>
>> grep -v Flush /usr/share/dbench/client.txt > client2.txt
>>
>> Why is it important for comparing?
>
> Because nobody could reproduce your results without working that
> out. You didn't disclose that you'd made these changes, and that
> makes it extremely misleading as to what the results mean. Given the
> headline-grab nature of it, it's deceptive at best.
>
> I don't care how fast tux3 is - I care about being able to reproduce
> other people's results. Hence if you are going to report benchmark
> results comparing filesystems then you need to tell everyone exactly
> what you've tweaked and why, from the hardware all the way up to the
> benchmark config.

Thanks for adivce.
--
OGAWA Hirofumi <[email protected]>

2013-05-15 17:10:55

by Andreas Dilger

[permalink] [raw]

Subject: Re: Tux3 Report: Faster than tmpfs, what?

On 2013-05-14, at 0:25, Daniel Phillips <[email protected]> wrote:
> Interesting, Andreas. We don't do anything as heavyweight as
> allocating an inode in this path, just mark the inode dirty (which
> puts it on a list) and set a bit in the inode flags.

The new inode allocation is only needed for the truncate-to-zero case. If the inode is being deleted it is used directly.

Sorry for confusion, it has been a long time since I looked at that code.

Cheers, Andreas-