2001-11-24 18:36:38

by Steve Bergman

[permalink] [raw]
Subject: Disk hardware caching, performance, and journalling

Hi,

I made a couple of discoveries today which were surprising to me:

1. Disk hardware caching defaults to ON. (hdparm -W1 /dev/hda)
2. It makes a *big* difference in write performance.

I had always thought that the default was off. I also always assumed
that a small cache behind a large (OS) cache would make no difference.

Here are my results with bonnie under kernel 2.4.14 on a reiserfs with a
maxtor Diamond max+ 60GB udma100 drive:

Write caching on:
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU
/sec %CPU
256 12618 97.1 38027 36.3 9647 6.9 11250 73.6 31832
12.1 200.9 1.2


Write caching off:
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU
/sec %CPU
256 9917 76.3 12280 11.2 5159 3.5 9934 65.3 33056
14.1 203.9 1.4

Note that block writes are over 3 times faster with caching on.

So what are the implications here for journalling? Do I have to turn
off caching and suffer a huge performance hit?


-Steve Bergman


2001-11-24 19:09:32

by Andrew Morton

[permalink] [raw]
Subject: Re: Disk hardware caching, performance, and journalling

Steve Bergman wrote:
>
> Note that block writes are over 3 times faster with caching on.

With a large linear write, linux typically feeds requests into the
disk like this:

write 248 sectors
write 248 sectors
...
write 248 sectors
write 8 sectors
write 248 sectors
...

Now, 248+8 sectors is 128 kbytes. A track is, say, 300 kbytes.

With writebehind the disk can write that entire track in pretty
much a single spin. But if we're waiting on the result of each
request we'll lose revolutions. In synchronous mode it's going
to take three or four spins to write a track.

> So what are the implications here for journalling? Do I have to turn
> off caching and suffer a huge performance hit?

In theory, yes. In my opinion, no. For ext3, at least. Caching
isn't bad per-se. It's reordering which can break the journalling
constraints. But given that the journal is, we hope, a strictly
ascending and (we really hope) contiguous chunk of blocks, it's
quite unlikely that the disk will decide to write them in an
unexpected order. This is especially true if the journal was
created when the disk was relatively unfragmented.

And if the disk _does_ write them in the wrong order, it has
to be specifically the journal commit block which was written
prior to some data blocks. And you need to lose power (not
just crash) prior to the data blocks hitting disk. It's a
very small time window containing an improbable occurrence.

Now that's all just vigorous handwaving, and may be wrong,
and yes, we really need a way of propagating barriers down
to the request queue. But I've not seen a whisker of a report
which indicates that write reordering has caused on-recovery
corruption.

-

2001-11-24 19:39:51

by Florian Weimer

[permalink] [raw]
Subject: Re: Disk hardware caching, performance, and journalling

Andrew Morton <[email protected]> writes:

> In theory, yes. In my opinion, no. For ext3, at least. Caching
> isn't bad per-se. It's reordering which can break the journalling
> constraints. But given that the journal is, we hope, a strictly
> ascending and (we really hope) contiguous chunk of blocks, it's
> quite unlikely that the disk will decide to write them in an
> unexpected order. This is especially true if the journal was
> created when the disk was relatively unfragmented.

When the journal resides on multiple disks or disks different from the
actual data (think LVM or RAID), all bets are off. You need
synchronous write operations in these cases, I think.

--
Florian Weimer [email protected]
University of Stuttgart http://cert.uni-stuttgart.de/
RUS-CERT +49-711-685-5973/fax +49-711-685-5898

2001-11-24 19:40:31

by Mark Hahn

[permalink] [raw]
Subject: Re: Disk hardware caching, performance, and journalling

> So what are the implications here for journalling? Do I have to turn
> off caching and suffer a huge performance hit?

why does everyone get freaked out about disk caches? afaikt,
there's only an O(50ms) window at each catastrophic power failure:
trivial for any reasonable rate of failures...

2001-11-24 21:58:02

by Phil Howard

[permalink] [raw]
Subject: Re: Disk hardware caching, performance, and journalling

On Sat, Nov 24, 2001 at 11:08:17AM -0800, Andrew Morton wrote:

| With writebehind the disk can write that entire track in pretty
| much a single spin. But if we're waiting on the result of each
| request we'll lose revolutions. In synchronous mode it's going
| to take three or four spins to write a track.
|
| > So what are the implications here for journalling? Do I have to turn
| > off caching and suffer a huge performance hit?
|
| In theory, yes. In my opinion, no. For ext3, at least. Caching
| isn't bad per-se. It's reordering which can break the journalling
| constraints. But given that the journal is, we hope, a strictly
| ascending and (we really hope) contiguous chunk of blocks, it's
| quite unlikely that the disk will decide to write them in an
| unexpected order. This is especially true if the journal was
| created when the disk was relatively unfragmented.
|
| And if the disk _does_ write them in the wrong order, it has
| to be specifically the journal commit block which was written
| prior to some data blocks. And you need to lose power (not
| just crash) prior to the data blocks hitting disk. It's a
| very small time window containing an improbable occurrence.
|
| Now that's all just vigorous handwaving, and may be wrong,
| and yes, we really need a way of propagating barriers down
| to the request queue. But I've not seen a whisker of a report
| which indicates that write reordering has caused on-recovery
| corruption.

What if (and maybe it is so, or maybe not) the write cache does
write-back (or write-behind) zones per track? That is, when it
gets a write request either when the cache is not dirty, or is
in the same _physical_ track that everything that is in there
dirty goes into, it gets queued. Then when some time passes,
or a request goes to a new track, it will write that track now
and move on. IWSTM this gives you the advantage of "one spin"
writes for sequential data, and still keeps the order right for
journaled data. Could they be doing this?

--
-----------------------------------------------------------------
| Phil Howard - KA9WGN | Dallas | http://linuxhomepage.com/ |
| [email protected] | Texas, USA | http://phil.ipal.org/ |
-----------------------------------------------------------------

2001-11-25 09:18:06

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Disk hardware caching, performance, and journalling

On Sat, Nov 24, 2001 at 12:36:18PM -0600, Steve Bergman wrote:

1. Disk hardware caching defaults to ON. (hdparm -W1 /dev/hda)
2. It makes a *big* difference in write performance.

I depends on the drive, my IDE drives do default to on, my SCSI drives
do not.

The difference in write performance doesn't seem to be a problem other
that in contrived situations (eg. streaming 5G of data to disk takes
the same amount of time either way, but untar something then 'sync' is
faster with the drive caching).

It also depends of your filesystems to some extent and the operations
being performed [1].

So what are the implications here for journalling? Do I have to
turn off caching and suffer a huge performance hit?

Yes. I do this on workstations and it doesn't seem to hurt in
practice (only in benchmarks).

I can't comment on your bonnie++ results and I have no idea how well
they reflect reality (I assume to a large extent they try to though).




--cw

[1] XFS rm -rf some_large_dir bites with drive-caching off for example.

2001-11-25 09:20:54

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Disk hardware caching, performance, and journalling

On Sat, Nov 24, 2001 at 11:08:17AM -0800, Andrew Morton wrote:

In theory, yes. In my opinion, no. For ext3, at least. Caching
isn't bad per-se. It's reordering which can break the journalling
constraints.

Some disks[1] most definately do reorder; I've actually been able to
demonstrate this in some circumstances (it wasn't trivial to
produce and required several attempts).



--cw

[1] SCSI, which we know does and will reorder writes when a barrier
isn't specificed, it appears IDE drives can do the same but lack of
hot-swap makes testing tedious :)

2001-11-25 09:22:14

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Disk hardware caching, performance, and journalling

On Sat, Nov 24, 2001 at 02:39:44PM -0500, Mark Hahn wrote:

why does everyone get freaked out about disk caches? afaikt,
there's only an O(50ms) window at each catastrophic power failure:
trivial for any reasonable rate of failures...

If your disks are busy all the time (eg. a large mail server) then you
will trivially hit this and it will be a problem.



--cw

2001-11-25 21:41:40

by Kevin P. Fleming

[permalink] [raw]
Subject: Re: Disk hardware caching, performance, and journalling

I think if you have a large mail server and zero power protection, you've
got much larger problems to worry about than write-behind caching on your
disk drives... my servers have never (in my memory) experienced a
catastrophic power failure, because they're too easy to avoid.

----- Original Message -----
From: "Chris Wedgwood" <[email protected]>
To: "Mark Hahn" <[email protected]>
Cc: <[email protected]>
Sent: Sunday, November 25, 2001 2:23 AM
Subject: Re: Disk hardware caching, performance, and journalling


> On Sat, Nov 24, 2001 at 02:39:44PM -0500, Mark Hahn wrote:
>
> why does everyone get freaked out about disk caches? afaikt,
> there's only an O(50ms) window at each catastrophic power failure:
> trivial for any reasonable rate of failures...
>
> If your disks are busy all the time (eg. a large mail server) then you
> will trivially hit this and it will be a problem.
>
>
>
> --cw
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
>

2001-11-25 22:09:50

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Disk hardware caching, performance, and journalling

On Sun, Nov 25, 2001 at 02:45:57PM -0700, Kevin P. Fleming wrote:

I think if you have a large mail server and zero power protection,
you've got much larger problems to worry about than write-behind
caching on your disk drives... my servers have never (in my
memory) experienced a catastrophic power failure, because they're
too easy to avoid.

In the specific case of email; you want to make certain guarantees,
and having data written to non-volatile storage is one of them.

As for power-failures, given enough time and enough hardware you will
get them, even if your machines are dual or triple powered of diverse
UPSs or -48V powered; it still is possible and _will eventually
happen_ that something further down the line like the motherboard will
fry or whatever.

People who assume that a small-window is small enough and decide that
is 'good enough' are dangerous :)



--cw

2001-11-25 22:30:00

by Mark Hahn

[permalink] [raw]
Subject: Re: Disk hardware caching, performance, and journalling

> I think if you have a large mail server and zero power protection,
> you've got much larger problems to worry about than write-behind
> caching on your disk drives... my servers have never (in my
...
> In the specific case of email; you want to make certain guarantees,
> and having data written to non-volatile storage is one of them.

it's pitiful to pretend that this miniscule risk (50ms per catastrophic
power failure) is all that stands between you and absolute stability
of storage.

> People who assume that a small-window is small enough and decide that
> is 'good enough' are dangerous :)

your religious pursuits are your own business.
the rest of the world will go on calculating probabilities of failure,
rather than emoting. using raid, redundant sites, upses, etc,
rather than obsessing on how terrible IDE disks are.

in summary: write caching on disks is *not* an impediment to robust systems.

I've omitted lkml from this reply since it has nothing to do with the kernel.

2001-11-26 01:11:24

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: Disk hardware caching, performance, and journalling

In article <00e001c175fa$90d02b40$6caaa8c0@kevin> you wrote:
> disk drives... my servers have never (in my memory) experienced a
> catastrophic power failure, because they're too easy to avoid.

Well, even UPS can fail. But battery protected RAID Controllers are a MUST,
anyway.

Greetings
Bernd