Hi!
From http://kerneltrap.org/node/5103
``Hyper-Threading, as currently implemented on Intel Pentium Extreme
Edition, Pentium 4, Mobile Pentium 4, and Xeon processors, suffers from
a serious security flaw," Colin explains. "This flaw permits local
information disclosure, including allowing an unprivileged user to steal
an RSA private key being used on the same machine. Administrators of
multi-user systems are strongly advised to take action to disable
Hyper-Threading immediately."
``More'' info here:
http://www.daemonology.net/hyperthreading-considered-harmful/
Is this flaw affects the current stable Linux kernels? Workaround?
Patch?
Thanks.
-
MG
On Fri, May 13, 2005 at 07:51:20AM +0200, Gabor MICSKO wrote:
> Is this flaw affects the current stable Linux kernels? Workaround?
> Patch?
Some pages with relevant information:
http://www.ussg.iu.edu/hypermail/linux/kernel/0403.2/0920.html
http://bugzilla.kernel.org/show_bug.cgi?id=2317
AFAICT, the workaround is something like this:
1. If possible, disable HyperThreading in BIOS.
2. If you have only one CPU, boot a UP kernel rather than SMP.
3. If you have 2 or more CPU's and you can't disable HT in the BIOS,
boot with "maxcpus=n", where "n" is the number of physical CPU's
in the computer (e.g. "maxcpus=2"). If you are running a kernel
earlier than 2.6.5 or 2.4.26, this probably isn't going to work.
If you try this, check dmesg afterward to make sure it worked
properly (see the bugzilla.kernel.org URL for details).
4. If you would try #3 except you are running a 2.4.xx *vendor* kernel
(not mainline), where xx < 26, try "noht".
5. If #3 and #4 don't work, try "acpi=off".
Option #3 ("maxcpus=2") is what I expect to be deploying in the next
several hours, FWIW...
-Barry K. Nathan <[email protected]>
Barry K. Nathan wrote:
> On Fri, May 13, 2005 at 07:51:20AM +0200, Gabor MICSKO wrote:
>
>>Is this flaw affects the current stable Linux kernels? Workaround?
>>Patch?
Simple. Just boot a uniprocessor kernel, and/or disable HT in BIOS.
> Some pages with relevant information:
> http://www.ussg.iu.edu/hypermail/linux/kernel/0403.2/0920.html
> http://bugzilla.kernel.org/show_bug.cgi?id=2317
These pages have zero information on the "flaw." In fact, I can see no
information at all proving that there is even a problem here.
Classic "I found a problem, but I'm keeping the info a secret" security
crapola.
Jeff
On Fri, May 13, 2005 at 10:10:36AM -0400, Jeff Garzik wrote:
> Barry K. Nathan wrote:
> >On Fri, May 13, 2005 at 07:51:20AM +0200, Gabor MICSKO wrote:
> >
> >>Is this flaw affects the current stable Linux kernels? Workaround?
> >>Patch?
>
> Simple. Just boot a uniprocessor kernel, and/or disable HT in BIOS.
>
>
> >Some pages with relevant information:
> >http://www.ussg.iu.edu/hypermail/linux/kernel/0403.2/0920.html
> >http://bugzilla.kernel.org/show_bug.cgi?id=2317
>
> These pages have zero information on the "flaw." In fact, I can see no
> information at all proving that there is even a problem here.
>
> Classic "I found a problem, but I'm keeping the info a secret" security
> crapola.
FYI:
http://www.daemonology.net/hyperthreading-considered-harmful/
I don't much agree with Colin about the severity of the problem, but
I've read his paper, which should be generally available later today.
It's definitely a legitimate issue.
--
Daniel Jacobowitz
CodeSourcery, LLC
Daniel Jacobowitz wrote:
> On Fri, May 13, 2005 at 10:10:36AM -0400, Jeff Garzik wrote:
>
>>Barry K. Nathan wrote:
>>
>>>On Fri, May 13, 2005 at 07:51:20AM +0200, Gabor MICSKO wrote:
>>>
>>>
>>>>Is this flaw affects the current stable Linux kernels? Workaround?
>>>>Patch?
>>
>>Simple. Just boot a uniprocessor kernel, and/or disable HT in BIOS.
>>
>>
>>
>>>Some pages with relevant information:
>>>http://www.ussg.iu.edu/hypermail/linux/kernel/0403.2/0920.html
>>>http://bugzilla.kernel.org/show_bug.cgi?id=2317
>>
>>These pages have zero information on the "flaw." In fact, I can see no
>>information at all proving that there is even a problem here.
>>
>>Classic "I found a problem, but I'm keeping the info a secret" security
>>crapola.
>
>
> FYI:
> http://www.daemonology.net/hyperthreading-considered-harmful/
Already read it. This link provides no more information than either of
the above links provide.
> I don't much agree with Colin about the severity of the problem, but
> I've read his paper, which should be generally available later today.
> It's definitely a legitimate issue.
We'll see...
As of this moment, there continues to be _zero_ information proving that
a problem exists.
Jeff
On Fri, May 13, 2005 at 10:32:48AM -0400, Jeff Garzik wrote:
> Daniel Jacobowitz wrote:
> > http://www.daemonology.net/hyperthreading-considered-harmful/
>
> Already read it. This link provides no more information than either of
> the above links provide.
He's posted his paper now.
http://www.daemonology.net/papers/htt.pdf
It's a side channel timing attack on data-dependent computation through
the L1 and L2 caches. Nice work. In-the-wild exploitation is
difficult, though; your timing gets screwed up if you get scheduled away
from your victim, and you don't even know, because you can't tell where
you were scheduled, so on any reasonably busy multiuser system it's not
clear that the attack is practical.
-andy
More info in this paper:
http://www.daemonology.net/papers/htt.pdf
> > FYI:
> > http://www.daemonology.net/hyperthreading-considered-harmful/
>
> Already read it. This link provides no more information than either of
> the above links provide.
Gabor MICSKO <[email protected]> writes:
> Hi!
>
> From http://kerneltrap.org/node/5103
>
> ``Hyper-Threading, as currently implemented on Intel Pentium Extreme
> Edition, Pentium 4, Mobile Pentium 4, and Xeon processors, suffers from
> a serious security flaw," Colin explains. "This flaw permits local
> information disclosure, including allowing an unprivileged user to steal
> an RSA private key being used on the same machine. Administrators of
> multi-user systems are strongly advised to take action to disable
> Hyper-Threading immediately."
>
> ``More'' info here:
> http://www.daemonology.net/hyperthreading-considered-harmful/
>
> Is this flaw affects the current stable Linux kernels? Workaround?
> Patch?
This is not a kernel problem, but a user space problem. The fix
is to change the user space crypto code to need the same number of cache line
accesses on all keys.
Disabling HT for this would the totally wrong approach, like throwing
out the baby with the bath water.
-Andi
On Fri, 13 May 2005, Andy Isaacson wrote:
> On Fri, May 13, 2005 at 10:32:48AM -0400, Jeff Garzik wrote:
> > Daniel Jacobowitz wrote:
> > > http://www.daemonology.net/hyperthreading-considered-harmful/
> >
> > Already read it. This link provides no more information than either of
> > the above links provide.
>
> He's posted his paper now.
>
> http://www.daemonology.net/papers/htt.pdf
>
> It's a side channel timing attack on data-dependent computation through
> the L1 and L2 caches. Nice work. In-the-wild exploitation is
> difficult, though; your timing gets screwed up if you get scheduled away
> from your victim, and you don't even know, because you can't tell where
> you were scheduled, so on any reasonably busy multiuser system it's not
> clear that the attack is practical.
>
> -andy
> -
Wouldn't scheduling appear as a rather big time delta (in measuring the
cache access times), so you would know to disregard that data point?
(Just wondering... :-) )
-Vadim
On Fri, 2005-05-13 at 20:03 +0200, Andi Kleen wrote:
> This is not a kernel problem, but a user space problem. The fix
> is to change the user space crypto code to need the same number of cache line
> accesses on all keys.
Well, this might not be trivial in general, and as pointed out by Colin
Perceval, this would require a major rewrite of OpenSSL RSA key
generation procedure. He also notes that other applications, a priori
less sensible, might also be targeted. And obviously, it would be
impractical to ensure this property in all application code.
> Disabling HT for this would the totally wrong approach, like throwing
> out the baby with the bath water.
Colin also mentions another work-around, at the level of the scheduler:
"[...] action must be taken to ensure that no pair of threads execute
simultaneously on the same processor core if they have different
privileges. Due to the complexities of performing such privilege checks
correctly and based on the principle that security fixes should be
chosen in such a way as to minimize the potential for new bugs to be
introduced, we recommend that existing operating systems provide the
necessary avoidance of inappropriate co-scheduling by never scheduling
any two threads on the same core, i.e., by only scheduling threads on
the first thread associated with each processor core. The more complex
solution of allowing certain "blessed" pairs of threads to be scheduled
on the same processor core is best delayed until future operating
systems where it can be extensively tested. In light of the potential
for information to be leaked across context switches, especially via the
L2 and larger cache(s), we also recommend that operating systems provide
some mechanism for processes to request special "secure" treatment,
which would include flushing all caches upon a context switch. It is not
immediately clear whether it is possible to use the occupancy of the
cache across context switches as a side channel, but if an unprivileged
user can cause his code to pre-empt a cryptographic operation
(e.g., by operating with a higher scheduling priority and being
repeatedly woken up by another process), then there is certainly a
strong possibility of a side
channel existing even in the absence of Hyper-Threading."
Is that relevant to the Linux kernel?
/er.
--
"Sleep, she is for the weak"
http://www.eleves.ens.fr/home/rannaud/
On Fri, 2005-05-13 at 20:03 +0200, Andi Kleen wrote:
> This is not a kernel problem, but a user space problem. The fix
> is to change the user space crypto code to need the same number of cache line
> accesses on all keys.
>
> Disabling HT for this would the totally wrong approach, like throwing
> out the baby with the bath water.
>
> -Andi
Why? It's certainly reasonable to disable it for the time being and
even prudent to do so.
--
Richard F. Rebel
cat /dev/null > `tty`
> This is not a kernel problem, but a user space problem. The fix
> is to change the user space crypto code to need the same number of cache line
> accesses on all keys.
You actually also need to hit the same cache line sequence on all keys
if you take a bit more care about it.
> Disabling HT for this would the totally wrong approach, like throwing
> out the baby with the bath water.
HT for most users is pretty irrelevant, its a neat idea but the
benchmarks don't suggest its too big a hit
Alan Cox wrote:
> HT for most users is pretty irrelevant, its a neat idea but the
> benchmarks don't suggest its too big a hit
On real-world applications, I haven't seen HT boost performance by more
than 15% on a Pentium 4 -- and the usual gain is around 5%, if anything
at all. HT is a nice idea, but I don't enable it on my systems.
..Scott
On Fri, May 13, 2005 at 11:30:27AM -0700, Vadim Lobanov wrote:
> On Fri, 13 May 2005, Andy Isaacson wrote:
> > It's a side channel timing attack on data-dependent computation through
> > the L1 and L2 caches. Nice work. In-the-wild exploitation is
> > difficult, though; your timing gets screwed up if you get scheduled away
> > from your victim, and you don't even know, because you can't tell where
> > you were scheduled, so on any reasonably busy multiuser system it's not
> > clear that the attack is practical.
>
> Wouldn't scheduling appear as a rather big time delta (in measuring the
> cache access times), so you would know to disregard that data point?
>
> (Just wondering... :-) )
Good question. Yes, you can probably filter the data. The question is,
how hard is it to set up the conditions to acquire the data? You have
to be scheduled on the same core as the target process (sibling
threads). And you don't know when the target is going to be scheduled,
and on a real-world system, there are other threads competing for
scheduling; if it's SMP (2 core, 4 thread) with perfect 100% utilization
then you've only got a 33% chance of being scheduled on the right
thread, and it gets worse if the machine is idle since the kernel should
schedule you and the OpenSSL process on different cores...
Getting the conditions right is challenging. Not impossible, but
neither is it a foregone conclusion.
-andy
On Fri, May 13, 2005 at 02:38:03PM -0400, Richard F. Rebel wrote:
> On Fri, 2005-05-13 at 20:03 +0200, Andi Kleen wrote:
> > This is not a kernel problem, but a user space problem. The fix
> > is to change the user space crypto code to need the same number of cache line
> > accesses on all keys.
> >
> > Disabling HT for this would the totally wrong approach, like throwing
> > out the baby with the bath water.
> >
> > -Andi
>
> Why? It's certainly reasonable to disable it for the time being and
> even prudent to do so.
No, i strongly disagree on that. The reasonable thing to do is
to fix the crypto code which has this vulnerability, not break
a useful performance enhancement for everybody else.
-Andi
On Fri, May 13, 2005 at 02:49:25PM -0400, Scott Robert Ladd wrote:
> Alan Cox wrote:
> > HT for most users is pretty irrelevant, its a neat idea but the
> > benchmarks don't suggest its too big a hit
>
> On real-world applications, I haven't seen HT boost performance by more
> than 15% on a Pentium 4 -- and the usual gain is around 5%, if anything
> at all. HT is a nice idea, but I don't enable it on my systems.
I saw better improvement in some cases. It always depends on the workload.
And on the generation of HT (there are three around). And lots of other
factors.
Even for your workload only it does not seem to me to be very rational
to throw away a 15% speedup with open eyes.
-Andi
On 05/13/05 02:38:03PM -0400, Richard F. Rebel wrote:
> On Fri, 2005-05-13 at 20:03 +0200, Andi Kleen wrote:
> > This is not a kernel problem, but a user space problem. The fix
> > is to change the user space crypto code to need the same number of cache line
> > accesses on all keys.
> >
> > Disabling HT for this would the totally wrong approach, like throwing
> > out the baby with the bath water.
> >
> > -Andi
>
> Why? It's certainly reasonable to disable it for the time being and
> even prudent to do so.
And what if you have more than one physical HT processor? AFAIK there's no
way to disable HT and still run SMP at the same time.
>
> --
> Richard F. Rebel
>
> cat /dev/null > `tty`
Jim.
El Fri, 13 May 2005 20:03:58 +0200,
Andi Kleen <[email protected]> escribi?:
> This is not a kernel problem, but a user space problem. The fix
> is to change the user space crypto code to need the same number of cache line
> accesses on all keys.
However they've patched the FreeBSD kernel to "workaround?" it:
ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09/htt5.patch
On Fri, May 13, 2005 at 09:16:09PM +0200, Diego Calleja wrote:
> However they've patched the FreeBSD kernel to "workaround?" it:
> ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09/htt5.patch
This patch just disables hyperthreading by default.
On Fri, 13 May 2005 14:49:25 -0400, Scott Robert Ladd <[email protected]> wrote:
>Alan Cox wrote:
>> HT for most users is pretty irrelevant, its a neat idea but the
>> benchmarks don't suggest its too big a hit
>
>On real-world applications, I haven't seen HT boost performance by more
>than 15% on a Pentium 4 -- and the usual gain is around 5%, if anything
>at all. HT is a nice idea, but I don't enable it on my systems.
P4-HT is great for winxp, a runaway process only gets half the CPU
resources, keeps the system responsive. I like HT for that reason,
perhaps that's what it was designed for? Hardware fix for msft 'OS' :o)
Recently on single AMD CPU box, 2.6.latest-mm, diff got stuck, no
disk activity, 100% CPU, started another terminal, recompiled kernel
with 8K stacks and rebooted, the whole time the unkillable 'diff'
was using just over 1/2 of resources. top showed all 1GB RAM in use,
no swap activity, nothing odd in /proc/whatever -- only happened once.
I suspected 4k stacks as only change before 'crash' was turning on
samba server day before, but I didn't trace 'problem' as it wasn't
really a crash. Impressive -- seeing 2.6 handling a stupid process,
business as usual for everything else. Haven't had a problem since
changing to 8K stacks. nfs, samba and ssh terminals on reiserfs 3.6
on via sata. May have had nvidia driver installed at the time, I
now load that only when X running (rare), mostly headless use.
--Grant.
On Fri, May 13, 2005 at 03:14:43PM -0400, Jim Crilly wrote:
> And what if you have more than one physical HT processor? AFAIK there's no
> way to disable HT and still run SMP at the same time.
Actually, there is; read my post earlier in this thread:
http://marc.theaimsgroup.com/?l=linux-kernel&m=111598859708620&w=2
To elaborate on the "check dmesg" part of that e-mail:
After you reboot with "maxcpus=2" (or however many physical CPU's you
have), you need to make sure you have messages like this, which indicate
that it really worked:
WARNING: No sibling found for CPU 0.
WARNING: No sibling found for CPU 1.
(and so on, if you have more than 2 CPU's)
-Barry K. Nathan <[email protected]>
On Fri, May 13, 2005 at 10:10:36AM -0400, Jeff Garzik wrote:
> Barry K. Nathan wrote:
> >On Fri, May 13, 2005 at 07:51:20AM +0200, Gabor MICSKO wrote:
> >
> >>Is this flaw affects the current stable Linux kernels? Workaround?
> >>Patch?
>
> Simple. Just boot a uniprocessor kernel, and/or disable HT in BIOS.
>
>
> >Some pages with relevant information:
> >http://www.ussg.iu.edu/hypermail/linux/kernel/0403.2/0920.html
> >http://bugzilla.kernel.org/show_bug.cgi?id=2317
>
> These pages have zero information on the "flaw." In fact, I can see no
> information at all proving that there is even a problem here.
I meant that those two URL's have relevant information regarding
disabling HT for those of us who can't simply boot a UP kernel or
disable HT in the BIOS, not that they had information on the flaw.
-Barry K. Nathan <[email protected]>
On Fri, May 13, 2005 at 09:05:49PM +0200, Andi Kleen wrote:
> On Fri, May 13, 2005 at 02:38:03PM -0400, Richard F. Rebel wrote:
> > Why? It's certainly reasonable to disable it for the time being and
> > even prudent to do so.
>
> No, i strongly disagree on that. The reasonable thing to do is
> to fix the crypto code which has this vulnerability, not break
> a useful performance enhancement for everybody else.
Pardon me for saying so, but that's bullshit. You're asking the crypto
guys to give up a 5x performance gain (that's my wild guess) by giving
up all their data-dependent algorithms and contorting their code wildly,
to avoid a microarchitectural problem with Intel's HT implementation.
There are three places to cut off the side channel, none of which is
obviously the right one.
1. The HT implementation could do the cache tricks Colin suggested in
his paper. Fairly large performance hit to address a fairly small
problem.
2. The OS could do the scheduler tricks to avoid scheduling unfriendly
threads on the same core. You're leaving a lot of the benefit of HT
on the floor by doing so.
3. Every security-sensitive app can be rigorously audited and re-written
to avoid *ever* referencing memory with the address determined by
private data.
(3) is a complete non-starter. It's just not feasible to rewrite all
that code. Furthermore, there's no way to know what code needs to be
rewritten! (Until someone publishes an advisory, that is...)
Hmm, I can't think of any reason that this technique wouldn't work to
extract information from kernel secrets, as well...
If SHA has plaintext-dependent memory references, Colin's technique
would enable an adversary to extract the contents of the /dev/random
pools. I don't *think* SHA does, based on a quick reading of
lib/sha1.c, but someone with an actual clue should probably take a look.
Andi, are you prepared to *require* that no code ever make a memory
reference as a function of a secret? Because that's what you're
suggesting the crypto people should do.
-andy
On Fri, May 13, 2005 at 02:26:20PM -0700, Andy Isaacson wrote:
> On Fri, May 13, 2005 at 09:05:49PM +0200, Andi Kleen wrote:
> > On Fri, May 13, 2005 at 02:38:03PM -0400, Richard F. Rebel wrote:
> > > Why? It's certainly reasonable to disable it for the time being and
> > > even prudent to do so.
> >
> > No, i strongly disagree on that. The reasonable thing to do is
> > to fix the crypto code which has this vulnerability, not break
> > a useful performance enhancement for everybody else.
>
> Pardon me for saying so, but that's bullshit. You're asking the crypto
> guys to give up a 5x performance gain (that's my wild guess) by giving
> up all their data-dependent algorithms and contorting their code wildly,
> to avoid a microarchitectural problem with Intel's HT implementation.
>
> There are three places to cut off the side channel, none of which is
> obviously the right one.
> 1. The HT implementation could do the cache tricks Colin suggested in
> his paper. Fairly large performance hit to address a fairly small
> problem.
> 2. The OS could do the scheduler tricks to avoid scheduling unfriendly
> threads on the same core. You're leaving a lot of the benefit of HT
> on the floor by doing so.
> 3. Every security-sensitive app can be rigorously audited and re-written
> to avoid *ever* referencing memory with the address determined by
> private data.
>
> (3) is a complete non-starter. It's just not feasible to rewrite all
> that code. Furthermore, there's no way to know what code needs to be
> rewritten! (Until someone publishes an advisory, that is...)
>
> Hmm, I can't think of any reason that this technique wouldn't work to
> extract information from kernel secrets, as well...
>
> If SHA has plaintext-dependent memory references, Colin's technique
> would enable an adversary to extract the contents of the /dev/random
> pools. I don't *think* SHA does, based on a quick reading of
> lib/sha1.c, but someone with an actual clue should probably take a look.
SHA1 should be fine, as are the pool mixing bits. Much more
problematic is the ability to do timing attacks against the entropy
gathering itself. If an attacker can guess the TSC value that gets
mixed into the pool, that's a problem.
It might not be much of a problem though. If he's a bit off per guess
(really impressive), he'll still be many bits off by the time there's
enough entropy in the primary pool to reseed the secondary pool so he
can check his guesswork.
--
Mathematics is the supreme nostalgia of our time.
On Gwe, 2005-05-13 at 22:59, Matt Mackall wrote:
> It might not be much of a problem though. If he's a bit off per guess
> (really impressive), he'll still be many bits off by the time there's
> enough entropy in the primary pool to reseed the secondary pool so he
> can check his guesswork.
You can also disable the tsc to user space in the intel processors.
Thats something they anticipated as being neccessary in secure
environments long ago. This makes the attack much harder.
The problem is with the *combination* of fine-grained multithreading,
a shared cache, *and* high-resolution timing via RDTSC.
A far easier fix would be to disable RDTSC.
(A third possibility would be to disable the cache, but I assume that's
too horrible to contemplate.)
When Intel implemented RDTSC, they were quite aware that it made a good
covert channel and provided an enable bit (bit 2 of CR4) to control
user-space access.
This attack is just showing that, with the tight coupling provided
by hyperthreading, it's possible to receive "interesting" data from a
process that is *not* deliberately transmitting. (Whereas the classic
problem enfocing the Bell-Lapadula model comes from preventing
*deliberate* transmission.)
If you don't want to disable it universally, how about providing,
at the OS level, a way for a task to request that RDTSC be disabled
while it is running? If another task tries to use it, it traps and one
of the two (doesn't matter which!) gets rescheduled later when the other
is not running.
If RDTSC is too annoying to disable, just interpret the same flag as a
"schedule me solo" flag that prevents scheduling anything else (at least,
not sharing the same ->mm) on the other virtual processor. (Of course,
the time should count double for scheduler fairness purposes.)
On Fri, 2005-05-13 at 23:47 +0100, Alan Cox wrote:
> On Gwe, 2005-05-13 at 22:59, Matt Mackall wrote:
> > It might not be much of a problem though. If he's a bit off per guess
> > (really impressive), he'll still be many bits off by the time there's
> > enough entropy in the primary pool to reseed the secondary pool so he
> > can check his guesswork.
>
> You can also disable the tsc to user space in the intel processors.
> Thats something they anticipated as being neccessary in secure
> environments long ago. This makes the attack much harder.
And break the hundreds of apps that depend on rdtsc? Am I missing
something?
Lee
On 05/13/05 01:18:40PM -0700, Barry K. Nathan wrote:
> On Fri, May 13, 2005 at 03:14:43PM -0400, Jim Crilly wrote:
> > And what if you have more than one physical HT processor? AFAIK there's no
> > way to disable HT and still run SMP at the same time.
>
> Actually, there is; read my post earlier in this thread:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=111598859708620&w=2
>
> To elaborate on the "check dmesg" part of that e-mail:
>
> After you reboot with "maxcpus=2" (or however many physical CPU's you
> have), you need to make sure you have messages like this, which indicate
> that it really worked:
>
> WARNING: No sibling found for CPU 0.
> WARNING: No sibling found for CPU 1.
>
> (and so on, if you have more than 2 CPU's)
But what about machines that don't enumerate physical processors before
logical? The comment in setup.c implies that the order that the BIOS
presents CPUs is undefined and if you're unlucky enough to have a machine
that presents the CPUs as physical, logical, physical, logical, etc you're
screwed.
Jim.
On Fri, May 13, 2005 at 07:00:12PM -0400, Lee Revell wrote:
> On Fri, 2005-05-13 at 23:47 +0100, Alan Cox wrote:
> > On Gwe, 2005-05-13 at 22:59, Matt Mackall wrote:
> > > It might not be much of a problem though. If he's a bit off per guess
> > > (really impressive), he'll still be many bits off by the time there's
> > > enough entropy in the primary pool to reseed the secondary pool so he
> > > can check his guesswork.
> >
> > You can also disable the tsc to user space in the intel processors.
> > Thats something they anticipated as being neccessary in secure
> > environments long ago. This makes the attack much harder.
>
> And break the hundreds of apps that depend on rdtsc? Am I missing
> something?
If those apps depend on rdtsc being a) present, and b) working
without providing fallbacks, they're already broken.
There's a reason its displayed in /proc/cpuinfo's flags field,
and visible through cpuid. Apps should be testing for presence
before assuming features are present.
Dave
On Fri, 13 May 2005, Andi Kleen wrote:
> No, i strongly disagree on that. The reasonable thing to do is to
> fix the crypto code which has this vulnerability, not break a
> useful performance enhancement for everybody else.
Already done:
http://www.openssl.org/news/secadv_20030317.txt
This is old news it seems, a timing attack that has long been known
about and fixed.
regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
Fortune:
What happens when you cut back the jungle? It recedes.
On Fri, 2005-05-13 at 19:27 -0400, Dave Jones wrote:
> On Fri, May 13, 2005 at 07:00:12PM -0400, Lee Revell wrote:
> > On Fri, 2005-05-13 at 23:47 +0100, Alan Cox wrote:
> > > On Gwe, 2005-05-13 at 22:59, Matt Mackall wrote:
> > > > It might not be much of a problem though. If he's a bit off per guess
> > > > (really impressive), he'll still be many bits off by the time there's
> > > > enough entropy in the primary pool to reseed the secondary pool so he
> > > > can check his guesswork.
> > >
> > > You can also disable the tsc to user space in the intel processors.
> > > Thats something they anticipated as being neccessary in secure
> > > environments long ago. This makes the attack much harder.
> >
> > And break the hundreds of apps that depend on rdtsc? Am I missing
> > something?
>
> If those apps depend on rdtsc being a) present, and b) working
> without providing fallbacks, they're already broken.
>
> There's a reason its displayed in /proc/cpuinfo's flags field,
> and visible through cpuid. Apps should be testing for presence
> before assuming features are present.
>
Well yes but you would still have to recompile those apps. And take the
big performance hit from using gettimeofday vs rdtsc. Disabling HT by
default looks pretty good by comparison.
Lee
On Fri, May 13, 2005 at 07:38:08PM -0400, Lee Revell wrote:
> On Fri, 2005-05-13 at 19:27 -0400, Dave Jones wrote:
> > On Fri, May 13, 2005 at 07:00:12PM -0400, Lee Revell wrote:
> > > On Fri, 2005-05-13 at 23:47 +0100, Alan Cox wrote:
> > > > On Gwe, 2005-05-13 at 22:59, Matt Mackall wrote:
> > > > > It might not be much of a problem though. If he's a bit off per guess
> > > > > (really impressive), he'll still be many bits off by the time there's
> > > > > enough entropy in the primary pool to reseed the secondary pool so he
> > > > > can check his guesswork.
> > > >
> > > > You can also disable the tsc to user space in the intel processors.
> > > > Thats something they anticipated as being neccessary in secure
> > > > environments long ago. This makes the attack much harder.
> > >
> > > And break the hundreds of apps that depend on rdtsc? Am I missing
> > > something?
> >
> > If those apps depend on rdtsc being a) present, and b) working
> > without providing fallbacks, they're already broken.
> >
> > There's a reason its displayed in /proc/cpuinfo's flags field,
> > and visible through cpuid. Apps should be testing for presence
> > before assuming features are present.
> >
>
> Well yes but you would still have to recompile those apps.
Not if the app is written correctly. See above.
Dave
On Fri, 13 May 2005, Andy Isaacson wrote:
> On Fri, May 13, 2005 at 09:05:49PM +0200, Andi Kleen wrote:
> > On Fri, May 13, 2005 at 02:38:03PM -0400, Richard F. Rebel wrote:
> > > Why? It's certainly reasonable to disable it for the time being and
> > > even prudent to do so.
> >
> > No, i strongly disagree on that. The reasonable thing to do is
> > to fix the crypto code which has this vulnerability, not break
> > a useful performance enhancement for everybody else.
>
> Pardon me for saying so, but that's bullshit. You're asking the crypto
> guys to give up a 5x performance gain (that's my wild guess) by giving
> up all their data-dependent algorithms and contorting their code wildly,
> to avoid a microarchitectural problem with Intel's HT implementation.
i think your wild guess is way off. i can think of several approaches to
fix these problems which won't be anywhere near 5x.
the problem is that an attacker can observe which cache indices (rows) are
in use. one workaround is to overload the possible secrets which each
index represents.
you can overload the secrets in each cache line: for example when doing
exponentiation there is an array of bignums x**(2*n). bignums themselves
are arrays (which span multiple cache lines). do a "row/column transpose"
on this array of arrays -- suddenly each cache line contains a number of
possible secrets. if you're operating with 32-bit words in a 64 byte line
then you've achieved a 16-fold reduction in exposed information by this
transpose. there'll be almost no performance penalty.
you can overload the secrets in each cache index: abuse the associativity
of the cache. the affected processors are all 8-way associative.
ideally you'd want to arrange your data so that it all collides within the
same cache index -- and get an 8-fold reduction in exposure. the trick
here is the L2 is physically indexed, and userland code can perform only
virtual allocations. but it's not too hard to discover physical conflicts
if you really want to (using rdtsc) -- it would be done early in the
initialization of the program because it involves asking for enough memory
until the kernel gives you enough colliding pages. (a system call could
help with this if we really wanted it.)
my not-so-wild guess is a 128-fold reduction for less than 10% perf hit...
i think there's possibly another approach involving a permuted array of
indirection pointers... which is going to affect perf a bit due to the
extra indirection required, but we're talking <10% here. (i'm just not
convinced yet you can select a permutation in a manner which doesn't leak
information when the attacker can view multiple invocations of the crypto
for example.)
> If SHA has plaintext-dependent memory references, Colin's technique
> would enable an adversary to extract the contents of the /dev/random
> pools. I don't *think* SHA does, based on a quick reading of
> lib/sha1.c, but someone with an actual clue should probably take a look.
the SHA family do not have any data-dependencies in their memory access
patterns.
-dean
On Fri, 2005-05-13 at 19:44 -0400, Dave Jones wrote:
> On Fri, May 13, 2005 at 07:38:08PM -0400, Lee Revell wrote:
> > On Fri, 2005-05-13 at 19:27 -0400, Dave Jones wrote:
> > > On Fri, May 13, 2005 at 07:00:12PM -0400, Lee Revell wrote:
> > > > On Fri, 2005-05-13 at 23:47 +0100, Alan Cox wrote:
> > > > > On Gwe, 2005-05-13 at 22:59, Matt Mackall wrote:
> > > > > > It might not be much of a problem though. If he's a bit off per guess
> > > > > > (really impressive), he'll still be many bits off by the time there's
> > > > > > enough entropy in the primary pool to reseed the secondary pool so he
> > > > > > can check his guesswork.
> > > > >
> > > > > You can also disable the tsc to user space in the intel processors.
> > > > > Thats something they anticipated as being neccessary in secure
> > > > > environments long ago. This makes the attack much harder.
> > > >
> > > > And break the hundreds of apps that depend on rdtsc? Am I missing
> > > > something?
> > >
> > > If those apps depend on rdtsc being a) present, and b) working
> > > without providing fallbacks, they're already broken.
> > >
> > > There's a reason its displayed in /proc/cpuinfo's flags field,
> > > and visible through cpuid. Apps should be testing for presence
> > > before assuming features are present.
> > >
> >
> > Well yes but you would still have to recompile those apps.
>
> Not if the app is written correctly. See above.
The apps that bother to use rdtsc vs. gettimeofday need a cheap high res
timer more than a correct one anyway - it's not guaranteed that rdtsc
provides a reliable time source at all, due to SMP and frequency scaling
issues.
I'll try to benchmark the difference. Maybe it's not that big a deal.
Lee
On Fri, 2005-05-13 at 22:51 +0000, [email protected] wrote:
> If RDTSC is too annoying to disable, just interpret the same flag as a
> "schedule me solo" flag that prevents scheduling anything else (at least,
> not sharing the same ->mm) on the other virtual processor. (Of course,
> the time should count double for scheduler fairness purposes.)
rdtsc is so unreliable on current hardware that no userspace app should
be using it anyway; it's not synchronized on SMP, powermanagement
impacts the rate of the ticks all the time etc etc.
Basically it's worthless on modern machines for anything but in-kernel
busy loops.
On Sad, 2005-05-14 at 00:38, Lee Revell wrote:
> Well yes but you would still have to recompile those apps. And take the
> big performance hit from using gettimeofday vs rdtsc. Disabling HT by
> default looks pretty good by comparison.
You cannot use rdtsc for anything but rough instruction timing. The
timers for different processors run at different speeds on some SMP
systems, the timer rates vary as processors change clock rate nowdays.
Rdtsc may also jump dramatically on a suspend/resume.
If the app uses rdtsc then generally speaking its terminally broken. The
only exception is some profiling tools.
On Sat, May 14, 2005 at 03:37:18AM -0400, Lee Revell wrote:
> The apps that bother to use rdtsc vs. gettimeofday need a cheap high res
> timer more than a correct one anyway - it's not guaranteed that rdtsc
> provides a reliable time source at all, due to SMP and frequency scaling
> issues.
On x86-64 the cost of gettimeofday is the same of the tsc, turning off
tsc on x86-64 is not nice (even if we usually have HPET there, so
perhaps it wouldn't be too bad). TSC is something only the kernel (or a
person with some kernel/hardware knowledge) can do safely knowing it'll
work fine. But on x86-64 parts of the kernel runs in userland...
Preventing tasks with different uid to run on the same physical cpu was
my first idea, disabled by default via sysctl, so only if one is
paranoid can enable it.
But before touching the kernel in any way it would be really nice if
somebody could bother to demonstrate this is real because I've an hard
time to believe this is not purely vapourware. On artificial
environments a computer can recognize the difference between two faces
too, no big deal, but that doesn't mean the same software is going to
recognize million of different faces at the airport too. So nothing has
been demonstrated in practical terms yet.
Nobody runs openssl -sign thousand of times in a row on a pure idle
system without noticing the 100% load on the other cpu for months (and
he's not root so he can't hide his runaway 100% process, if he was root
and he could modify the kernel or ps/top to hide the runaway process,
he'd have faster ways to sniff).
So to me this sounds a purerly theoretical problem. Cache covert
channels are possible too as the paper states, next time somebody will
find how to sniff a letter out of a pdf document on a UP no-HT system by
opening and closing it some million of times on a otherwise idle system.
We're sure not going to flush the l2 cache because of that (at least not
by default ;).
This was an interesting read, but in practice I'd rate this to have
severity 1 on a 0-100 scale, unless somebody bothers to demonstrate it
in a remotely realistic environment.
Even if this would be real if they sniff a openssl key, unless they also
crack the dns the browser will complain (not very different from not
having a certificate authority signature on a fake key). And if the
server is remotely serious they'll notice the 100% runaway process way
before he can sniff the whole key (the 100% runaway load cannot be
hidden). Most servers have some statistics so a 100% load for weeks or
months isn't very likely to be overlooked.
On Sat, May 14, 2005 at 04:23:10PM +0100, Alan Cox wrote:
> You cannot use rdtsc for anything but rough instruction timing. The
> timers for different processors run at different speeds on some SMP
> systems, the timer rates vary as processors change clock rate nowdays.
> Rdtsc may also jump dramatically on a suspend/resume.
x86-64 uses it for vgettimeofday very safely (i386 could do too but it
doesn't).
Anyway I believe at least for seccomp it's worth to turn off the tsc,
not just for HT but for the L2 cache too. So it's up to you, either you
turn it off completely (which isn't very nice IMHO) or I recommend to
apply this below patch. This has been tested successfully on x86-64
against current cogito repository (i686 compiles so I didn't bother
testing ;). People selling the cpu through cpushare may appreciate this
bit for a peace of mind. There's no way to get any timing info anymore
with this applied (gettimeofday is forbidden of course). The seccomp
environment is completely deterministic so it can't be allowed to get
timing info, it has to be deterministic so in the future I can enable a
computing mode that does a parallel computing for each task with server
side transparent checkpointing and verification that the output is the
same from all the 2/3 seller computers for each task, without the buyer
even noticing (for now the verification is left to the buyer client
side and there's no checkpointing, since that would require more kernel
changes to track the dirty bits but it'll be easy to extend once the
basic mode is finished).
Thanks.
Signed-off-by: Andrea Arcangeli <[email protected]>
Index: arch/i386/kernel/process.c
===================================================================
--- eed337ef5e9ae7d62caa84b7974a11fddc7f06e0/arch/i386/kernel/process.c (mode:100644)
+++ uncommitted/arch/i386/kernel/process.c (mode:100644)
@@ -561,6 +561,25 @@
}
/*
+ * This function selects if the context switch from prev to next
+ * has to tweak the TSC disable bit in the cr4.
+ */
+static void disable_tsc(struct thread_info *prev,
+ struct thread_info *next)
+{
+ if (unlikely(has_secure_computing(prev) ||
+ has_secure_computing(next))) {
+ /* slow path here */
+ if (has_secure_computing(prev) &&
+ !has_secure_computing(next)) {
+ clear_in_cr4(X86_CR4_TSD);
+ } else if (!has_secure_computing(prev) &&
+ has_secure_computing(next))
+ set_in_cr4(X86_CR4_TSD);
+ }
+}
+
+/*
* switch_to(x,yn) should switch tasks from x to y.
*
* We fsave/fwait so that an exception goes off at the right time
@@ -639,6 +658,8 @@
if (unlikely(prev->io_bitmap_ptr || next->io_bitmap_ptr))
handle_io_bitmap(next, tss);
+ disable_tsc(prev_p->thread_info, next_p->thread_info);
+
return prev_p;
}
Index: arch/x86_64/kernel/process.c
===================================================================
--- eed337ef5e9ae7d62caa84b7974a11fddc7f06e0/arch/x86_64/kernel/process.c (mode:100644)
+++ uncommitted/arch/x86_64/kernel/process.c (mode:100644)
@@ -439,6 +439,25 @@
}
/*
+ * This function selects if the context switch from prev to next
+ * has to tweak the TSC disable bit in the cr4.
+ */
+static void disable_tsc(struct thread_info *prev,
+ struct thread_info *next)
+{
+ if (unlikely(has_secure_computing(prev) ||
+ has_secure_computing(next))) {
+ /* slow path here */
+ if (has_secure_computing(prev) &&
+ !has_secure_computing(next)) {
+ clear_in_cr4(X86_CR4_TSD);
+ } else if (!has_secure_computing(prev) &&
+ has_secure_computing(next))
+ set_in_cr4(X86_CR4_TSD);
+ }
+}
+
+/*
* This special macro can be used to load a debugging register
*/
#define loaddebug(thread,r) set_debug(thread->debugreg ## r, r)
@@ -556,6 +575,8 @@
}
}
+ disable_tsc(prev_p->thread_info, next_p->thread_info);
+
return prev_p;
}
Index: include/linux/seccomp.h
===================================================================
--- eed337ef5e9ae7d62caa84b7974a11fddc7f06e0/include/linux/seccomp.h (mode:100644)
+++ uncommitted/include/linux/seccomp.h (mode:100644)
@@ -19,6 +19,11 @@
__secure_computing(this_syscall);
}
+static inline int has_secure_computing(struct thread_info *ti)
+{
+ return unlikely(test_ti_thread_flag(ti, TIF_SECCOMP));
+}
+
#else /* CONFIG_SECCOMP */
#if (__GNUC__ > 2)
@@ -28,6 +33,7 @@
#endif
#define secure_computing(x) do { } while (0)
+#define has_secure_computing(x) 0
#endif /* CONFIG_SECCOMP */
On Sat, 14 May 2005, Paul Jakma wrote:
> http://www.openssl.org/news/secadv_20030317.txt
>
> This is old news it seems, a timing attack that has long been known
> about and fixed.
I've now been told it's a new, more involved, timing attack to the
one the URL above describes a defence against.
regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
Fortune:
Weinberg's First Law:
Progress is only made on alternate Fridays.
On Sat, 2005-05-14 at 16:23 +0100, Alan Cox wrote:
> On Sad, 2005-05-14 at 00:38, Lee Revell wrote:
> > Well yes but you would still have to recompile those apps. And take the
> > big performance hit from using gettimeofday vs rdtsc. Disabling HT by
> > default looks pretty good by comparison.
>
> You cannot use rdtsc for anything but rough instruction timing. The
> timers for different processors run at different speeds on some SMP
> systems, the timer rates vary as processors change clock rate nowdays.
> Rdtsc may also jump dramatically on a suspend/resume.
>
> If the app uses rdtsc then generally speaking its terminally broken. The
> only exception is some profiling tools.
That is basically all JACK and mplayer use it for. They have RT
constraints and the tsc is used to know if we got woken up too late and
should just drop some frames. The developers are aware of the issues
with rdtsc and have chosen to use it anyway because these apps need
every ounce of CPU and cannot tolerate the overhead of gettimeofday().
Lee
On Sat, 2005-05-14 at 12:30 -0400, Lee Revell wrote:
> On Sat, 2005-05-14 at 16:23 +0100, Alan Cox wrote:
> > On Sad, 2005-05-14 at 00:38, Lee Revell wrote:
> > > Well yes but you would still have to recompile those apps. And take the
> > > big performance hit from using gettimeofday vs rdtsc. Disabling HT by
> > > default looks pretty good by comparison.
> >
> > You cannot use rdtsc for anything but rough instruction timing. The
> > timers for different processors run at different speeds on some SMP
> > systems, the timer rates vary as processors change clock rate nowdays.
> > Rdtsc may also jump dramatically on a suspend/resume.
> >
> > If the app uses rdtsc then generally speaking its terminally broken. The
> > only exception is some profiling tools.
>
> That is basically all JACK and mplayer use it for. They have RT
> constraints and the tsc is used to know if we got woken up too late and
> should just drop some frames. The developers are aware of the issues
> with rdtsc and have chosen to use it anyway because these apps need
> every ounce of CPU and cannot tolerate the overhead of gettimeofday().
then JACK is terminally broken if it doesn't have a fallback for non-
rdtsc cpus.
Lee Revell wrote:
> On Sat, 2005-05-14 at 16:23 +0100, Alan Cox wrote:
>
>>On Sad, 2005-05-14 at 00:38, Lee Revell wrote:
>>
>>>Well yes but you would still have to recompile those apps. And take the
>>>big performance hit from using gettimeofday vs rdtsc. Disabling HT by
>>>default looks pretty good by comparison.
>>
>>You cannot use rdtsc for anything but rough instruction timing. The
>>timers for different processors run at different speeds on some SMP
>>systems, the timer rates vary as processors change clock rate nowdays.
>>Rdtsc may also jump dramatically on a suspend/resume.
>>
>>If the app uses rdtsc then generally speaking its terminally broken. The
>>only exception is some profiling tools.
>
>
> That is basically all JACK and mplayer use it for. They have RT
> constraints and the tsc is used to know if we got woken up too late and
> should just drop some frames. The developers are aware of the issues
> with rdtsc and have chosen to use it anyway because these apps need
> every ounce of CPU and cannot tolerate the overhead of gettimeofday().
AFAIK, mplayer actually uses gettimeofday(). rdtsc is used in some
places for profiling and debugging purposes and not compiled in by default.
--
Jindrich Makovicka
On Sat, 2005-05-14 at 18:44 +0200, Arjan van de Ven wrote:
> then JACK is terminally broken if it doesn't have a fallback for non-
> rdtsc cpus.
It does have a fallback, but the selection is done at compile time. It
uses rdtsc for all x86 CPUs except pre-i586 SMP systems.
Maybe we should check at runtime, but this has always worked.
Lee
On Sat, 2005-05-14 at 13:56 -0400, Lee Revell wrote:
> On Sat, 2005-05-14 at 18:44 +0200, Arjan van de Ven wrote:
> > then JACK is terminally broken if it doesn't have a fallback for non-
> > rdtsc cpus.
>
> It does have a fallback, but the selection is done at compile time. It
> uses rdtsc for all x86 CPUs except pre-i586 SMP systems.
>
> Maybe we should check at runtime,
it's probably a sign that JACK isn't used on SMP systems much, at least
not on the bigger systems (like IBM's x440's) where the tsc *will*
differ wildly between cpus...
On Sat, 2005-05-14 at 19:04 +0200, Jindrich Makovicka wrote:
> AFAIK, mplayer actually uses gettimeofday(). rdtsc is used in some
> places for profiling and debugging purposes and not compiled in by default.
>
OK. The comments in the JACK code say it was copied from mplayer. I
guess the usage is not the same.
Lee
On Sat, 2005-05-14 at 20:01 +0200, Arjan van de Ven wrote:
> On Sat, 2005-05-14 at 13:56 -0400, Lee Revell wrote:
> > On Sat, 2005-05-14 at 18:44 +0200, Arjan van de Ven wrote:
> > > then JACK is terminally broken if it doesn't have a fallback for non-
> > > rdtsc cpus.
> >
> > It does have a fallback, but the selection is done at compile time. It
> > uses rdtsc for all x86 CPUs except pre-i586 SMP systems.
> >
> > Maybe we should check at runtime,
>
> it's probably a sign that JACK isn't used on SMP systems much, at least
> not on the bigger systems (like IBM's x440's) where the tsc *will*
> differ wildly between cpus...
Correct. The only bug reports we have seen related to the use of the
TSC is due to CPU frequency scaling. The fix is to not use it - people
who want to use their PC as a DSP for audio probably don't want their
processor slowing down anyway. And JACK is targeted at desktop and
smaller systems, it would be kind of crazy to run it on a big iron.
Well, maybe there are people who like to record sessions or practice
guitar in the server room...
If gettimeofday is really as cheap as rdtsc on x86_64, we should use it.
But it's too expensive for slower x86 systems. Anyway, Andi's fix
disables *all* high res timing including gettimeofday. Obviously no
multimedia app can tolerate this, so discussing rdtsc is really a red
herring. But multimedia apps aren't much in seccomp environments
either.
Lee
On Sat, 2005-05-14 at 15:21 -0400, Lee Revell wrote:
> On Sat, 2005-05-14 at 20:01 +0200, Arjan van de Ven wrote:
> > On Sat, 2005-05-14 at 13:56 -0400, Lee Revell wrote:
> > > On Sat, 2005-05-14 at 18:44 +0200, Arjan van de Ven wrote:
> > > > then JACK is terminally broken if it doesn't have a fallback for non-
> > > > rdtsc cpus.
> > >
> > > It does have a fallback, but the selection is done at compile time. It
> > > uses rdtsc for all x86 CPUs except pre-i586 SMP systems.
> > >
> > > Maybe we should check at runtime,
> >
> > it's probably a sign that JACK isn't used on SMP systems much, at least
> > not on the bigger systems (like IBM's x440's) where the tsc *will*
> > differ wildly between cpus...
>
> Correct. The only bug reports we have seen related to the use of the
> TSC is due to CPU frequency scaling. The fix is to not use it - people
> who want to use their PC as a DSP for audio probably don't want their
> processor slowing down anyway.
it's a matter of time (my estimate is a year or two) before processors
get variable frequencies based on temperature targets etc...
and then rdtsc is really useless for this kind of thing..
On Sat, 2005-05-14 at 21:48 +0200, Arjan van de Ven wrote:
> On Sat, 2005-05-14 at 15:21 -0400, Lee Revell wrote:
> > On Sat, 2005-05-14 at 20:01 +0200, Arjan van de Ven wrote:
> > > On Sat, 2005-05-14 at 13:56 -0400, Lee Revell wrote:
> > > > On Sat, 2005-05-14 at 18:44 +0200, Arjan van de Ven wrote:
> > > > > then JACK is terminally broken if it doesn't have a fallback for non-
> > > > > rdtsc cpus.
> > > >
> > > > It does have a fallback, but the selection is done at compile time. It
> > > > uses rdtsc for all x86 CPUs except pre-i586 SMP systems.
> > > >
> > > > Maybe we should check at runtime,
> > >
> > > it's probably a sign that JACK isn't used on SMP systems much, at least
> > > not on the bigger systems (like IBM's x440's) where the tsc *will*
> > > differ wildly between cpus...
> >
> > Correct. The only bug reports we have seen related to the use of the
> > TSC is due to CPU frequency scaling. The fix is to not use it - people
> > who want to use their PC as a DSP for audio probably don't want their
> > processor slowing down anyway.
>
> it's a matter of time (my estimate is a year or two) before processors
> get variable frequencies based on temperature targets etc...
> and then rdtsc is really useless for this kind of thing..
I was under the impression that P4 and later processors do not vary the
TSC rate when doing frequency scaling. This is mentioned in the
documentation for the high res timers patch.
Lee
Andrea Arcangeli <[email protected]> writes:
> Nobody runs openssl -sign thousand of times in a row on a pure idle
> system without noticing the 100% load on the other cpu for months
Well, actually one does. On a normal https server, each https request
results in an operation on the private key. So if the attacker shares
the same web server as the victim it's probably rather easy for the
attacker to see when the machine is idle and launch an attack giving
him thousands of chances to spy on the victim.
But I do agree that this probably isn't all that serious, for those
who really have secrets to hide, they won't run their https server on
a machine shared with anybody else.
/Christer
--
"Just how much can I get away with and still go to heaven?"
Freelance consultant specializing in device driver programming for Linux
Christer Weinigel <[email protected]> http://www.weinigel.se
On Sat, 14 May 2005, Arjan van de Ven wrote:
> it's a matter of time (my estimate is a year or two) before processors
> get variable frequencies based on temperature targets etc...
> and then rdtsc is really useless for this kind of thing..
what do you mean "a year or two"? processors have been doing this for
many years now.
i'm biased, but i still think transmeta did this the right way... the tsc
operates at the top frequency of the processor always.
i do a hell of a lot of microbenchmarking on various processors and i
always use tsc -- but i'm just smart enough to take multiple samples and i
try to make each sample smaller than a time slice... which avoids most of
the pitfalls, and would even work on smp boxes with tsc differences.
-dean
On Sat, 2005-05-14 at 19:40 -0400, Lee Revell wrote:
> > it's a matter of time (my estimate is a year or two) before processors
> > get variable frequencies based on temperature targets etc...
> > and then rdtsc is really useless for this kind of thing..
>
> I was under the impression that P4 and later processors do not vary the
> TSC rate when doing frequency scaling. This is mentioned in the
> documentation for the high res timers patch.
seems not the case, and worse, during idle time the clock is allowed to
stop entirely.... (and that is also happening more and more and linux is
getting more agressive idle support (eg no timer tick and such patches)
which will trigger bios thresholds for this even more too.
On Fri, May 13, 2005 at 12:02:44PM -0700, Andy Isaacson wrote:
> On Fri, May 13, 2005 at 11:30:27AM -0700, Vadim Lobanov wrote:
> > On Fri, 13 May 2005, Andy Isaacson wrote:
> > > It's a side channel timing attack on data-dependent computation through
> > > the L1 and L2 caches. Nice work. In-the-wild exploitation is
> > > difficult, though; your timing gets screwed up if you get scheduled away
> > > from your victim, and you don't even know, because you can't tell where
> > > you were scheduled, so on any reasonably busy multiuser system it's not
> > > clear that the attack is practical.
> >
> > Wouldn't scheduling appear as a rather big time delta (in measuring the
> > cache access times), so you would know to disregard that data point?
> >
> > (Just wondering... :-) )
>
> Good question. Yes, you can probably filter the data. The question is,
> how hard is it to set up the conditions to acquire the data? You have
> to be scheduled on the same core as the target process (sibling
> threads). And you don't know when the target is going to be scheduled,
> and on a real-world system, there are other threads competing for
> scheduling; if it's SMP (2 core, 4 thread) with perfect 100% utilization
> then you've only got a 33% chance of being scheduled on the right
> thread, and it gets worse if the machine is idle since the kernel should
> schedule you and the OpenSSL process on different cores...
>...
But if you start 3 processes in the idle case you might get a 100%
chance?
> -andy
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
On Sat, May 14, 2005 at 01:56:36PM -0400, Lee Revell wrote:
> On Sat, 2005-05-14 at 18:44 +0200, Arjan van de Ven wrote:
> > then JACK is terminally broken if it doesn't have a fallback for non-
> > rdtsc cpus.
>
> It does have a fallback, but the selection is done at compile time. It
> uses rdtsc for all x86 CPUs except pre-i586 SMP systems.
>
> Maybe we should check at runtime, but this has always worked.
If this is critical for JACK, runtime selection was an improvement for
distributions like Debian that support both pre-i586 SMP systems and
current hardware.
> Lee
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
> I was under the impression that P4 and later processors do not vary the
> TSC rate when doing frequency scaling. This is mentioned in the
> documentation for the high res timers patch.
Prescott and later do not vary TSC, but P4s before that do.
On x86-64 it is true because only Nocona is supported which has a
pstate invariant TSC.
The latest x86-64 kernel has a special X86_CONSTANT_TSC internal
CPUID bit, which is set in that case. If some other subsystem
uses it I would recommend to port that to i386 too.
-Andi
On Fri, May 13, 2005 at 02:26:20PM -0700, Andy Isaacson wrote:
> On Fri, May 13, 2005 at 09:05:49PM +0200, Andi Kleen wrote:
> > On Fri, May 13, 2005 at 02:38:03PM -0400, Richard F. Rebel wrote:
> > > Why? It's certainly reasonable to disable it for the time being and
> > > even prudent to do so.
> >
> > No, i strongly disagree on that. The reasonable thing to do is
> > to fix the crypto code which has this vulnerability, not break
> > a useful performance enhancement for everybody else.
>
> Pardon me for saying so, but that's bullshit. You're asking the crypto
> guys to give up a 5x performance gain (that's my wild guess) by giving
> up all their data-dependent algorithms and contorting their code wildly,
> to avoid a microarchitectural problem with Intel's HT implementation.
And what you're doing is to ask all the non crypto guys to give
up an useful optimization just to fix a problem in the crypto guy's
code. The cache line information leak is just a information leak
bug in the crypto code, not a general problem.
There is much more non crypto code than crypto code around - you
are proposing to screw the majority of codes to solve a relatively
obscure problem of only a few functions, which seems like the totally
wrong approach to me.
BTW the crypto guys are always free to check for hyperthreading
themselves and use different functions. However there is a catch
there - the modern dual core processors which actually have
separated L1 and L2 caches set these too to stay compatible
with old code and license managers.
-Andi
On Sat, May 14, 2005 at 05:33:07PM +0200, Andrea Arcangeli wrote:
> On Sat, May 14, 2005 at 03:37:18AM -0400, Lee Revell wrote:
> > The apps that bother to use rdtsc vs. gettimeofday need a cheap high res
> > timer more than a correct one anyway - it's not guaranteed that rdtsc
> > provides a reliable time source at all, due to SMP and frequency scaling
> > issues.
>
> On x86-64 the cost of gettimeofday is the same of the tsc, turning off
It depends, on many systems it is more costly. e.g. on many SMP
systems we have to use HPET or even the PM timer, because TSC is not
reliable.
> tsc on x86-64 is not nice (even if we usually have HPET there, so
> perhaps it wouldn't be too bad). TSC is something only the kernel (or a
> person with some kernel/hardware knowledge) can do safely knowing it'll
> work fine. But on x86-64 parts of the kernel runs in userland...
Agreed. It is quite complicated to decide if TSC is reliable or not
and I would not recommend user space to do this.
[hmm actually I already have constant_tsc fake cpuid bit, but
it only refers to single CPUs. I wonder if I should add another
one for SMP "synchronized_tsc". The latest mm code already has
this information, but it does not export it yet]
>
> Preventing tasks with different uid to run on the same physical cpu was
> my first idea, disabled by default via sysctl, so only if one is
> paranoid can enable it.
The paranoid should just fix their crypto code. And if they're
clinically paranoid they can always boot with noht or disable
it in the BIOS. But really I think they should just fix OpenSSL.
>
> But before touching the kernel in any way it would be really nice if
> somebody could bother to demonstrate this is real because I've an hard
> time to believe this is not purely vapourware. On artificial
Similar feeling here.
> Nobody runs openssl -sign thousand of times in a row on a pure idle
> system without noticing the 100% load on the other cpu for months (and
> he's not root so he can't hide his runaway 100% process, if he was root
> and he could modify the kernel or ps/top to hide the runaway process,
> he'd have faster ways to sniff).
Exactly.
>
> So to me this sounds a purerly theoretical problem. Cache covert
Perhaps not purely theoretical, but it is certainly not something
that needs drastic action like disabling HT in general.
> This was an interesting read, but in practice I'd rate this to have
> severity 1 on a 0-100 scale, unless somebody bothers to demonstrate it
> in a remotely realistic environment.
Agreed.
-Andi
On Fri, May 13, 2005 at 09:16:09PM +0200, Diego Calleja wrote:
> El Fri, 13 May 2005 20:03:58 +0200,
> Andi Kleen <[email protected]> escribi?:
>
>
> > This is not a kernel problem, but a user space problem. The fix
> > is to change the user space crypto code to need the same number of cache line
> > accesses on all keys.
>
>
> However they've patched the FreeBSD kernel to "workaround?" it:
> ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09/htt5.patch
That's a similar stupid idea as they did with the disk write
cache (lowering the MTBFs of their disks by considerable factors,
which is much worse than the power off data loss problem)
Let's not go down this path please.
-Andi
On Sat, May 14, 2005 at 12:30:28PM -0400, Lee Revell wrote:
> On Sat, 2005-05-14 at 16:23 +0100, Alan Cox wrote:
> > On Sad, 2005-05-14 at 00:38, Lee Revell wrote:
> > > Well yes but you would still have to recompile those apps. And take the
> > > big performance hit from using gettimeofday vs rdtsc. Disabling HT by
> > > default looks pretty good by comparison.
> >
> > You cannot use rdtsc for anything but rough instruction timing. The
> > timers for different processors run at different speeds on some SMP
> > systems, the timer rates vary as processors change clock rate nowdays.
> > Rdtsc may also jump dramatically on a suspend/resume.
> >
> > If the app uses rdtsc then generally speaking its terminally broken. The
> > only exception is some profiling tools.
>
> That is basically all JACK and mplayer use it for. They have RT
> constraints and the tsc is used to know if we got woken up too late and
> should just drop some frames. The developers are aware of the issues
> with rdtsc and have chosen to use it anyway because these apps need
> every ounce of CPU and cannot tolerate the overhead of gettimeofday().
I would consider jack broken then. For once it breaks
on Centrinos and on AMD systems with PowerNow and some others which all
have frequency scaling with non pstate invariant TSC.
As an additional problem the modern Opterons which support SMP
powernow can even have completely different TSC frequencies
on different CPUs.
All I can recommend is to use gettimeofday() for this. The kernel
goes to considerable pains to make gettimeofday() fast, and when
it is not fast then the system in general cannot do it better.
-Andi
On Sat, May 14, 2005 at 08:01:33PM +0200, Arjan van de Ven wrote:
> On Sat, 2005-05-14 at 13:56 -0400, Lee Revell wrote:
> > On Sat, 2005-05-14 at 18:44 +0200, Arjan van de Ven wrote:
> > > then JACK is terminally broken if it doesn't have a fallback for non-
> > > rdtsc cpus.
> >
> > It does have a fallback, but the selection is done at compile time. It
> > uses rdtsc for all x86 CPUs except pre-i586 SMP systems.
> >
> > Maybe we should check at runtime,
>
> it's probably a sign that JACK isn't used on SMP systems much, at least
> not on the bigger systems (like IBM's x440's) where the tsc *will*
> differ wildly between cpus...
It does not even need SMP, just use a Centrino laptop.
I suppose what the Jack guys are doing is to recommend to disable
frequency scaling then the sound guys complain again
that sound on linux is so hard to use. I wonder where this comes from? :)
-Andi
Hi all,
i am running a 2.6.4 kernel on my system , and i am playing a little bit
with kernel time issues and helper functions,just to understand how the
things really work.
While doing that on my x86 system and loaded a module from LDD 3rd
edition,jit.c, which uses a dynamic /proc file to return textual
information.
The info that returns is in this format and uses the kernel functions
,do_gettimeofday,current_kernel_time and jiffies_to_timespec.
The output format is:
0x0009073c 0x000000010009073c 1116162967.247441
1116162967.246530656 591.586065248
0x0009073c 0x000000010009073c 1116162967.247463
1116162967.246530656 591.586065248
0x0009073c 0x000000010009073c 1116162967.247476
1116162967.246530656 591.586065248
0x0009073c 0x000000010009073c 1116162967.247489
1116162967.246530656 591.586065248
where the first two values are the jiffies and jiffies_64.The next two are
the do_gettimeofday and current_kernel_time and the last value is the
jiffies_to_timespec.This output text is "recorded" after 16 minutes of
uptime.Shouldnt the last value be the same as uptime.I have attached an
output file from the boot time until the time the function resets the
struct and starts count from the beggining.Is this a bug or i am missing
sth here???
Best regards,
Chris.
On Sat, 14 May 2005 [email protected] wrote:
> On Sat, May 14, 2005 at 04:23:10PM +0100, Alan Cox wrote:
> > You cannot use rdtsc for anything but rough instruction timing. The
> > timers for different processors run at different speeds on some SMP
> > systems, the timer rates vary as processors change clock rate nowdays.
> > Rdtsc may also jump dramatically on a suspend/resume.
>
> x86-64 uses it for vgettimeofday very safely (i386 could do too but it
> doesn't).
>
> Anyway I believe at least for seccomp it's worth to turn off the tsc,
> not just for HT but for the L2 cache too. So it's up to you, either you
> turn it off completely (which isn't very nice IMHO) or I recommend to
> apply this below patch. This has been tested successfully on x86-64
> against current cogito repository (i686 compiles so I didn't bother
> testing ;). People selling the cpu through cpushare may appreciate this
> bit for a peace of mind. There's no way to get any timing info anymore
> with this applied (gettimeofday is forbidden of course).
Another possibility to get timing is from direct-io --- i.e. initiate
direct io read, wait until one cache line contains new data and you can be
sure that the next will contain new data in certain time. IDE controller
bus master operation acts here as a timer.
Mikulas
> The seccomp
> environment is completely deterministic so it can't be allowed to get
> timing info, it has to be deterministic so in the future I can enable a
> computing mode that does a parallel computing for each task with server
> side transparent checkpointing and verification that the output is the
> same from all the 2/3 seller computers for each task, without the buyer
> even noticing (for now the verification is left to the buyer client
> side and there's no checkpointing, since that would require more kernel
> changes to track the dirty bits but it'll be easy to extend once the
> basic mode is finished).
>
> Thanks.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
>
> Index: arch/i386/kernel/process.c
> ===================================================================
> --- eed337ef5e9ae7d62caa84b7974a11fddc7f06e0/arch/i386/kernel/process.c (mode:100644)
> +++ uncommitted/arch/i386/kernel/process.c (mode:100644)
> @@ -561,6 +561,25 @@
> }
>
> /*
> + * This function selects if the context switch from prev to next
> + * has to tweak the TSC disable bit in the cr4.
> + */
> +static void disable_tsc(struct thread_info *prev,
> + struct thread_info *next)
> +{
> + if (unlikely(has_secure_computing(prev) ||
> + has_secure_computing(next))) {
> + /* slow path here */
> + if (has_secure_computing(prev) &&
> + !has_secure_computing(next)) {
> + clear_in_cr4(X86_CR4_TSD);
> + } else if (!has_secure_computing(prev) &&
> + has_secure_computing(next))
> + set_in_cr4(X86_CR4_TSD);
> + }
> +}
> +
> +/*
> * switch_to(x,yn) should switch tasks from x to y.
> *
> * We fsave/fwait so that an exception goes off at the right time
> @@ -639,6 +658,8 @@
> if (unlikely(prev->io_bitmap_ptr || next->io_bitmap_ptr))
> handle_io_bitmap(next, tss);
>
> + disable_tsc(prev_p->thread_info, next_p->thread_info);
> +
> return prev_p;
> }
>
> Index: arch/x86_64/kernel/process.c
> ===================================================================
> --- eed337ef5e9ae7d62caa84b7974a11fddc7f06e0/arch/x86_64/kernel/process.c (mode:100644)
> +++ uncommitted/arch/x86_64/kernel/process.c (mode:100644)
> @@ -439,6 +439,25 @@
> }
>
> /*
> + * This function selects if the context switch from prev to next
> + * has to tweak the TSC disable bit in the cr4.
> + */
> +static void disable_tsc(struct thread_info *prev,
> + struct thread_info *next)
> +{
> + if (unlikely(has_secure_computing(prev) ||
> + has_secure_computing(next))) {
> + /* slow path here */
> + if (has_secure_computing(prev) &&
> + !has_secure_computing(next)) {
> + clear_in_cr4(X86_CR4_TSD);
> + } else if (!has_secure_computing(prev) &&
> + has_secure_computing(next))
> + set_in_cr4(X86_CR4_TSD);
> + }
> +}
> +
> +/*
> * This special macro can be used to load a debugging register
> */
> #define loaddebug(thread,r) set_debug(thread->debugreg ## r, r)
> @@ -556,6 +575,8 @@
> }
> }
>
> + disable_tsc(prev_p->thread_info, next_p->thread_info);
> +
> return prev_p;
> }
>
> Index: include/linux/seccomp.h
> ===================================================================
> --- eed337ef5e9ae7d62caa84b7974a11fddc7f06e0/include/linux/seccomp.h (mode:100644)
> +++ uncommitted/include/linux/seccomp.h (mode:100644)
> @@ -19,6 +19,11 @@
> __secure_computing(this_syscall);
> }
>
> +static inline int has_secure_computing(struct thread_info *ti)
> +{
> + return unlikely(test_ti_thread_flag(ti, TIF_SECCOMP));
> +}
> +
> #else /* CONFIG_SECCOMP */
>
> #if (__GNUC__ > 2)
> @@ -28,6 +33,7 @@
> #endif
>
> #define secure_computing(x) do { } while (0)
> +#define has_secure_computing(x) 0
>
> #endif /* CONFIG_SECCOMP */
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Sun, 15 May 2005, Andi Kleen wrote:
> On Fri, May 13, 2005 at 09:16:09PM +0200, Diego Calleja wrote:
> > El Fri, 13 May 2005 20:03:58 +0200,
> > Andi Kleen <[email protected]> escribi?:
> >
> >
> > > This is not a kernel problem, but a user space problem. The fix
> > > is to change the user space crypto code to need the same number of cache line
> > > accesses on all keys.
> >
> >
> > However they've patched the FreeBSD kernel to "workaround?" it:
> > ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09/htt5.patch
>
> That's a similar stupid idea as they did with the disk write
> cache (lowering the MTBFs of their disks by considerable factors,
> which is much worse than the power off data loss problem)
> Let's not go down this path please.
What wrong did they do with disk write cache?
Mikulas
> -Andi
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Fri, 13 May 2005, Andy Isaacson wrote:
> On Fri, May 13, 2005 at 09:05:49PM +0200, Andi Kleen wrote:
> > On Fri, May 13, 2005 at 02:38:03PM -0400, Richard F. Rebel wrote:
> > > Why? It's certainly reasonable to disable it for the time being and
> > > even prudent to do so.
> >
> > No, i strongly disagree on that. The reasonable thing to do is
> > to fix the crypto code which has this vulnerability, not break
> > a useful performance enhancement for everybody else.
>
> Pardon me for saying so, but that's bullshit. You're asking the crypto
> guys to give up a 5x performance gain (that's my wild guess) by giving
> up all their data-dependent algorithms and contorting their code wildly,
> to avoid a microarchitectural problem with Intel's HT implementation.
That information leak can be exploited not only on HT or SMP, but on any
CPU with L2 cache. Without HT it's much harder to get information about L2
cache footprint, but it's still possible. If an attacker can make
unlimited number of connections to ssh or http server and manages to get 1
bit in 100 connections, it's still a problem.
Possible solutions:
1) don't use branches and data-dependent memory accesses depending on
secret data
2) flush cache completely when switching to process with different EUID
(0.2ms on Pentium 4 with 1M cache, even worse on CPUs with more cache).
Disabling HT/SMP is not a solution. A year later someone may come with
something like this:
* prefill L2 cache with known pattern
* sleep on some precious timer
* make connection to security application (ssh, https)
* on wakeup, read what's in L2 cache --- get one bit with small
probability --- but when repeated many times, it's still a problem
Mikulas
> There are three places to cut off the side channel, none of which is
> obviously the right one.
> 1. The HT implementation could do the cache tricks Colin suggested in
> his paper. Fairly large performance hit to address a fairly small
> problem.
> 2. The OS could do the scheduler tricks to avoid scheduling unfriendly
> threads on the same core. You're leaving a lot of the benefit of HT
> on the floor by doing so.
> 3. Every security-sensitive app can be rigorously audited and re-written
> to avoid *ever* referencing memory with the address determined by
> private data.
>
> (3) is a complete non-starter. It's just not feasible to rewrite all
> that code. Furthermore, there's no way to know what code needs to be
> rewritten! (Until someone publishes an advisory, that is...)
>
> Hmm, I can't think of any reason that this technique wouldn't work to
> extract information from kernel secrets, as well...
>
> If SHA has plaintext-dependent memory references, Colin's technique
> would enable an adversary to extract the contents of the /dev/random
> pools. I don't *think* SHA does, based on a quick reading of
> lib/sha1.c, but someone with an actual clue should probably take a look.
>
> Andi, are you prepared to *require* that no code ever make a memory
> reference as a function of a secret? Because that's what you're
> suggesting the crypto people should do.
>
> -andy
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Sun, May 15, 2005 at 03:51:05PM +0200, Mikulas Patocka wrote:
>
>
> On Sun, 15 May 2005, Andi Kleen wrote:
>
> > On Fri, May 13, 2005 at 09:16:09PM +0200, Diego Calleja wrote:
> > > El Fri, 13 May 2005 20:03:58 +0200,
> > > Andi Kleen <[email protected]> escribi?:
> > >
> > >
> > > > This is not a kernel problem, but a user space problem. The fix
> > > > is to change the user space crypto code to need the same number of cache line
> > > > accesses on all keys.
> > >
> > >
> > > However they've patched the FreeBSD kernel to "workaround?" it:
> > > ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09/htt5.patch
> >
> > That's a similar stupid idea as they did with the disk write
> > cache (lowering the MTBFs of their disks by considerable factors,
> > which is much worse than the power off data loss problem)
> > Let's not go down this path please.
>
> What wrong did they do with disk write cache?
They turned it off by default, which according to disk vendors
lowers the MTBF of your disk to a fraction of the original value.
I bet the total amount of valuable data lost for FreeBSD users because
of broken disks is much much bigger than what they gained from not losing
in the rather hard to hit power off cases.
-Andi
On Sun, 15 May 2005, Andi Kleen wrote:
> On Sun, May 15, 2005 at 03:51:05PM +0200, Mikulas Patocka wrote:
> >
> >
> > On Sun, 15 May 2005, Andi Kleen wrote:
> >
> > > On Fri, May 13, 2005 at 09:16:09PM +0200, Diego Calleja wrote:
> > > > El Fri, 13 May 2005 20:03:58 +0200,
> > > > Andi Kleen <[email protected]> escribi?:
> > > >
> > > >
> > > > > This is not a kernel problem, but a user space problem. The fix
> > > > > is to change the user space crypto code to need the same number of cache line
> > > > > accesses on all keys.
> > > >
> > > >
> > > > However they've patched the FreeBSD kernel to "workaround?" it:
> > > > ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09/htt5.patch
> > >
> > > That's a similar stupid idea as they did with the disk write
> > > cache (lowering the MTBFs of their disks by considerable factors,
> > > which is much worse than the power off data loss problem)
> > > Let's not go down this path please.
> >
> > What wrong did they do with disk write cache?
>
> They turned it off by default, which according to disk vendors
> lowers the MTBF of your disk to a fraction of the original value.
>
> I bet the total amount of valuable data lost for FreeBSD users because
> of broken disks is much much bigger than what they gained from not losing
> in the rather hard to hit power off cases.
>
> -Andi
BTW. Is there any blacklist of disks with broken FLUSH CACHE command? Or a
list of companies that cheat in implementation of it?
Mikulas
> There are three places to cut off the side channel, none of which is
> obviously the right one.
> 1. The HT implementation could do the cache tricks Colin suggested in
> his paper. Fairly large performance hit to address a fairly small
> problem.
As Dean pointed out that is probably not true.
> 2. The OS could do the scheduler tricks to avoid scheduling unfriendly
> threads on the same core. You're leaving a lot of the benefit of HT
> on the floor by doing so.
And probably still lose badly in some workloads.
> 3. Every security-sensitive app can be rigorously audited and re-written
> to avoid *ever* referencing memory with the address determined by
> private data.
Sure after it was demonstrated that this attack is actually feasible
in practice. If yes then fix the crypto code, otherwise do nothing.
I have no problem with crypto people being paranoid (that is their
job after all), as long as they don't try to affect non crypto code
in the process. But the later seems to be clearly the case here :-(
>
> (3) is a complete non-starter. It's just not feasible to rewrite all
> that code. Furthermore, there's no way to know what code needs to be
> rewritten! (Until someone publishes an advisory, that is...)
>
> Hmm, I can't think of any reason that this technique wouldn't work to
> extract information from kernel secrets, as well...
>
> If SHA has plaintext-dependent memory references, Colin's technique
> would enable an adversary to extract the contents of the /dev/random
> pools. I don't *think* SHA does, based on a quick reading of
> lib/sha1.c, but someone with an actual clue should probably take a look.
>
> Andi, are you prepared to *require* that no code ever make a memory
> reference as a function of a secret? Because that's what you're
> suggesting the crypto people should do.
No, just not do it frequently enough that you leak enough data.
Or add dummy memory references to blend your data.
And then nobody said writing crypto code was easy. It just got a bit
harder today.
It is basically like writing smart card code, where you need
to care about such side channels. The other crypto code writers
just need to care about this too. They will probably avoid
other timing attacks on cache misses too with this approach. Although
it is doubtful enough signal is leaked in this way, e.g. if you time the
performance of a network server with RSA answering over the network -
but you see some data is always leaked - the question is just
if it is enough and accurate data to aid an attacker. The paper
has shown that it is feasible in some cases, but so far the proof
is still out this could be actually replicated in not very controlled
loads. With more noise in the data it becomes harder. And the question
is is the small amount of data with normal background workload
is really useful enough to lead to useful real world attacks. I have
severe doubts on that. Certainly the effidence is not clear enough
for a serious step like disabling an useful performance enhancement like
HT.
-Andi
P.S.: My personal opinion is that we have a far bigger crypto security
problem on many system due to weak /dev/random seeding on many systems.
If anything is done it would be better to attack that.
On Sun, May 15, 2005 at 04:12:07PM +0200, Andi Kleen wrote:
> > > > However they've patched the FreeBSD kernel to "workaround?" it:
> > > > ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09/htt5.patch
> > >
> > > That's a similar stupid idea as they did with the disk write
> > > cache (lowering the MTBFs of their disks by considerable factors,
> > > which is much worse than the power off data loss problem)
> > > Let's not go down this path please.
> >
> > What wrong did they do with disk write cache?
>
> They turned it off by default, which according to disk vendors
> lowers the MTBF of your disk to a fraction of the original value.
>
> I bet the total amount of valuable data lost for FreeBSD users because
> of broken disks is much much bigger than what they gained from not losing
> in the rather hard to hit power off cases.
Aren't I/O barriers a way to safely use write cache?
--
Tomasz Torcz "God, root, what's the difference?"
[email protected] "God is more forgiving."
On Sun, 15 May 2005, Tomasz Torcz wrote:
> On Sun, May 15, 2005 at 04:12:07PM +0200, Andi Kleen wrote:
> > > > > However they've patched the FreeBSD kernel to "workaround?" it:
> > > > > ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09/htt5.patch
> > > >
> > > > That's a similar stupid idea as they did with the disk write
> > > > cache (lowering the MTBFs of their disks by considerable factors,
> > > > which is much worse than the power off data loss problem)
> > > > Let's not go down this path please.
> > >
> > > What wrong did they do with disk write cache?
> >
> > They turned it off by default, which according to disk vendors
> > lowers the MTBF of your disk to a fraction of the original value.
> >
> > I bet the total amount of valuable data lost for FreeBSD users because
> > of broken disks is much much bigger than what they gained from not losing
> > in the rather hard to hit power off cases.
>
> Aren't I/O barriers a way to safely use write cache?
FreeBSD used these barriers (FLUSH CACHE command) long time ago.
There are rumors that some disks ignore FLUSH CACHE command just to get
higher benchmarks in Windows. But I haven't heart of any proof. Does
anybody know, what companies fake this command?
Mikulas
> > They turned it off by default, which according to disk vendors
> > lowers the MTBF of your disk to a fraction of the original value.
> >
> > I bet the total amount of valuable data lost for FreeBSD users because
> > of broken disks is much much bigger than what they gained from not losing
> > in the rather hard to hit power off cases.
>
> Aren't I/O barriers a way to safely use write cache?
yes they are. However of course they also decrease the mtbf somewhat,
although less so than entirely disabling the cache....
On Sunday 15 May 2005 11:00, Mikulas Patocka wrote:
>On Sun, 15 May 2005, Tomasz Torcz wrote:
>> On Sun, May 15, 2005 at 04:12:07PM +0200, Andi Kleen wrote:
>> > > > > However they've patched the FreeBSD kernel to
>> > > > > "workaround?" it:
>> > > > > ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09/ht
>> > > > >t5.patch
>> > > >
>> > > > That's a similar stupid idea as they did with the disk write
>> > > > cache (lowering the MTBFs of their disks by considerable
>> > > > factors, which is much worse than the power off data loss
>> > > > problem) Let's not go down this path please.
>> > >
>> > > What wrong did they do with disk write cache?
>> >
>> > They turned it off by default, which according to disk vendors
>> > lowers the MTBF of your disk to a fraction of the original
>> > value.
>> >
>> > I bet the total amount of valuable data lost for FreeBSD users
>> > because of broken disks is much much bigger than what they
>> > gained from not losing in the rather hard to hit power off
>> > cases.
>>
>> Aren't I/O barriers a way to safely use write cache?
>
>FreeBSD used these barriers (FLUSH CACHE command) long time ago.
>
>There are rumors that some disks ignore FLUSH CACHE command just to
> get higher benchmarks in Windows. But I haven't heart of any proof.
> Does anybody know, what companies fake this command?
>
>From a story I read elsewhere just a few days ago, this problem is
virtually universal even in the umpty-bucks 15,000 rpm scsi server
drives. It appears that this is just another way to crank up the
numbers and make each drive seem faster than its competition.
My gut feeling is that if this gets enough ink to get under the drive
makers skins, we will see the issuance of a utility from the makers
that will re-program the drives therefore enabling the proper
handling of the FLUSH CACHE command. This would be an excellent
chance IMO, to make a bit of noise if the utility comes out, but only
runs on windows. In that event, we hold their feet to the fire (the
prefereable method), or a wrapper is written that allows it to run on
any os with a bash-like shell manager.
>Mikulas
>-
>To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.34% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2005 by Maurice Eugene Heskett, all rights reserved.
On Sun, May 15, 2005 at 11:21:36AM -0400, Gene Heskett wrote:
> On Sunday 15 May 2005 11:00, Mikulas Patocka wrote:
> >On Sun, 15 May 2005, Tomasz Torcz wrote:
> >> On Sun, May 15, 2005 at 04:12:07PM +0200, Andi Kleen wrote:
> >> > > > > However they've patched the FreeBSD kernel to
> >> > > > > "workaround?" it:
> >> > > > > ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09/ht
> >> > > > >t5.patch
> >> > > >
> >> > > > That's a similar stupid idea as they did with the disk write
> >> > > > cache (lowering the MTBFs of their disks by considerable
> >> > > > factors, which is much worse than the power off data loss
> >> > > > problem) Let's not go down this path please.
> >> > >
> >> > > What wrong did they do with disk write cache?
> >> >
> >> > They turned it off by default, which according to disk vendors
> >> > lowers the MTBF of your disk to a fraction of the original
> >> > value.
> >> >
> >> > I bet the total amount of valuable data lost for FreeBSD users
> >> > because of broken disks is much much bigger than what they
> >> > gained from not losing in the rather hard to hit power off
> >> > cases.
> >>
> >> Aren't I/O barriers a way to safely use write cache?
> >
> >FreeBSD used these barriers (FLUSH CACHE command) long time ago.
> >
> >There are rumors that some disks ignore FLUSH CACHE command just to
> > get higher benchmarks in Windows. But I haven't heart of any proof.
> > Does anybody know, what companies fake this command?
> >
> >From a story I read elsewhere just a few days ago, this problem is
> virtually universal even in the umpty-bucks 15,000 rpm scsi server
> drives. It appears that this is just another way to crank up the
> numbers and make each drive seem faster than its competition.
>
> My gut feeling is that if this gets enough ink to get under the drive
> makers skins, we will see the issuance of a utility from the makers
> that will re-program the drives therefore enabling the proper
> handling of the FLUSH CACHE command. This would be an excellent
> chance IMO, to make a bit of noise if the utility comes out, but only
> runs on windows. In that event, we hold their feet to the fire (the
> prefereable method), or a wrapper is written that allows it to run on
> any os with a bash-like shell manager.
There is a large amount of yammering and speculation in this thread.
Most disks do seem to obey SYNC CACHE / FLUSH CACHE.
Jeff
On Sun, 15 May 2005, Gene Heskett wrote:
> On Sunday 15 May 2005 11:00, Mikulas Patocka wrote:
> >On Sun, 15 May 2005, Tomasz Torcz wrote:
> >> On Sun, May 15, 2005 at 04:12:07PM +0200, Andi Kleen wrote:
> >> > > > > However they've patched the FreeBSD kernel to
> >> > > > > "workaround?" it:
> >> > > > > ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09/ht
> >> > > > >t5.patch
> >> > > >
> >> > > > That's a similar stupid idea as they did with the disk write
> >> > > > cache (lowering the MTBFs of their disks by considerable
> >> > > > factors, which is much worse than the power off data loss
> >> > > > problem) Let's not go down this path please.
> >> > >
> >> > > What wrong did they do with disk write cache?
> >> >
> >> > They turned it off by default, which according to disk vendors
> >> > lowers the MTBF of your disk to a fraction of the original
> >> > value.
> >> >
> >> > I bet the total amount of valuable data lost for FreeBSD users
> >> > because of broken disks is much much bigger than what they
> >> > gained from not losing in the rather hard to hit power off
> >> > cases.
> >>
> >> Aren't I/O barriers a way to safely use write cache?
> >
> >FreeBSD used these barriers (FLUSH CACHE command) long time ago.
> >
> >There are rumors that some disks ignore FLUSH CACHE command just to
> > get higher benchmarks in Windows. But I haven't heart of any proof.
> > Does anybody know, what companies fake this command?
> >
> From a story I read elsewhere just a few days ago, this problem is
> virtually universal even in the umpty-bucks 15,000 rpm scsi server
> drives. It appears that this is just another way to crank up the
> numbers and make each drive seem faster than its competition.
I've just made test on my Western Digical 40G IDE disk:
just writes without flush cache: 1min 33sec
same access pattern, but flush cache after each write: 20min 7sec (and
disk made more noise)
(this testcase does many 1-sector writes to the same or adjacent sectors,
so cache helps here a lot)
So it's likely that this disk honours cache flushing.
(but the disk contains another severe bug --- it corrupts it
cache-coherency logic when 256-sector accesses are being used --- I
asked WD about it and got no response. 256 is represented as 0 in IDE
registers --- that's probably where the bug came from).
I've also heard a lot of rumors about ignoring cache flush --- but I mean,
have anybody actually proven that some disk corrupts data this way? i.e.:
make a program that does repeatedly this:
write some sector
issue flush cache command
send a packet about what was written where
... and turn off machine while this program runs and see if disk contains
all the data from packets.
or
write many small sectors
issue flush cache
turn off power via ACPI
on next reboot see, if disk contains all the data
Note that disk can still ignore FLUSH CACHE command cached data are small
enough to be written on power loss, so small FLUSH CACHE time doesn't
prove disk cheating.
Mikulas
> My gut feeling is that if this gets enough ink to get under the drive
> makers skins, we will see the issuance of a utility from the makers
> that will re-program the drives therefore enabling the proper
> handling of the FLUSH CACHE command. This would be an excellent
> chance IMO, to make a bit of noise if the utility comes out, but only
> runs on windows. In that event, we hold their feet to the fire (the
> prefereable method), or a wrapper is written that allows it to run on
> any os with a bash-like shell manager.
>
> >Mikulas
> >-
> >To unsubscribe from this list: send the line "unsubscribe
> > linux-kernel" in the body of a message to [email protected]
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at http://www.tux.org/lkml/
>
> --
> Cheers, Gene
> "There are four boxes to be used in defense of liberty:
> soap, ballot, jury, and ammo. Please use in that order."
> -Ed Howdershelt (Author)
> 99.34% setiathome rank, not too shabby for a WV hillbilly
> Yahoo.com and AOL/TW attorneys please note, additions to the above
> message by Gene Heskett are:
> Copyright 2005 by Maurice Eugene Heskett, all rights reserved.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>>>>> "Jeff" == Jeff Garzik <[email protected]> writes:
Jeff> On Sun, May 15, 2005 at 11:21:36AM -0400, Gene Heskett wrote:
>> On Sunday 15 May 2005 11:00, Mikulas Patocka wrote:
>> >On Sun, 15 May 2005, Tomasz Torcz wrote:
>> >> On Sun, May 15, 2005 at 04:12:07PM +0200, Andi Kleen wrote:
>> >> > > > > However they've patched the FreeBSD kernel to
>> >> > > > > "workaround?" it:
>> >> > > > > ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09/ht
>> >> > > > >t5.patch
>> >> > > >
>> >> > > > That's a similar stupid idea as they did with the disk write
>> >> > > > cache (lowering the MTBFs of their disks by considerable
>> >> > > > factors, which is much worse than the power off data loss
>> >> > > > problem) Let's not go down this path please.
>> >> > >
>> >> > > What wrong did they do with disk write cache?
>> >> >
>> >> > They turned it off by default, which according to disk vendors
>> >> > lowers the MTBF of your disk to a fraction of the original
>> >> > value.
>> >> >
>> >> > I bet the total amount of valuable data lost for FreeBSD users
>> >> > because of broken disks is much much bigger than what they
>> >> > gained from not losing in the rather hard to hit power off
>> >> > cases.
>> >>
>> >> Aren't I/O barriers a way to safely use write cache?
>> >
>> >FreeBSD used these barriers (FLUSH CACHE command) long time ago.
>> >
>> >There are rumors that some disks ignore FLUSH CACHE command just to
>> > get higher benchmarks in Windows. But I haven't heart of any proof.
>> > Does anybody know, what companies fake this command?
>> >
>> >From a story I read elsewhere just a few days ago, this problem is
>> virtually universal even in the umpty-bucks 15,000 rpm scsi server
>> drives. It appears that this is just another way to crank up the
>> numbers and make each drive seem faster than its competition.
>>
>> My gut feeling is that if this gets enough ink to get under the drive
>> makers skins, we will see the issuance of a utility from the makers
>> that will re-program the drives therefore enabling the proper
>> handling of the FLUSH CACHE command. This would be an excellent
>> chance IMO, to make a bit of noise if the utility comes out, but only
>> runs on windows. In that event, we hold their feet to the fire (the
>> prefereable method), or a wrapper is written that allows it to run on
>> any os with a bash-like shell manager.
Jeff> There is a large amount of yammering and speculation in this thread.
Jeff> Most disks do seem to obey SYNC CACHE / FLUSH CACHE.
Then it must be file system who's not controlling properly. And
because this is so widely spread among Linux, there must be at least
one bug existing in VFS ( or there was, and everyone copied it ).
At least, from:
http://developer.osdl.jp/projects/doubt/
there is project name "diskio" which does black box test about this:
http://developer.osdl.jp/projects/doubt/diskio/index.html
And if we assume for Read after Write access semantics of HDD for
"SURELY" checking the data image on disk surface ( by HDD, I mean ),
on both SCSI and ATA, ALL the file system does not pass the test.
And I was wondering who's bad. File system? Device driver of both
SCSI and ATA? or criterion? From Jeff's point, it seems like file
system or criterion...
----
Kenichi Okuyama
Kenichi Okuyama wrote:
>>>>>>"Jeff" == Jeff Garzik <[email protected]> writes:
>
>
> Jeff> On Sun, May 15, 2005 at 11:21:36AM -0400, Gene Heskett wrote:
>
>>>On Sunday 15 May 2005 11:00, Mikulas Patocka wrote:
>>>
>>>>On Sun, 15 May 2005, Tomasz Torcz wrote:
>>>>
>>>>>On Sun, May 15, 2005 at 04:12:07PM +0200, Andi Kleen wrote:
>>>>>
>>>>>>>>>However they've patched the FreeBSD kernel to
>>>>>>>>>"workaround?" it:
>>>>>>>>>ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09/ht
>>>>>>>>>t5.patch
>>>>>>>>
>>>>>>>>That's a similar stupid idea as they did with the disk write
>>>>>>>>cache (lowering the MTBFs of their disks by considerable
>>>>>>>>factors, which is much worse than the power off data loss
>>>>>>>>problem) Let's not go down this path please.
>>>>>>>
>>>>>>>What wrong did they do with disk write cache?
>>>>>>
>>>>>>They turned it off by default, which according to disk vendors
>>>>>>lowers the MTBF of your disk to a fraction of the original
>>>>>>value.
>>>>>>
>>>>>>I bet the total amount of valuable data lost for FreeBSD users
>>>>>>because of broken disks is much much bigger than what they
>>>>>>gained from not losing in the rather hard to hit power off
>>>>>>cases.
>>>>>
>>>>> Aren't I/O barriers a way to safely use write cache?
>>>>
>>>>FreeBSD used these barriers (FLUSH CACHE command) long time ago.
>>>>
>>>>There are rumors that some disks ignore FLUSH CACHE command just to
>>>>get higher benchmarks in Windows. But I haven't heart of any proof.
>>>>Does anybody know, what companies fake this command?
>>>>
>>>
>>>>From a story I read elsewhere just a few days ago, this problem is
>>>virtually universal even in the umpty-bucks 15,000 rpm scsi server
>>>drives. It appears that this is just another way to crank up the
>>>numbers and make each drive seem faster than its competition.
>>>
>>>My gut feeling is that if this gets enough ink to get under the drive
>>>makers skins, we will see the issuance of a utility from the makers
>>>that will re-program the drives therefore enabling the proper
>>>handling of the FLUSH CACHE command. This would be an excellent
>>>chance IMO, to make a bit of noise if the utility comes out, but only
>>>runs on windows. In that event, we hold their feet to the fire (the
>>>prefereable method), or a wrapper is written that allows it to run on
>>>any os with a bash-like shell manager.
>
>
>
> Jeff> There is a large amount of yammering and speculation in this thread.
>
> Jeff> Most disks do seem to obey SYNC CACHE / FLUSH CACHE.
>
>
> Then it must be file system who's not controlling properly. And
> because this is so widely spread among Linux, there must be at least
> one bug existing in VFS ( or there was, and everyone copied it ).
>
> At least, from:
>
> http://developer.osdl.jp/projects/doubt/
>
> there is project name "diskio" which does black box test about this:
>
> http://developer.osdl.jp/projects/doubt/diskio/index.html
>
> And if we assume for Read after Write access semantics of HDD for
> "SURELY" checking the data image on disk surface ( by HDD, I mean ),
> on both SCSI and ATA, ALL the file system does not pass the test.
>
> And I was wondering who's bad. File system? Device driver of both
> SCSI and ATA? or criterion? From Jeff's point, it seems like file
> system or criterion...
The ability of a filesystem or fsync(2) to cause a [FLUSH|SYNC] CACHE
command to be generated has only been present in the most recent 2.6.x
kernels. See the "write barrier" stuff that people have been discussing.
Furthermore, read-after-write implies nothing at all. The only way to
you can be assured that your data has "hit the platter" is
(1) issuing [FLUSH|SYNC] CACHE, or
(2) using FUA-style disk commands
It sounds like your test (or reasoning) is invalid.
Jeff
On May 15, 2005, at 12:43:07, Jeff Garzik wrote:
> The only way to you can be assured that your data has "hit the
> platter" is
> (1) issuing [FLUSH|SYNC] CACHE, or
> (2) using FUA-style disk commands
And even then, some battery-backed RAID controllers will completely
ignore cache-flushes, because in the event of a power failure they
can maintain the cached data for anywhere from a couple days to a
month or two, depending on the quality of the card and the size of
its battery.
Jeff Garzik <[email protected]> writes:
>
> The ability of a filesystem or fsync(2) to cause a [FLUSH|SYNC] CACHE
> command to be generated has only been present in the most recent 2.6.x
> kernels. See the "write barrier" stuff that people have been
> discussing.
Are you sure mainline does it for fsync() file data at all? iirc it
was only done for journal writes in reiserfs/xfs/jbd. However since
I suppose a lot of disks flush everything pending on a flush cache
command it still works assuming the file systems write the
data to disk in fsync before syncing the journal. I don't know
if they do that.
-Andi
On Sun, 15 May 2005, Jeff Garzik wrote:
> Kenichi Okuyama wrote:
> >>>>>>"Jeff" == Jeff Garzik <[email protected]> writes:
> >
> >
> > Jeff> On Sun, May 15, 2005 at 11:21:36AM -0400, Gene Heskett wrote:
> >
> >>>On Sunday 15 May 2005 11:00, Mikulas Patocka wrote:
> >>>
> >>>>On Sun, 15 May 2005, Tomasz Torcz wrote:
> >>>>
> >>>>>On Sun, May 15, 2005 at 04:12:07PM +0200, Andi Kleen wrote:
> >>>>>
> >>>>>>>>>However they've patched the FreeBSD kernel to
> >>>>>>>>>"workaround?" it:
> >>>>>>>>>ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09/ht
> >>>>>>>>>t5.patch
> >>>>>>>>
> >>>>>>>>That's a similar stupid idea as they did with the disk write
> >>>>>>>>cache (lowering the MTBFs of their disks by considerable
> >>>>>>>>factors, which is much worse than the power off data loss
> >>>>>>>>problem) Let's not go down this path please.
> >>>>>>>
> >>>>>>>What wrong did they do with disk write cache?
> >>>>>>
> >>>>>>They turned it off by default, which according to disk vendors
> >>>>>>lowers the MTBF of your disk to a fraction of the original
> >>>>>>value.
> >>>>>>
> >>>>>>I bet the total amount of valuable data lost for FreeBSD users
> >>>>>>because of broken disks is much much bigger than what they
> >>>>>>gained from not losing in the rather hard to hit power off
> >>>>>>cases.
> >>>>>
> >>>>> Aren't I/O barriers a way to safely use write cache?
> >>>>
> >>>>FreeBSD used these barriers (FLUSH CACHE command) long time ago.
> >>>>
> >>>>There are rumors that some disks ignore FLUSH CACHE command just to
> >>>>get higher benchmarks in Windows. But I haven't heart of any proof.
> >>>>Does anybody know, what companies fake this command?
> >>>>
> >>>
> >>>>From a story I read elsewhere just a few days ago, this problem is
> >>>virtually universal even in the umpty-bucks 15,000 rpm scsi server
> >>>drives. It appears that this is just another way to crank up the
> >>>numbers and make each drive seem faster than its competition.
> >>>
> >>>My gut feeling is that if this gets enough ink to get under the drive
> >>>makers skins, we will see the issuance of a utility from the makers
> >>>that will re-program the drives therefore enabling the proper
> >>>handling of the FLUSH CACHE command. This would be an excellent
> >>>chance IMO, to make a bit of noise if the utility comes out, but only
> >>>runs on windows. In that event, we hold their feet to the fire (the
> >>>prefereable method), or a wrapper is written that allows it to run on
> >>>any os with a bash-like shell manager.
> >
> >
> >
> > Jeff> There is a large amount of yammering and speculation in this thread.
> >
> > Jeff> Most disks do seem to obey SYNC CACHE / FLUSH CACHE.
> >
> >
> > Then it must be file system who's not controlling properly. And
> > because this is so widely spread among Linux, there must be at least
> > one bug existing in VFS ( or there was, and everyone copied it ).
> >
> > At least, from:
> >
> > http://developer.osdl.jp/projects/doubt/
> >
> > there is project name "diskio" which does black box test about this:
> >
> > http://developer.osdl.jp/projects/doubt/diskio/index.html
> >
> > And if we assume for Read after Write access semantics of HDD for
> > "SURELY" checking the data image on disk surface ( by HDD, I mean ),
> > on both SCSI and ATA, ALL the file system does not pass the test.
> >
> > And I was wondering who's bad. File system? Device driver of both
> > SCSI and ATA? or criterion? From Jeff's point, it seems like file
> > system or criterion...
>
> The ability of a filesystem or fsync(2) to cause a [FLUSH|SYNC] CACHE
> command to be generated has only been present in the most recent 2.6.x
> kernels. See the "write barrier" stuff that people have been discussing.
>
> Furthermore, read-after-write implies nothing at all. The only way to
> you can be assured that your data has "hit the platter" is
> (1) issuing [FLUSH|SYNC] CACHE, or
> (2) using FUA-style disk commands
>
> It sounds like your test (or reasoning) is invalid.
The above program checks that write+[f[data]]sync took longer than time
required to transmit data via IDE bus. It has nothing to do with FLUSH
CACHE command at all.
The results just show that ext3 used to have bug in f[data]sync in
data-journal mode and that XFS still has bug in fdatasync on 2.4 kernels.
Incorrect results in this test can't be caused by buggy disk.
Mikulas
> Jeff
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>>>>> "Jeff" == Jeff Garzik <[email protected]> writes:
Jeff> Kenichi Okuyama wrote:
>>>>>>> "Jeff" == Jeff Garzik <[email protected]> writes:
>>
>>
Jeff> On Sun, May 15, 2005 at 11:21:36AM -0400, Gene Heskett wrote:
>>
>>>> On Sunday 15 May 2005 11:00, Mikulas Patocka wrote:
>>>>
>>>>> On Sun, 15 May 2005, Tomasz Torcz wrote:
>>>>>
>>>>>> On Sun, May 15, 2005 at 04:12:07PM +0200, Andi Kleen wrote:
>>>>>>
>>>>>>>>>> However they've patched the FreeBSD kernel to
>>>>>>>>>> "workaround?" it:
>>>>>>>>>> ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09/ht
>>>>>>>>>> t5.patch
>>>>>>>>>
>>>>>>>>> That's a similar stupid idea as they did with the disk write
>>>>>>>>> cache (lowering the MTBFs of their disks by considerable
>>>>>>>>> factors, which is much worse than the power off data loss
>>>>>>>>> problem) Let's not go down this path please.
>>>>>>>>
>>>>>>>> What wrong did they do with disk write cache?
>>>>>>>
>>>>>>> They turned it off by default, which according to disk vendors
>>>>>>> lowers the MTBF of your disk to a fraction of the original
>>>>>>> value.
>>>>>>>
>>>>>>> I bet the total amount of valuable data lost for FreeBSD users
>>>>>>> because of broken disks is much much bigger than what they
>>>>>>> gained from not losing in the rather hard to hit power off
>>>>>>> cases.
>>>>>>
>>>>> Aren't I/O barriers a way to safely use write cache?
>>>>>
>>>>> FreeBSD used these barriers (FLUSH CACHE command) long time ago.
>>>>>
>>>>> There are rumors that some disks ignore FLUSH CACHE command just to
>>>>> get higher benchmarks in Windows. But I haven't heart of any proof.
>>>>> Does anybody know, what companies fake this command?
>>>>>
>>>>
>>>>> From a story I read elsewhere just a few days ago, this problem is
>>>> virtually universal even in the umpty-bucks 15,000 rpm scsi server
>>>> drives. It appears that this is just another way to crank up the
>>>> numbers and make each drive seem faster than its competition.
>>>>
>>>> My gut feeling is that if this gets enough ink to get under the drive
>>>> makers skins, we will see the issuance of a utility from the makers
>>>> that will re-program the drives therefore enabling the proper
>>>> handling of the FLUSH CACHE command. This would be an excellent
>>>> chance IMO, to make a bit of noise if the utility comes out, but only
>>>> runs on windows. In that event, we hold their feet to the fire (the
>>>> prefereable method), or a wrapper is written that allows it to run on
>>>> any os with a bash-like shell manager.
>>
>>
>>
Jeff> There is a large amount of yammering and speculation in this thread.
>>
Jeff> Most disks do seem to obey SYNC CACHE / FLUSH CACHE.
>>
>>
>> Then it must be file system who's not controlling properly. And
>> because this is so widely spread among Linux, there must be at least
>> one bug existing in VFS ( or there was, and everyone copied it ).
>>
>> At least, from:
>>
>> http://developer.osdl.jp/projects/doubt/
>>
>> there is project name "diskio" which does black box test about this:
>>
>> http://developer.osdl.jp/projects/doubt/diskio/index.html
>>
>> And if we assume for Read after Write access semantics of HDD for
>> "SURELY" checking the data image on disk surface ( by HDD, I mean ),
>> on both SCSI and ATA, ALL the file system does not pass the test.
>>
>> And I was wondering who's bad. File system? Device driver of both
>> SCSI and ATA? or criterion? From Jeff's point, it seems like file
>> system or criterion...
Jeff> The ability of a filesystem or fsync(2) to cause a [FLUSH|SYNC] CACHE
Jeff> command to be generated has only been present in the most recent 2.6.x
Jeff> kernels. See the "write barrier" stuff that people have been discussing.
Jeff> Furthermore, read-after-write implies nothing at all. The only way to
Jeff> you can be assured that your data has "hit the platter" is
Jeff> (1) issuing [FLUSH|SYNC] CACHE, or
Jeff> (2) using FUA-style disk commands
Jeff> It sounds like your test (or reasoning) is invalid.
Thank you for you information, Jeff.
I didn't see the reason why my reasoning is invalid, for they are
black box test and doesn't care about implementation.
But with your explanation and some logs, I see where to look for.
I'll run test with FreeBSD as soon as I got time.
If FreeBSD fails, there must be something wrong with reasoning.
Thanks again for great hint.
regards,
----
Kenichi Okuyama
Andi Kleen wrote:
> And what you're doing is to ask all the non crypto guys to give
> up an useful optimization just to fix a problem in the crypto guy's
> code. The cache line information leak is just a information leak
> bug in the crypto code, not a general problem.
Portable code shouldn't even have to know that there is such a thing as a
cache line. It should be able to rely on the operating system not to let
other tasks with a different security context spy on the details of its
operation.
> There is much more non crypto code than crypto code around - you
> are proposing to screw the majority of codes to solve a relatively
> obscure problem of only a few functions, which seems like the totally
> wrong approach to me.
That I do agree with.
> BTW the crypto guys are always free to check for hyperthreading
> themselves and use different functions. However there is a catch
> there - the modern dual core processors which actually have
> separated L1 and L2 caches set these too to stay compatible
> with old code and license managers.
This is just a recipe for making it impossible to write correct code. If
you don't believe the operating system or the hardware is at all at fault
for this problem, then it would follow that they could repeat this same
problem with some new mechanism and still not be at fault. So even if the
program checked for hyper-threading, it would still not be correct. It would
have to check for every possible future way this same type of problem could
arise and hide every type of trace that they could create, even if that
trace is in optimization mechanisms and potential channels over which the
programmer has no knowledge because they don't exist yet.
Let's try a rudctio ad absurdum. Surely you would agree that something
other than than the crypto software is at fault if the operating system or
hardware allowed another process with a different security context to see
every instruction the code executed. The crypto authors shouldn't be
expected to make the instruction flows look identical. How different is
monitoring the memory accesses?
Portable, POSIX-compliant C software shouldn't even have to know that there
is such a thing as a cache line.
I'm not going to be unreasonable though. Hyper-threading is here, and now
that we know the potential problems, it's not unreasonable to ask developers
of crypto code to work around it. But it's not a bug in their code that they
need to fix. In fact, they can't even fix it yet because there is no
portable way to determine if you're on a machine that has hyper-threading or
not.
DS
* David Schwartz ([email protected]) wrote:
>
> Andi Kleen wrote:
>
> > And what you're doing is to ask all the non crypto guys to give
> > up an useful optimization just to fix a problem in the crypto guy's
> > code. The cache line information leak is just a information leak
> > bug in the crypto code, not a general problem.
>
> Portable code shouldn't even have to know that there is such a thing as a
> cache line. It should be able to rely on the operating system not to let
> other tasks with a different security context spy on the details of its
> operation.
I find it interesting to compare this thread with a thread from about
a week ago talking about how /proc/cpuinfo wasn't consistent
across architectures - where we come round to the view of whether
the application writers shouldn't care/are too dumb/shouldn't need
to know about/can't be trusted with knowing about what the real
hardware is.
Personally I think this is a good case of where the application
should take care of it - with whatever support the OS can really
give.
(That is if this is actually a real problem and not just
purely theoretical - my crypto knowledge isn't good enough
to answer that - but it feels very very abstract).
Dave
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux on Alpha,68K| Happy \
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/
On Sul, 2005-05-15 at 08:30, Arjan van de Ven wrote:
> stop entirely.... (and that is also happening more and more and linux is
> getting more agressive idle support (eg no timer tick and such patches)
> which will trigger bios thresholds for this even more too.
Cyrix did TSC stop on halt a long long time ago, back when it was worth
the power difference.
Alan
Andi Kleen <[email protected]> wrote:
>
> However since
> I suppose a lot of disks flush everything pending on a flush cache
> command it still works assuming the file systems write the
> data to disk in fsync before syncing the journal. I don't know
> if they do that.
ext3 does, in data=journal and data=ordered modes.
On Sun, 2005-05-15 at 21:41 +0100, Alan Cox wrote:
> On Sul, 2005-05-15 at 08:30, Arjan van de Ven wrote:
> > stop entirely.... (and that is also happening more and more and linux is
> > getting more agressive idle support (eg no timer tick and such patches)
> > which will trigger bios thresholds for this even more too.
>
> Cyrix did TSC stop on halt a long long time ago, back when it was worth
> the power difference.
With linux going to ACPI C2 mode more... tsc is defined to halt in C2...
On Sun, 2005-05-15 at 22:48 +0200, Arjan van de Ven wrote:
> On Sun, 2005-05-15 at 21:41 +0100, Alan Cox wrote:
> > On Sul, 2005-05-15 at 08:30, Arjan van de Ven wrote:
> > > stop entirely.... (and that is also happening more and more and linux is
> > > getting more agressive idle support (eg no timer tick and such patches)
> > > which will trigger bios thresholds for this even more too.
> >
> > Cyrix did TSC stop on halt a long long time ago, back when it was worth
> > the power difference.
>
> With linux going to ACPI C2 mode more... tsc is defined to halt in C2...
JACK doesn't care about any of this now, the behavior when you
suspend/resume with a running jackd is undefined. Eventually we should
handle it, but there's no point until the ALSA drivers get proper
suspend/resume support.
Lee
On Sun, May 15, 2005 at 11:21:36AM -0400, Gene Heskett wrote:
> >FreeBSD used these barriers (FLUSH CACHE command) long time ago.
> >
> >There are rumors that some disks ignore FLUSH CACHE command just to
> > get higher benchmarks in Windows. But I haven't heart of any proof.
> > Does anybody know, what companies fake this command?
> >
> >From a story I read elsewhere just a few days ago, this problem is
> virtually universal even in the umpty-bucks 15,000 rpm scsi server
> drives. It appears that this is just another way to crank up the
> numbers and make each drive seem faster than its competition.
Probably you talking about this: http://www.livejournal.com/~brad/2116715.html
It has hit Slashdot yesterday.
--
Tomasz Torcz "God, root, what's the difference?"
[email protected] "God is more forgiving."
On Sun, May 15, 2005 at 05:10:59PM -0400, Lee Revell wrote:
> On Sun, 2005-05-15 at 22:48 +0200, Arjan van de Ven wrote:
> > On Sun, 2005-05-15 at 21:41 +0100, Alan Cox wrote:
> > > On Sul, 2005-05-15 at 08:30, Arjan van de Ven wrote:
> > > > stop entirely.... (and that is also happening more and more and linux is
> > > > getting more agressive idle support (eg no timer tick and such patches)
> > > > which will trigger bios thresholds for this even more too.
> > >
> > > Cyrix did TSC stop on halt a long long time ago, back when it was worth
> > > the power difference.
> >
> > With linux going to ACPI C2 mode more... tsc is defined to halt in C2...
>
> JACK doesn't care about any of this now, the behavior when you
> suspend/resume with a running jackd is undefined. Eventually we should
> handle it, but there's no point until the ALSA drivers get proper
> suspend/resume support.
suspend/resume are S states, not C states. C states are occuring
during runtime.
Dave
On Sun, 2005-05-15 at 18:55 -0400, Dave Jones wrote:
> On Sun, May 15, 2005 at 05:10:59PM -0400, Lee Revell wrote:
> > On Sun, 2005-05-15 at 22:48 +0200, Arjan van de Ven wrote:
> > > On Sun, 2005-05-15 at 21:41 +0100, Alan Cox wrote:
> > > > On Sul, 2005-05-15 at 08:30, Arjan van de Ven wrote:
> > > > > stop entirely.... (and that is also happening more and more and linux is
> > > > > getting more agressive idle support (eg no timer tick and such patches)
> > > > > which will trigger bios thresholds for this even more too.
> > > >
> > > > Cyrix did TSC stop on halt a long long time ago, back when it was worth
> > > > the power difference.
> > >
> > > With linux going to ACPI C2 mode more... tsc is defined to halt in C2...
> >
> > JACK doesn't care about any of this now, the behavior when you
> > suspend/resume with a running jackd is undefined. Eventually we should
> > handle it, but there's no point until the ALSA drivers get proper
> > suspend/resume support.
>
> suspend/resume are S states, not C states. C states are occuring
> during runtime.
It should never go into C2 if jackd is running, because you're getting
interrupts from the audio interface at least every 100ms or so (usually
much more often) which will wake up jackd and any clients.
Lee
In principle, it is correct that CPU caches should _not_ permit, or
facilitate data leakage attacks and disk caches should _not_ prevent
applications from ensuring that data is really transferred to non-
volatile storage.
But turning Hypertheading, multiple ALUs, or disk cacheing off in the
OS is not a solution, it is a cop-out, and as other posters have pointed
out, simply invites other more serious failure modes; thus the BSD
knee jerk reactions are simply wrong, and in fact counter productive.
The name of the game is a correct, not a fast fix. Don't make things
worse.
So what really does need doing
(a) a power-is-failing hook which does a dirty-writback and flush
cache to disk; this is the best you can do, and it is very very cheap
to provide DC power hold up for 10(s)->100(s) seconds, by which time
the crap disks will do an autonomous writback anyway (1-10 F +5v,+12b
~ 12 USD say), or, on servers use a UPS with, say 30m hold up.
Well designed servers, or SAN disks have this built in.
(b) CPU registers, and caches, are inherently insecure, and most
hardware designers still do not have a good enough background to
understand what the OS really needs to do this right in hardware:
so secure apps need a way to tell the OS to do an _expensive_
context switch in which it is guarenteed to flush all all leaky-context,
and since this is architecture-model-sub_architecture- ...
mask step dependant it can only be done in the OS, but user-land
needs a way to tell the OS to be paranoic, after the context
save and before scheduling another real context (excluding
the idle-loop), this is an API extension, ulimit ?.
This will let user-land, not include atchitecture dependant code,
and most context switches to be no more expensive than they are
now.
Almost no applications need paranoic context flushes, can't know
how to do it themselves, so this has to go in the model dependant
OS code, with a user mode API to turn it on per-thead.
--
mit freundlichen Gr??en, Brian.
Dr. Brian O'Mahoney
Mobile +41 (0)79 334 8035 Email: [email protected]
Bleicherstrasse 25, CH-8953 Dietikon, Switzerland
PGP Key fingerprint = 33 41 A2 DE 35 7C CE 5D F5 14 39 C9 6D 38 56 D5
On Sunday 15 May 2005 11:29, Jeff Garzik wrote:
>On Sun, May 15, 2005 at 11:21:36AM -0400, Gene Heskett wrote:
>> On Sunday 15 May 2005 11:00, Mikulas Patocka wrote:
>> >On Sun, 15 May 2005, Tomasz Torcz wrote:
>> >> On Sun, May 15, 2005 at 04:12:07PM +0200, Andi Kleen wrote:
>> >> > > > > However they've patched the FreeBSD kernel to
>> >> > > > > "workaround?" it:
>> >> > > > > ftp://ftp.freebsd.org/pub/FreeBSD/CERT/patches/SA-05:09
>> >> > > > >/ht t5.patch
>> >> > > >
>> >> > > > That's a similar stupid idea as they did with the disk
>> >> > > > write cache (lowering the MTBFs of their disks by
>> >> > > > considerable factors, which is much worse than the power
>> >> > > > off data loss problem) Let's not go down this path
>> >> > > > please.
>> >> > >
>> >> > > What wrong did they do with disk write cache?
>> >> >
>> >> > They turned it off by default, which according to disk
>> >> > vendors lowers the MTBF of your disk to a fraction of the
>> >> > original value.
>> >> >
>> >> > I bet the total amount of valuable data lost for FreeBSD
>> >> > users because of broken disks is much much bigger than what
>> >> > they gained from not losing in the rather hard to hit power
>> >> > off cases.
>> >>
>> >> Aren't I/O barriers a way to safely use write cache?
>> >
>> >FreeBSD used these barriers (FLUSH CACHE command) long time ago.
>> >
>> >There are rumors that some disks ignore FLUSH CACHE command just
>> > to get higher benchmarks in Windows. But I haven't heart of any
>> > proof. Does anybody know, what companies fake this command?
>> >
>> >From a story I read elsewhere just a few days ago, this problem
>> > is
>>
>> virtually universal even in the umpty-bucks 15,000 rpm scsi server
>> drives. It appears that this is just another way to crank up the
>> numbers and make each drive seem faster than its competition.
>>
>> My gut feeling is that if this gets enough ink to get under the
>> drive makers skins, we will see the issuance of a utility from the
>> makers that will re-program the drives therefore enabling the
>> proper handling of the FLUSH CACHE command. This would be an
>> excellent chance IMO, to make a bit of noise if the utility comes
>> out, but only runs on windows. In that event, we hold their feet
>> to the fire (the prefereable method), or a wrapper is written that
>> allows it to run on any os with a bash-like shell manager.
>
>There is a large amount of yammering and speculation in this thread.
I agree, and frankly I'm just another of the yammerers as I don't
have the clout to be otherwise.
>Most disks do seem to obey SYNC CACHE / FLUSH CACHE.
>
> Jeff
I don't think I have any drives here that do obey that, Jeff. I got
curious about this, oh, maybe a year back when this discussion first
took place on another list, and wrote a test gizmo that copied a
large file, then slept for 1 second and issued a sync command. No
drive led activity until the usual 5 second delay of the filesystem
had expired. To me, that indicated that the sync command was being
returned as completed without error and I had my shell prompt back
long before the drives leds came on. Admittedly that may not be a
100% valid test, but I really did expect to see the leds come on as
the sync command was executed.
I also have some setup stuff for heyu that runs at various times of
the day, reconfigureing how heyu and xtend run 3 times a day here,
which depends on a valid disk file, and I've had to use sleeps for
guaranteeing the proper sequencing, where if the sync command
actually worked, I could get the job done quite a bit faster.
Again, probably not a valid test of the sync command, but thats the
evidence I have. I do not believe it works here, with any of the 5
drives currently spinning in these two boxes.
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.34% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2005 by Maurice Eugene Heskett, all rights reserved.
Gene Heskett wrote:
> I don't think I have any drives here that do obey that, Jeff. I got
> curious about this, oh, maybe a year back when this discussion first
> took place on another list, and wrote a test gizmo that copied a
> large file, then slept for 1 second and issued a sync command. No
> drive led activity until the usual 5 second delay of the filesystem
> had expired. To me, that indicated that the sync command was being
> returned as completed without error and I had my shell prompt back
> long before the drives leds came on. Admittedly that may not be a
> 100% valid test, but I really did expect to see the leds come on as
> the sync command was executed.
> Again, probably not a valid test of the sync command, but thats the
> evidence I have. I do not believe it works here, with any of the 5
> drives currently spinning in these two boxes.
Correct, that's a pretty poor test.
Jeff
On Sun, 15 May 2005, Gene Heskett wrote:
> >There is a large amount of yammering and speculation in this thread.
>
> I agree, and frankly I'm just another of the yammerers as I don't
> have the clout to be otherwise.
>
> >Most disks do seem to obey SYNC CACHE / FLUSH CACHE.
> >
> > Jeff
>
> I don't think I have any drives here that do obey that, Jeff. I got
> curious about this, oh, maybe a year back when this discussion first
> took place on another list, and wrote a test gizmo that copied a
> large file, then slept for 1 second and issued a sync command. No
> drive led activity until the usual 5 second delay of the filesystem
> had expired. To me, that indicated that the sync command was being
> returned as completed without error and I had my shell prompt back
> long before the drives leds came on. Admittedly that may not be a
> 100% valid test, but I really did expect to see the leds come on as
> the sync command was executed.
>
> I also have some setup stuff for heyu that runs at various times of
> the day, reconfigureing how heyu and xtend run 3 times a day here,
> which depends on a valid disk file, and I've had to use sleeps for
> guaranteeing the proper sequencing, where if the sync command
> actually worked, I could get the job done quite a bit faster.
>
> Again, probably not a valid test of the sync command, but thats the
> evidence I have. I do not believe it works here, with any of the 5
> drives currently spinning in these two boxes.
Note, that Linux can't send FLUSH CACHE command at all (until very recent
2.6 kernels). So write cache is always dangerous under Linux, no matter if
disk is broken or not.
Another note: according to posix, sync() is asynchronous --- i.e. it
initiates write, but doesn't have to wait for complete. In linux, sync()
waits for writes to complete, but it doesn't have to in other OSes.
Mikulas
> --
> Cheers, Gene
> "There are four boxes to be used in defense of liberty:
> soap, ballot, jury, and ammo. Please use in that order."
> -Ed Howdershelt (Author)
> 99.34% setiathome rank, not too shabby for a WV hillbilly
> Yahoo.com and AOL/TW attorneys please note, additions to the above
> message by Gene Heskett are:
> Copyright 2005 by Maurice Eugene Heskett, all rights reserved.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>took place on another list, and wrote a test gizmo that copied a
>large file, then slept for 1 second and issued a sync command. No
>drive led activity until the usual 5 second delay of the filesystem
>had expired. To me, that indicated that the sync command was being
There's your clue. The drive LEDs normally reflect activity
over the ATA bus (the cable!). If they're not on, then the drive
isn't receiving data/commands from the host.
Cheers
On Sunday 15 May 2005 22:24, Mikulas Patocka wrote:
>On Sun, 15 May 2005, Gene Heskett wrote:
>> >There is a large amount of yammering and speculation in this
>> > thread.
>>
>> I agree, and frankly I'm just another of the yammerers as I don't
>> have the clout to be otherwise.
>>
>> >Most disks do seem to obey SYNC CACHE / FLUSH CACHE.
>> >
>> > Jeff
>>
>> I don't think I have any drives here that do obey that, Jeff. I
>> got curious about this, oh, maybe a year back when this discussion
>> first took place on another list, and wrote a test gizmo that
>> copied a large file, then slept for 1 second and issued a sync
>> command. No drive led activity until the usual 5 second delay of
>> the filesystem had expired. To me, that indicated that the sync
>> command was being returned as completed without error and I had my
>> shell prompt back long before the drives leds came on. Admittedly
>> that may not be a 100% valid test, but I really did expect to see
>> the leds come on as the sync command was executed.
>>
>> I also have some setup stuff for heyu that runs at various times
>> of the day, reconfigureing how heyu and xtend run 3 times a day
>> here, which depends on a valid disk file, and I've had to use
>> sleeps for guaranteeing the proper sequencing, where if the sync
>> command actually worked, I could get the job done quite a bit
>> faster.
>>
>> Again, probably not a valid test of the sync command, but thats
>> the evidence I have. I do not believe it works here, with any of
>> the 5 drives currently spinning in these two boxes.
>
>Note, that Linux can't send FLUSH CACHE command at all (until very
> recent 2.6 kernels). So write cache is always dangerous under
> Linux, no matter if disk is broken or not.
>
>Another note: according to posix, sync() is asynchronous --- i.e. it
>initiates write, but doesn't have to wait for complete. In linux,
> sync() waits for writes to complete, but it doesn't have to in
> other OSes.
>
>Mikulas
>
Humm, I'm getting the impression I should rerun that test script if I
can find it. I believe the last time I tried it, I was running a
2.4.x kernel, right now 2.6.12-rc1.
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.34% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2005 by Maurice Eugene Heskett, all rights reserved.
On Sunday 15 May 2005 22:32, Mark Lord wrote:
> >took place on another list, and wrote a test gizmo that copied a
> >large file, then slept for 1 second and issued a sync command. No
> >drive led activity until the usual 5 second delay of the
> > filesystem had expired. To me, that indicated that the sync
> > command was being
>
>There's your clue. The drive LEDs normally reflect activity
>over the ATA bus (the cable!). If they're not on, then the drive
>isn't receiving data/commands from the host.
>
>Cheers
That was my theory too Mark, but Jeff G. says its not a valid
indicator. So who's right?
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.34% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2005 by Maurice Eugene Heskett, all rights reserved.
On Sun, May 15, 2005 at 03:38:22PM +0200, Mikulas Patocka wrote:
> Another possibility to get timing is from direct-io --- i.e. initiate
> direct io read, wait until one cache line contains new data and you can be
> sure that the next will contain new data in certain time. IDE controller
> bus master operation acts here as a timer.
There's no way to do direct-io through seccomp, all the fds are pipes
with twisted userland listening the other side of the pipe. So disabling
the tsc is more than enough to give to CPUShare users a peace of mind
with HT enabled and without having to flush the l2 cache either.
CPUShare is the only case I can imagine where an untrusted and random
bytecode running at 100% system load is the normal behaviour.
Andi Kleen <[email protected]> writes:
> On Fri, May 13, 2005 at 02:26:20PM -0700, Andy Isaacson wrote:
> > On Fri, May 13, 2005 at 09:05:49PM +0200, Andi Kleen wrote:
> > > On Fri, May 13, 2005 at 02:38:03PM -0400, Richard F. Rebel wrote:
> > > > Why? It's certainly reasonable to disable it for the time being and
> > > > even prudent to do so.
> > >
> > > No, i strongly disagree on that. The reasonable thing to do is
> > > to fix the crypto code which has this vulnerability, not break
> > > a useful performance enhancement for everybody else.
> >
> > Pardon me for saying so, but that's bullshit. You're asking the crypto
> > guys to give up a 5x performance gain (that's my wild guess) by giving
> > up all their data-dependent algorithms and contorting their code wildly,
> > to avoid a microarchitectural problem with Intel's HT implementation.
>
> And what you're doing is to ask all the non crypto guys to give
> up an useful optimization just to fix a problem in the crypto guy's
> code. The cache line information leak is just a information leak
> bug in the crypto code, not a general problem.
It is not a problem in the crypto code, it is a mis-feature of
the hardware/kernel combination. As such you must know be intimate
about each and every flavor of the hardware to attempt to avoid
it in the software, and that way lies madness.
First this is a reminder that prefect security requires an audit
of the hardware as well as the software. As we are neither
auditing the hardware not locking it down we obviously will not
achieve perfection. The question then becomes what can be done
to decrease the likely hood that an application will inadvertently
and unavoidably leak information from timing attacks due to unknown
hardware optimizations? Attacks that do not result from hardware
micro-architecture are another problem and one an application can
anticipate and avoid.
Ideally a solution will be proposed that will allow this problem
to be avoided using the existing POSIX API or at least the current
linux kernel API. But that problem may not be the case.
The only solution I have seen proposed so far that seems to work
is to not schedule untrusted processes simultaneously with
the security code. With the current API that sounds like
a root process killing off, or at least stopping all non-root
processes until the critical process has finished.
Potentially the scheduler can be modified to do this at a finer
grain but I don't know if this would impact the scheduler fast
path. Given the rarity and uncertainty of this it should probably
be something that the process that is worried about security should
asks for instead of simply getting by default.
So it looks to me like the sanest way to handle this is to
allocate a pool of threads/processes one per cpu. Set the
affinity of each process to a particular cpu. And set priority
of the threads to run at the highest priority. And during the
critical time ensure none of the threads are sleeping.
Can someone see a better way to prevent an accidental information
leak to do to hardware architecture details?
I wish there was a better way to ensure all of the threads were
running simultaneously and other then giving them the highest priority
in the system but I don't currently see an alternative.
> There is much more non crypto code than crypto code around - you
> are proposing to screw the majority of codes to solve a relatively
> obscure problem of only a few functions, which seems like the totally
> wrong approach to me.
>
> BTW the crypto guys are always free to check for hyperthreading
> themselves and use different functions. However there is a catch
> there - the modern dual core processors which actually have
> separated L1 and L2 caches set these too to stay compatible
> with old code and license managers.
And those same processors will have the same problem if the share
significant cpu resources. Ideally the entire problem set
would fit in the cache and the cpu designers would allow cache
blocks to be locked but that is not currently the case. So a shared
L3 cache with dual core processors will have the same problem.
In addition a flavor of this attack may be made by repeatedly doing
multiplies or other activities that access functional units and seeing
how long they have to be waited for. So even hyperthreading without
sharing a L2 cache may see this problem.
Eric
On Sun, 2005-05-15 at 19:10 -0400, Lee Revell wrote:
> On Sun, 2005-05-15 at 18:55 -0400, Dave Jones wrote:
> > On Sun, May 15, 2005 at 05:10:59PM -0400, Lee Revell wrote:
> > > On Sun, 2005-05-15 at 22:48 +0200, Arjan van de Ven wrote:
> > > > On Sun, 2005-05-15 at 21:41 +0100, Alan Cox wrote:
> > > > > On Sul, 2005-05-15 at 08:30, Arjan van de Ven wrote:
> > > > > > stop entirely.... (and that is also happening more and more and linux is
> > > > > > getting more agressive idle support (eg no timer tick and such patches)
> > > > > > which will trigger bios thresholds for this even more too.
> > > > >
> > > > > Cyrix did TSC stop on halt a long long time ago, back when it was worth
> > > > > the power difference.
> > > >
> > > > With linux going to ACPI C2 mode more... tsc is defined to halt in C2...
> > >
> > > JACK doesn't care about any of this now, the behavior when you
> > > suspend/resume with a running jackd is undefined. Eventually we should
> > > handle it, but there's no point until the ALSA drivers get proper
> > > suspend/resume support.
> >
> > suspend/resume are S states, not C states. C states are occuring
> > during runtime.
>
> It should never go into C2 if jackd is running, because you're getting
> interrupts from the audio interface at least every 100ms or so (usually
> much more often) which will wake up jackd and any clients.
you're not guaranteed to not enter C2 in that case. C2 can happen after
a few ms already
On Sun, 15 May 2005, Jeff Garzik wrote:
> The ability of a filesystem or fsync(2) to cause a [FLUSH|SYNC] CACHE
> command to be generated has only been present in the most recent 2.6.x
> kernels. See the "write barrier" stuff that people have been discussing.
To make this explicit and unmistakable, Linux should be ashamed of
having put its users' data at risk for as long as it has existed, and
looking at how often I still get "barrier synch failed", it still does
with the kernel SUSE Linux 9.3 shipped with.
This came up several times, from database and mailserver authors, but
found no reasonable solution to date.
The documentation which file systems request cache flush for fsync, and
which device drivers (SCSI as ATA) as well as chipset adaptors pass this
down properly, is still missing. I've asked for help with such a list
several times over the recent years, I've offered my help in setting up
and maintaining the list when sent the raw information, but no-one cared
to provide this kind of information.
I will not try again, it's no good, kernel hackers with a handful of
exceptions, don't care.
If they think they do in spite of my statement, they'll have to prove
their point by growing up and documenting which combinations of
(file system, mount options, block dev driver, hardware/chip driver)
barrier synch is 100% reliable, which file systems, chipset drivers,
block drivers, hardware drivers, are missing links in the chain -- and
request that the kernel switches off the drive's write cache in all
drives unless the whole fsync() stuff works (unless defeated by a
"benchmark" kernel boot parameter).
Until then, my applications will have to recommend that users switch off
drive caches for consistency.
P. S.: Yes, the subject and this mail are provoking and exaggerated a
tiny bit. I feel that's needed to raise the necessary motivation
to finally address this issue after a decade or so.
--
Matthias Andree
> The only solution I have seen proposed so far that seems to work
> is to not schedule untrusted processes simultaneously with
> the security code. With the current API that sounds like
> a root process killing off, or at least stopping all non-root
> processes until the critical process has finished.
With virtualization and a hypervisor freely scheduling it is quite
impossible to guarantee this. Of course as always the signal
is quite noisy so it is unclear if it is exploitable in practical
settings. On virtualized environments you cannot use ps to see
if a crypto process is running.
> And those same processors will have the same problem if the share
> significant cpu resources. Ideally the entire problem set
> would fit in the cache and the cpu designers would allow cache
> blocks to be locked but that is not currently the case. So a shared
> L3 cache with dual core processors will have the same problem.
At some point the signal gets noisy enough and the assumptions
an attacker has to make too great for it being an useful attack.
For me it is not even clear it is a real attack on native Linux, at
least the setup in the paper looked highly artifical and quite impractical.
e.g. I suppose it would be quite difficult to really synchronize
to the beginning and end of the RSA encryptions on a server that
does other things too.
-Andi
> and
> request that the kernel switches off the drive's write cache in all
> drives unless the whole fsync() stuff works (unless defeated by a
> "benchmark" kernel boot parameter).
I think you missed the part where disabling the writecache decreases the
mtbf of your disk by like a factor 100 or so. At which point your
dataloss opportunity INCREASES by doing this.
Sure you can waive rethorics around, but the fact is that linux is
improving; there now is write barrier support for ext3 (and I assume
reiserfs) for at least IDE and iirc selected scsi too.
Lets repeat that again: disabling the writecache altogether is bad for
your disk. really bad. Barriers aren't brilliant for it either but a
heck of a lot better. Lacking barriers, it's probably safer for your
data to have write cache on than off.
On Sun, 15 May 2005, Mikulas Patocka wrote:
> Note that disk can still ignore FLUSH CACHE command cached data are small
> enough to be written on power loss, so small FLUSH CACHE time doesn't
> prove disk cheating.
Have you seen a drive yet that writes back blocks after power loss?
I have heard rumors about this, but all OEM manuals I looked at for
drives I bought or recommended simply stated that the block currently
being written at power loss can become damaged (with write cache off),
and that the drive can lose the full write cache at power loss (with
write cache on) so this looks like daydreaming manifested as rumor.
I've heard that drives would be taking rotational energy from their
rotating platters and such, but never heard how the hardware compensates
the dilation with decreasing rotational frequency, which also requires
changed filter settings for the write channel, block encoding, delays,
possibly stepping the heads and so on. I don't believe these stories
until I see evidence.
These are corner cases that a vendor would hardly optimize for.
If you know a disk drive (not battery-backed disk controller!) that
flashes its cache to NVRAM, or uses rotational energy to save its cache
on the platters, please name brand and model and where I can download
the material that documents this behavior.
On Mon, 16 May 2005, Arjan van de Ven wrote:
> I think you missed the part where disabling the writecache decreases the
> mtbf of your disk by like a factor 100 or so. At which point your
> dataloss opportunity INCREASES by doing this.
Nah, if that were a factor of 100, then it should have been in the OEM
manuals, no?
Besides that, although my small sample is not representative, I have
older drives still alive & kicking - an MTBF of 1/100 of what the vendor
stated would mean the chance of failure way above 90 % by now, the drive
has seen 22,000 POH with write cache off and has been a system drive for
like 14,000 POH. So?
> Sure you can waive rethorics around, but the fact is that linux is
> improving; there now is write barrier support for ext3 (and I assume
> reiserfs) for at least IDE and iirc selected scsi too.
See the problem: "I assume", "IIRC selected...". There is no
list of corroborated facts which systems work and which don't. I have
made several attempts in compiling one, posting public calls for data
here, no response.
I don't blame you personally, but the lack of documentation about such
crucial facts or generally documentation in Linux environments in general.
--
Matthias Andree
On Fri, May 13, 2005 at 05:39:25PM -0700, dean gaudet wrote:
> same cache index -- and get an 8-fold reduction in exposure. the trick
> here is the L2 is physically indexed, and userland code can perform only
> virtual allocations. but it's not too hard to discover physical conflicts
> if you really want to (using rdtsc) -- it would be done early in the
> initialization of the program because it involves asking for enough memory
> until the kernel gives you enough colliding pages. (a system call could
> help with this if we really wanted it.)
A 8-way set associative 1M cache is guaranteed to go at l2 speed only
up to 128K (no matter what the kernel does), but even if the secret
payload is larger than 128K as long as the load is still distributed
evenly at each pass for each page, there's not going to be any covert
channel, simply the process will run slower than it could if it had a
better page coloring.
So I don't see the need of kernel support, all it needs to do is to know
the page size, and that's provided to userland already.
Gene Heskett wrote:
>
>>There's your clue. The drive LEDs normally reflect activity
>>over the ATA bus (the cable!). If they're not on, then the drive
>>isn't receiving data/commands from the host.
>
> That was my theory too Mark, but Jeff G. says its not a valid
> indicator. So who's right?
If the LEDs are connected to the controller on the motherboard,
then they are a strict indication of activity over the cable
between the drive and controller (if they function at all).
But it is possible for software to leave those LEDs permanently
in the "on" state, depending on the register sequence used.
If the LEDs are on the drive itself, they may indicate transfers
over the connector (cable) -- usually always the case -- or they
could indicate transfers to/from the media.
Cheers
>To make this explicit and unmistakable, Linux should be ashamed of
>having put its users' data at risk for as long as it has existed, and
>looking at how often I still get "barrier synch failed", it still does
>with the kernel SUSE Linux 9.3 shipped with.
With ATA drives, this is strictly a userspace "policy" decision.
Most of us want longer lifespan and 2X the performance from our hardware,
and use UPSs to guarantee continuous power & survivability.
Others want to live more dangerously on the power supply end,
but still be safe on the filesystem end -- no guarantees there,
even with "hdparm -W0" to disable the on-drive cache.
Pulling power from a writing drive is ALWAYS a bad idea,
and can permanently corrupt the track/cylinder that was being
written. This will toast a filesystem regardless of how careful
or proper the write flushes were done.
Write caching on the drive is not as big an issue as
good reliable power for this.
Cheers
On Mon, 2005-05-16 at 13:29 +0200, Matthias Andree wrote:
> On Mon, 16 May 2005, Arjan van de Ven wrote:
>
> > I think you missed the part where disabling the writecache decreases the
> > mtbf of your disk by like a factor 100 or so. At which point your
> > dataloss opportunity INCREASES by doing this.
>
> Nah, if that were a factor of 100, then it should have been in the OEM
> manuals, no?
Why would they? Windows doesn't do it. They only need to advertise MTBF
in the default settings (and I guess in Windows).
They do talk about this if you ask them.
> So?
one sample doesn't prove the statistics are wrong.
>
> > Sure you can waive rethorics around, but the fact is that linux is
> > improving; there now is write barrier support for ext3 (and I assume
> > reiserfs) for at least IDE and iirc selected scsi too.
>
> See the problem: "I assume", "IIRC selected...". There is no
> list of corroborated facts which systems work and which don't. I have
> made several attempts in compiling one, posting public calls for data
> here, no response.
well what stops you from building that list yourself by doing the actual
work yourself?
Matthias Andree wrote:
> On Sun, 15 May 2005, Mikulas Patocka wrote:
>
>
>>Note that disk can still ignore FLUSH CACHE command cached data are small
>>enough to be written on power loss, so small FLUSH CACHE time doesn't
>>prove disk cheating.
>
>
> Have you seen a drive yet that writes back blocks after power loss?
>
> I have heard rumors about this, but all OEM manuals I looked at for
> drives I bought or recommended simply stated that the block currently
> being written at power loss can become damaged (with write cache off),
> and that the drive can lose the full write cache at power loss (with
> write cache on) so this looks like daydreaming manifested as rumor.
Upon power loss, at least one ATA vendor's disks try to write out as
much data as possible.
Jeff
On Mon, 16 May 2005, Arjan van de Ven wrote:
> > See the problem: "I assume", "IIRC selected...". There is no
> > list of corroborated facts which systems work and which don't. I have
> > made several attempts in compiling one, posting public calls for data
> > here, no response.
>
> well what stops you from building that list yourself by doing the actual
> work yourself?
Two things.
#1 it's the subsystem maintainer's responsibility to arrange for such
information. I searched Documentation/* to no avail, see below.
#2 That I would need to get acquainted with and understand several dozen
subsystems, drivers and so on to be able to make a substantiated
statement.
Subsystem maintainers will usually know the shape their code is in and
just need to state "not yet", "not planned", "not needed, different
layer", "work in progress" or "working since kernel version 2.6.42".
Takes a minute per maintainer, rather than wasting countless hours on
working through foreign code only to forget all this after I know what I
wanted to know. Sounds like an unreasonable expectation? Not to me. I
had hoped, several times, that asking here would give the first dozen of
answers as a starting point.
It's not as though I could go forth and just take two weeks off a shelf
and read all common block devices code...
I still have insufficient information even for ext3 on traditional
parallel ATA interfaces, so how do I start a list without information?
$ cd linux-2.6/Documentation/
$ find -iname '*barr*'
./arm/Sharp-LH/IOBarrier
$ head -4 ../Makefile
VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 11
EXTRAVERSION = .9
$
Documentation/block/biodoc.txt has some information about how it could
look like two years from now. filesystems/ext3 mentions it requires a
barrier=1 mount option. No information what block interfaces support it.
AIC7XXX was once reported to have it, experimentally, I don't know what
has become of the code, and I don't have AIC7XXX here, too expensive.
--
Matthias Andree
On Sul, 2005-05-15 at 16:00, Mikulas Patocka wrote:
> There are rumors that some disks ignore FLUSH CACHE command just to get
> higher benchmarks in Windows. But I haven't heart of any proof. Does
> anybody know, what companies fake this command?
The specification was intentionally written so that his command has to
do what it is specified to or be unknown and thus error and not be in
the ident info.
That was done by people who wanted to be very sure that any vendor who
tried to shortcut the command would have "sue me" written on their
forehead.
There are problems with a few older drives which have a write cache but
don't support cache commands.
Alan
> I have heard rumors about this, but all OEM manuals I looked at for
> drives I bought or recommended simply stated that the block currently
> being written at power loss can become damaged (with write cache off),
> and that the drive can lose the full write cache at power loss (with
> write cache on) so this looks like daydreaming manifested as rumor.
IBM drives definitely used to trash the sector in this case. They newer
ones either don't or recover from it presumably because people took that
to be a drive failure and returned it. Sometimes the people win ;)
> flashes its cache to NVRAM, or uses rotational energy to save its cache
> on the platters, please name brand and model and where I can download
> the material that documents this behavior.
I am not aware of any IDE drive with these properties.
On Llu, 2005-05-16 at 12:12, Arjan van de Ven wrote:
> Sure you can waive rethorics around, but the fact is that linux is
> improving; there now is write barrier support for ext3 (and I assume
> reiserfs) for at least IDE and iirc selected scsi too.
scsi supports tagging so ext3 at least is just fine.
On Mon, 16 May 2005, Mark Lord wrote:
> Most of us want longer lifespan and 2X the performance from our hardware,
> and use UPSs to guarantee continuous power & survivability.
Which is a different story and doesn't protect from dying power supply
units. I have replaced several PSUs that died "in mid-flight" and that
were not overloaded. UPS isn't going to help in that case. Of course you
can use a redundant PSU, redundant UPS - but that's easily more than a
battery-backed up cache on a decent RAID controller - since drive
failure will also toast file systems.
> Others want to live more dangerously on the power supply end,
> but still be safe on the filesystem end -- no guarantees there,
> even with "hdparm -W0" to disable the on-drive cache.
As long as one can rely on the kernel scheduling writes in the proper
order, no problem that I'd see. ext3 has apparently been doing this for
a long time in the default options, and I have yet to see ext3
corruption (except for massive hardware failure such as b0rked non-ECC
RAM or a harddisk that crashed its heads).
> Pulling power from a writing drive is ALWAYS a bad idea,
> and can permanently corrupt the track/cylinder that was being
> written. This will toast a filesystem regardless of how careful
> or proper the write flushes were done.
Most drive manufacturers make more extensive guarantees about what gets
NOT damaged when a write is interrupted by power loss, and are careful
to turn the write current off pretty soon on power loss. None of the OEM
manuals I looked at advised that data that was already on disk would be
damaged beyond the block that was being written.
--
Matthias Andree
I think you need to get real if you want that degree of integrity with a
PC
Your typical PC setup means your precious data
Gets written to un ecc protected memory over an unprotected bus
Gets read back over the same
Each PATA command is sent without any CRC or error recovery/correction
The PATA data is pulled out of unprotected memory over PCI
It goes to the drive (with a CRC) and gets stored in memory
It's probably sitting in non ECC RAM on the disk
It's probably fed through non ECC DSP logic
It's mixed on the disk with other data and may get rewritten without
you knowing
You might want to amuse yourself trying to get the bit error rates for
the busses and ram to start documenting the probabilities.
I'd prefer Linux turned writecache off on old drives but Mark Lord has
really good points even there. And for scsi we do tagging and the
journals can be ordered depending on your need.
You are storing 40 billion bits of information on a lump of metal and
glass rotating at 10,000rpm and pushing into areas of quantum theory in
order to store you data. It should be no suprise it might not be there a
month later.
You also appear confused: It isn't the maintainers responsibility to
arrange for such info. It's the maintainers responsibility to process
contributed patches with such info.
On Mon, 16 May 2005, Jeff Garzik wrote:
> Matthias Andree wrote:
>> On Sun, 15 May 2005, Mikulas Patocka wrote:
>>
>>
>>> Note that disk can still ignore FLUSH CACHE command cached data are small
>>> enough to be written on power loss, so small FLUSH CACHE time doesn't
>>> prove disk cheating.
>>
>> Have you seen a drive yet that writes back blocks after power loss?
>>
>> I have heard rumors about this, but all OEM manuals I looked at for
>> drives I bought or recommended simply stated that the block currently
>> being written at power loss can become damaged (with write cache off),
>> and that the drive can lose the full write cache at power loss (with
>> write cache on) so this looks like daydreaming manifested as rumor.
>
> Upon power loss, at least one ATA vendor's disks try to write out as
> much data as possible.
>
> Jeff
Then I suggest you never use such a drive. Anything that does this,
will end up replacing a good track with garbage. Unless a disk drive
has a built-in power source such as super-capacitors or batteries, what
happens during a power-failure is that all electronics stops and
the discs start coasting. Eventually the heads will crash onto
the platter. Older discs had a magnetically released latch which would
send the heads to an inside landing zone. Nobody bothers anymore.
Many high-quality drives cache data. Fortunately, upon power loss
these data are NOT attempted to be written. This means that,
although you may have incomplete or even bad data on the physical
medium, at least the medium can be read and written. The sectoring
has not been corrupted (read destroyed).
If you think about the physical process necessary to write data to
the medium, you will understand that without a large amount of
energy storage capacity on the disk, it's just not possible.
To write a sector, one needs to cache the data in a sector-buffer
putting on a sector header and trailing CRC, wait for the write-
splice from the previous sector (could be almost one rotation),
then write data and sync to the sector. If the disc is too slow,
these data will be underwrite the sector. Also, if the disc
was only 5 percent slow, the clock recovery on a subsequent
read will be off by 5 percent, outside the range of PLL lock-in,
so you write something that can never be read, a guaranteed bad block.
Combinations of journalizing on media that can be completely flushed,
and ordinary cache-intensive discs can result in reliable data
storage. However a single ATA or SCSI disk just isn't a perfectly
reliable storage medium although it's usually good enough.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.
On Mon, 16 May 2005, Alan Cox wrote:
> I'd prefer Linux turned writecache off on old drives but Mark Lord has
> really good points even there. And for scsi we do tagging and the
> journals can be ordered depending on your need.
Is tagged command queueing (we'll need the ordered tag here) compatible
with all SCSI adaptors that Linux supports?
What if tagged command queueing is switched off for some reason
(adaptor or HW incapability, user override) and the drive still has
write cache enable = true and queue algorithm modifier = 1 (which
permits out-of-order execution of write requests except for ordered
tags)? Is that something that would cause some bit of notice to be
logged? Or is that simply "do this at your own risk". My recent SCSI
drives have been shipping with WCE=1 and QAM=0.
Am I missing a bit here?
> You also appear confused: It isn't the maintainers responsibility to
> arrange for such info. It's the maintainers responsibility to process
> contributed patches with such info.
I didn't think of arranging as in "write himself". Who writes that info
down doesn't matter, but I'd think that such documentation should always
be committed alongside the code, except in code marked experimental.
(which, in turn, should only be promoted to non-experimental if it's
properly documented).
I understand that people who understand the code are eager to focus on
the code and even if that documentation is just an unordered lists of
statement with a kernel version attached, that'd be fine. But what is a
decent code without users?
--
Matthias Andree
On Mon, 16 May 2005, Richard B. Johnson wrote:
> Then I suggest you never use such a drive. Anything that does this,
> will end up replacing a good track with garbage. Unless a disk drive
> has a built-in power source such as super-capacitors or batteries, what
> happens during a power-failure is that all electronics stops and
> the discs start coasting. Eventually the heads will crash onto
> the platter. Older discs had a magnetically released latch which would
> send the heads to an inside landing zone. Nobody bothers anymore.
IBM/Hitachi hard disk drives still use a "load/unload ramp" that
entirely moves the heads off the platters - I've known this since the
DJNA, and it is still advertised in Deskstar 7K500 and Ultrastar 15K147
to name just two examples.
--
Matthias Andree
On Fri, 13 May 2005, Scott Robert Ladd wrote:
>
> Alan Cox wrote:
> > HT for most users is pretty irrelevant, its a neat idea but the
> > benchmarks don't suggest its too big a hit
>
> On real-world applications, I haven't seen HT boost performance by more
> than 15% on a Pentium 4 -- and the usual gain is around 5%, if anything
> at all. HT is a nice idea, but I don't enable it on my systems.
HT is _wonderful_ for latency reduction.
Why people think "performace" means "throughput" is something I'll never
understand. Throughput is _always_ secondary to latency, and really only
becomes interesting when it becomes a latency number (ie "I need higher
throughput in order to process these jobs in 4 hours instead of 8" -
notice how the real issue was again about _latency_).
Now, Linux tends to have pretty good CPU latency anyway, so it's not
usually that big of a deal, but I definitely enjoyed having a HT machine
over a regular UP one. I'm told the effect was even more pronounced on
XP.
Of course, these days I enjoy having dual cores more, though, and with
multiple cores, the latency advantages of HT become much less pronounced.
As to the HT "vulnerability", it really seems to be not a whole lot
different than what people saw with early SMP and (small) direct-mapped
caches. Thank God those days are gone.
I'd be really surprised if somebody is actually able to get a real-world
attack on a real-world pgp key usage or similar out of it (and as to the
covert channel, nobody cares). It's a fairly interesting approach, but
it's certainly neither new nor HT-specific, or necessarily seem all that
worrying in real life.
(HT and modern CPU speeds just means that the covert channel is _faster_
than it has been before, since you can test the L1 at core speeds. I doubt
it helps the key attack much, though, since faster in that case cuts both
ways: the speed of testing the cache eviction may have gone up, but so has
the speed of the operation you're trying to follow, and you'd likely have
a really hard time trying to catch things in real life).
It does show that if you want to hide key operations, you want to be
careful. I don't think HT is at fault per se.
Linus
Uttered Linus Torvalds <[email protected]>, spake thus:
> It does show that if you want to hide key operations, you want to be
> careful. I don't think HT is at fault per se.
Trivially easy when two processes share the same FS namespace.
Consider two files:
$ ls -l /tmp/a /tmp/b
-rw------ 1 owner owner xxxxx /tmp/a
-rw------ 1 owner owner xxxxx /tmp/b
One file serves as a clock. Note that the permissions deny all
access to everyone except the owner. The owner user then does this,
intentionally or unintentionally:
for x in 0 0 0 1 0 0 0 0 0 1
do
rm -f /tmp/a /tmp/b
case "$x" in
0 ) rm -f /tmp/a;;
1 ) touch /tmp/a;;
esac
touch /tmp/b
sleep 2
done
And the baddie does this:
let n=1
let char=0
while (($n < 8))
do
while [ ! -f /tmp/b ]; do
sleep 0.5
done
let char=((char << 1))
if [ -f /tmp/a ]; do
let char=((char + 1))
done
done
printf "The letter was: %c\n" $char
This is one of the classic TEMPEST problems that secure systems have
long had to deal with. See, at no time did HT ever raise its ugly
head ;-)
Cheers
On Llu, 2005-05-16 at 16:40, Matthias Andree wrote:
> On Mon, 16 May 2005, Alan Cox wrote:
> Is tagged command queueing (we'll need the ordered tag here) compatible
> with all SCSI adaptors that Linux supports?
TCQ is a device not controller property.
> What if tagged command queueing is switched off for some reason
> (adaptor or HW incapability, user override) and the drive still has
> write cache enable = true and queue algorithm modifier = 1 (which
We turn the write back cache off if TCQ isn't available.
On Mon, 16 May 2005 10:33:30 EDT, Jeff Garzik said:
> Upon power loss, at least one ATA vendor's disks try to write out as
> much data as possible.
Does the firmware for this vendor's disks have enough smarts to reserve that
last little bit of power to park the heads so it's not actively writing when
it finally loses entirely?
* Alan Cox:
> On Llu, 2005-05-16 at 16:40, Matthias Andree wrote:
>> On Mon, 16 May 2005, Alan Cox wrote:
>> Is tagged command queueing (we'll need the ordered tag here) compatible
>> with all SCSI adaptors that Linux supports?
>
> TCQ is a device not controller property.
I suppose it's one in RAID controllers.
Andi Kleen <[email protected]> writes:
> > The only solution I have seen proposed so far that seems to work
> > is to not schedule untrusted processes simultaneously with
> > the security code. With the current API that sounds like
> > a root process killing off, or at least stopping all non-root
> > processes until the critical process has finished.
>
> With virtualization and a hypervisor freely scheduling it is quite
> impossible to guarantee this. Of course as always the signal
> is quite noisy so it is unclear if it is exploitable in practical
> settings. On virtualized environments you cannot use ps to see
> if a crypto process is running.
Interesting. I think that is a problem for the hypervisor maintainer.
Although that is about enough to convince me to request a
OS flag that says "please give me privacy" and later that can be passed
down to the hypervisor. My gut feel is running under a hypervisor
is when things will at their most vulnerable.
Where this is a threat is when there will be a lot of RSA
key transactions. At which point it is likely that the attacker
can reproduce enough of the setup to figure out the fine details.
I think discovering a crypto process will simply be a matter
finding a https sever. As for getting the timing how about
initiating a https connection? Getting rid of the noise will certainly
be a challenge but you will have multiple attempts.
> > And those same processors will have the same problem if the share
> > significant cpu resources. Ideally the entire problem set
> > would fit in the cache and the cpu designers would allow cache
> > blocks to be locked but that is not currently the case. So a shared
> > L3 cache with dual core processors will have the same problem.
>
> At some point the signal gets noisy enough and the assumptions
> an attacker has to make too great for it being an useful attack.
> For me it is not even clear it is a real attack on native Linux, at
> least the setup in the paper looked highly artifical and quite impractical.
> e.g. I suppose it would be quite difficult to really synchronize
> to the beginning and end of the RSA encryptions on a server that
> does other things too.
Possibly. But then buffer overflow attacks when you don't know the exact
stack layout are similarly difficult and ways have been found. And if
you have multiple chances things get easier. And if you are aiming
at something easier then brute forcing a private key even the littlest
bit is a help.
When people mmap pages we zero them for the same reason so that
we don't have unintentional information leaks.
I agree that for now because little is known this is a highly specialized
attack. However the trend is now towards increasingly big SMP's.
That increases the number of resources that can be shared so the
possibility of a problem increases. At the rate Intel's cpus are
going we may see throttling of one cpu core when the other one
generates too much heat, because it is busy doing something else cpu
intensive. And other optimizations lead to much easier to imagine
vulnerabilities.
As for noise with the area cpu designers are getting into things
are becoming increasingly fine grained so information is leaking
at an increasingly fine level. As the L2 cache issue has shown
that information starts to leak below the level an application
designer has control of. At which point things get very difficult
to manage.
Information leaks are more difficult than simply gaining root on
the box where you can simply take the information you want. But
that means that is exactly where a locked down well administered
box will be vulnerable if a way is not found to avoid the problem.
I don't know what the consequences of having your private key
discovered are, but I have never heard a case where identity theft
was something pleasant to fix.
Eric
On Mon, 16 May 2005 13:14:23 MDT, Eric W. Biederman said:
> Interesting. I think that is a problem for the hypervisor maintainer.
> Although that is about enough to convince me to request a
> OS flag that says "please give me privacy" and later that can be passed
> down to the hypervisor. My gut feel is running under a hypervisor
> is when things will at their most vulnerable.
Not really, because....
> I think discovering a crypto process will simply be a matter
> finding a https sever. As for getting the timing how about
> initiating a https connection? Getting rid of the noise will certainly
> be a challenge but you will have multiple attempts.
And the hypervisor is, if anything, adding noise.
Richard B. Johnson wrote:
> Then I suggest you never use such a drive. Anything that does this,
> will end up replacing a good track with garbage. Unless a disk drive
> has a built-in power source such as super-capacitors or batteries, what
> happens during a power-failure is that all electronics stops and
> the discs start coasting. Eventually the heads will crash onto
If the power to the drive is truly just cut, then this is basically what
will happen. However, I have heard, for what it's worth, that in many
cases if you pull the AC power from a typical PC, the Power Good signal
from the PSU will be de-asserted, which triggers the Reset line on all
the buses, which triggers the ATA reset line, which triggers the drive
to finish writing out the sector it is doing. There is likely enough
capacitance in the power supply to do that before the voltage drops off.
> the platter. Older discs had a magnetically released latch which would
> send the heads to an inside landing zone. Nobody bothers anymore.
Sure they do. All current or remotely recent drives (to my knowledge,
anyway) will park the heads properly at the landing zone on power-off.
If the drive is told to power off cleanly, this works as expected, and
if the power is simply cut, the remaining energy in the spinning
platters is used like a generator to provide power to move the head
actuator to the park positon.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
On Mon, 16 May 2005, Alan Cox wrote:
> > flashes its cache to NVRAM, or uses rotational energy to save its cache
> > on the platters, please name brand and model and where I can download
> > the material that documents this behavior.
>
> I am not aware of any IDE drive with these properties.
I'm not sure I know of a SCSI drive which does that, either. It was a big
thing a few decades to use rotational energy to park the heads, but I
haven't seen discussion of save to nvram. Then again, I haven't been
looking for it.
What would be ideal is some cache which didn't depend on power to maintain
state, like core (remember core?) or the bubble memory which spent almost
a decade being just slightly too {slow,costly} to replace disk. There
doesn't seem to be a cost effective technology yet.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On May 17, 2005, at 09:15:52, Bill Davidsen wrote:
> What would be ideal is some cache which didn't depend on power to
> maintain
> state, like core (remember core?) or the bubble memory which spent
> almost
> a decade being just slightly too {slow,costly} to replace disk. There
> doesn't seem to be a cost effective technology yet.
I've seen some articles recently on a micro-punchcard technology that
uses
grids of thousands of miniature needles and sheets of polymer plastic
that
can be melted at somewhat low temperatures to create or remove
indentations
in the plastic. The device can read and write each position at a
very high
rate, and since there are several thousand bits per position, with
one bit
for each needle, the bandwidth is enormous. (And it scales linearly
with
the size of the device, too!) Purportedly these grids can be easily
built
with slight modifications to modern semiconductor etching
technologies, and
the polymer plastic is reasonably simple to manufacture, so the
resultant
cost per device is hundreds of times cheaper than today's drives.
Likewise,
they have significantly higher memory density than current hardware
due to
fewer relativistic and quantum effects (no magnetism).
Cheers,
Kyle Moffett
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$
r !y?(-)
------END GEEK CODE BLOCK------
On 5/15/05, Mark Lord <[email protected]> wrote:
> There's your clue. The drive LEDs normally reflect activity
> over the ATA bus (the cable!). If they're not on, then the drive
> isn't receiving data/commands from the host.
Mark is correct, activity indicators are associated with bus activity,
not internal drive activity.
On 5/16/05, Matthias Andree <[email protected]> wrote:
> I've heard that drives would be taking rotational energy from their
> rotating platters and such, but never heard how the hardware compensates
> the dilation with decreasing rotational frequency, which also requires
> changed filter settings for the write channel, block encoding, delays,
> possibly stepping the heads and so on. I don't believe these stories
> until I see evidence.
I'm pretty sure that most drives out there will immediately attempt to
safely retract or park the heads the instant that a power loss is
detected. There's too much potential damage that can occur if the
heads aren't able to safely retract to a landing zone or ramp, that
trying to save "one more block of cached data" just isn't worth the
risk.
--eric
On 5/16/05, Robert Hancock <[email protected]> wrote:
> If the power to the drive is truly just cut, then this is basically what
> will happen. However, I have heard, for what it's worth, that in many
> cases if you pull the AC power from a typical PC, the Power Good signal
> from the PSU will be de-asserted, which triggers the Reset line on all
> the buses, which triggers the ATA reset line, which triggers the drive
> to finish writing out the sector it is doing. There is likely enough
> capacitance in the power supply to do that before the voltage drops off.
Yes, but as you said this isn't a power loss event. It is a hard
reset with a full write cache, which all drives on the market today
respond to by flushing the cache.
According to the spec the time to flush can exceed 30s, so your PSU
better have some honkin caps on it to ensure data integrity when you
yank the power cord out of the wall.
--eric
On May 17, 2005, at 21:41:39, Kyle Moffett wrote:
>I've seen some articles recently on a micro-punchcard technology that uses
>grids of thousands of miniature needles and sheets of polymer plastic
Bwa-ha-ha! That's rich. You should have saved that one for next April
1st!
Does it use micro-relay logic to drive the micro-punchcard reader? Or
does it have nano-technology vacuum tube logic circuits?
Good one.
-Paul
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Eric,
> On 5/16/05, Robert Hancock <[email protected]> wrote:
> > If the power to the drive is truly just cut, then this is basically
> > what will happen. However, I have heard, for what it's
> worth, that in
> > many cases if you pull the AC power from a typical PC, the
> Power Good
> > signal from the PSU will be de-asserted, which triggers the
> Reset line
> > on all the buses, which triggers the ATA reset line, which triggers
> > the drive to finish writing out the sector it is doing. There is
> > likely enough capacitance in the power supply to do that
> before the voltage drops off.
>
> Yes, but as you said this isn't a power loss event. It is a
> hard reset with a full write cache, which all drives on the
> market today respond to by flushing the cache.
>
> According to the spec the time to flush can exceed 30s, so
> your PSU better have some honkin caps on it to ensure data
> integrity when you yank the power cord out of the wall.
why don't drive vendors create firmware which reserved a cache-sized
(e.g. 2MB) hole of internal drive space somewhere for such an event, and
a "cache flush caused by hard-reset" simply caused it to write the cache
to a fixed (contiguous) area of disk.
the same drive firmware on power-on could check that area and 'write
back' the data to the correct locations.
all said and done, why wouldn't a vendor (lets just say "Maxtor" :) )
implement something like this and market it as a feature?
i'd happily spend a few extra bucks for something that given a modern
PSU providing a few Hz of power (e.g. 50msec) provided higher data
reliability in case of power failure..
cheers,
lincoln.
On Wed, 18 May 2005, Paul Zimmerman wrote:
> On May 17, 2005, at 21:41:39, Kyle Moffett wrote:
>> I've seen some articles recently on a micro-punchcard technology that uses
>> grids of thousands of miniature needles and sheets of polymer plastic
>
> Bwa-ha-ha! That's rich. You should have saved that one for next April
> 1st!
> Does it use micro-relay logic to drive the micro-punchcard reader? Or
> does it have nano-technology vacuum tube logic circuits?
>
> Good one.
>
> -Paul
Actually carbon nano tubes, vacuum tubes not needed! May need a
"filament" transformer, though ^:
Cheers,
Dick Johnson
Penguin : Linux version 2.6.11.9 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.
Paul Zimmerman wrote:
> On May 17, 2005, at 21:41:39, Kyle Moffett wrote:
>
>>I've seen some articles recently on a micro-punchcard technology that uses
>>grids of thousands of miniature needles and sheets of polymer plastic
>
>
> Bwa-ha-ha! That's rich. You should have saved that one for next April
> 1st!
> Does it use micro-relay logic to drive the micro-punchcard reader? Or
> does it have nano-technology vacuum tube logic circuits?
>
> Good one.
No, actually. That one's for real. See:
http://www.zurich.ibm.com/st/storage/millipede.html
Looks like it will be the next generation of storage after rotating
discs.
(*grumble* ... forgot to hit 'reply all'. Sorry!)
Stephan Wonczak
"I haven't lost my mind; I know exactly where I left it."
"The meaning of my life is to make me crazy"
>>>>> "Lincoln" == Lincoln Dale \(ltd\) <Lincoln> writes:
Lincoln> why don't drive vendors create firmware which reserved a
Lincoln> cache-sized (e.g. 2MB) hole of internal drive space somewhere
Lincoln> for such an event, and a "cache flush caused by hard-reset"
Lincoln> simply caused it to write the cache to a fixed (contiguous)
Lincoln> area of disk.
Well, if you're losing power in the next Xmilliseconds, do you have
the time to seek to the cache holding area and settle down the head
(since you could have done a seek from the edge of the disk to the
middle), start writing, etc? Seems better to have a cache sized flash
ram instead where you could just keep the data there in case of power
loss.
But that's expensive, and not something most people need...
John
On Fri, 13 May 2005, Alan Cox wrote:
> > This is not a kernel problem, but a user space problem. The fix
> > is to change the user space crypto code to need the same number of cache line
> > accesses on all keys.
>
> You actually also need to hit the same cache line sequence on all keys
> if you take a bit more care about it.
>
> > Disabling HT for this would the totally wrong approach, like throwing
> > out the baby with the bath water.
>
> HT for most users is pretty irrelevant, its a neat idea but the
> benchmarks don't suggest its too big a hit
This is one of those things which can give any result depending on the
measurement. For kernel compiles I might see a 5-30% reduction in clock
time, for threaded applications like web/mail/news not much, and for
applications which communicate via shared memory up to 50% because some
blocking system calls can be avoided and cache impact is lower.
In general I have to agree with the "too big," but I haven't seen any
indication that the hole can be exploited without being able to run a
custom application on the machine, so for single users machines and
servers the risk level seems low.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
> -----Original Message-----
> From: John Stoffel [mailto:[email protected]]
> Sent: Wednesday, 18 May 2005 11:49 PM
> To: Lincoln Dale (ltd)
> Cc: Eric D. Mudama; Robert Hancock; linux-kernel
> Subject: RE: Disk write cache (Was: Hyper-Threading Vulnerability)
>
> >>>>> "Lincoln" == Lincoln Dale \(ltd\) <Lincoln> writes:
>
> Lincoln> why don't drive vendors create firmware which reserved a
> Lincoln> cache-sized (e.g. 2MB) hole of internal drive space
> somewhere
> Lincoln> for such an event, and a "cache flush caused by hard-reset"
> Lincoln> simply caused it to write the cache to a fixed (contiguous)
> Lincoln> area of disk.
>
> Well, if you're losing power in the next Xmilliseconds, do
> you have the time to seek to the cache holding area and
> settle down the head (since you could have done a seek from
> the edge of the disk to the middle), start writing, etc?
I believe its possible.
rationale:
[1] ATX power specification, (google finds this for me at
http://www.formfactors.org/developer%5Cspecs%5CATX12V_1_3dg.pdf)
section 3.2.11 (Voltage Hold-up time) states:
The power supply should maintain output regulation per Section
3.2.1 despite a loss of input
power at the low-end nominal range-115 VAC / 57 Hz or 230 VAC /
47 Hz-at maximum
continuous output load as applicable for a minimum of 17 ms.
the assumption here is that T6 in figure 5 does de-assert the
POWER_OK signal early in that "minimum of 17ms".
the spec (unfortunately) only calls for >=1msec.
once again, i see that there could be a market for a combination of
p/s & peripherals that could make use of it.
lets say that we DO have 17msec.
[2] Hard drive response times
picking a 'standard' high-end hard drive (Maxtor Atlas 10K V scsi
disk):
average seek + rotional latency is measured at 7.6msec.
transfer rates at beginning of disk are 89.5MB/s at end of disk
are 53.9MB/s.
(source
http://www.storagereview.com/articles/200411/200411028D300L0_2.html)
allowing 8msec for seek time, and writing at the 'slow' side of the
disk, writing 2MB
could take ~37msec (2 / 53.9). allow 50% overhead here - and we
have 55msec.
55 + 8 = 63 msec.
ok - 63msec doesn't fit into 17msec -
but as i say, a combination of p/s and/or larger caps (and/or more
innovative design by a case or p/s manufactuer which creates a dedicated
peripheral power bus)
> Seems better to have a cache sized flash ram instead where
> you could just keep the data there in case of power loss.
>
> But that's expensive, and not something most people need...
indeed, and that is what MS have been targeting. (flash isn't that
expensive, but flash write times are..).
cheers,
lincoln.
>
> John
>
[email protected] wrote:
> Hi all,
> i am running a 2.6.4 kernel on my system , and i am playing a little bit
> with kernel time issues and helper functions,just to understand how the
> things really work.
> While doing that on my x86 system and loaded a module from LDD 3rd
> edition,jit.c, which uses a dynamic /proc file to return textual
> information.
> The info that returns is in this format and uses the kernel functions
> ,do_gettimeofday,current_kernel_time and jiffies_to_timespec.
> The output format is:
> 0x0009073c 0x000000010009073c 1116162967.247441
> 1116162967.246530656 591.586065248
> 0x0009073c 0x000000010009073c 1116162967.247463
> 1116162967.246530656 591.586065248
> 0x0009073c 0x000000010009073c 1116162967.247476
> 1116162967.246530656 591.586065248
> 0x0009073c 0x000000010009073c 1116162967.247489
> 1116162967.246530656 591.586065248
> where the first two values are the jiffies and jiffies_64.The next two are
> the do_gettimeofday and current_kernel_time and the last value is the
> jiffies_to_timespec.This output text is "recorded" after 16 minutes of
> uptime.Shouldnt the last value be the same as uptime.I have attached an
> output file from the boot time until the time the function resets the
> struct and starts count from the beggining.Is this a bug or i am missing
> sth here???
You are assuming that jiffies starts at zero at boot time. This is clearly not
so even from your print outs. (It starts at a value near overflow of the low
order 32-bits to flush out problems with the roll over.)
>
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Alan Cox <[email protected]> writes:
> I think you need to get real if you want that degree of integrity with a
> PC
>
> Your typical PC setup means your precious data
...
All of your listed cases are low probability events. You're quit right that
low probability errors will always be present -- you could have just listed
cosmic rays and been finished. They're by far the most common such source of
errors.
But that doesn't mean we should just throw up our hands and say there's no way
to make computers work right, let's go home.
Making computer systems that don't randomly trash file systems in the case of
power outages isn't a hard problem. It's been solved for decades. That's *why*
fsync exists.
Oracle, Sybase, Postgres, other databases have hard requirements. They
guarantee that when they acknowledge a transaction commit the data has been
written to non-volatile media and will be recoverable even in the face of a
routine power loss.
They meet this requirement just fine on SCSI drives (where write caching
generally ships disabled) and on any OS where fsync issues a cache flush. If
the OS doesn't successfully flush the data to disk on fsync then it's quite
likely that any routine power outage will mean transactions are lost. That's
just ridiculous.
Worse, if the disk flushes the data to disk out of order it's quite likely the
entire database will be corrupted on any simple power outage. I'm not clear
whether that's the case for any common drives.
--
greg
On Sun, 29 May 2005, Greg Stark wrote:
> Oracle, Sybase, Postgres, other databases have hard requirements. They
> guarantee that when they acknowledge a transaction commit the data has been
> written to non-volatile media and will be recoverable even in the face of a
> routine power loss.
>
> They meet this requirement just fine on SCSI drives (where write caching
> generally ships disabled) and on any OS where fsync issues a cache flush. If
I don't know what facts "generally ships disabled" is based on, all of
the more recent SCSI drives (non SCA type though) I acquired came with
write cache enabled and some also with queue algorithm modifier set to 1.
> Worse, if the disk flushes the data to disk out of order it's quite
> likely the entire database will be corrupted on any simple power
> outage. I'm not clear whether that's the case for any common drives.
It's a matter of enforcing write order. In how far such ordering
constraints are propagated by file systems, VFS layer, down to the
hardware, is the grand question.
--
Matthias Andree
Matthias Andree <[email protected]> writes:
> On Sun, 29 May 2005, Greg Stark wrote:
>
> > They meet this requirement just fine on SCSI drives (where write caching
> > generally ships disabled) and on any OS where fsync issues a cache flush. If
>
> I don't know what facts "generally ships disabled" is based on, all of
> the more recent SCSI drives (non SCA type though) I acquired came with
> write cache enabled and some also with queue algorithm modifier set to 1.
People routinely post "Why does this cheap IDE drive outperform my shiny new
high end SCSI drive?" questions to the postgres mailing list. To which people
point out the IDE numbers they've presented are physically impossible for a
7200 RPM drive and the SCSI numbers agree appropriately with an average
rotational latency calculated from whatever speed their SCSI drives are.
> > Worse, if the disk flushes the data to disk out of order it's quite
> > likely the entire database will be corrupted on any simple power
> > outage. I'm not clear whether that's the case for any common drives.
>
> It's a matter of enforcing write order. In how far such ordering
> constraints are propagated by file systems, VFS layer, down to the
> hardware, is the grand question.
Well guaranteeing write order will at least mean the database isn't complete
garbage after a power event.
It still means lost transactions, something that isn't going to be acceptable
for any real-life business where those transactions are actual dollars.
--
greg
On Mon, 30 May 2005, Greg Stark wrote:
> Matthias Andree <[email protected]> writes:
>
> > On Sun, 29 May 2005, Greg Stark wrote:
> >
> > > They meet this requirement just fine on SCSI drives (where write caching
> > > generally ships disabled) and on any OS where fsync issues a cache flush. If
> >
> > I don't know what facts "generally ships disabled" is based on, all of
> > the more recent SCSI drives (non SCA type though) I acquired came with
> > write cache enabled and some also with queue algorithm modifier set to 1.
>
> People routinely post "Why does this cheap IDE drive outperform my shiny new
> high end SCSI drive?" questions to the postgres mailing list. To which people
> point out the IDE numbers they've presented are physically impossible for a
> 7200 RPM drive and the SCSI numbers agree appropriately with an average
> rotational latency calculated from whatever speed their SCSI drives are.
This may be a different reason than the vendor default or the saved
setting being WCE = 0, Queue Algorithm Modifier = 0...
I would really appreciate if the kernel printed a warning for every
partition mounted that cannot both enforce write order and guarantee
synchronous completion for f(data)sync, based on the drive's write
cache, file system type, current write barrier support and all that.
> > It's a matter of enforcing write order. In how far such ordering
> > constraints are propagated by file systems, VFS layer, down to the
> > hardware, is the grand question.
>
> Well guaranteeing write order will at least mean the database isn't complete
> garbage after a power event.
>
> It still means lost transactions, something that isn't going to be acceptable
> for any real-life business where those transactions are actual dollars.
Right, synchronous completion is the other issue. I want the kernel to
tell me if it's capable of doing that on a particular partition (given
hardware settings WRT cache, drivers, file system, and all that). Either
in the docs or if it's too confusing via dmesg.
--
Matthias Andree