2003-08-27 12:37:21

by Timo Sirainen

[permalink] [raw]
Subject: Lockless file reading

(Maybe a bit off topic, but I couldn't get answers elsewhere and it is
about kernel behaviour)

The question is what can happen if I read() a file that's being
simultaneously updated by a write() in another process? If I'm writing
123 over XXX, is it possible that read() returns 1X3 in some conditions?
Is the behaviour filesystem specific? Any idea about other operating
systems?

I'm thinking about implementing lockless file reads that work like:

void write_data(int fd, off_t offset, void *data, size_t size) {
lock_file(fd);
pwrite(fd, data, size, offset); // or writev() or copy+single pwrite()
pwrite(fd, data, size, offset + size);
unlock_file(fd);
}

void read_data(int fd, off_t offset, void *data, size_t size) {
unsigned char buf{size*2];
for (;;) {
pread(fd, buf, size*2, offset); // or shared mmap()ed access
if (memcmp(buf, buf+size, size) == 0) break;
usleep(10);
}
memcpy(data, buf, size);
}

The only case when I see that it would break is if we write "1212" but
pread() sees "1X1X" or "X2X2" there.



2003-08-27 12:52:12

by Timo Sirainen

[permalink] [raw]
Subject: Re: Lockless file reading

On Wed, 2003-08-27 at 15:42, Martin Konold wrote:
> > The question is what can happen if I read() a file that's being
> > simultaneously updated by a write() in another process?
>
> The behaviour is undefined.

So it's not such a good idea then? Hmh.. That'd solve a lot of problems
for me.

> > 123 over XXX, is it possible that read() returns 1X3 in some conditions?
>
> Yes. The actual order stuff is written to the disk is not guaranteed.

It doesn't matter when it's actually written to disk, if it's only seen
by read().


2003-08-27 12:48:51

by Martin Konold

[permalink] [raw]
Subject: Re: Lockless file reading

Am Wednesday 27 August 2003 02:37 pm schrieb Timo Sirainen:

Hi,

> The question is what can happen if I read() a file that's being
> simultaneously updated by a write() in another process?

The behaviour is undefined.

> 123 over XXX, is it possible that read() returns 1X3 in some conditions?

Yes. The actual order stuff is written to the disk is not guaranteed.

Regards,
-- martin

Dipl.-Phys. Martin Konold
e r f r a k o n
Erlewein, Frank, Konold & Partner - Beratende Ingenieure und Physiker
Nobelstrasse 15, 70569 Stuttgart, Germany
fon: 0711 67400963, fax: 0711 67400959
email: [email protected]

2003-08-27 13:40:37

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Lockless file reading

On Wed, 27 Aug 2003, Timo Sirainen wrote:

> On Wed, 2003-08-27 at 15:42, Martin Konold wrote:
> > > The question is what can happen if I read() a file that's being
> > > simultaneously updated by a write() in another process?
> >
> > The behaviour is undefined.
>
> So it's not such a good idea then? Hmh.. That'd solve a lot of problems
> for me.
>
> > > 123 over XXX, is it possible that read() returns 1X3 in some conditions?
> >
> > Yes. The actual order stuff is written to the disk is not guaranteed.
>
> It doesn't matter when it's actually written to disk, if it's only seen
> by read().
>


> 123 over XXX, is it possible that read() returns 1X3 in some conditions?

I'm going to take this question literly. You are asking
if the middle-byte of a 3-byte sequence can be residual,
not yet updated.

Let's see if it is possible for the middle byte of
a 3-byte sequence to not be written when both
other bytes are written:

In the first place, everything within a buffer or
sector will be written in-order, i.e., if you have
the two end bytes, you must have the middle byte.
It's only the end-bytes that can stop anywhere.

So we look at the end of a buffer condition.

|End of buffer
XXXXXXXXXXXXXXXXXXXXXXXXXXX <-- original data
123 <-- new data
|____ Write a byte
||___ Write a word
||||_ Write a long

Clearly, regardless of how the bytes are written, if you
get a '3', but not a two, the next write must have started
at offset 1, not offset 0. So, whatever write sequence
is done internally must subsequently seek backwards to offset
zero. This is highly unlikely (although possible).

Even in machines that do load/store operations where individual
components of those stores happen in groups, access to/from
a buffer of such data is controlled (by hardware) so a write
will complete before a read occurs. So if a data element that
is at a higher address has been modified as well as a data element
at a lower address, either the hardware is capable of byte access
or the center element must also have been modified. With hardware
that can perform byte-access (ix86), the only byte-access that
is going to happen is at the end(s) of buffers.

Given the 'famous' byte-sequence 0x12345678, the following incomplete
updates are possible (big endian):

0x123456xx Byte access
0x1234xxxx Word access
0xxx345678 Byte access
0xxxxx5678 Word access

Given the 'famous' byte-sequence 0x12345678, the following incomplete
updates are possible (little endian):

0x875634xx Byte access
0x8756xxxx Word access
0xxx563412 Byte access
0xxxxx3412 Word access

So I don't see how you could ever have a sequence of 123 written,
with both '1' and '3' written, but not '2'. It is only the stuff
on the 'ends' that can be incomplete.

Anyway, if you want two (or more) processes to access the file,
you should mmap it. You can configure a mmap'ed file so that
updates appear to all readers. However, just like any shared-memory
access, you need some kind of synchronization, perhaps a semaphore,
so that you always read valid data. Usually one only needs
__valid__ data, not necessarily __current__ data. For instance,
I have one task writing incremental numbers to a specific offset
in a file (shared memory). If it's okay for the reader to get
a non-garbage number, even though it's not the latest instantaneous
number, then if you stay away from multi-part numbers (like long long),
your data will always be "good" with no locking at all. That's because
the write will complete before the CPU gets taken away and given to a
reader. What is not guaranteed is that the data are current. You need a
semaphore for that.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (794.73 BogoMips).
Note 96.31% of all statistics are fiction.


2003-08-27 14:56:51

by Timo Sirainen

[permalink] [raw]
Subject: Re: Lockless file reading

On Wednesday, Aug 27, 2003, at 16:40 Europe/Helsinki, Richard B.
Johnson wrote:

> So I don't see how you could ever have a sequence of 123 written,
> with both '1' and '3' written, but not '2'. It is only the stuff
> on the 'ends' that can be incomplete.

That's pretty much what I was assuming without knowing how the kernel
internally really works.

> Anyway, if you want two (or more) processes to access the file,
> you should mmap it. You can configure a mmap'ed file so that
> updates appear to all readers. However, just like any shared-memory
> access, you need some kind of synchronization, perhaps a semaphore,
> so that you always read valid data. Usually one only needs
> __valid__ data, not necessarily __current__ data.

Right, that's why I don't really need read locking. The
double-writing/reading with memcmp() checking was supposed to check
that the data is valid.

I'm already using shared mmaps, but I was thinking that supporting NFS
would be nice as well. That'd work pretty much the same as write()s.

Maybe it would be possible to use some kind of error detection
checksums which would guarantee that the data either is valid or isn't,
regardless of the order in which it is written. I don't really know how
that could be done though.

2003-08-27 23:39:38

by Jamie Lokier

[permalink] [raw]
Subject: Re: Lockless file reading

Timo Sirainen wrote:
> Maybe it would be possible to use some kind of error detection
> checksums which would guarantee that the data either is valid or isn't,
> regardless of the order in which it is written. I don't really know how
> that could be done though.

You can use a strong checksum like MD5, if that is really faster than
locking. (Over NFS it probably is faster than fcntl()-locking, for
small data blocks).

-- Jamie

2003-08-28 00:03:52

by Jamie Lokier

[permalink] [raw]
Subject: Re: Lockless file reading

Richard B. Johnson wrote:
> Let's see if it is possible for the middle byte of
> a 3-byte sequence to not be written when both
> other bytes are written:

> Even in machines that do load/store operations where individual
> components of those stores happen in groups, access to/from
> a buffer of such data is controlled (by hardware) so a write
> will complete before a read occurs.

I don't understand what you mean by this.

Do you mean that the writes are forced to appear on a different CPU in
the same order that they were written? That isn't true on x86, for
two reasons: 1. writes aren't always in processor order (see
CONFIG_X86_OOSTORE and CONFIG_X86_PPRO_FENCE); 2. reads on the other
processor are out of order anyway.

> With hardware that can perform byte-access (ix86), the only
> byte-access that is going to happen is at the end(s) of buffers.

Not true. Take a look at __copy_user() in arch/i386/lib/usercopy.c.
The first few bytes are copied using "rep;movsb", which is not
guaranteed to use a word write for the aligned pair, nor is it
guaranteed a particular timing (there could be an interrupt between
each byte).

Other architectures are similar.

-- Jamie

2003-08-28 00:52:26

by Timo Sirainen

[permalink] [raw]
Subject: Re: Lockless file reading

On Thu, 2003-08-28 at 02:39, Jamie Lokier wrote:
> Timo Sirainen wrote:
> > Maybe it would be possible to use some kind of error detection
> > checksums which would guarantee that the data either is valid or isn't,
> > regardless of the order in which it is written. I don't really know how
> > that could be done though.
>
> You can use a strong checksum like MD5, if that is really faster than
> locking. (Over NFS it probably is faster than fcntl()-locking, for
> small data blocks).

While MD5 is probably good enough, it doesn't _guarantee_ the
consistency. I just thought of a simple algorithm that I think would.
I'll go and use that unless someone proves it wrong :)

Except of course if there's 256 writes and the last one is non-ordered
and it all happens while a read() is executing.. Less than unlikely I'd
say.

I don't think there's any other potential problems with read() than that
it may not have all data up to date? Such as it would never temporarily
show the data as zero?

static int verify(const unsigned char *buf, size_t size)
{
const unsigned char *checksum = buf + size;
unsigned char xor;
size_t i;

xor = buf[0] ^ checksum[0];
for (i = 1; i < size; i++) {
if ((buf[i] ^ xor) != checksum[i])
return 0;
}
return 1;
}

void write_data(const void *data, size_t size)
{
unsigned char *checksum = buf + size;
unsigned char xor;
size_t i;

memcpy(buf, data, size);

checksum[0]++;
xor = buf[0] ^ checksum[0];

for (i = 1; i < size; i++)
checksum[i] = buf[i] ^ xor;
}

void read_data(void *data, size_t size)
{
unsigned char copy[size*2];

do {
memcpy(copy, buf, size*2);
} while (!verify(copy, size));
memcpy(data, copy, size);
}


2003-08-28 01:50:44

by Jamie Lokier

[permalink] [raw]
Subject: Re: Lockless file reading

Timo Sirainen wrote:
> checksum[0]++;
> xor = buf[0] ^ checksum[0];

Your algorithm isn't going to work if the new value of xor is the same
as the old value of xor.

> do {
> memcpy(copy, buf, size*2);
> } while (!verify(copy, size));
> memcpy(data, copy, size);

It isn't safe because the memcpy() can read writes done on another
processor out of order, and xor does not always change.

You can read some of the newly written bytes in both the buf[] and
checksum[] halves of the buffer, while reading some of the previous
bytes in each half. If the set of new bytes in the first half matches
the set in the second half well enough (i.e. the two sets match for
bytes which aredifferent between the old and new data blocks), and the
xor values are the same between the old and new data blocks, then your
test will accept a mix of old and new data bytes.

-- Jamie

2003-08-28 03:18:07

by Timo Sirainen

[permalink] [raw]
Subject: Re: Lockless file reading

On Thursday, Aug 28, 2003, at 04:50 Europe/Helsinki, Jamie Lokier wrote:

>> checksum[0]++;
>> xor = buf[0] ^ checksum[0];
>
> Your algorithm isn't going to work if the new value of xor is the same
> as the old value of xor.

I was trying to prevent it with the checksum[0]++ .. but yes, you're
right.

I'm sure someone has figured out a way to make a checksum of data that
can detect if there's even a single bit wrong, if the checksum is
allowed to take as much space as the data itself. I should read more
about algorithms..

How about checksum[n] = data[n-1] ^ data[n]? That looks like it would
work.

2003-08-28 06:13:51

by Jamie Lokier

[permalink] [raw]
Subject: Re: Lockless file reading

Timo Sirainen wrote:
> I'm sure someone has figured out a way to make a checksum of data that
> can detect if there's even a single bit wrong, if the checksum is
> allowed to take as much space as the data itself. I should read more
> about algorithms..

You said that MD5 wasn't strong enough, and you would like a guarantee.

You won't find a guarantee unless you are prepared to use memory
barriers in your code. _Any_ checksum is going to have a chance of
false validation if you are doing out-of-order reads which can observe
parts of the old and new data, and parts of the old and new checksum.

> How about checksum[n] = data[n-1] ^ data[n]? That looks like it would
> work.

Consider when the data {1,0,1,0} is replaced by {2,0,2,0}.

-- Jamie

2003-08-28 06:01:19

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: Lockless file reading

On Thu, 28 Aug 2003 06:17:46 +0300, Timo Sirainen said:

> How about checksum[n] = data[n-1] ^ data[n]? That looks like it would
> work.

Unfortunately, if you swap the contents of 'n' and 'n-1', the ^ remains the same,
as it's a commutative operation.

If you want to go down this route, you *really* want to use at least a CRC
here, and understand *why* the incremental computation of the IP header
checksum works at all (hint - the fact it's so simple puts a significant upper
bound on the error-detection strength of the checksum...)

1071 Computing the Internet checksum. R.T. Braden, D.A. Borman, C.
Partridge. Sep-01-1988. (Format: TXT=54941 bytes) (Updated by
RFC1141) (Status: UNKNOWN)

1141 Incremental updating of the Internet checksum. T. Mallory, A.
Kullberg. Jan-01-1990. (Format: TXT=3587 bytes) (Updates RFC1071)
(Updated by RFC1624) (Status: INFORMATIONAL)

1624 Computation of the Internet Checksum via Incremental Update. A.
Rijsinghani, Ed.. May 1994. (Format: TXT=9836 bytes) (Updates
RFC1141) (Status: INFORMATIONAL)

http://www.ietf.org/rfc/rfc1071.txt
http://www.ietf.org/rfc/rfc1141.txt
http://www.ietf.org/rfc/rfc1624.txt


Attachments:
(No filename) (226.00 B)

2003-08-28 06:43:49

by Nagendra Singh Tomar

[permalink] [raw]
Subject: Re: Lockless file reading

Hi,
I beleive ur original post was to address the case of a reader reading
a file getting *incorrect* data due to the file being written
simultaneously by another writer process.
Why do u require file locking if there is a *single* writer ?? I don't
understand why a 123 written over XXX can result in 1X3. The kernel should
take care of this. When the writer process is writing 123 it will first be
written to the page cache. The page cache lock will be taken inside the
kernel before writing to it, so we know that writing 123 over XXX will be
atomic. Now even when this page is flushed to disk, the page lock would
be taken. So I cannot see a possibility of 123 written over XXX being read
as 1X3.
File locking is reqd to synchronize between *multiple* writers as
their writes can get interspersed. process A writing XXXX... and process B
writing YYYY.. can result in XXXYYYXY (such a fine intersperse is though
unlikely).
I am not clear whether the proposed read_data/write_data/verify are in the
user
level or inside the kernel, but I assume that they are in the user level
called by the writer and reader processes.
Write re-ordering can take place at user level (one writer processes
calling write() before other process) and/or inside the kernel (based on
the order in which the page lock was grabbed), while the checksum trick
can (to some extent) address user space re-ordering, it cannot address
kernel level reordering.



Thanx
tomar




On Thu, 28 Aug 2003, Timo Sirainen wrote:

> On Thu, 2003-08-28 at 02:39, Jamie Lokier wrote:
> > Timo Sirainen wrote:
> > > Maybe it would be possible to use some kind of error detection
> > > checksums which would guarantee that the data either is valid or
> isn't,
> > > regardless of the order in which it is written. I don't really know
> how
> > > that could be done though.
> >
> > You can use a strong checksum like MD5, if that is really faster than
> > locking. (Over NFS it probably is faster than fcntl()-locking, for
> > small data blocks).
>
> While MD5 is probably good enough, it doesn't _guarantee_ the
> consistency. I just thought of a simple algorithm that I think would.
> I'll go and use that unless someone proves it wrong :)
>
> Except of course if there's 256 writes and the last one is non-ordered
> and it all happens while a read() is executing.. Less than unlikely I'd
> say.
>
> I don't think there's any other potential problems with read() than that
> it may not have all data up to date? Such as it would never temporarily
> show the data as zero?
>
> static int verify(const unsigned char *buf, size_t size)
> {
> const unsigned char *checksum = buf + size;
> unsigned char xor;
> size_t i;
>
> xor = buf[0] ^ checksum[0];
> for (i = 1; i < size; i++) {
> if ((buf[i] ^ xor) != checksum[i])
> return 0;
> }
> return 1;
> }
>
> void write_data(const void *data, size_t size)
> {
> unsigned char *checksum = buf + size;
> unsigned char xor;
> size_t i;
>
> memcpy(buf, data, size);
>
> checksum[0]++;
> xor = buf[0] ^ checksum[0];
>
> for (i = 1; i < size; i++)
> checksum[i] = buf[i] ^ xor;
> }
>
> void read_data(void *data, size_t size)
> {
> unsigned char copy[size*2];
>
> do {
> memcpy(copy, buf, size*2);
> } while (!verify(copy, size));
> memcpy(data, copy, size);
> }
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-08-28 08:48:41

by Timo Sirainen

[permalink] [raw]
Subject: Re: Lockless file reading

On Wed, 2003-08-27 at 21:42, Nagendra Singh Tomar wrote:
> Hi,
> I beleive ur original post was to address the case of a reader reading
> a file getting *incorrect* data due to the file being written
> simultaneously by another writer process.

Well, "old" data, which mixed with new data would become incorrect as a
whole.

> Why do u require file locking if there is a *single* writer ?? I don't
> understand why a 123 written over XXX can result in 1X3. The kernel should
> take care of this. When the writer process is writing 123 it will first be
> written to the page cache. The page cache lock will be taken inside the
> kernel before writing to it, so we know that writing 123 over XXX will be
> atomic. Now even when this page is flushed to disk, the page lock would
> be taken. So I cannot see a possibility of 123 written over XXX being read
> as 1X3.

That was my original plan, to just rely on such kernel behaviour. I just
don't know if it's such a good idea to rely on that, especially if I
want to keep my program portable. I'll probably fallback to that anyway
if my checksumming ideas won't work.


2003-08-28 08:58:39

by Timo Sirainen

[permalink] [raw]
Subject: Re: Lockless file reading

On Thu, 2003-08-28 at 09:13, Jamie Lokier wrote:
> Timo Sirainen wrote:
> > I'm sure someone has figured out a way to make a checksum of data that
> > can detect if there's even a single bit wrong, if the checksum is
> > allowed to take as much space as the data itself. I should read more
> > about algorithms..
>
> You said that MD5 wasn't strong enough, and you would like a guarantee.

Yes. I don't really like it if my program heavily relies on something
that can go wrong in some situations.

> You won't find a guarantee unless you are prepared to use memory
> barriers in your code. _Any_ checksum is going to have a chance of
> false validation if you are doing out-of-order reads which can observe
> parts of the old and new data, and parts of the old and new checksum.

Not really. With {b1, b2, b1 xor b2} it doesn't matter what you read or
write first, no matter what the old data was. If they match, the result
is always either old or new.

If I want to get a checksum of 4 bytes then, I have to divide them into
two parts. Using the b1^b2 I can know if either one of them is valid,
but I can't know if they belong together.

Assuming that it's always either the previous one or the new one, I
think (once again :) that it's possible to check that by getting mixed
checksums: data = ABCD, c1 = A^B, c2 = C^D, c3 = A^C, c4 = B^D.

Except that the old data that read() sees could be even older than the
previous value. Maybe here works the growing xor-byte. I haven't thought
that far yet.


2003-08-28 10:08:52

by David Schwartz

[permalink] [raw]
Subject: RE: Lockless file reading


> On Thu, 2003-08-28 at 09:13, Jamie Lokier wrote:

> > Timo Sirainen wrote:
> > > I'm sure someone has figured out a way to make a checksum of
> > > data that
> > > can detect if there's even a single bit wrong, if the checksum is
> > > allowed to take as much space as the data itself. I should read more
> > > about algorithms..
> >
> > You said that MD5 wasn't strong enough, and you would like a guarantee.

> Yes. I don't really like it if my program heavily relies on something
> that can go wrong in some situations.

Okay, this is too much. Your alternative, assuming the kernel won't
re-order writes, is clearly relying on something that can go wrong. MD5
would be orders of magnitude more reliable.

No two data sets with the same MD5 hash are known. It will be many, many
years before anyone finds two data sets of the same size with the same MD5
hash. The odds of having two data sets just happen to have the same MD5 has
are infinitesimal.

If you compared the MD5 hashes of random data sets 1,000,000 times a second
for a million years, your odds of getting a single false match on any one of
those comparisons are less than one in a quadrillion.

DS


2003-08-28 10:08:52

by Timo Sirainen

[permalink] [raw]
Subject: Re: Lockless file reading

On Thu, 2003-08-28 at 00:15, Nagendra Singh Tomar wrote:
> > > I beleive ur original post was to address the case of a reader
> > reading
> > > a file getting *incorrect* data due to the file being written
> > > simultaneously by another writer process.
> >
> > Well, "old" data, which mixed with new data would become incorrect as a
> > whole.
>
> What is this mixing we are talking of ??

I can easily see that the data might be divided into two separate pages
and between updating those pages, another process would have read the
first page, but not the second page. That would result in a mixed old
and new data. Probably just <old><new> or <new><old> instead of
<new><old><new> what I was worrying about, but still it would be nicer
to rely only on byte atomicity and read/write ordering. :)

> > That was my original plan, to just rely on such kernel behaviour. I just
> > don't know if it's such a good idea to rely on that, especially if I
> > want to keep my program portable. I'll probably fallback to that anyway
> > if my checksumming ideas won't work.
>
> But I don't see any problem with a single writer and >=1 reader. There is
> no question of portability.

There are of multiple writers, but it's fine for them to do locking
between each others.


2003-08-28 10:08:53

by Martin Konold

[permalink] [raw]
Subject: Re: Lockless file reading

Am Thursday 28 August 2003 11:27 am schrieben Sie:

Hi,

> It's not about CPU usage. It's mostly about being able to modify the
> file even when there's thousands of simultaneous readers that could
> otherwise keep the file locked almost constantly.

No, the readers only have to check if the file got locked by the writer(s).

> Also it'd be nice to support NFS with .lock files since no-one really
> uses lockd.

IMHO a lock file on nfs is the most inefficient locking mechanism in case of
readers and writer(s) on the same machine.

BTW: What about moving this thread away from linux-kernel? It is already clear
that the kernel does not provide these guarantees you asked for.

Regards,
-- martin

Dipl.-Phys. Martin Konold
e r f r a k o n
Erlewein, Frank, Konold & Partner - Beratende Ingenieure und Physiker
Nobelstrasse 15, 70569 Stuttgart, Germany
fon: 0711 67400963, fax: 0711 67400959
email: [email protected]

2003-08-28 10:10:22

by Timo Sirainen

[permalink] [raw]
Subject: Re: Lockless file reading

On Thu, 2003-08-28 at 12:13, Martin Konold wrote:
> > How about checksum[n] = data[n-1] ^ data[n]? That looks like it would
>
> I propose you first make some benchmarks and try to figure out how big the
> actual overhead of locking really is. I can easily assume that your
> "solution" is actually slower than a simple mechanism/semaphore.

It's not about CPU usage. It's mostly about being able to modify the
file even when there's thousands of simultaneous readers that could
otherwise keep the file locked almost constantly.

Also it'd be nice to support NFS with .lock files since no-one really
uses lockd.


2003-08-28 10:16:21

by Matthias Andree

[permalink] [raw]
Subject: Re: Lockless file reading

On Wed, 27 Aug 2003, Martin Konold wrote:

> Am Wednesday 27 August 2003 02:37 pm schrieb Timo Sirainen:
>
> Hi,
>
> > The question is what can happen if I read() a file that's being
> > simultaneously updated by a write() in another process?
>
> The behaviour is undefined.
>
> > 123 over XXX, is it possible that read() returns 1X3 in some conditions?
>
> Yes. The actual order stuff is written to the disk is not guaranteed.

What has the disk block write order got to do with it? It's not as
though write() would be cached but read() see the raw disk.

--
Matthias Andree

2003-08-28 10:08:53

by Nagendra Singh Tomar

[permalink] [raw]
Subject: Re: Lockless file reading


On Thu, 28 Aug 2003, Timo Sirainen wrote:

> On Thu, 2003-08-28 at 00:15, Nagendra Singh Tomar wrote:
> > > > I beleive ur original post was to address the case of a reader
> > > reading
> > > > a file getting *incorrect* data due to the file being written
> > > > simultaneously by another writer process.
> > >
> > > Well, "old" data, which mixed with new data would become incorrect
> as a
> > > whole.
> >
> > What is this mixing we are talking of ??
>
> I can easily see that the data might be divided into two separate pages
> and between updating those pages, another process would have read the
> first page, but not the second page. That would result in a mixed old
> and new data. Probably just <old><new> or <new><old> instead of
> <new><old><new> what I was worrying about, but still it would be nicer
> to rely only on byte atomicity and read/write ordering. :)

Timo, I am at loss. OK, let me put across my understanding of the problem.
Thw writer is *updating* the file and not writing it for the first time,
which means that the file can be 1000 bytes long and the writer might be
updating bytes 100 onwards. You are worried abt the reader reading
partially old and partially new data. A question.
If the writer does not want the reader to read old data why does'nt it
truncate the file and start writing from the begining every time.
Please correct me if I am missing something significant. All this while I
was thinking that the writer is writing to the file for the first time.
i.e as he keeps writing the file size increases, so the reader is sure not
to read anything else but what the writer writes.

>
> > > That was my original plan, to just rely on such kernel behaviour. I
> just
> > > don't know if it's such a good idea to rely on that, especially if I
> > > want to keep my program portable. I'll probably fallback to that
> anyway
> > > if my checksumming ideas won't work.
> >
> > But I don't see any problem with a single writer and >=1 reader. There
> is
> > no question of portability.
>
> There are of multiple writers, but it's fine for them to do locking
> between each others.
>
>

2003-08-28 10:07:25

by Nagendra Singh Tomar

[permalink] [raw]
Subject: Re: Lockless file reading


On Thu, 28 Aug 2003, Timo Sirainen wrote:

> On Wed, 2003-08-27 at 21:42, Nagendra Singh Tomar wrote:
> > Hi,
> > I beleive ur original post was to address the case of a reader
> reading
> > a file getting *incorrect* data due to the file being written
> > simultaneously by another writer process.
>
> Well, "old" data, which mixed with new data would become incorrect as a
> whole.

What is this mixing we are talking of ??

>
> > Why do u require file locking if there is a *single* writer ?? I don't
>
> > understand why a 123 written over XXX can result in 1X3. The kernel
> should
> > take care of this. When the writer process is writing 123 it will
> first be
> > written to the page cache. The page cache lock will be taken inside
> the
> > kernel before writing to it, so we know that writing 123 over XXX will
> be
> > atomic. Now even when this page is flushed to disk, the page lock
> would
> > be taken. So I cannot see a possibility of 123 written over XXX being
> read
> > as 1X3.
>
> That was my original plan, to just rely on such kernel behaviour. I just
> don't know if it's such a good idea to rely on that, especially if I
> want to keep my program portable. I'll probably fallback to that anyway
> if my checksumming ideas won't work.

But I don't see any problem with a single writer and >=1 reader. There is
no question of portability.

tomar
>
>

2003-08-28 10:10:23

by Martin Konold

[permalink] [raw]
Subject: Re: Lockless file reading

Am Thursday 28 August 2003 05:17 am schrieb Timo Sirainen:

Hi Timo,

> How about checksum[n] = data[n-1] ^ data[n]? That looks like it would

I propose you first make some benchmarks and try to figure out how big the
actual overhead of locking really is. I can easily assume that your
"solution" is actually slower than a simple mechanism/semaphore.

Regards,
-- martin

Dipl.-Phys. Martin Konold
e r f r a k o n
Erlewein, Frank, Konold & Partner - Beratende Ingenieure und Physiker
Nobelstrasse 15, 70569 Stuttgart, Germany
fon: 0711 67400963, fax: 0711 67400959
email: [email protected]

2003-08-28 10:07:26

by David Schwartz

[permalink] [raw]
Subject: RE: Lockless file reading


> That was my original plan, to just rely on such kernel behaviour. I just
> don't know if it's such a good idea to rely on that, especially if I
> want to keep my program portable. I'll probably fallback to that anyway
> if my checksumming ideas won't work.

If you only have one writer, why not have the writer update an MD5 checksum
in the file along with the datawrite? If the reader sees an invalid
checksum, it repeats the read. This is simple, elegant, and foolproof. The
only possible flaw would be if you found two data sets with the same MD5
checksum. The instant fame would be well worth the inconvenience. ;)

DS


2003-08-28 10:27:18

by Timo Sirainen

[permalink] [raw]
Subject: RE: Lockless file reading

On Thu, 2003-08-28 at 12:56, David Schwartz wrote:
> > > You said that MD5 wasn't strong enough, and you would like a guarantee.
>
> > Yes. I don't really like it if my program heavily relies on something
> > that can go wrong in some situations.
>
> Okay, this is too much. Your alternative, assuming the kernel won't
> re-order writes, is clearly relying on something that can go wrong.

Reorder on per-byte basis? Per-page/block would still be acceptable.

Anyway, the alternative would be shared mmap()ed file. You can trust
32bit memory updates to be atomic, right?

Or what about: write("12"), fsync(), write("12")? Is it still possible
for read() to return "1x1x"?


2003-08-28 10:54:26

by Robin Rosenberg

[permalink] [raw]
Subject: Re: Lockless file reading

Aren't Linux files also streams. Writing to a stream in sequence
should update the stream in sequence, or it wouldn't be a stream,
would it?

-- robin

onsdagen den 27 augusti 2003 14.37 skrev Timo Sirainen:
> (Maybe a bit off topic, but I couldn't get answers elsewhere and it is
> about kernel behaviour)
>
> The question is what can happen if I read() a file that's being
> simultaneously updated by a write() in another process? If I'm writing
> 123 over XXX, is it possible that read() returns 1X3 in some conditions?
> Is the behaviour filesystem specific? Any idea about other operating
> systems?
>
> I'm thinking about implementing lockless file reads that work like:
>
> void write_data(int fd, off_t offset, void *data, size_t size) {
> lock_file(fd);
> pwrite(fd, data, size, offset); // or writev() or copy+single pwrite()
> pwrite(fd, data, size, offset + size);
> unlock_file(fd);
> }
>
> void read_data(int fd, off_t offset, void *data, size_t size) {
> unsigned char buf{size*2];
> for (;;) {
> pread(fd, buf, size*2, offset); // or shared mmap()ed access
> if (memcmp(buf, buf+size, size) == 0) break;
> usleep(10);
> }
> memcpy(data, buf, size);
> }
>
> The only case when I see that it would break is if we write "1212" but
> pread() sees "1X1X" or "X2X2" there.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2003-08-28 10:59:15

by Nagendra Singh Tomar

[permalink] [raw]
Subject: RE: Lockless file reading


On Thu, 28 Aug 2003, Timo Sirainen wrote:

> On Thu, 2003-08-28 at 12:56, David Schwartz wrote:
> > > > You said that MD5 wasn't strong enough, and you would like a
> guarantee.
> >
> > > Yes. I don't really like it if my program heavily relies on
> something
> > > that can go wrong in some situations.
> >
> > Okay, this is too much. Your alternative, assuming the kernel
> won't
> > re-order writes, is clearly relying on something that can go wrong.
>
> Reorder on per-byte basis? Per-page/block would still be acceptable.
>
> Anyway, the alternative would be shared mmap()ed file. You can trust
> 32bit memory updates to be atomic, right?
>
> Or what about: write("12"), fsync(), write("12")? Is it still possible
> for read() to return "1x1x"?

No. Not possible. fsync() doesn't really make a difference here. Just that
the first "12" gets its way into the disk. Th read will still read both
"12"s from the cache (in very high possibility). Even if the page caches
are stolen (due to mem shortage) the kernel file cachecing will ensure
that you get consistent data.

>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-08-28 12:01:45

by Jamie Lokier

[permalink] [raw]
Subject: Re: Lockless file reading

Timo Sirainen wrote:
> Reorder on per-byte basis? Per-page/block would still be acceptable.

The _CPUs_ can reorder on a per-byte basis, on a multiprocessor. It
has nothing to do with the kernel.

> Anyway, the alternative would be shared mmap()ed file. You can trust
> 32bit memory updates to be atomic, right?

Atomic yes (if aligned), weakly ordered though.

> Or what about: write("12"), fsync(), write("12")? Is it still possible
> for read() to return "1x1x"?

Yes it is possible, in principle.

This is what happens: the writing CPU writes "1", "2" in order. The
reading CPU reads bytes 1, 3 before the writes are
observed and bytes 0, 2 after. The CPU can do this.

The kernel doesn't prevent this, because it doesn't hold any exclusive
lock between the writer and reader during the data transfers.
Furthermore the kernel transfers a byte at a time, on some
architecture (including x86), if any buffer is not aligned.

It is very unlikely to return "1x1x", but you should know it is not
architecturally impossible. Given your incomplete knowledge of every
architectural quirk, it is more likely to occur than an MD5 collision.

On 32-bit aligned atomicity: if the block of 4 bytes is aligned in
memory, then with shared mmap you will only see whole words
transferred because all (current) Linux SMP-capable architectures
offer 32 bit atomicity. It is not a very nice assumption: it doesn't
hold for 16 bits or 64 bits, and may not hold for a future 64 bit
architecture. Keep in mind the words stay whole, but multiple words
are read out of order.

With read() and write(), even aligned 32-bit words don't work. On
some 64-bit architectures, a 32-bit word read() will be issued as 4
byte reads at the machine level, with weak memory ordering. (I'm
reading Alpha and IA64 __copy_user right now, and they both do that).

Enjoy :)
-- Jamie

2003-08-28 12:06:54

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Lockless file reading

On Thu, 28 Aug 2003, Jamie Lokier wrote:

> Richard B. Johnson wrote:
> > Let's see if it is possible for the middle byte of
> > a 3-byte sequence to not be written when both
> > other bytes are written:
>
> > Even in machines that do load/store operations where individual
> > components of those stores happen in groups, access to/from
> > a buffer of such data is controlled (by hardware) so a write
> > will complete before a read occurs.
>
> I don't understand what you mean by this.
>
> Do you mean that the writes are forced to appear on a different CPU in
> the same order that they were written? That isn't true on x86, for
> two reasons: 1. writes aren't always in processor order (see
> CONFIG_X86_OOSTORE and CONFIG_X86_PPRO_FENCE); 2. reads on the other
> processor are out of order anyway.
>

I never said, nor even implied any such thing.

> > With hardware that can perform byte-access (ix86), the only
> > byte-access that is going to happen is at the end(s) of buffers.
>
> Not true. Take a look at __copy_user() in arch/i386/lib/usercopy.c.
> The first few bytes are copied using "rep;movsb", which is not
> guaranteed to use a word write for the aligned pair, nor is it
> guaranteed a particular timing (there could be an interrupt between
> each byte).
>

Of course it's true. The implimentation detail of trimming the
start or finish of a copy procedure trims the ends. That's,
in fact, why it is impossible, yes, __impossible__ for the byte-sequence
0xABC to have both 0xA and 0xC copied without the 0xB being copied also.
There are no combinations of byte/word/long writes that will allow
this to happen.

It doesn't make any difference about the endian either as I
have previously shown.

Whether or not there's an interrupt between each byte means nothing
also. In the byte copy described, an interruopt at any time will
simply leave:

XXX
XXC
XBC
ABC

Clearly, if both A and C were copied, B was copied also.
You can screw around with endian and use words and longwords
as well. It just isn't possible for, in any 3-byte sequence
for the middle byte to be missing.


If 0x0A, 0x,0B, 0x0c, exist in memory as a word (longword),
with Intel, the lowest byte is in the lowest memory location.
Therefore, we have a word of 0xXCBA or 0xCBAX, depending upon
whether the high nibble or the low nibble becomes part of the
short. Notice that the 'B' is always between 'A' and 'C'?
Therefore if A and C got copied, B must have also been copied.

> Other architectures are similar.
>
> -- Jamie
>

Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (794.73 BogoMips).
Note 96.31% of all statistics are fiction.


2003-08-28 12:18:33

by Jamie Lokier

[permalink] [raw]
Subject: Re: Lockless file reading

Nagendra Singh Tomar wrote:
> > Or what about: write("12"), fsync(), write("12")? Is it still possible
> > for read() to return "1x1x"?
>
> Th read will still read both "12"s from the cache (in very high possibility).

That is a terrible assumption.

On all the Linux architures, even single CPUs, with no fancy memory
ordering, even 386, Pentiums etc., it is possible for read() to return
"1x1x".

Like this: the write() is preempted (see CONFIG_PREEMPT) after writing
one "1" byte, the read() runs, reads "1x" and is then preempted back
to the writer, which writes "2", calls fsync(), writes "1" and is then
preempted back to the reader, which continues and reads its second
"1x".

Interrupts caused by network packets arriving at just the wrong moment
can trigger that sort of thing.

> Even if the page caches are stolen (due to mem shortage) the kernel
> file cachecing will ensure that you get consistent data.

The kernel does not provide synchronisation between read() and write()
data transfers, and it does not always use atomic 32-bit reads and
writes either.

-- Jamie

2003-08-28 12:41:50

by Ragnar Hojland Espinosa

[permalink] [raw]
Subject: Re: Lockless file reading

On Thu, Aug 28, 2003 at 02:56:29AM -0700, David Schwartz wrote:
> No two data sets with the same MD5 hash are known. It will be many, many
> years before anyone finds two data sets of the same size with the same MD5
> hash. The odds of having two data sets just happen to have the same MD5 has
> are infinitesimal.

It can happen. It happened to me with two gifs. FWIW.
--
Ragnar Hojland - Project Manager
Linalco "Specialists in Linux and Free Software"
http://www.linalco.com Tel: +34-91-5970074 Fax: +34-91-5970083

2003-08-28 12:43:03

by Jamie Lokier

[permalink] [raw]
Subject: Re: Lockless file reading

Robin Rosenberg wrote:
> Aren't Linux files also streams.

No, they aren't. Stream sockets, pipes and FIFOs are. Files aren't.

-- Jamie

2003-08-28 12:40:47

by Nagendra Singh Tomar

[permalink] [raw]
Subject: Re: Lockless file reading


On Thu, 28 Aug 2003, Jamie Lokier wrote:

> Nagendra Singh Tomar wrote:
> > > Or what about: write("12"), fsync(), write("12")? Is it still
> possible
> > > for read() to return "1x1x"?
> >
> > Th read will still read both "12"s from the cache (in very high
> possibility).
>
> That is a terrible assumption.
>
> On all the Linux architures, even single CPUs, with no fancy memory
> ordering, even 386, Pentiums etc., it is possible for read() to return
> "1x1x".
>
> Like this: the write() is preempted (see CONFIG_PREEMPT) after writing
> one "1" byte, the read() runs, reads "1x" and is then preempted back
> to the writer, which writes "2", calls fsync(), writes "1" and is then
> preempted back to the reader, which continues and reads its second
> "1x".

kernel mode premption in Linux has still not sunk within me. All this
while I was assuming that the kernel cannot be preempted. But I have one
question:
While the write had "12" in its buffers and it would have grabbed the
page lock to write it into the page cache, won't it set some flag saying
that I don't want to be prempted now. I think there is a small primitive
for it in from 2.5 onwards. I don't think it will be a good idea to prempt
while it is holding the page lock. How is it possible that it just wrote
"1" and did not write "2" though it had grabbed the page lock for that
purpose.
I am yet to read the 2.5 kernel premption code.
Thanx for ur nice explaination though.

>
> Interrupts caused by network packets arriving at just the wrong moment
> can trigger that sort of thing.
>
> > Even if the page caches are stolen (due to mem shortage) the kernel
> > file cachecing will ensure that you get consistent data.
>
> The kernel does not provide synchronisation between read() and write()
> data transfers, and it does not always use atomic 32-bit reads and
> writes either.


>
> -- Jamie
>

2003-08-28 12:40:29

by Jamie Lokier

[permalink] [raw]
Subject: Re: Lockless file reading

Richard B. Johnson wrote:
> > > Even in machines that do load/store operations where individual
> > > components of those stores happen in groups, access to/from
> > > a buffer of such data is controlled (by hardware) so a write
> > > will complete before a read occurs.
> >
> > I don't understand what you mean by this.
> >
> > Do you mean that the writes are forced to appear on a different CPU in
> > the same order that they were written? That isn't true on x86, for
> > two reasons: 1. writes aren't always in processor order (see
> > CONFIG_X86_OOSTORE and CONFIG_X86_PPRO_FENCE); 2. reads on the other
> > processor are out of order anyway.
>
> I never said, nor even implied any such thing.

That's why I said I don't understand what you meant. What does your
first paragraph above mean?

> [...] The implimentation detail of trimming the start or finish of a
> copy procedure trims the ends. That's, in fact, why it is
> impossible, yes, __impossible__ for the byte-sequence 0xABC to have
> both 0xA and 0xC copied without the 0xB being copied also. There
> are no combinations of byte/word/long writes that will allow this to
> happen.

Correct.

> Whether or not there's an interrupt between each byte means nothing
> also. In the byte copy described, an interruopt at any time will
> simply leave:
>
> XXX
> XXC
> XBC
> ABC
>
> Clearly, if both A and C were copied, B was copied also.
> You can screw around with endian and use words and longwords
> as well. It just isn't possible for, in any 3-byte sequence
> for the middle byte to be missing.

What you say is correct about the intermediate states from
the writer's point of view, however a parallel reader _can_ observe
the middle byte missing.

The writing proceeds: A, B then C. In that order. No problem.

The writing CPU or task will always see these memory states:

XXX
AXX
ABX
ABC

Two problems for the reader, though: multiprocessor out of order
reads, and uniprocessor preemption.

Out of order means the reader can read the second byte _before_ the
first byte. Which means the reader sees this:

...
.X.
AX.
AXB

Preemption means this sequence of events can occur:

Writer's Reader's
destination buffer destination buffer
--------------------------------------------------------

XXX
writer writes A
AXX
context switch to reader
reader reads 2 bytes
-> AX.
context switch to writer
writer writes B, C
ABX
ABC
context switch to reader
reader reads final byte
-> AXB

The preemption case is a classic race between a writer and reader
moving in the same direction, alternating who is ahead.

Parallel memory access is not as simple as you make it out to be. If
you don't understand this, you will have a hard time understanding locks,
set_task_state vs. __set_task_state, the wmb/rmb/smp_rmb() macros, etc.

-- Jamie

2003-08-28 13:07:35

by Nagendra Singh Tomar

[permalink] [raw]
Subject: Re: Lockless file reading


On Thu, 28 Aug 2003, Jamie Lokier wrote:

> Nagendra Singh Tomar wrote:
> > While the write had "12" in its buffers and it would have grabbed the
>
> > page lock to write it into the page cache, won't it set some flag
> saying
> > that I don't want to be prempted now. I think there is a small
> primitive
> > for it in from 2.5 onwards. I don't think it will be a good idea to
> prempt
> > while it is holding the page lock. How is it possible that it just
> wrote
> > "1" and did not write "2" though it had grabbed the page lock for that
>
> > purpose.
>
> Nope. I don't see any disabling of preemption while the page is held.
>
> It wouldn't make sense anyway, because the copies to/from userspace
> can sleep, so there's nothing to gain by disabling preemption.

I understand what you say. Thanx.

>
> -- Jamie
>

2003-08-28 13:03:39

by Jamie Lokier

[permalink] [raw]
Subject: Re: Lockless file reading

Ragnar Hojland Espinosa wrote:
> It can happen. It happened to me with two gifs. FWIW.

Probability on the order of 2^-32 with MD5 any-pairs collision.
(It's not usual to have so many GIFs to compare, though :)
SHA is better, and both probably have some weakness that increases the
probability of collision.

Do you still have the GIFs?

-- Jamie

2003-08-28 13:00:57

by Jamie Lokier

[permalink] [raw]
Subject: Re: Lockless file reading

Nagendra Singh Tomar wrote:
> While the write had "12" in its buffers and it would have grabbed the
> page lock to write it into the page cache, won't it set some flag saying
> that I don't want to be prempted now. I think there is a small primitive
> for it in from 2.5 onwards. I don't think it will be a good idea to prempt
> while it is holding the page lock. How is it possible that it just wrote
> "1" and did not write "2" though it had grabbed the page lock for that
> purpose.

Nope. I don't see any disabling of preemption while the page is held.

It wouldn't make sense anyway, because the copies to/from userspace
can sleep, so there's nothing to gain by disabling preemption.

-- Jamie

2003-08-28 13:26:42

by Matthias Andree

[permalink] [raw]
Subject: Re: Lockless file reading

On Thu, 28 Aug 2003, Nagendra Singh Tomar wrote:

> If the writer does not want the reader to read old data why does'nt it
> truncate the file and start writing from the begining every time.

To give you some background: Timo wants to store cache data in these
lockless. When these caches are truncated, they can be dropped
altogether, but the application's performance will drop drastically.

--
Matthias Andree
Encrypted mail solicited Verschl?sselte Mail erw?nscht
http://blackhole.pca.dfn.de:11371/pks/lookup?op=get&search=0x052E7D95
-> http://www.gnupg.org/(en)/related_software/frontends.html

2003-08-28 13:28:49

by Timo Sirainen

[permalink] [raw]
Subject: Re: Lockless file reading

On Thursday, Aug 28, 2003, at 15:01 Europe/Helsinki, Jamie Lokier wrote:

> This is what happens: the writing CPU writes "1", "2" in order. The
> reading CPU reads bytes 1, 3 before the writes are
> observed and bytes 0, 2 after. The CPU can do this.
>
> The kernel doesn't prevent this, because it doesn't hold any exclusive
> lock between the writer and reader during the data transfers.
> Furthermore the kernel transfers a byte at a time, on some
> architecture (including x86), if any buffer is not aligned.

Thanks, this was the kind of information I wanted to know. I assumed
there were exclusive writer locks and such.

I think I've finally decided what to do. Mostly due to realization that
the most problematic part was actually not a problem at all since it
required locking anyway. :) Here's the summary and then I'll shut up
here:

This was about index files for mailboxes. I had already written support
for maildir, but mbox support was where I got stuck. I needed four kind
of file updates:

1) File offset pointers (linked list). These are set only once, never
modified. Originally I used lowest bit to specify that the offset was
"used". I first wrote them without the bit, fsync, then the bit.
Apparently this isn't enough then, so I'll change all 4 bytes to
contain the used-bit. This gives 28bit number, but I can deal with 8
byte alignments so maximum file size is 2GB which is enough.

2) mbox message offsets. These change when messages are deleted or
sometimes with flag changes. But the mbox file has to be locked for
reading anyway if I want to use these offsets, so there's no way
someone is updating them at the same time.

3) Index summary. Total number of message count, unread message count,
etc. There's not many fields here, so I could try different kinds of
ideas here.. One that at least works is to keep them in separate file
and update it with rename().

4) Flag fields. There aren't mutually exclusive flag changes, so it
doesn't matter if I read some of the changes partially.

2003-08-28 14:03:52

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: Lockless file reading

On Thu, 28 Aug 2003 00:12:30 +0530, Nagendra Singh Tomar said:

> Why do u require file locking if there is a *single* writer ?? I don't
> understand why a 123 written over XXX can result in 1X3.

You'd be *amazed* at what can go strange (not wrong, just strange) at the
hardware level. Let's assume for the sake of argument that in your '123' and
'XXX', each character represents 4 or 8 or however many bytes wide your memory
bus is (so "123" is really a 24 byte string). The following can happen
(especially if your hardware doesn't have cache coherency):

1) CPU 0 stores the 24 bytes, and it splits across a cache line boundary.
The '12' goes in one line, '3' in the next.

2) Cache controller 0 does the writeback of '3' first.

3) Cache controller 0 starts the writeback of the other cache line,
and gets the '1' written, still waiting for next memory cycle to write '2'.

4) CPU 1 or a DMA device snarfs up the 24 bytes before the '2' gets there.
'3' got there, '1' got there, '2' will get there the *NEXT* bus cycle.

1X3. Whoops.

Real hardware does this sort of thing all the time. Consider *this* gem from
the IBM S/370 Principles of Operation (GA22-7000-10, page 5-12):

"For the instructions EDIT, EDIT AND MARK, and TRANSLATE, the portions of the
operands that are actually used in the operation may be established in a trial
execution for operand accessibility that is performed before the execution of
the instruction is started. This trial execution consists in an execution of
the instruction in which results are not stored. If the first operand of
TRANSLATE or either operand of EDIT or EDIT AND MARK is changed by another CPU
or by a channel, after the initial trial execution but before completion of
execution, the contents of any fields due to be changed by the instruction are
unpredictable. Furthermore, it is unpredictable whether or not an interruption
occurs for an access exception that was not initially applicable".

In Linux terms, the S/370 family has a hardware instruction that does a
somewhat limited 'sprintf'. It may make a dummy pass through the format string
to compute how long the result is (so it can tell if you should get a SEGV for
going off the end of a page) and it's possible for the format string to get
changed out from under it while it's doing that. In addition, if the format
*used* to be a %4d that wrote to the last 4 bytes of a page, and somebody nails
it to be a %8d halfway through, we may or may not get a SEGV when we scribble 4
bytes onto the next page in memory, even if it's a readonly page we scribbled
on....

The description of what order things are seen to happen in runs for 11 pages,
*not* including special-cases like the above....







Attachments:
(No filename) (226.00 B)

2003-08-28 17:31:13

by Ian Stirling

[permalink] [raw]
Subject: Re: Lockless file reading

>
> Ragnar Hojland Espinosa wrote:
> > It can happen. It happened to me with two gifs. FWIW.
>
> Probability on the order of 2^-32 with MD5 any-pairs collision.
> (It's not usual to have so many GIFs to compare, though :)
> SHA is better, and both probably have some weakness that increases the
> probability of collision.
>
> Do you still have the GIFs?

MD5 is 128 bit output, so that's around 2^64 pairs before you have a birthday.
There arn't that many GIFs in the world.
I'd be really surprised if there were that many pictures in the world.

2003-08-28 17:35:35

by Jamie Lokier

[permalink] [raw]
Subject: Re: Lockless file reading

[email protected] wrote:
> > Probability on the order of 2^-32 with MD5 any-pairs collision.
> MD5 is 128 bit output, so that's around 2^64 pairs before you have a birthday.

Right. Dozy me :)

> > Do you still have the GIFs?
>
> There arn't that many GIFs in the world.
> I'd be really surprised if there were that many pictures in the world.

I'd be really surprised if what you saw wasn't a software error,
misreporting or miscalculating the MD5.

-- JAmie

2003-08-28 18:15:18

by Ian Stirling

[permalink] [raw]
Subject: Re: Lockless file reading

>
> [email protected] wrote:
> > > Probability on the order of 2^-32 with MD5 any-pairs collision.
> > MD5 is 128 bit output, so that's around 2^64 pairs before you have a birthday.
>
> Right. Dozy me :)
>
> > > Do you still have the GIFs?
> >
> > There arn't that many GIFs in the world.
> > I'd be really surprised if there were that many pictures in the world.
>
> I'd be really surprised if what you saw wasn't a software error,
> misreporting or miscalculating the MD5.

Or perhaps more likely, truncating the hash to 32 bits, in which case for most
people there won't be a problem, as a collision isn't likely until you get
to tens of thousands of images.

2003-08-28 20:11:45

by David B. Stevens

[permalink] [raw]
Subject: Re: Lockless file reading

[email protected] wrote:
> On Thu, 28 Aug 2003 00:12:30 +0530, Nagendra Singh Tomar said:
>
>
>>Why do u require file locking if there is a *single* writer ?? I don't
>>understand why a 123 written over XXX can result in 1X3.
>
>
> You'd be *amazed* at what can go strange (not wrong, just strange) at the
> hardware level. Let's assume for the sake of argument that in your '123' and
> 'XXX', each character represents 4 or 8 or however many bytes wide your memory
> bus is (so "123" is really a 24 byte string). The following can happen
> (especially if your hardware doesn't have cache coherency):
>
> 1) CPU 0 stores the 24 bytes, and it splits across a cache line boundary.
> The '12' goes in one line, '3' in the next.
>
> 2) Cache controller 0 does the writeback of '3' first.
>
> 3) Cache controller 0 starts the writeback of the other cache line,
> and gets the '1' written, still waiting for next memory cycle to write '2'.
>
> 4) CPU 1 or a DMA device snarfs up the 24 bytes before the '2' gets there.
> '3' got there, '1' got there, '2' will get there the *NEXT* bus cycle.
>
> 1X3. Whoops.
>
> Real hardware does this sort of thing all the time. Consider *this* gem from
> the IBM S/370 Principles of Operation (GA22-7000-10, page 5-12):
>

Snip

>
>
> In addition, if the format
> *used* to be a %4d that wrote to the last 4 bytes of a page, and somebody nails
> it to be a %8d halfway through, we may or may not get a SEGV when we scribble 4
> bytes onto the next page in memory, even if it's a readonly page we scribbled
> on....
>

Ah, but the scribble will not take place.

> The description of what order things are seen to happen in runs for 11 pages,
> *not* including special-cases like the above....
>

BTW you are quoting from my all time favorite book. Conceptual order is
not always the real order and all those mind games.

Cheers,
Dave

2003-08-28 20:24:30

by David Schwartz

[permalink] [raw]
Subject: RE: Lockless file reading


> On Thu, 2003-08-28 at 12:56, David Schwartz wrote:

> > > > You said that MD5 wasn't strong enough, and you would like
> > > > a guarantee.

> > > Yes. I don't really like it if my program heavily relies on something
> > > that can go wrong in some situations.

> > Okay, this is too much. Your alternative, assuming the kernel won't
> > re-order writes, is clearly relying on something that can go wrong.

> Reorder on per-byte basis? Per-page/block would still be acceptable.

There is no law that says the kernel can't re-order on a per-byte basis.
Memory visibility and posted writes could, at least in theory, result in
this outcome today. Future technology may make it even more probable.

> Anyway, the alternative would be shared mmap()ed file. You can trust
> 32bit memory updates to be atomic, right?

Yes, but you can't trust that multiple 32 bit memory updates won't be seen
out of order by another processor without appropriate barriers or
synchronization.

> Or what about: write("12"), fsync(), write("12")? Is it still possible
> for read() to return "1x1x"?

I believe it still is. The 'fsync' function does not synchronize another
process' view of memory. It is not a memory barrier and, in any event, if it
was a memory barrier it would be running in the wrong place. The 'fsync'
function syncs the memory with stable storage, it does not synch (or at
least, it is not guaranteed to sync) with the memory view of another
process.

Nothing stops speculative reads for the two halves of the data, each in
their own 32-bit unit, from crossing in time. Even if it's not possible on
today's hardware, it is still possible in principle.

DS


2003-08-28 20:38:22

by David Schwartz

[permalink] [raw]
Subject: RE: Lockless file reading


> On Thu, Aug 28, 2003 at 02:56:29AM -0700, David Schwartz wrote:

> > No two data sets with the same MD5 hash are known. It will
> > be many, many
> > years before anyone finds two data sets of the same size with
> > the same MD5
> > hash. The odds of having two data sets just happen to have the
> > same MD5 has
> > are infinitesimal.

> It can happen. It happened to me with two gifs. FWIW.

Find those GIFs, double-check, and publish immediately. That would be
amazingly big news and would probably cause huge numbers of people to switch
from MD5 to SHA1 overnight.

Far more likely, you are mistaken.

DS


2003-08-28 22:14:52

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: Lockless file reading

In article <[email protected]> you wrote:
> Find those GIFs, double-check, and publish immediately. That would be
> amazingly big news and would probably cause huge numbers of people to switch
> from MD5 to SHA1 overnight.

Why is that? For a hash with n bits there are at least
2^y / 2^n = 2^(y-n) files with the same hash, if they have size y bits.
Three are even more, if you consider all files up to this size.

With n=128 (md5) and y=80000 (10k file) x=2^79872 different files with the same
hash, which is:

73758771305612149138757973947840453624822288693642912037251984265061\
00114977473466610342516138996953532522212913684506495987369027376247\
15680102183899956755293229604036298841899346896312975318416816792181\
27163102430377376805525407677117167195273455353500937474682475767850\
59179157052570202917558881297004500267146311552672331119254293941288\
81200564802772472735124212232463022640846529434844027943281412745730\
85139886962923676798297010059108583228148158547397805954834594989809\
72953082523096457840274668781301014306803374729359559243555477575414\
40743184860888676456545815644014580307967389124314445607714035149073\
07589212426013390575273338209003958822345762588652779339518192543309\
86317487722170491968583142374006999722192982951269179790768422998167\
08449702141611939030460451116251870263862094621134152440907522456787\
51931691226465546945055073850028999878309015830845841569473374428818\
38069014874242916295358660843459867579762128717086485255524813917818\
93048437497315173334429979530572030202862199229371143871214566570030\
75727422692477487723119025817627046899875118936499627544458020896601\
57937925093115270533258873516793039117097872784457478242527803568668\
28631394601551771044737759212698144953003036178706026830838663271584\
01592980859584723256093157100606730021922520778222789293516580288692\
25264244660543124963539004344523631091601054705861145611115771926445\
77478534613667305980110146817904878379835989090392217842151012038447\
32903904630045111054945146422201082791203805511249222706103705145946\
71820615624444389237027551184044272914241672227827166453175228687028\
66938683553829374974281116729061361295584136766500184918365115021199\
37169541145849843445053788883266574169813935549175907371700139138138\
72502071050440159721888802465338768753780473818364722509365607509899\
90281003295942592341570547210474186910921077818634263104420109401884\
65729426198803625408484236068350211025109659994524663107915732813683\
23249026269135868577867279765840583230539982435092205081214139636131\
52716158876922016303074856693366702970999646715276255547943011782454\
30451711340888832195663444371564861489518840602769168182051363706921\
56348391007580137339506987160160155329372785718654816692279404095002\
71631290709679864507500640424102364928586816669728574941400097861766\
98623345776769276035921922328812527260139781666739266290331711100138\
69355520506331079686339512263367217460717946825089940625062127323177\
86653058251722203905147108918559635723363913322812633573852636358143\
05929154109676275195321548031664387835763338445788114635589758368895\
17974907569210022960592177775660950422645740978394883220940559863725\
17922274955785951581660617914435597347391402963730377168547070736184\
09409184312862027981853714218127309283218818430216375047782183166306\
16773059670202724953584786650536914280599829953639402884772276266175\
74429349374567546167841892184392743355864837653142407213100256586750\
41336949647306338065403113935569936012036214119515511822780434287340\
30928734800002296824907120604076897989376531011143837044498904591969\
67573317525873614600338869535377596304879781797541287906717071752299\
79432009185387838666393216748127057247821526828251843082459605902721\
91898927046995849949526294260125716903061767423516291316259448887102\
21455117996027263044925703914568589351096240255287529058558303698852\
12851051794252832424631653450392428817783210001109070180912805870873\
07386072771724226933940298630758819641696078651336928520241718916016\
03519077021589676947417102185394484893439475195997406265182485474944\
59178017073695377899466712572589814669494675411536769252318932662247\
42203943726665732107904396523581039784309768624193293669453454613822\
15749223233434417462994719818283525937442572314803099751885933759868\
16634054700507294873752867437781563278323662445301928131033228617590\
35430194103956670055593952506967036312263794347882662751144124966996\
91351677629105493982971530018612054379318441420814383739436560078198\
43138418873116377898447796602758998473352233358353697569513976917684\
45423037648290442116492517078342243822589633273690408706974851274325\
08132319738846084567101570415478070907066126456504167969509604493072\
48645914957093677769177377246870165130337742440748526644770151979128\
47211604618032267059204616591999117507548440215522308874869279619705\
44790645926520397665781107515922450033794810695414009529069301089709\
73157072428268136340204167242984153890552777305481674601484732441017\
28560824403131840588191220521738725204457330809940997294315972715083\
60725305431083136292963174671514146443703870465945021837055000033974\
71343654123104963598692282015717496687304002201274588103720513379013\
73625443109735850169152345779367135337891706905992705940277701946037\
51485390147093579878750275097249158482766343355558089592208214469361\
42276931894022033384129759100364979210001950264216905954876971408322\
93770122779579100592816803465994717468500525380304100505203551456609\
49222119718024454375095356130282351572692787346489597961086730476245\
84334725114957596993473817425633275365545763431282993581470273699983\
69757127063885542311338579547488527586375582908609448860018780419534\
19537297554404836539225811986557499970844823045184713996979552418442\
74406760049159007279015378851550531580664049422653169603166655301827\
48599352193666664508497247573119948382046576805640670494290298734858\
84417223231477087092260452907970029505261005523561715560411919810056\
26325051985436811764833953794260944157856829720560313439000417881996\
66649136236854393658468387755348570975402758189647533662960697427372\
12412647608184960382793382969692900066773207105873069629313639145519\
80713895245632909242920773336712845268455388312926807578615068583681\
14735015718512752431993230816015076649376333137154762997959604419405\
40830019834949523547804904348975842603471995834045489085385950422244\
21421527056856660062948296226796218876244177325383754033674652041604\
73794980904792158435418276687436661921442912299388028773041675317063\
41702985744933602467081363806581646503634957766157218777471663915518\
02627490937845991300518517420026319487215368719584336485771425750960\
74147906392458647321632082098445775029982834370708320917720129979770\
70455506307814353531726531033563236535836642541212334872049648262139\
75871050354308249098506454123665816967377687075573519014546199680718\
25504901192973448419707741143954204819897813037578850381034208321065\
90313529679084091518649506694968971633222734828354332911155417094032\
10171315988545986151874205830881181014135272806355975146559095444333\
46904167367553579042699361907200322278137701375922580419478734423318\
06121488248550399697475750553850800216488249799595528934562185651076\
98483900622844796406822349397868990356392508055418557772495336464051\
84315867243632263128883584334516110866716220279538827000535200550974\
00891217955190870275336026052153821633475850086009198501611505257778\
47933885680755163296870602837070256783151019285961820158384372038595\
17860622613336775309534151796578569381023269452368108582507056951247\
98793404504540912203957292887715691948144120313016926351322062639192\
87532878209614118770056914646891704747300580355118133346576439497968\
43441915977050748964333072943054209320237827891228838972888080894003\
90308897490656128265393212239169918857054561600215188074809617435067\
81432282933335020500424666690346752022172982638444273323437620748833\
52465137449659970847203193277839245531164554835482556691810957680344\
96966689035036985236456367028660016192169034841287219164333818992502\
01221473697763200027508042141909843428691750233206647821600433304062\
18891775169474513534619981893727149433699742028109319843146130300790\
48374620733930112399742313419818337405590406519992557821007628973929\
00113044791623585643793713735052077993606309171905986641896042192953\
02914470680005418423614268207498056056631718569685347881208192391048\
12736436159629914835075343477182008191330928785768860373127939973177\
63052494277204358579240174730488245462349822120037469207659121017510\
01218092291272699027706462817620604198339134745323293916578572282549\
24116244533223866364856865317114967557066663536376343128567680906720\
61764562620886039013573861476592226708854287846592304969334912469803\
12507139147064808903612376733627058308253758999654545653079870971966\
39352239897170601807293538529571930214655824923553210628065441410082\
64909257175626977532187270976620686987192211027534081125067087406140\
04272207570560217022276862167201738705591967047132029817306091706167\
88511161829178775447774522166205835983174468078354646490355262626185\
92707769692650015463072254012707290128020704879925204262894366607928\
64569688539411097568820564002708260545302498646096613541573488053035\
29464112666464799579334001389463815093626821732035568867075693046582\
56066956011728344932544832226503455818172038856422008742254965154356\
66307452484032159959349389526776258180342580543432325682116409135613\
88088767367314844519622100290218802495202321475300477340729565680783\
24011223601204141085691954438204937682332227360382936775988886545128\
93525280821911159972971397574162898085649998565393045404760240604331\
59456659730073515088831333039887309968528845217110079369331693213177\
05558397796379108248708431620233189642397710829943933934005835585902\
17513252794090844735190314941878898989170740854188417518129638056086\
07436762139072615685935122969736651233387075068866259306433410312065\
73504562435401771168805144280321002849860775467143169339505355458866\
35914375988685051115119841307132244418293162454216614670787597317746\
66702911316829023341512709925133948717354523200459998088061675060047\
00566452526710346126651489557268627487651204778699369531585918862201\
59394424745148402742835964929610173703136610804997257202644221090466\
98365297814614744314991928707978437775163172405897520668685112787838\
76745065718227414525009485074009755851181719599276687557225114880046\
03173249464073454779191764341396098872621419423704644007522244100096\
27738372944337782228361617675552100763578393891200653283263519950059\
32714093086973120414639660768933627220032232042053664212501154040067\
41174802564325712314623332216263370677113891935181869723499560650683\
86380815056042861307598414035089442567526704536760404967820726566472\
34670879286692361894969928723752609821775353401149541579651197092514\
39298884447458964811457837922247753902245800878298397829819244600259\
60945304748005444929411694097677373095316194496470938817202711690246\
79164873737588149815425937970573045094656554393364961182798975769812\
79063271606896233814472483059102258115803176451232829617731169473560\
56583472626947527000930777640604734587567158760154364347063261832083\
30385996594589042026405412914351169274159744595075192817784749916345\
62470152751910309001973543268287070146028018506822943624359351696941\
89166494104952290716061467283074936182290043354342490250743196513357\
17212673549917050251415008968860641379683205051131693384563804473690\
89903250787076735030486079965562026803790996240143092356323469108949\
07845535789793689147065546557522839031958055488735322364364131104340\
87166908595780193040642500443843720507588834041902825237277938594405\
15509518379205857942740777802594442587952768392559677020956217589081\
33581899255243337125974255338488475972519745646623282832498001302879\
97096502521398591719712234914109280948285965958478557415471893042412\
37475503214912860261669526560018289994451491775859964901871011582605\
60397222346597960243475095118473975946655966278677300109917282300025\
91196061133760791076031926996915150405486810283942043845416405992441\
04234755865756841005862837001581295460756719973524643902069661443476\
62959323040510506366321536923369435204548433728840037610657060117916\
95218686643134430952095557859827160438074418388249761882170083975363\
49979442098222565355493523403865857324728074954237197296535659279741\
76818946685603045371134466487089332172645850146950210739544416499959\
11699854986965051039658459578956685582448600003884615796462271186448\
17152459906468747763647302588605751100068841061137998158161515420480\
22318750114714645614691409738870229881884135960615681233518827274646\
97655884831524317196513396138604061144709084173127050169960969426206\
52655647298123938967996892262704749484308663243367637233409236395877\
16158838490461567162279329710757545371095127926752344490557461174904\
98350296783037094146290240832642344526951036992128964662104823342328\
54326936859341427510045192855933515626979981579879097216743613480412\
84130111347086326278853018443938104571356527897125855276423431718226\
89948184483964355088191096584486312766049989136785882316252089805747\
91742304600477823361721935335348605132172666045313701386462122287056\
68758394818039394561931249752372359921860235500907703617657471112468\
36477879522500385199108709199684161073878090521604160929658824487302\
08189162141752889696881176598927266776983392249528725466796126147755\
84837319880970278900349346884424939463657023804336647859408884059715\
87012385007902709564185245198547735808691732425755879909338114582324\
96496038851490551258638486060560640779848053453363726807356342381429\
06674320609564946079226292351275912040951590375532791423187152515437\
23852827475892552409562375800889091489173683224781236872060328011056\
48781494027495854853889407259647716580473257070974609155548989153057\
01770396652955542660500535945575049924043963937672274745980419440159\
73418948752580407334724228608155102768772518992593039260939482902676\
68501092440834712572791048398198968952196157492831517190068185690130\
54317743725067865519433837861089656827013210893936524514466555291774\
10118234887021100028017166818373039407738256677074456585169768128673\
31216559786733796846363998574307817534685493308372883174988103338379\
77555499797541822977140710754411379683506825653833835598576632482517\
06897089238260684742063540778170332874744726525838301125226609661605\
75675563264013373771866481097747281315582164490082018932862351602307\
75895597782979091503504290038602647497360256570951098586836413102875\
05985508624511969413092808738775819236922993770908434526013048298335\
63019052110438767283756977888399786317944181262208218736332067033822\
80782661106911795601344763359694750182233827973870280096262437147106\
21345220816897690852045590174605218233496926136141353580378360441714\
28438525233992119663994156239594488313118082243243369528158085152773\
57113378544700236891699113941209103476577008555151150115939550291435\
84311784078501339810706274047644249675445910838212985742593658736726\
05590639634037920190032975704406638099883233645741951743468840995139\
43095512100364094192434566607233069670007691516971518451704593177588\
61282246970125199204660143691272485688286787069963298843415512939777\
64698968031194340606574494573123788121878504563898097114591153445491\
29114045251097051865489399353650136868661016594913875104276448322279\
38508327655772735551923801839886882663244185808346800338063200999436\
16634873166641664790809023885607426914349049558044907612956164113179\
41293084012994951583428416228171274647443699568963003542273548716725\
17543291905739230393954548707021174633513110455440690091273680401298\
12811462097001540223298427500619842282324231271218044540375197467894\
28787621198614457986071126897232495087089840730512905845441161761976\
66601236384032968167285587365409796297465633367726141545996988226883\
94404671657814361092082070588158359501103729598165458354758606730331\
06290923381456909223490297830648362593407596156948828105259159998540\
44681640786318943172502512008752140731673485710162423305221273151554\
14015776177271915335460332199341190875273337813209488648738342833559\
09991792529769222004989099471330675361053443603329769538381697759223\
41487298533047095302388789420693497992006089184803515480781920674919\
09075112364663390621356691339916840245607264564259595373510560189888\
83101152955043503694233791483780253651442819969176278483653883495740\
57348801797457787746560339910935082336178277998340420138860067492161\
80813845520161575668483836859059768645273848264199471732826240207283\
38944009476728508602225488515628992665747699225194127625115010282470\
11255421483010324323934269992453569693406756352625863659617370205694\
46730077567953286193905468530935524416186408220535319273409812154850\
66824640573392916854142528081849040648607925797094385311506922861842\
29144703260429506651941695600111589700081612805239764675960599037895\
85277505278580936871197464961945900796506521050743626513365790652639\
56267995795043841000443373363843188371190067977723895217943323322350\
07131769451445041372122674930206828784929068233225437967059745018449\
24032625385535746044191547305282331530150508122071323064899819715044\
74449341068915282166440199819997121493518122097904841256853371980242\
48686646703791404029568587332628302419928168479892841143897873463368\
91001511064038809491058043208641187299641201395772810244836614652387\
82450681130790331970820232940957236616253594577206591272486713707415\
46948035707641090744852030891472775930637435026651131633879270089970\
44303247964187045852327022149531568482844729534374639100547599878437\
44909519052859764561719309116739679668953213767053177134351858393487\
62085012809637892432061609006747375297605894263669573918014684415911\
77903552657797646888181858087038285788571519109995872900507911395336\
80048417655055459938566880542365695507252330830042280914049463078624\
33569004674540996019078694725685882888530273965971773466249177582671\
42332603208213018856233359235015599937114645865169765977699011678178\
65491574334292488980604052665130357287858832893782241288755784638960\
64397859346154929859372879633696613249742905340795597385119890246889\
08688700640311194450755763678975202118000072107423667977807786843802\
46174591520518486946282593854915117798362423931239232306806408129269\
34708055912784899835082772799444653961132222086104558970432767747437\
88633596584638516429917747859701245255267497057587911651007806899367\
83605388585832329612519332348998714112329223819072111733578428201610\
89626170510883461035298725572775800706131604421628258178566484068261\
12405337928043939969536697946870334885024960249216783531029555972097\
84824503847059549111252017217901866557048983055500910349584743702027\
02270184957931102654283893318223648235659233326009586692028246496989\
60710329892077654990141054083454257677162385124341509709606214608731\
38194953833855587315682647117056971213294828892548510635009164117322\
69780817879546310416971723639232724179897303295326084797910865848021\
71103436912274354278556604836788582218736050493567224034486228435752\
75772545021730137777109249271305862854606379162479556321716369747012\
71058026509724749532545806263542358710308420248483382469723776037376\
69597983214319566329384729274790976182741310687825587124747694271202\
28994119273457226536530124689674046719121694781331975885866944397414\
80646928865900476175693828843742554200741632048619852118186250519076\
96565875277089012694058827876101297079720952161195398523023828552657\
02759085017247705446867703237841506390589295997796632748329114874936\
18403281978323594239953480189136997128462682656651813935559548642366\
91344303935616433794560077739844540045026897632503760550277165908180\
44396002251991670687351585740943555336338404437656999031317379460574\
33644039340780171677598011917530775848363921450425665793612984955814\
46229638868178745565157046098390502162983892266194312085128770307885\
03935195697010019173631562502099772732135285257616457457496713081114\
27698087930936134900937786460750887725990966034866935651522844738350\
67836613842467245269685348426064212529884648042396191418842320783973\
75995845907852205274837621277373785446822230566103387366537283783471\
14733090624177806365035413997882073568160227884766259995743233063926\
04323525627951228768897375175821617018283853002830565486213078784078\
55148910934828741195385502504757742265796083683664577440469374407863\
13164696065151429326584359520661070954626038065790919987475229950740\
98216024048599560662121214665212267608906887699061900342360030929097\
47880445733354436868505653643898815979415719223075312881714224290123\
41103905501102814618080271895011309165724524105825798203841299390522\
77992971203468621022598279406809300700081197021715469489274684846553\
77079370486851280632467450720757578010346887352264309727206549952369\
84430678997862148905029984576703819937255833785743720341765540435032\
96646736240918734453057541699910818180971797288574782433416947290678\
37280856433367250804218572773292211981900395475951298227991332978940\
27393755102812354289041266547584803540291969245091496543899626015439\
20544328434478996976635560702205616427714866475845230636742677201248\
54833564014964014561785203285877295747933756675009893281551897328021\
59418768491468356922541194834374009854217058462894766713723976242506\
33501980447527984054506287725044344865461692582424535946790342675202\
19477924778192351335239273036438409900357736968995545750602630916862\
07145824980908319494765273379305286696173544270167015094012422019461\
10110887380826756316024097822574773598557194695478247372939507966203\
29705858302812528783680908061271781318308853066172524646574009812058\
15491577412033628444516865373536840299843225279750680250512209287726\
70397253246630246078780479868717275510620978044054573646725585953242\
24758324770955832253706182177378996095677809200493210148738421043729\
74274161518014481783974080854045256838062599340760051731497627412507\
47265737662102660719976929447755810323498446434442537511040265468910\
43510141536688577299258638010109977110563868273874430671291803629871\
60935608539042562755537099055435404432514293203203439433284377871436\
57994132148348689166418229564255038075101142940868371859796958043423\
95574667099069464561787497638061471023478899430045148689551367991881\
21566086064763654052446414731052255138606247622016854079994875327395\
48336043017691288701099997564691519979028704748160129306878648076267\
78807483332466167739993336059393439458713795495027130913477137163967\
33212638853761959999943498955186589302026949447117881641750014115955\
94467826370160249803549647393175426177149199867597218424686918372730\
78980944010070138138080752013831258280466907055392611875371510020251\
51800106836557368069728552452932087455607278280099606979325749370959\
32140319187190721617428623235849046666769248377704247592637495100653\
16908459137057163937207803590262164206824274693991636821189334048631\
18357431296069511882893876953799136574175178784539883682427570859050\
23638256014536405998794647819608013589296597158733008320056824712260\
63549827856575391246729828051534769056050117993691741808160658338919\
44097905646550003580396419617760070019081705875625972878012366674679\
89292342512250048487425437546815015357199615226915228311396456047784\
77865786121932224656620803395855540010187517792883116941453865099079\
74150833888842573806312810859398218917803622873821251271610417946046\
11112073413623443045273514117087181250817162098103623756234695368183\
13329645002476824195759384817251427574077359703488481671707390563123\
29489572243615805666703488202767581029534591772098940413043218935233\
90863085401224517659534671566110922764069066448030558833184012635309\
67680593942113177332551690381005762329396768070186525893092140568792\
80842586082843501854057032456535635578215444814064662533030013057318\
31828990009857758497553071226979016224998850520128961978321074633694\
72596920941676631197534864178982171496812717822255063769725399908483\
22022524322928978677204857120686863585204689473401131106177840652358\
37371614145085538055371518912620992897032714169143489868354092566936\
48812775003327275943150768750860164679671172857179015820811479158914\
84122397219348846942722048529334135862982966040930598883092702070297\
01021054310366548846285483053033629752307019066118768758621856563205\
61077991767594949026864421630980354592182135940618795388702995460780\
15244528568453393510567298824950473694304626782827092561451548697001\
58655671621640122290190060058355755003380914495159377600243284934348\
30830232919737374256470494422199465871651077565986045910787485717984\
28640454849863473825645873938305511859807699766243416400243821342352\
03754512014015085697869209666106500847867745391449969654486637962121\
96099816021654477490828443488872228300302385780318236684548929433912\
93263596581118329352900643557103898820061908796437986587441214584341\
11988752772398357121525143168135648985355335801715938044069098157455\
59287951656187194689448537276786455279919640158591032349477352564508\
41323717085310676425390779501257241835410944478913727273009603729220\
56811212360534469226325897656316758584230613651821772025561754082913\
7151881915423973881943439830877204381696

files.

Greetings
Bernd
--
eckes privat - http://www.eckes.org/
Project Freefire - http://www.freefire.org/

2003-08-28 22:01:17

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: Lockless file reading

In article <[email protected]> you wrote:
> I'd be really surprised if there were that many pictures in the world.

Well, this is about probabilty. It does not mean that you need 2^64
pictures, neighter does it mean you have a collision within 2^64 pictures.

Greetings
Bernd
--
eckes privat - http://www.eckes.org/
Project Freefire - http://www.freefire.org/

2003-08-28 21:49:36

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: Lockless file reading

In article <[email protected]> you wrote:
> The kernel does not provide synchronisation between read() and write()
> data transfers, and it does not always use atomic 32-bit reads and
> writes either.

There are some gurantees for pipes, for example. Writes which are smaller
than pathconf(_PC_PIPE_BUF, pipe) will be atomic, i.e. not mixed between
multiple writers. I dont know what the gurantee for write/read or mmap is.

Greetings
Bernd
--
eckes privat - http://www.eckes.org/
Project Freefire - http://www.freefire.org/

2003-08-28 23:00:20

by Jamie Lokier

[permalink] [raw]
Subject: Re: Lockless file reading

Bernd Eckenfels wrote:
> Why is that? For a hash with n bits there are at least
> 2^y / 2^n = 2^(y-n) files with the same hash, if they have size y bits.
> Three are even more, if you consider all files up to this size.

Correct. Now, can you tell me how many of the 2^(y-n) files are ones
you will find on the net or on anybody's hard disk?

I'll take a couple of guesses: either 0 or 1 :)

The strength of a cryptographic hash is due to the immense difficulty
of deliberately creating a file given a hash value. Another strength
is the very low probability of finding two different files with the
same hash value.

They're a bit like uncomputable numbers. (Just a little bit). Lots
of colliding files exist. But if you can find even one pair, that's
an amazing event.

Uncomputable numbers are similar. There are infinitely more
uncomputable numbers than computable ones, yet it is impossible to
write one down or even refer to a specific one of them.

-- Jamie

2003-08-28 23:02:40

by Jamie Lokier

[permalink] [raw]
Subject: Re: Lockless file reading

Bernd Eckenfels wrote:
> In article <[email protected]> you wrote:
> > I'd be really surprised if there were that many pictures in the world.
>
> Well, this is about probabilty. It does not mean that you need 2^64
> pictures, neighter does it mean you have a collision within 2^64 pictures.

It just means that if you have a collision with many fewer pictures
than that, it's such an unlikely event that a flaw in the program
calculating the hash, or a flaw in the hash algorithm itself, is more
likely than it being a random collision.

-- Jamie

2003-08-28 23:49:06

by Ian Stirling

[permalink] [raw]
Subject: Re: Lockless file readingu

>
> In article <[email protected]> you wrote:
> > I'd be really surprised if there were that many pictures in the world.
>
> Well, this is about probabilty. It does not mean that you need 2^64
> pictures, neighter does it mean you have a collision within 2^64 pictures.

Of course it dowesn't.
The probability gets rather smaller as numbers go down, and bigger as
they go up.
With 2^128 bits, the chance of a a collision between 2^64 randomly chosen
pictures is 50%.
At 2^54 pictures, it's about one in a million, and at 2^34 (enough for
several pictures of everyone alive) one in a billion billion.
At more common numbers of pictures (say 2^14) it becomes vanishingly
unlikely for anyone to have two matching pictures (even with several billion
archives)

2003-08-29 00:47:13

by David Schwartz

[permalink] [raw]
Subject: RE: Lockless file reading


> In article <[email protected]> you wrote:

> > Find those GIFs, double-check, and publish immediately.
> > That would be
> > amazingly big news and would probably cause huge numbers of
> > people to switch
> > from MD5 to SHA1 overnight.

> Why is that?

Because the security of a hash is based upon it being impractical to find
collisions. If you could demonstrate that you could find even a single
collision (by actually doing it), that would reduce confidence in MD5
massively.

> For a hash with n bits there are at least
> 2^y / 2^n = 2^(y-n) files with the same hash, if they have size y bits.
> Three are even more, if you consider all files up to this size.
>
> With n=128 (md5) and y=80000 (10k file) x=2^79872 different files
> with the same
> hash, which is:
[snip]

So what? You can, in principle, break a 4096-bit RSA key just by brute
force factorization. The security is based upon the difficulty of actually
doing it in practice. The simplicity of doing it in theory is irrelevent.
Yes, in theory you can break any public key encryption scheme or any hash
digest algrithm by brute force trial and error. In practice, however your
children's, children's, children's, children would be long dead before you
got 1% of the way there.

To date, no full MD5 collisions have been published and until earlier in
this thread, nobody even claimed to have one. And I'll bet $100 to $1 that
the claimant is in error, either because he used only part of the hash or a
hash of the hash, mis-implementing the hash, o actually hashed the same data
twice without realizing it.

DS


2003-08-29 10:00:18

by jlnance

[permalink] [raw]
Subject: Re: Lockless file readingu

On Fri, Aug 29, 2003 at 12:44:00AM +0100, [email protected] wrote:

> Of course it dowesn't.
> The probability gets rather smaller as numbers go down, and bigger as
> they go up.
> With 2^128 bits, the chance of a a collision between 2^64 randomly chosen
> pictures is 50%.
> At 2^54 pictures, it's about one in a million, and at 2^34 (enough for
> several pictures of everyone alive) one in a billion billion.
> At more common numbers of pictures (say 2^14) it becomes vanishingly
> unlikely for anyone to have two matching pictures (even with several billion
> archives)

Be careful. I remember discussing in probability class the liklyhood that
two people in a room with N people have the same birthday. N does not have
to be anywhere close to 365 for your probability of a collision to be greater
than 50%. I forget what the exact number is but its less than 30. The
image problem sounds similar, depending on exactly how you phrase it.

Thanks,

Jim

2003-08-29 11:55:56

by David Schwartz

[permalink] [raw]
Subject: RE: Lockless file readingu


> On Fri, Aug 29, 2003 at 12:44:00AM +0100, [email protected] wrote:

> > Of course it dowesn't.
> > The probability gets rather smaller as numbers go down, and bigger as
> > they go up.
> > With 2^128 bits, the chance of a a collision between 2^64
> > randomly chosen
> > pictures is 50%.
> > At 2^54 pictures, it's about one in a million, and at 2^34 (enough for
> > several pictures of everyone alive) one in a billion billion.
> > At more common numbers of pictures (say 2^14) it becomes vanishingly
> > unlikely for anyone to have two matching pictures (even with
> > several billion
> > archives)

> Be careful. I remember discussing in probability class the liklyhood that
> two people in a room with N people have the same birthday. N
> does not have
> to be anywhere close to 365 for your probability of a collision
> to be greater
> than 50%. I forget what the exact number is but its less than 30. The
> image problem sounds similar, depending on exactly how you phrase it.

He's saying that with 2^128 possible hash values, you get a collision with
50% probability with roughly 2^64 pictures. Analogously for the birthday
paradox, with 365 possible birthdays (about 2^8), you get a collision with
50% probability with less than 30 people (about 2^5).

Now does 5 really seem to be that much less than 8? His math is right and
so is yours. With 2^N possible values, all equally likely, you get a
collision with 50% probability in roughly 2^((N+1)/2).

You'll note that the birthday paradox doesn't seem so suprising in
exponential notation. In the case of MD5, with
340,282,366,920,938,463,463,374,607,431,768,211,456 possible hashes, you get
a 50% chance of a collision in roughly
12,912,720,851,596,686,131 hashes. So that could seem fairly surprising when
you take it out of exponential notation too. (Just compare the lengths of
the two numbers.)

DS


2003-08-29 15:42:51

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Lockless file readingu

On Fri, Aug 29, 2003 at 06:00:11AM -0400, [email protected] wrote:
> Be careful. I remember discussing in probability class the liklyhood that
> two people in a room with N people have the same birthday. N does not have
> to be anywhere close to 365 for your probability of a collision to be greater
> than 50%. I forget what the exact number is but its less than 30. The
> image problem sounds similar, depending on exactly how you phrase it.

This is very simple to see. Consider the probability of not having a
collision instead of having a collision. Each choice eliminates a
choice, so the probability of not colliding while taking a sample after
taking N samples from a uniform distribution over K values is (K-N)/K.
You end up getting 1 - K!/((K-N)! * K**N).

Given one possible value for each day of the year, K == 365.

So you want 1 - 365!/((365 - N)! * 365**N) > 1/2.

The first few probabilities of collision (calculated afresh) are:

0 0.0
1 0.0
2 2.739726027397249e-3
3 8.204165884781345e-3
4 1.6355912466550326e-2
5 2.713557369979358e-2
6 4.0462483649111536e-2
7 5.6235703095975365e-2
8 7.433529235166902e-2
9 9.462383388916673e-2
10 0.11694817771107757
11 0.14114137832173312
12 0.16702478883806438
13 0.19441027523242949
14 0.223102512004973
15 0.25290131976368646
16 0.2836040052528499
17 0.31500766529656066
18 0.34691141787178936
19 0.37911852603153673
20 0.41143838358058005
21 0.4436883351652058
22 0.4756953076625501
23 0.5072972343239854
24 0.5383442579145288
25 0.5686997039694639
26 0.598240820135939
27 0.626859282263242
28 0.6544614723423994
29 0.680968537477777
30 0.7063162427192686
31 0.7304546337286438
32 0.7533475278503207
33 0.774971854175772
34 0.7953168646201543
35 0.8143832388747152
36 0.8321821063798795
37 0.8487340082163846
38 0.8640678210821209
39 0.878219664366722
40 0.891231809817949
41 0.9031516114817354
42 0.9140304715618692
43 0.9239228556561199
44 0.9328853685514263
45 0.940975899465775
46 0.9482528433672547
47 0.9547744028332994
48 0.9605979728794224
49 0.9657796093226765
50 0.9703735795779884

N = 23 should do fine for your purposes.


-- wli