2021-09-16 12:19:14

by Rolf Eike Beer

[permalink] [raw]
Subject: Re: data loss when doing ls-remote and piped to command

Am Donnerstag, 16. September 2021, 12:12:48 CEST schrieb Tobias Ulmer:
> On 16/09/2021 08:38, Rolf Eike Beer wrote:
> ...
>
> > The redirection seems to be an important part of it. I now did:
> >
> > git ... 2>&1 | sha256sum
>
> I've tried to reproduce this since yesterday, but couldn't until now:
>
> 2>&1 made all the difference, took less than a minute.
>
> Different repo, different machine, but also running Tumbleweed
> 5.14.1-1-default, git 2.33.0
>
> while [ "`git --git-dir=$PWD/in/linux/.git ls-remote origin 2>&1 | tee
> failed.out | sha1sum`" = "7fa299e589bacdc908395730beff542b0fc684eb -"
> ]; do echo -n .; done
> ..........
>
> failed.out has multiple lines like this:
>
> --8<--
> 4e77f7f1261f65cff06918bc5e66d02a418fc842 refs/tags/v3.10.18^{}
> f7b8df0cc81cf82a4ac6834225bddbe46a340455a4a5d52f29d08d923ce8d232b0b497da674d
> d2c refs/tags/v3.18
> b2776bf7149bddd1f4161f14f79520f17fc1d71d refs/tags/v3.18^{}
> --8<--
>
>
> Running the same on Archlinux (5.13.13-arch1-1, 2.33.0) doesn't show the
> problem.
> This may well turn out not to be git, but a kernel issue.

Linus,

since you have been hacking around in pipe.c recently, I fear this isn't
entirely impossible. Have you any idea?

For easier reference, the complete thread is at:

https://public-inbox.org/git/[email protected]/T/

Eike
--
Rolf Eike Beer, emlix GmbH, http://www.emlix.com
Fon +49 551 30664-0, Fax +49 551 30664-11
Gothaer Platz 3, 37083 Göttingen, Germany
Sitz der Gesellschaft: Göttingen, Amtsgericht Göttingen HR B 3160
Geschäftsführung: Heike Jordan, Dr. Uwe Kracke – Ust-IdNr.: DE 205 198 055

emlix - smart embedded open source


Attachments:
signature.asc (321.00 B)
This is a digitally signed message part.

2021-09-16 15:52:21

by Mike Galbraith

[permalink] [raw]
Subject: Re: data loss when doing ls-remote and piped to command

On Thu, 2021-09-16 at 14:17 +0200, Rolf Eike Beer wrote:
> Am Donnerstag, 16. September 2021, 12:12:48 CEST schrieb Tobias Ulmer:
> > On 16/09/2021 08:38, Rolf Eike Beer wrote:
> > ...
> >
> > > The redirection seems to be an important part of it. I now did:
> > >
> > > git ... 2>&1 | sha256sum
> >
> > I've tried to reproduce this since yesterday, but couldn't until now:
> >
> > 2>&1 made all the difference, took less than a minute.
> >
> > Different repo, different machine, but also running Tumbleweed
> > 5.14.1-1-default, git 2.33.0
> >
> > while [ "`git --git-dir=$PWD/in/linux/.git ls-remote origin 2>&1 | tee
> > failed.out | sha1sum`" = "7fa299e589bacdc908395730beff542b0fc684eb  -"
> > ]; do echo -n .; done
> > ..........
> >
> > failed.out has multiple lines like this:
> >
> > --8<--
> > 4e77f7f1261f65cff06918bc5e66d02a418fc842        refs/tags/v3.10.18^{}
> > f7b8df0cc81cf82a4ac6834225bddbe46a340455a4a5d52f29d08d923ce8d232b0b497da674d
> > d2c refs/tags/v3.18
> > b2776bf7149bddd1f4161f14f79520f17fc1d71d        refs/tags/v3.18^{}
> > --8<--
> >
> >
> > Running the same on Archlinux (5.13.13-arch1-1, 2.33.0) doesn't show the
> > problem.
> > This may well turn out not to be git, but a kernel issue.
>
> Linus,
>
> since you have been hacking around in pipe.c recently, I fear this isn't
> entirely impossible. Have you any idea?
>
> For easier reference, the complete thread is at:
>
> https://public-inbox.org/git/[email protected]/T/
>

I use git-daemon (2.33) and reference clones for my local pile of
kernel trees (74), so out of curiosity, modified the above ls-remote
loop to fit one of them, and tried to reproduce with both master.today
(ff1ffd71) and SUSE's stable branch (where Tumbleweed gets source,
currently at 5.14.4). Both kernels failed to reproduce given a few
minutes each (zzzz) to do so. I'm running Leap-15.3 vs Tumbleweed, but
that shouldn't matter.

-Mike

2021-09-17 08:08:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: data loss when doing ls-remote and piped to command

On Thu, Sep 16, 2021 at 5:17 AM Rolf Eike Beer <[email protected]> wrote:
>
> Am Donnerstag, 16. September 2021, 12:12:48 CEST schrieb Tobias Ulmer:
> > > The redirection seems to be an important part of it. I now did:
> > >
> > > git ... 2>&1 | sha256sum
> >
> > I've tried to reproduce this since yesterday, but couldn't until now:
> >
> > 2>&1 made all the difference, took less than a minute.

So if that redirection is what matters, and what causes problems, I
can almost guarantee that the reason is very simple:

Your git repository (or more likely your upstream) has some problem,
it's getting reported on stderr, and because you mix stdout and stderr
with that '2>&1', you get randomly mixed output.

Then it depends on timing where the mixing happens.

Or rather, it depends on various different factors, like the buffering
done internally by stdio (where stdout generally will be
block-buffered, while stderr is usually line-buffered, which is why
you get odd mixing of the two).

But timing can be an effect particularly with "git ls-remote" and
friends, because you may get errors from the transport asynchronously.

So the different buffering ends up causing the effect of mixing things
in the middle of lines, while the timing differences due to the
asynchronous nature of the remote access pipeline will likely then
cause that odd mixing to be different.

End result: corrupted lines, and different sha256sum every time.

> > Running the same on Archlinux (5.13.13-arch1-1, 2.33.0) doesn't show the
> > problem.
> > This may well turn out not to be git, but a kernel issue.

Much more likely that the other box just doesn't have the error situation.

> since you have been hacking around in pipe.c recently, I fear this isn't
> entirely impossible. Have you any idea?

Almost certainly not the kernel. Kernel - and other - differences
could affect timing, of course, but the whole "2>&1" really is
fundamentally bogus.

If you don't have any errors, then the "2>&1" doesn't matter.

And if you *do* have errors, then by definition the "2>&1" will mix in
the errors with the output randomly and piping them together is
senseless.

Either way, it's wrong.

So what I'd suggest Tobias should do is

git ... 2> err | sha256sum

which will send the errors to the "err" file. Take a look at that file
afterwards and see what is in it.

Basically, '2&>1" is almost never the right thing to do, unless you
explicitly don't care about the output and just want to suppress it.

So "2&>1 > /dev/null" is common and natural.

Of course, people also use it when they just want to eyeball the
errors mixed in, so doing that

... 2&>1 | less

thing isn't necessarily *wrong*, but it's somewhat dangerous and
confusing. Because when you do it you do need to be very aware of the
fact that the errors and output will be *mixed*. And the mixing will
not necessarily be at all sensible.

Finally: pipes on a low level guarantee certain atomicity constraints,
so if you do low-level "write()" calls of size PIPE_BUF or less, the
contents will not be interleaved randomly. HOWEVER. That's only true
at that "write()" level. The moment you use <stdio> for your IO, you
have buffering inside of the standard IO libraries, and if your code
isn't explicitly very careful about it, using setbuf() and fflush()
and friends, you'll get that random mixing.

Anyway. That was a long email just to tell people it's almost
certainly user error, not the kernel.

Linus

2021-09-17 09:00:26

by Junio C Hamano

[permalink] [raw]
Subject: Re: data loss when doing ls-remote and piped to command

Linus Torvalds <[email protected]> writes:

> On Thu, Sep 16, 2021 at 5:17 AM Rolf Eike Beer <[email protected]> wrote:
>>
>> Am Donnerstag, 16. September 2021, 12:12:48 CEST schrieb Tobias Ulmer:
>> > > The redirection seems to be an important part of it. I now did:
>> > >
>> > > git ... 2>&1 | sha256sum
>> >
>> > I've tried to reproduce this since yesterday, but couldn't until now:
>> >
>> > 2>&1 made all the difference, took less than a minute.
>
> So if that redirection is what matters, and what causes problems, I
> can almost guarantee that the reason is very simple:
> ...
> Anyway. That was a long email just to tell people it's almost
> certainly user error, not the kernel.

Yes, 2>&1 will mix messages from the standard error stream at random
places in the output, which explains the checksum quite well.

I am not sure if it explains the initial report where

ls-remote 2>&1 | less

produced

> 6f38b5d6cfd43dde3058a10c68baae9cf17af912 refs/tags/v5.0-rc2
> 1c7fc5cbc33980acd13ae83d0b416db002fe95601e7f97f64b59514d936 refs/tags/v5.7-rc2^{}
> d0709bb6da2ab6d49b11643e98abdf79b1a2817f refs/tags/v5.7-rc3

What we see on the second line is the beginning of peeled
v5.0-rc2^{} up to the "acd13" (that is, the first 19 bytes of the
line), followed by the full line for peeled v5.7-rc2^{} (which
begins with "ae83d"). 12407 bytes in between are missing, which
is even more puzzling as it is not a nice round number.

I can sort of guess that the progress display during transfer, which
comes out on the standard error stream and uses terminal control
sequences like "go back to the end of the line without feeding a new
line", "erase to the end of the line", etc., would be contributing,
but because it is piped to "less", which would make it "visible"
(i.e. you do not get the raw escape but see three capital letters
ESC in reverse), it does not quite explain how the display was
broken.

In any case, I do not think the kernel is involved, or more
generally I do not think any "loss of output bytes" is happening
here. It's just "| less" that failed to show a range about 12k
bytes long is mystery to me ;-).


2021-09-17 12:43:02

by Mike Galbraith

[permalink] [raw]
Subject: Re: data loss when doing ls-remote and piped to command

On Thu, 2021-09-16 at 17:49 +0200, Mike Galbraith wrote:
> Both kernels failed to reproduce...

Nor did the TW kernel (now 5.14.2-1-default) reproduce, neither in my
Leap-15.3 box, nor in a TW KVM set up to play server. 'course that
doesn't mean there's no kernel bug lurking, means with certainty only
that if there is one, the posted reproducer ain't all that wonderful.

-Mike

2021-09-17 12:48:01

by Rolf Eike Beer

[permalink] [raw]
Subject: Re: data loss when doing ls-remote and piped to command

Am Donnerstag, 16. September 2021, 22:42:22 CEST schrieb Junio C Hamano:
> Linus Torvalds <[email protected]> writes:
> > On Thu, Sep 16, 2021 at 5:17 AM Rolf Eike Beer <[email protected]> wrote:
> >> Am Donnerstag, 16. September 2021, 12:12:48 CEST schrieb Tobias Ulmer:
> >> > > The redirection seems to be an important part of it. I now did:
> >> > >
> >> > > git ... 2>&1 | sha256sum
> >> >
> >> > I've tried to reproduce this since yesterday, but couldn't until now:
> >> >
> >> > 2>&1 made all the difference, took less than a minute.
> >
> > So if that redirection is what matters, and what causes problems, I
> > can almost guarantee that the reason is very simple:
> > ...
> > Anyway. That was a long email just to tell people it's almost
> > certainly user error, not the kernel.
>
> Yes, 2>&1 will mix messages from the standard error stream at random
> places in the output, which explains the checksum quite well.

If there would be any errors. The point is: if I run the command with ">/dev/
null" just to the terminals a hundred times there is never any output on
stderr at all. If I pipe stderr into a file it's empty after all of this (yes,
I did append, not overwrite).

That the particular construct in this case is sort of nonsense is granted, I
just hit it because some tool here used some very similar construct and
suddenly started failing. "less" isn't the original reproducer, it was just
something I started testing with to be able to easily visually inspect the
output.

What you need is a _fast_ git server. kernel.org or github.com seem to be too
slow for this if you don't sit somewhere in their datacenter. Use something in
your local network, a Xeon E5 with lot's of RAM and connected with 1GBit/s
Ethernet in my case.

And the reader must be "somewhat" slow. Using sha256sum works reliably for me.
Using "wc -l" does not, also md5sum and sha1sum are too fast as it seems.

When I run the whole thing with strace I can't see the effect, which isn't
really surprising. But there is a difference between the cases where I run
with redirection "2>&1":

ioctl(2, TCGETS, 0x7ffd6f119b10) = -1 ENOTTY (Inappropriate ioctl for
device)

and without:

ioctl(2, TCGETS, {B38400 opost isig icanon echo ...}) = 0

AFAICT this is the only place where fd 2 is used at all during the whole time.

Regards,

Eike
--
Rolf Eike Beer, emlix GmbH, http://www.emlix.com
Fon +49 551 30664-0, Fax +49 551 30664-11
Gothaer Platz 3, 37083 Göttingen, Germany
Sitz der Gesellschaft: Göttingen, Amtsgericht Göttingen HR B 3160
Geschäftsführung: Heike Jordan, Dr. Uwe Kracke – Ust-IdNr.: DE 205 198 055

emlix - smart embedded open source


Attachments:
signature.asc (321.00 B)
This is a digitally signed message part.

2021-09-18 02:53:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: data loss when doing ls-remote and piped to command

On Thu, Sep 16, 2021 at 11:59 PM Rolf Eike Beer <[email protected]> wrote:
>
> When I run the whole thing with strace I can't see the effect, which isn't
> really surprising. But there is a difference between the cases where I run
> with redirection "2>&1":
>
> ioctl(2, TCGETS, 0x7ffd6f119b10) = -1 ENOTTY (Inappropriate ioctl for device)

Ehh. That format of strace implies that you didn't use "strace -f"
(which would have the PID in it).

Although maybe you edited it out.

I think the error output would come from the other process (ssh, or
whatever process you use to run "git-upload-pack" on the other end).

I still strongly doubt it's about pipes - we've had changes to them,
but if they are broken we'd see a lot more breakage than some very
incidental use by git.

But I can easily see it being timing-dependent. And yes, sadly
'strace' can often end up hiding any timing issues because it
obviously slows down the target quite a bit.

Doing "strace -o tracefile -f" in a loop would be interesting if you
can reproduce it (and then stop when you reproduce it, so that the
final 'tracefile' is the one for the case that reproduced it).

Linus

2021-09-18 07:44:31

by Jeff King

[permalink] [raw]
Subject: Re: data loss when doing ls-remote and piped to command

On Fri, Sep 17, 2021 at 08:59:07AM +0200, Rolf Eike Beer wrote:

> What you need is a _fast_ git server. kernel.org or github.com seem to be too
> slow for this if you don't sit somewhere in their datacenter. Use something in
> your local network, a Xeon E5 with lot's of RAM and connected with 1GBit/s
> Ethernet in my case.

One thing that puzzled me here: is the bad output between the server and
ls-remote, or between ls-remote and its output pipe?

I'd guess it has to be the latter, since otherwise ls-remote itself
would barf with an error message.

In that case, I'd think "git ls-remote ." would give you the fastest
outcome, because it's talking to upload-pack on the local box. But I'm
also confused how the speed could matter, as ls-remote reads the entire
input into an in-memory array, and then formats it.

We do the write using printf(). Is it possible your libc's stdio may
drop bytes when the pipe is full, rather than blocking? In general, I'd
expect write() to block, so libc doesn't have to care at all. But might
there be something in your environment putting the pipe into
non-blocking mode, and we get EAGAIN or something? If so, I'd expect
stdio to return the error.

Maybe patching Git like this would help:

diff --git a/builtin/ls-remote.c b/builtin/ls-remote.c
index f4fd823af8..5936b2b42c 100644
--- a/builtin/ls-remote.c
+++ b/builtin/ls-remote.c
@@ -146,7 +146,8 @@ int cmd_ls_remote(int argc, const char **argv, const char *prefix)
const struct ref_array_item *ref = ref_array.items[i];
if (show_symref_target && ref->symref)
printf("ref: %s\t%s\n", ref->symref, ref->refname);
- printf("%s\t%s\n", oid_to_hex(&ref->objectname), ref->refname);
+ if (printf("%s\t%s\n", oid_to_hex(&ref->objectname), ref->refname) < 0)
+ die_errno("printf failed");
status = 0; /* we found something */
}


> And the reader must be "somewhat" slow. Using sha256sum works reliably for me.
> Using "wc -l" does not, also md5sum and sha1sum are too fast as it seems.

If a slow pipe is involved, maybe:

git ls-remote . | (sleep 5; cat) | sha256sum

would help reproduce. Assuming ls-remote's output is bigger than your
system pipe buffer (which is another interesting thing to check), then
it should block for 5 seconds on write() midway through the output,
which you can verify with strace.

-Peff

2021-09-18 13:08:56

by Mike Galbraith

[permalink] [raw]
Subject: Re: data loss when doing ls-remote and piped to command

On Fri, 2021-09-17 at 08:59 +0200, Rolf Eike Beer wrote:
>
> What you need is a _fast_ git server. kernel.org or github.com seem to be too
> slow for this if you don't sit somewhere in their datacenter. Use something in
> your local network, a Xeon E5 with lot's of RAM and connected with 1GBit/s
> Ethernet in my case.

Even faster: what's coming across that wire should be a constant (is?),
variable is only delivery/consumption jitter. If there's really really
a pipe problem lurking, you should also be able to trigger by saving
the data once, and just catting it, letting interrupts etc provide
jitter. Which stdout is left of '|' in a script shouldn't matter one
whit to the interpreter/kernel conversation, they're all the same.

That said, if I had a reproducer I was confident pointed to the kernel,
I'd try to bisect.. boring as hell, but highly effective.

-Mike