2024-05-14 05:00:32

by Bagas Sanjaya

[permalink] [raw]
Subject: Fwd: [regression] [bisected] commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 corrupts data sent via pseudoterminal device

Hi,

<[email protected]> reported on Bugzilla
(https://bugzilla.kernel.org/show_bug.cgi?id=218834) pseudoterminal data
corruption regression. He wrote:

> Hello. There appears to be a regression in pseudoterminal support introduced by commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0.
>
> We have a wrapper program that uses a pseudoterminal to transform data fed into input/read from output of a child process. Transformed data is written to the master device, the child process, whose stdin/stdout are mapped to the slave device, reads data from parent by reading from its stdin, processes that data and sends it back to parent by writing the response to its stdout, and then data from the child process is read by the wrapper from pseudoterminal master descriptor.
>
> This used to work fine on various Linux distros such as Amazon Linux2, SLES 12/15, openSUSE 15, RHEL 8/9 with pre 6.0 kernels. However, running our regression suite on Amazon Linux 2023 which uses a 6.x kernel revealed regression. Sometimes, when a lot of data is written by the parent process into the master device's descriptor, some data is lost or corrupted when the child process is reading it from the slave device's descriptor. We've verified that this is a kernel regression introduced via commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 by performing kernel bisection as described in https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html#introguide-bissbs. Bisection was performed using openSUSE 15.5 in VirtualBox (so the kernels had the tainted flag set) but the issue can also be reproduced outside of VirtualBox.
>
> Bisection logs and a repro test case used to perform bisection are in the repro+bisect.zip attachment.
>
> -----------------------------------------------------------------------------------
> The test script, ptybug.sh, pipes a large plain text file containing 1027800 plain text records separated by newlines into the wrapper program. Each record is a 0-padded record number padded with spaces to 80 bytes.
>
> The wrapper program sends the unmodified data read from its stdin to the child program via a pseudoterminal, reads the output from the child and prints it on its stdout. Wrapper's stdout is redirected to a file which is then compared with expected output.
>
> The child process reads exactly 1000000 records from its stdin, strips off trailing whitespace, and prints the record on its stdout.
>
> Both the wrapper and the child process mirror the data written into the pseudoterminal master descriptor/read from the pseudoterminal slave descriptor into separate files - which are identical on kernels prior to commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 and which differ on kernels that include that commit.
>
> -----------------------------------------------------------------------------------
> How to reproduce:
>
> tar xvf repro+bisect.tar.zstd
> cd repro
> ./ptybug.sh
>
> Expected output (on 5.19.0-rc1* kernels prior to commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0)
> rm -f ptybug ptybug_child gen *.o
> cc ptybug.c -o ptybug
> cc ptybug_child.c -o ptybug_child
> cc gen.c -o gen
> Test run 1 ...
> Test run 2 ...
> Test run 3 ...
> Test run 4 ...
> Test run 5 ...
> PASS
>
> Actual/bad output (on 5.19.0-rc1* starting from commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 and later kernels, including latest)
> rm -f ptybug ptybug_child gen *.o
> cc ptybug.c -o ptybug
> cc ptybug_child.c -o ptybug_child
> cc gen.c -o gen
> Test run 1 ...
> FAIL Actual and expected output does not match (gen1027800)
> FAIL
>
>
> To see the difference between data written by the master into the pseudoterminal master descriptor and data read by the child from the pseudoterminal slace descriptor (stripping off trailing spaces) run:
> diff -u gen1027800_sysin.txt.nospace gen1027800_a_out.txt.copy |more
>
> One possible output is below (every time you run this on a kernel starting from commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 the results are sligtly different). As you can see lines corresponding to records 17861-17862 have been erroneourly replaced with lines 17858-17859, and so on.
>
> --- gen1027800_expected.txt 2024-04-26 14:39:22.287121535 +0100
> +++ gen1027800_a_out.txt.copy 2024-05-13 15:52:48.915952662 +0100
> @@ -17858,8 +17858,8 @@
> 000000000000017858
> 000000000000017859
> 000000000000017860
> -000000000000017861
> -000000000000017862
> +000000000000017858
> +000000000000017859
> 000000000000017863
> 000000000000017864
> 000000000000017865
> @@ -51261,7 +51261,7 @@
> 000000000000051261
> 000000000000051262
> 000000000000051263
> -000000000000051264
> +0000000000000512651264
> 000000000000051265
> 000000000000051266
> 000000000000051267
> @@ -104576,9 +104576,9 @@
> 000000000000104576
> 000000000000104577
> 000000000000104578
> -000000000000104579
> -000000000000104580
> -000000000000104581
> +000000000000104576
> +000000000000104577
> +000000000000104578
> 000000000000104582
> 000000000000104583
> 000000000000104584
> @@ -110897,8 +110897,8 @@
> 000000000000110897
> 000000000000110898
> 000000000000110899
> -000000000000110900
> -000000000000110901
> +000000000000110898
> +000000000000110899
> 000000000000110902
> 000000000000110903
> 000000000000110904
> @@ -279673,9 +279673,9 @@
> 000000000000279673
> 000000000000279674
> 000000000000279675
> -000000000000279676
> -000000000000279677
> -000000000000279678

See Bugzilla link for the reproducer mentioned.

IMO, this is quite different from the case fixed by 56c14fb4086b2d ("tty: Fix
lookahead_buf crash with serdev").

Thanks.

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (5.69 kB)
signature.asc (235.00 B)
Download all attachments

2024-05-14 09:11:20

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [regression] [bisected] commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 corrupts data sent via pseudoterminal device

On Tue, May 14, 2024 at 8:00 AM Bagas Sanjaya <[email protected]> wrote:
>
> Hi,
>
> <[email protected]> reported on Bugzilla
> (https://bugzilla.kernel.org/show_bug.cgi?id=218834) pseudoterminal data
> corruption regression. He wrote:
>
> > Hello. There appears to be a regression in pseudoterminal support introduced by commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0.
> >
> > We have a wrapper program that uses a pseudoterminal to transform data fed into input/read from output of a child process. Transformed data is written to the master device, the child process, whose stdin/stdout are mapped to the slave device, reads data from parent by reading from its stdin, processes that data and sends it back to parent by writing the response to its stdout, and then data from the child process is read by the wrapper from pseudoterminal master descriptor.
> >
> > This used to work fine on various Linux distros such as Amazon Linux2, SLES 12/15, openSUSE 15, RHEL 8/9 with pre 6.0 kernels. However, running our regression suite on Amazon Linux 2023 which uses a 6.x kernel revealed regression. Sometimes, when a lot of data is written by the parent process into the master device's descriptor, some data is lost or corrupted when the child process is reading it from the slave device's descriptor. We've verified that this is a kernel regression introduced via commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 by performing kernel bisection as described in https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html#introguide-bissbs. Bisection was performed using openSUSE 15.5 in VirtualBox (so the kernels had the tainted flag set) but the issue can also be reproduced outside of VirtualBox.
> >
> > Bisection logs and a repro test case used to perform bisection are in the repro+bisect.zip attachment.
> >
> > -----------------------------------------------------------------------------------
> > The test script, ptybug.sh, pipes a large plain text file containing 1027800 plain text records separated by newlines into the wrapper program. Each record is a 0-padded record number padded with spaces to 80 bytes.
> >
> > The wrapper program sends the unmodified data read from its stdin to the child program via a pseudoterminal, reads the output from the child and prints it on its stdout. Wrapper's stdout is redirected to a file which is then compared with expected output.
> >
> > The child process reads exactly 1000000 records from its stdin, strips off trailing whitespace, and prints the record on its stdout.
> >
> > Both the wrapper and the child process mirror the data written into the pseudoterminal master descriptor/read from the pseudoterminal slave descriptor into separate files - which are identical on kernels prior to commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 and which differ on kernels that include that commit.
> >
> > -----------------------------------------------------------------------------------
> > How to reproduce:
> >
> > tar xvf repro+bisect.tar.zstd
> > cd repro
> > ./ptybug.sh
> >
> > Expected output (on 5.19.0-rc1* kernels prior to commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0)
> > rm -f ptybug ptybug_child gen *.o
> > cc ptybug.c -o ptybug
> > cc ptybug_child.c -o ptybug_child
> > cc gen.c -o gen
> > Test run 1 ...
> > Test run 2 ...
> > Test run 3 ...
> > Test run 4 ...
> > Test run 5 ...
> > PASS
> >
> > Actual/bad output (on 5.19.0-rc1* starting from commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 and later kernels, including latest)
> > rm -f ptybug ptybug_child gen *.o
> > cc ptybug.c -o ptybug
> > cc ptybug_child.c -o ptybug_child
> > cc gen.c -o gen
> > Test run 1 ...
> > FAIL Actual and expected output does not match (gen1027800)
> > FAIL
> >
> >
> > To see the difference between data written by the master into the pseudoterminal master descriptor and data read by the child from the pseudoterminal slace descriptor (stripping off trailing spaces) run:
> > diff -u gen1027800_sysin.txt.nospace gen1027800_a_out.txt.copy |more
> >
> > One possible output is below (every time you run this on a kernel starting from commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 the results are sligtly different). As you can see lines corresponding to records 17861-17862 have been erroneourly replaced with lines 17858-17859, and so on.
> >
> > --- gen1027800_expected.txt 2024-04-26 14:39:22.287121535 +0100
> > +++ gen1027800_a_out.txt.copy 2024-05-13 15:52:48.915952662 +0100
> > @@ -17858,8 +17858,8 @@
> > 000000000000017858
> > 000000000000017859
> > 000000000000017860
> > -000000000000017861
> > -000000000000017862
> > +000000000000017858
> > +000000000000017859
> > 000000000000017863
> > 000000000000017864
> > 000000000000017865
> > @@ -51261,7 +51261,7 @@
> > 000000000000051261
> > 000000000000051262
> > 000000000000051263
> > -000000000000051264
> > +0000000000000512651264
> > 000000000000051265
> > 000000000000051266
> > 000000000000051267
> > @@ -104576,9 +104576,9 @@
> > 000000000000104576
> > 000000000000104577
> > 000000000000104578
> > -000000000000104579
> > -000000000000104580
> > -000000000000104581
> > +000000000000104576
> > +000000000000104577
> > +000000000000104578
> > 000000000000104582
> > 000000000000104583
> > 000000000000104584
> > @@ -110897,8 +110897,8 @@
> > 000000000000110897
> > 000000000000110898
> > 000000000000110899
> > -000000000000110900
> > -000000000000110901
> > +000000000000110898
> > +000000000000110899
> > 000000000000110902
> > 000000000000110903
> > 000000000000110904
> > @@ -279673,9 +279673,9 @@
> > 000000000000279673
> > 000000000000279674
> > 000000000000279675
> > -000000000000279676
> > -000000000000279677
> > -000000000000279678
>
> See Bugzilla link for the reproducer mentioned.
>
> IMO, this is quite different from the case fixed by 56c14fb4086b2d ("tty: Fix
> lookahead_buf crash with serdev").

What's the output of `stty -a -F /dev/...` (dunno if you can do it on
pty, but at least control tty can do it)?
Or can you share the pty configuration done for this? (I haven't
looked into any shared code, maybe it's already there)


--
With Best Regards,
Andy Shevchenko