Hi,
I am posting as a representative of the NixOS Linux distribution,
about a userspace regression on 5.0-rc* which recently was backported
to the 4.14.99, 4.19.21 and 4.20.8 current LTS and stable versions.
The issue has been reported to the bug tracker, bug 202497, but seems
to have gone unnoticed by the maintainers.
The issue seems to break userspace for long-standing patterns in the
NixOS distribution, with regards to use of the shebangs.
Here is an example shebang causing an issue:
#! /nix/store/mbwav8kz8b3y471wjsybgzw84mrh4js9-perl-5.28.1/bin/perl
-I/nix/store/x6yyav38jgr924nkna62q3pkp0dgmzlx-perl5.28.1-File-Slurp-9999.25/lib/perl5/site_perl
-I/nix/store/ha8v67sl8dac92r9z07vzr4gv1y9nwqz-perl5.28.1-Net-DBus-1.1.0/lib/perl5/site_perl
-I/nix/store/dcrkvnjmwh69ljsvpbdjjdnqgwx90a9d-perl5.28.1-XML-Parser-2.44/lib/perl5/site_perl
-I/nix/store/rmji88k2zz7h4zg97385bygcydrf2q8h-perl5.28.1-XML-Twig-3.52/lib/perl5/site_perl
(The shebang was artificially wrapped spaces replaced by newlines)
Another contributor tracked the regression it to commit
8099b047ecc431518b9bb6bdbba3549bbecdc343 in the 5.0-rc* tree.
I bring no particular fix to the issue, but I believe it should at
least be fast-tracked to a revert for the stable and LTS branches, and
since 5.0 might drop soon, a solution worked on, or possibly a revert
until one is figured out.
Thanks,
--
— Samuel Dionne-Riel
Sorry, had an issue when sending the e-mail and lost a bit of context
at the end, here's the missing bit:
---
[...]
Note: I wish to be CC'ed for answers and comments as I am not
subscribed to the mailing list, thanks.
References:
* NixOS bug 53672, https://github.com/NixOS/nixpkgs/issues/53672
* Bug 202497, https://bugzilla.kernel.org/show_bug.cgi?id=202497
Thanks,
On 13/02/2019, Samuel Dionne-Riel <[email protected]> wrote:
> Hi,
>
> I am posting as a representative of the NixOS Linux distribution,
> about a userspace regression on 5.0-rc* which recently was backported
> to the 4.14.99, 4.19.21 and 4.20.8 current LTS and stable versions.
> The issue has been reported to the bug tracker, bug 202497, but seems
> to have gone unnoticed by the maintainers.
>
> The issue seems to break userspace for long-standing patterns in the
> NixOS distribution, with regards to use of the shebangs.
>
> Here is an example shebang causing an issue:
>
> #! /nix/store/mbwav8kz8b3y471wjsybgzw84mrh4js9-perl-5.28.1/bin/perl
> -I/nix/store/x6yyav38jgr924nkna62q3pkp0dgmzlx-perl5.28.1-File-Slurp-9999.25/lib/perl5/site_perl
> -I/nix/store/ha8v67sl8dac92r9z07vzr4gv1y9nwqz-perl5.28.1-Net-DBus-1.1.0/lib/perl5/site_perl
> -I/nix/store/dcrkvnjmwh69ljsvpbdjjdnqgwx90a9d-perl5.28.1-XML-Parser-2.44/lib/perl5/site_perl
> -I/nix/store/rmji88k2zz7h4zg97385bygcydrf2q8h-perl5.28.1-XML-Twig-3.52/lib/perl5/site_perl
>
> (The shebang was artificially wrapped spaces replaced by newlines)
>
> Another contributor tracked the regression it to commit
> 8099b047ecc431518b9bb6bdbba3549bbecdc343 in the 5.0-rc* tree.
>
> I bring no particular fix to the issue, but I believe it should at
> least be fast-tracked to a revert for the stable and LTS branches, and
> since 5.0 might drop soon, a solution worked on, or possibly a revert
> until one is figured out.
>
> Thanks,
>
> --
> — Samuel Dionne-Riel
>
--
— Samuel Dionne-Riel
[CC'in relevant folks]
On Thu, Feb 14, 2019 at 12:19 AM Samuel Dionne-Riel
<[email protected]> wrote:
>
> Hi,
>
> I am posting as a representative of the NixOS Linux distribution,
> about a userspace regression on 5.0-rc* which recently was backported
> to the 4.14.99, 4.19.21 and 4.20.8 current LTS and stable versions.
> The issue has been reported to the bug tracker, bug 202497, but seems
> to have gone unnoticed by the maintainers.
>
> The issue seems to break userspace for long-standing patterns in the
> NixOS distribution, with regards to use of the shebangs.
>
> Here is an example shebang causing an issue:
>
> #! /nix/store/mbwav8kz8b3y471wjsybgzw84mrh4js9-perl-5.28.1/bin/perl
> -I/nix/store/x6yyav38jgr924nkna62q3pkp0dgmzlx-perl5.28.1-File-Slurp-9999.25/lib/perl5/site_perl
> -I/nix/store/ha8v67sl8dac92r9z07vzr4gv1y9nwqz-perl5.28.1-Net-DBus-1.1.0/lib/perl5/site_perl
> -I/nix/store/dcrkvnjmwh69ljsvpbdjjdnqgwx90a9d-perl5.28.1-XML-Parser-2.44/lib/perl5/site_perl
> -I/nix/store/rmji88k2zz7h4zg97385bygcydrf2q8h-perl5.28.1-XML-Twig-3.52/lib/perl5/site_perl
This this ever work correctly? It is longer than BINPRM_BUF_SIZE.
> (The shebang was artificially wrapped spaces replaced by newlines)
>
> Another contributor tracked the regression it to commit
> 8099b047ecc431518b9bb6bdbba3549bbecdc343 in the 5.0-rc* tree.
>
> I bring no particular fix to the issue, but I believe it should at
> least be fast-tracked to a revert for the stable and LTS branches, and
> since 5.0 might drop soon, a solution worked on, or possibly a revert
> until one is figured out.
Your shebang line exceeds BINPRM_BUF_SIZE.
Before the said commit the kernel silently truncated the shebang line
(and corrupted it),
now it tells the user that the line is too long.
Thanks,
//richard
Thanks for CC'ing relevant folks.
On 13/02/2019, Richard Weinberger <[email protected]> wrote:
> [CC'in relevant folks]
>
> On Thu, Feb 14, 2019 at 12:19 AM Samuel Dionne-Riel
> <[email protected]> wrote:
>>
>> Here is an example shebang causing an issue:
>>
>> #! /nix/store/mbwav8kz8b3y471wjsybgzw84mrh4js9-perl-5.28.1/bin/perl
>> -I/nix/store/x6yyav38jgr924nkna62q3pkp0dgmzlx-perl5.28.1-File-Slurp-9999.25/lib/perl5/site_perl
>> -I/nix/store/ha8v67sl8dac92r9z07vzr4gv1y9nwqz-perl5.28.1-Net-DBus-1.1.0/lib/perl5/site_perl
>> -I/nix/store/dcrkvnjmwh69ljsvpbdjjdnqgwx90a9d-perl5.28.1-XML-Parser-2.44/lib/perl5/site_perl
>> -I/nix/store/rmji88k2zz7h4zg97385bygcydrf2q8h-perl5.28.1-XML-Twig-3.52/lib/perl5/site_perl
>
> This this ever work correctly? It is longer than BINPRM_BUF_SIZE.
>
Yes, this is longer than BINPRM_BUF_SIZE. This worked when the
interpreter knew to re-read the shebang, which among other
interpreters, perl did.
https://perl5.git.perl.org/perl.git/blob/04db542212fdad3a62f13afe741c99028f4bf799:/toke.c#l5524
>> (The shebang was artificially wrapped spaces replaced by newlines)
>>
>> Another contributor tracked the regression it to commit
>> 8099b047ecc431518b9bb6bdbba3549bbecdc343 in the 5.0-rc* tree.
>>
>> I bring no particular fix to the issue, but I believe it should at
>> least be fast-tracked to a revert for the stable and LTS branches, and
>> since 5.0 might drop soon, a solution worked on, or possibly a revert
>> until one is figured out.
>
> Your shebang line exceeds BINPRM_BUF_SIZE.
> Before the said commit the kernel silently truncated the shebang line
> (and corrupted it),
> now it tells the user that the line is too long.
>
Yes, the shebang line exceeds BINPRM_BUF_SIZE.
Before, the interpreter was still used (assuming it wasn't cut by the
length), and the interpreter was free to re-read the shebang if
desired.
With the new behaviour, instead of executing the interpreter with a
possibly truncated command line, the default script interpreter is
used, meaning that the perl script is being interpreted by the wrong
interpreter, giving copious amounts of irrelevant errors. This is not
telling the user that the line is too long.
This provably breaks the userland as under 4.14.98 the perl
interpreter is used when running the script, and under 4.14.99 bash is
used when running the script.
For a given script with a shebang of #!(124*A)ZBBB:
The behaviour as of before the regression:
sh: ./foolish: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAZ:
bad interpreter: No such file or directory
As expected, it cuts the shebang.
The behaviour following the change is the script ends up being
executed by the shell interpreter, as if there was no shebang, and +x.
Quoted from `man 3 exec`
> If the header of a file isn't recognized (the attempted execve(2) failed with the error ENOEXEC), these functions will execute the shell (/bin/sh) with the path of the file as its first argument. (If this attempt fails, no further searching is done.)
The now returned ENOEXEC might be "more right", but changes the
semantics of a truncated shebang.
Here I am not debating the validity of using a truncated shebang, but
showing that the userspace behaviour changed in an unexpected and
breaking behaviour.
Am I under a wrong assumption that the kernel shouldn't break userspace?
--
— Samuel Dionne-Riel
On Wed, Feb 13, 2019 at 3:36 PM Richard Weinberger
<[email protected]> wrote:
>
> [CC'in relevant folks]
>
> On Thu, Feb 14, 2019 at 12:19 AM Samuel Dionne-Riel
> <[email protected]> wrote:
> >
> > Hi,
> >
> > I am posting as a representative of the NixOS Linux distribution,
> > about a userspace regression on 5.0-rc* which recently was backported
> > to the 4.14.99, 4.19.21 and 4.20.8 current LTS and stable versions.
> > The issue has been reported to the bug tracker, bug 202497, but seems
> > to have gone unnoticed by the maintainers.
> >
> > The issue seems to break userspace for long-standing patterns in the
> > NixOS distribution, with regards to use of the shebangs.
> >
> > Here is an example shebang causing an issue:
> >
> > #! /nix/store/mbwav8kz8b3y471wjsybgzw84mrh4js9-perl-5.28.1/bin/perl
> > -I/nix/store/x6yyav38jgr924nkna62q3pkp0dgmzlx-perl5.28.1-File-Slurp-9999.25/lib/perl5/site_perl
> > -I/nix/store/ha8v67sl8dac92r9z07vzr4gv1y9nwqz-perl5.28.1-Net-DBus-1.1.0/lib/perl5/site_perl
> > -I/nix/store/dcrkvnjmwh69ljsvpbdjjdnqgwx90a9d-perl5.28.1-XML-Parser-2.44/lib/perl5/site_perl
> > -I/nix/store/rmji88k2zz7h4zg97385bygcydrf2q8h-perl5.28.1-XML-Twig-3.52/lib/perl5/site_perl
>
> This this ever work correctly? It is longer than BINPRM_BUF_SIZE.
>
> > (The shebang was artificially wrapped spaces replaced by newlines)
> >
> > Another contributor tracked the regression it to commit
> > 8099b047ecc431518b9bb6bdbba3549bbecdc343 in the 5.0-rc* tree.
> >
> > I bring no particular fix to the issue, but I believe it should at
> > least be fast-tracked to a revert for the stable and LTS branches, and
> > since 5.0 might drop soon, a solution worked on, or possibly a revert
> > until one is figured out.
>
> Your shebang line exceeds BINPRM_BUF_SIZE.
> Before the said commit the kernel silently truncated the shebang line
> (and corrupted it),
> now it tells the user that the line is too long.
Yeah, it looks like it just truncates:
$ cat /nix/store/mbwav8kz8b3y471wjsybgzw84mrh4js9-perl-5.28.1/bin/perl
#!/usr/bin/perl
print "Arg # 0 : $0\n";
$counter = 1;
foreach my $a (@ARGV) {
print "Arg # $counter : $a\n";
$counter++;
}
$ cat test.pl
#! /nix/store/mbwav8kz8b3y471wjsybgzw84mrh4js9-perl-5.28.1/bin/perl
-I/nix/store/x6yyav38jgr924nkna62q3pkp0dgmzlx-perl5.28.1-File-Slurp-9999.25/lib/perl5/site_perl
-I/nix/store/ha8v67sl8dac92r9z07vzr4gv1y9nwqz-perl5.28.1-Net-DBus-1.1.0/lib/perl5/site_perl
-I/nix/store/dcrkvnjmwh69ljsvpbdjjdnqgwx90a9d-perl5.28.1-XML-Parser-2.44/lib/perl5/site_perl
-I/nix/store/rmji88k2zz7h4zg97385bygcydrf2q8h-perl5.28.1-XML-Twig-3.52/lib/perl5/site_perl
print "I am the script\n";
4.20.7:
$ ./test.pl
Arg # 0 : /nix/store/mbwav8kz8b3y471wjsybgzw84mrh4js9-perl-5.28.1/bin/perl
Arg # 1 : -I/nix/store/x6yyav38jgr924nkna62q3pkp0dgmzlx-perl5.28.1-Fi
Arg # 2 : ./test.pl
4.20.8:
$ ./test.pl
Error: no such file "I am the script\n"
(My shell seems to fall back to direct shell execution)
Since this is breaking existing userspace, we should probably switch
back to the truncation, but do a WARN_ONCE or something so there's a
visible hint _somewhere_ about the (long standing) issue? What do you
think Oleg?
-Kees
--
Kees Cook
On Wed, Feb 13, 2019 at 4:41 PM Samuel Dionne-Riel
<[email protected]> wrote:
> Before, the interpreter was still used (assuming it wasn't cut by the
> length), and the interpreter was free to re-read the shebang if
> desired.
Oh awesome. Yeah, so, nevermind about the WARN_ONCE().
So, to address the "wrong binary" problem, how about we ENOEXEC only
if no newline or spaces are found in the string?
--
Kees Cook
On 13/02/2019, Kees Cook <[email protected]> wrote:
> On Wed, Feb 13, 2019 at 4:41 PM Samuel Dionne-Riel
> <[email protected]> wrote:
>> Before, the interpreter was still used (assuming it wasn't cut by the
>> length), and the interpreter was free to re-read the shebang if
>> desired.
>
> So, to address the "wrong binary" problem, how about we ENOEXEC only
> if no newline or spaces are found in the string?
>
If I understand right, you're asking whether it should return NOEXEC
if, of the first 128 bytes of the shebang, there are no spaces, but a
too long shebang? I wouldn't know for sure. The behaviour would
change. Instead failing due to trying to execute a shortened path, it
would fall back to the shell interpreter interpreting the file, which,
due to the inclusion of a specific shebang, might be a wrong
assumption still. Here I believe it's still in the "undefined
behaviour" territory, but one where it fails early for the userspace.
I don't have a strong opinion, but having a special case depending on
whitespace or not (well, possibility of the interpreter being
truncated or not) feels off. As an end-user, I would rather it
truncates, and show the truncated interpreter it tried to use
(behaviour before regression), rather than fail in a way where the
libc will continue executing using another unexpected interpreter.
Thinking in the principle of least astonishment, I would be less
surprised to see a truncated path on exec() than seeing exec() use an
unexpected interpreter.
--
— Samuel Dionne-Riel
On Wed, Feb 13, 2019 at 5:27 PM Samuel Dionne-Riel
<[email protected]> wrote:
> If I understand right, you're asking whether it should return NOEXEC
> if, of the first 128 bytes of the shebang, there are no spaces, but a
> too long shebang? I wouldn't know for sure. The behaviour would
> change. Instead failing due to trying to execute a shortened path, it
> would fall back to the shell interpreter interpreting the file, which,
> due to the inclusion of a specific shebang, might be a wrong
> assumption still. Here I believe it's still in the "undefined
> behaviour" territory, but one where it fails early for the userspace.
The original problem that was trying to be fixed here was to disallow
execution of a truncated interpreter path. It was assumed argument
truncate was just as bad, but it's not, since the interpreter can (and
does!) re-read the script to get the right arguments.
So, I've sent a fix-up patch that should disallow the path truncation,
but pass through the argument truncation as before. This passes all
the tests I built:
$ ls -l /AAA*/perl
-rwxr-xr-x 1 root root 129 Feb 13 17:17
/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/perl
-rwxr-xr-x 1 root root 129 Feb 13 17:17
/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/perl
-rwxr-xr-x 1 root root 129 Feb 13 17:17
/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/perl
-rwxr-xr-x 1 root root 129 Feb 13 17:17
/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/perl
-rwxr-xr-x 1 root root 129 Feb 13 17:17
/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/perl
$ ./test.pl
Arg # 0 : /nix/store/mbwav8kz8b3y471wjsybgzw84mrh4js9-perl-5.28.1/bin/perl
Arg # 1 : -I/nix/store/x6yyav38jgr924nkna62q3pkp0dgmzlx-perl5.28.1-Fi
Arg # 2 : ./test.pl
$ ./AAAA.pl
Error: no such file "I should fail to run huge interp\n"
$ ./A128.pl
Error: no such file "I should fail to run 128 byte buf interp\n"
$ ./A127.pl
Error: no such file "I should fail to run 127 byte buf interp\n"
$ ./A126.pl
Arg # 0 : '/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/perl'
Arg # 1 : './A126.pl'
$ ./A125space.pl
Arg # 0 : '/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/perl'
Arg # 1 : './A125space.pl'
Are you able to test the patch and report back?
Thanks again for bringing this to our attention!
--
Kees Cook
On 13/02/2019, Kees Cook <[email protected]> wrote:
> The original problem that was trying to be fixed here was to disallow
> execution of a truncated interpreter path. It was assumed argument
> truncate was just as bad, but it's not, since the interpreter can (and
> does!) re-read the script to get the right arguments.
>
> So, I've sent a fix-up patch that should disallow the path truncation,
> but pass through the argument truncation as before. This passes all
> the tests I built:
>
> [...]
>
> Are you able to test the patch and report back?
The patch works as implemented. It also fixes the specific regression
which affected NixOS. This was verified to work once applied to 4.14
in our testing infra. Confidence is high enough that I don't think I
need to test other LTS/stable versions.
Though, I have one minor doubt in mind. Looking at man 2 execve,
ENOEXEC
An executable is not in a recognized format, is for the wrong
architecture, or has some other format error that means it cannot
be executed.
I can see "or some other format error" could be misapplied to mean
ENOEXEC on failure to read the shebang, but I'm thinking it's kinda
abusing the meaning behind the failure. The format was recognized, as
a shebang, but it was impossible to use the shebang. If I were to
misuse an error code, I would probably misuse ENAMETOOLONG. I'm still
doubting ENOEXEC is safe to not cause issues since a truncated
interpreter name (not shebang) will end up with a different behaviour
than expected in the exec(3) userspace scenario, where with ENOEXEC
the shell will be used instead of failing.
Though, this is a different face to the same root regression reported
here; our initial issue with the regression can be deemed fully
resolved with the patch.
In all cases, I think the man page will need an update to describe the
new failure mode with truncated shebangs, and describe the non-failure
mode when truncating arguments.
> Thanks again for bringing this to our attention!
Thanks for the quick turnaround!
--
— Samuel Dionne-Riel
On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger
<[email protected]> wrote:
>
> Your shebang line exceeds BINPRM_BUF_SIZE.
> Before the said commit the kernel silently truncated the shebang line
> (and corrupted it),
> now it tells the user that the line is too long.
It doesn't matter if it "corrupted" things by truncating it. All that
matters is "it used to work, now it doesn't"
Yes, maybe it never *should* have worked. And yes, it's sad that
people apparently had cases that depended on this odd behavior, but
there we are.
I see that Kees has a patch to fix it up.
Linus
On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds <[email protected]> wrote:
> On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger
> <[email protected]> wrote:
> >
> > Your shebang line exceeds BINPRM_BUF_SIZE.
> > Before the said commit the kernel silently truncated the shebang line
> > (and corrupted it),
> > now it tells the user that the line is too long.
>
> It doesn't matter if it "corrupted" things by truncating it. All that
> matters is "it used to work, now it doesn't"
>
> Yes, maybe it never *should* have worked. And yes, it's sad that
> people apparently had cases that depended on this odd behavior, but
> there we are.
>
> I see that Kees has a patch to fix it up.
>
Greg, I think we have a problem here.
8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang
string") wasn't marked for backporting. And, presumably as a
consequence, Kees's fix "exec: load_script: allow interpreter argument
truncation" was not marked for backporting.
8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet
it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.
I don't know if Oleg considered backporting that patch. I certainly
did (I always do), and I decided against doing so. Yet there it is.
On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:
> On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds <[email protected]> wrote:
>
> > On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger
> > <[email protected]> wrote:
> > >
> > > Your shebang line exceeds BINPRM_BUF_SIZE.
> > > Before the said commit the kernel silently truncated the shebang line
> > > (and corrupted it),
> > > now it tells the user that the line is too long.
> >
> > It doesn't matter if it "corrupted" things by truncating it. All that
> > matters is "it used to work, now it doesn't"
> >
> > Yes, maybe it never *should* have worked. And yes, it's sad that
> > people apparently had cases that depended on this odd behavior, but
> > there we are.
> >
> > I see that Kees has a patch to fix it up.
> >
>
> Greg, I think we have a problem here.
>
> 8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang
> string") wasn't marked for backporting. And, presumably as a
> consequence, Kees's fix "exec: load_script: allow interpreter argument
> truncation" was not marked for backporting.
>
> 8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet
> it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.
It came in 5.0-rc1, so it fits the "in a Linus released kernel"
requirement. If we are to wait until it shows up in a -final, that
would be months too late for almost all of these types of patches that
are picked up.
> I don't know if Oleg considered backporting that patch. I certainly
> did (I always do), and I decided against doing so. Yet there it is.
This came in through Sasha's tools, which give people a week or so to
say "hey, this isn't a stable patch!" and it seems everyone ignored that
:(
Where is Kees's fix? I'll be glad to queue it up, or just revert the
above commit, which ever people think is easiest.
thanks,
greg k-h
On Fri, Feb 15, 2019 at 08:00:22AM +0100, Greg Kroah-Hartman wrote:
> On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:
> > On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds <[email protected]> wrote:
> >
> > > On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger
> > > <[email protected]> wrote:
> > > >
> > > > Your shebang line exceeds BINPRM_BUF_SIZE.
> > > > Before the said commit the kernel silently truncated the shebang line
> > > > (and corrupted it),
> > > > now it tells the user that the line is too long.
> > >
> > > It doesn't matter if it "corrupted" things by truncating it. All that
> > > matters is "it used to work, now it doesn't"
> > >
> > > Yes, maybe it never *should* have worked. And yes, it's sad that
> > > people apparently had cases that depended on this odd behavior, but
> > > there we are.
> > >
> > > I see that Kees has a patch to fix it up.
> > >
> >
> > Greg, I think we have a problem here.
> >
> > 8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang
> > string") wasn't marked for backporting. And, presumably as a
> > consequence, Kees's fix "exec: load_script: allow interpreter argument
> > truncation" was not marked for backporting.
> >
> > 8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet
> > it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.
>
> It came in 5.0-rc1, so it fits the "in a Linus released kernel"
> requirement. If we are to wait until it shows up in a -final, that
> would be months too late for almost all of these types of patches that
> are picked up.
>
> > I don't know if Oleg considered backporting that patch. I certainly
> > did (I always do), and I decided against doing so. Yet there it is.
>
> This came in through Sasha's tools, which give people a week or so to
> say "hey, this isn't a stable patch!" and it seems everyone ignored that
> :(
>
> Where is Kees's fix? I'll be glad to queue it up, or just revert the
> above commit, which ever people think is easiest.
Ah, I see the fix now, _after_ I just pushed out a bunch of stable
releases. I'll go queue it up and push it out with just that fix in it
now...
thanks,
greg k-h
On Fri 15-02-19 08:00:22, Greg KH wrote:
> On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:
> > On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds <[email protected]> wrote:
> >
> > > On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger
> > > <[email protected]> wrote:
> > > >
> > > > Your shebang line exceeds BINPRM_BUF_SIZE.
> > > > Before the said commit the kernel silently truncated the shebang line
> > > > (and corrupted it),
> > > > now it tells the user that the line is too long.
> > >
> > > It doesn't matter if it "corrupted" things by truncating it. All that
> > > matters is "it used to work, now it doesn't"
> > >
> > > Yes, maybe it never *should* have worked. And yes, it's sad that
> > > people apparently had cases that depended on this odd behavior, but
> > > there we are.
> > >
> > > I see that Kees has a patch to fix it up.
> > >
> >
> > Greg, I think we have a problem here.
> >
> > 8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang
> > string") wasn't marked for backporting. And, presumably as a
> > consequence, Kees's fix "exec: load_script: allow interpreter argument
> > truncation" was not marked for backporting.
> >
> > 8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet
> > it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.
>
> It came in 5.0-rc1, so it fits the "in a Linus released kernel"
> requirement. If we are to wait until it shows up in a -final, that
> would be months too late for almost all of these types of patches that
> are picked up.
rc1 is just a too early. Waiting few more rcs or even a final release
for something that people do not see as an issue should be just fine.
Consider this particular patch and tell me why it had to be rushed in
the first place. The original code was broken for _years_ but I do not
remember anybody would be complaining.
> > I don't know if Oleg considered backporting that patch. I certainly
> > did (I always do), and I decided against doing so. Yet there it is.
>
> This came in through Sasha's tools, which give people a week or so to
> say "hey, this isn't a stable patch!" and it seems everyone ignored that
> :(
I thought we were through this already. Automagic autoselection of
patches in the core kernel (or mmotm tree patches in particular) is too
dangerous. We try hard to consider each and every patch for stable. Even
if something slips through then it is much more preferred to ask for a
stable backport in the respective email thread and wait for a conclusion
before adding it.
--
Michal Hocko
SUSE Labs
On Fri, Feb 15, 2019 at 10:10:00AM +0100, Michal Hocko wrote:
> On Fri 15-02-19 08:00:22, Greg KH wrote:
> > On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:
> > > On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds <[email protected]> wrote:
> > >
> > > > On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger
> > > > <[email protected]> wrote:
> > > > >
> > > > > Your shebang line exceeds BINPRM_BUF_SIZE.
> > > > > Before the said commit the kernel silently truncated the shebang line
> > > > > (and corrupted it),
> > > > > now it tells the user that the line is too long.
> > > >
> > > > It doesn't matter if it "corrupted" things by truncating it. All that
> > > > matters is "it used to work, now it doesn't"
> > > >
> > > > Yes, maybe it never *should* have worked. And yes, it's sad that
> > > > people apparently had cases that depended on this odd behavior, but
> > > > there we are.
> > > >
> > > > I see that Kees has a patch to fix it up.
> > > >
> > >
> > > Greg, I think we have a problem here.
> > >
> > > 8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang
> > > string") wasn't marked for backporting. And, presumably as a
> > > consequence, Kees's fix "exec: load_script: allow interpreter argument
> > > truncation" was not marked for backporting.
> > >
> > > 8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet
> > > it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.
> >
> > It came in 5.0-rc1, so it fits the "in a Linus released kernel"
> > requirement. If we are to wait until it shows up in a -final, that
> > would be months too late for almost all of these types of patches that
> > are picked up.
>
> rc1 is just a too early. Waiting few more rcs or even a final release
> for something that people do not see as an issue should be just fine.
> Consider this particular patch and tell me why it had to be rushed in
> the first place. The original code was broken for _years_ but I do not
> remember anybody would be complaining.
This patch was in 4.20.10, which was released on Feb 12 while 5.0-rc1
came out on Jan 6. Over a month delay.
> > > I don't know if Oleg considered backporting that patch. I certainly
> > > did (I always do), and I decided against doing so. Yet there it is.
> >
> > This came in through Sasha's tools, which give people a week or so to
> > say "hey, this isn't a stable patch!" and it seems everyone ignored that
> > :(
>
> I thought we were through this already. Automagic autoselection of
> patches in the core kernel (or mmotm tree patches in particular) is too
> dangerous. We try hard to consider each and every patch for stable. Even
> if something slips through then it is much more preferred to ask for a
> stable backport in the respective email thread and wait for a conclusion
> before adding it.
We have a list of blacklisted files/subsystems for people that do not
want this to happen to their area of the kernel. The patch seemed to
make sense, and it passed all known tests that we currently have.
Sometimes things will slip through like this, it happens. And really, a
3 day turn-around-time to resolve this is pretty good, don't you think?
It also seems like we need another test to catch this problem from ever
happening again :)
thanks,
greg k-h
On Fri 15-02-19 10:20:13, Greg KH wrote:
> On Fri, Feb 15, 2019 at 10:10:00AM +0100, Michal Hocko wrote:
> > On Fri 15-02-19 08:00:22, Greg KH wrote:
> > > On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:
> > > > On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds <[email protected]> wrote:
> > > >
> > > > > On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > Your shebang line exceeds BINPRM_BUF_SIZE.
> > > > > > Before the said commit the kernel silently truncated the shebang line
> > > > > > (and corrupted it),
> > > > > > now it tells the user that the line is too long.
> > > > >
> > > > > It doesn't matter if it "corrupted" things by truncating it. All that
> > > > > matters is "it used to work, now it doesn't"
> > > > >
> > > > > Yes, maybe it never *should* have worked. And yes, it's sad that
> > > > > people apparently had cases that depended on this odd behavior, but
> > > > > there we are.
> > > > >
> > > > > I see that Kees has a patch to fix it up.
> > > > >
> > > >
> > > > Greg, I think we have a problem here.
> > > >
> > > > 8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang
> > > > string") wasn't marked for backporting. And, presumably as a
> > > > consequence, Kees's fix "exec: load_script: allow interpreter argument
> > > > truncation" was not marked for backporting.
> > > >
> > > > 8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet
> > > > it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.
> > >
> > > It came in 5.0-rc1, so it fits the "in a Linus released kernel"
> > > requirement. If we are to wait until it shows up in a -final, that
> > > would be months too late for almost all of these types of patches that
> > > are picked up.
> >
> > rc1 is just a too early. Waiting few more rcs or even a final release
> > for something that people do not see as an issue should be just fine.
> > Consider this particular patch and tell me why it had to be rushed in
> > the first place. The original code was broken for _years_ but I do not
> > remember anybody would be complaining.
>
> This patch was in 4.20.10, which was released on Feb 12 while 5.0-rc1
> came out on Jan 6. Over a month delay.
Obviously not long enough.
> > > > I don't know if Oleg considered backporting that patch. I certainly
> > > > did (I always do), and I decided against doing so. Yet there it is.
> > >
> > > This came in through Sasha's tools, which give people a week or so to
> > > say "hey, this isn't a stable patch!" and it seems everyone ignored that
> > > :(
> >
> > I thought we were through this already. Automagic autoselection of
> > patches in the core kernel (or mmotm tree patches in particular) is too
> > dangerous. We try hard to consider each and every patch for stable. Even
> > if something slips through then it is much more preferred to ask for a
> > stable backport in the respective email thread and wait for a conclusion
> > before adding it.
>
> We have a list of blacklisted files/subsystems for people that do not
> want this to happen to their area of the kernel. The patch seemed to
> make sense, and it passed all known tests that we currently have.
Yes, the patch makes sense (I wouldn't give my acked-by otherwise). But
this is one of the area where things that make sense might still break
because it is hard to assume what userspace depends on.
> Sometimes things will slip through like this, it happens. And really, a
> 3 day turn-around-time to resolve this is pretty good, don't you think?
Yes, but that doesn't make any difference on the fact that this was not
marked for stable and I still think this is not a stable material - at
least not at this moment.
> It also seems like we need another test to catch this problem from ever
> happening again :)
Agreed on this.
--
Michal Hocko
SUSE Labs
On Fri, Feb 15, 2019 at 10:42:05AM +0100, Michal Hocko wrote:
>On Fri 15-02-19 10:20:13, Greg KH wrote:
>> On Fri, Feb 15, 2019 at 10:10:00AM +0100, Michal Hocko wrote:
>> > On Fri 15-02-19 08:00:22, Greg KH wrote:
>> > > On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:
>> > > > On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds <[email protected]> wrote:
>> > > >
>> > > > > On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger
>> > > > > <[email protected]> wrote:
>> > > > > >
>> > > > > > Your shebang line exceeds BINPRM_BUF_SIZE.
>> > > > > > Before the said commit the kernel silently truncated the shebang line
>> > > > > > (and corrupted it),
>> > > > > > now it tells the user that the line is too long.
>> > > > >
>> > > > > It doesn't matter if it "corrupted" things by truncating it. All that
>> > > > > matters is "it used to work, now it doesn't"
>> > > > >
>> > > > > Yes, maybe it never *should* have worked. And yes, it's sad that
>> > > > > people apparently had cases that depended on this odd behavior, but
>> > > > > there we are.
>> > > > >
>> > > > > I see that Kees has a patch to fix it up.
>> > > > >
>> > > >
>> > > > Greg, I think we have a problem here.
>> > > >
>> > > > 8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang
>> > > > string") wasn't marked for backporting. And, presumably as a
>> > > > consequence, Kees's fix "exec: load_script: allow interpreter argument
>> > > > truncation" was not marked for backporting.
>> > > >
>> > > > 8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet
>> > > > it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.
>> > >
>> > > It came in 5.0-rc1, so it fits the "in a Linus released kernel"
>> > > requirement. If we are to wait until it shows up in a -final, that
>> > > would be months too late for almost all of these types of patches that
>> > > are picked up.
>> >
>> > rc1 is just a too early. Waiting few more rcs or even a final release
>> > for something that people do not see as an issue should be just fine.
>> > Consider this particular patch and tell me why it had to be rushed in
>> > the first place. The original code was broken for _years_ but I do not
>> > remember anybody would be complaining.
>>
>> This patch was in 4.20.10, which was released on Feb 12 while 5.0-rc1
>> came out on Jan 6. Over a month delay.
>
>Obviously not long enough.
You're assuming that if we wouldn't have taken this patch to stable
somehow someone else would notice this bug and fix it.
What test do we have that would catch this? Which testsuite tests for
long shebang lines? Where is the test added together with this patch
that covers this and similar cases?
The fact is that many patches are not tested until they get to stable,
whether we add them the same week they went upstream or months later.
This is a great case for this: I doubt anyone but NixOS does this crazy
thing with shebang lines, so who else would discover the bug?
If this is indeed a case of us jumping the gun and shipping stuff too
early before all tests are complete, please point me to the test that we
missed and I'll make sure that for any future kernel release it gets run
before we ship a stable kernel.
>> > > > I don't know if Oleg considered backporting that patch. I certainly
>> > > > did (I always do), and I decided against doing so. Yet there it is.
>> > >
>> > > This came in through Sasha's tools, which give people a week or so to
>> > > say "hey, this isn't a stable patch!" and it seems everyone ignored that
>> > > :(
>> >
>> > I thought we were through this already. Automagic autoselection of
>> > patches in the core kernel (or mmotm tree patches in particular) is too
>> > dangerous. We try hard to consider each and every patch for stable. Even
>> > if something slips through then it is much more preferred to ask for a
>> > stable backport in the respective email thread and wait for a conclusion
>> > before adding it.
>>
>> We have a list of blacklisted files/subsystems for people that do not
>> want this to happen to their area of the kernel. The patch seemed to
>> make sense, and it passed all known tests that we currently have.
>
>Yes, the patch makes sense (I wouldn't give my acked-by otherwise). But
>this is one of the area where things that make sense might still break
>because it is hard to assume what userspace depends on.
Great, so the solution is to just not take these things into stable at
all? The solution should be to add tests to the patches that go in there
to verify their correctness and that they don't regress in the future.
If you're really concerned about subsystems being brittle the solution
is to improve their testing rather push stuff in and hope nothing
explodes.
On one hand you Ack it saying it looks great to you and should be
merged, but on the other hand you're saying that you don't really trust
the patch?
Really, if I wouldn't pick this patch now what do you think would have
happened? It would just pop up in a few months as we roll our stable
kernel forward.
>> Sometimes things will slip through like this, it happens. And really, a
>> 3 day turn-around-time to resolve this is pretty good, don't you think?
>
>Yes, but that doesn't make any difference on the fact that this was not
>marked for stable and I still think this is not a stable material - at
>least not at this moment.
Hindsight is 20/20 :)
If people were good at understanding the impact and implications their
patch has on the kernel we would never introduce new bugs!
I'll happily list a bunch more patches where folks didn't think they're
stable material, but turned out to be important fixes.
--
Thanks,
Sasha
On Fri 15-02-19 10:19:12, Sasha Levin wrote:
> On Fri, Feb 15, 2019 at 10:42:05AM +0100, Michal Hocko wrote:
> > On Fri 15-02-19 10:20:13, Greg KH wrote:
> > > On Fri, Feb 15, 2019 at 10:10:00AM +0100, Michal Hocko wrote:
> > > > On Fri 15-02-19 08:00:22, Greg KH wrote:
> > > > > On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:
> > > > > > On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds <[email protected]> wrote:
> > > > > >
> > > > > > > On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger
> > > > > > > <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Your shebang line exceeds BINPRM_BUF_SIZE.
> > > > > > > > Before the said commit the kernel silently truncated the shebang line
> > > > > > > > (and corrupted it),
> > > > > > > > now it tells the user that the line is too long.
> > > > > > >
> > > > > > > It doesn't matter if it "corrupted" things by truncating it. All that
> > > > > > > matters is "it used to work, now it doesn't"
> > > > > > >
> > > > > > > Yes, maybe it never *should* have worked. And yes, it's sad that
> > > > > > > people apparently had cases that depended on this odd behavior, but
> > > > > > > there we are.
> > > > > > >
> > > > > > > I see that Kees has a patch to fix it up.
> > > > > > >
> > > > > >
> > > > > > Greg, I think we have a problem here.
> > > > > >
> > > > > > 8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang
> > > > > > string") wasn't marked for backporting. And, presumably as a
> > > > > > consequence, Kees's fix "exec: load_script: allow interpreter argument
> > > > > > truncation" was not marked for backporting.
> > > > > >
> > > > > > 8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet
> > > > > > it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.
> > > > >
> > > > > It came in 5.0-rc1, so it fits the "in a Linus released kernel"
> > > > > requirement. If we are to wait until it shows up in a -final, that
> > > > > would be months too late for almost all of these types of patches that
> > > > > are picked up.
> > > >
> > > > rc1 is just a too early. Waiting few more rcs or even a final release
> > > > for something that people do not see as an issue should be just fine.
> > > > Consider this particular patch and tell me why it had to be rushed in
> > > > the first place. The original code was broken for _years_ but I do not
> > > > remember anybody would be complaining.
> > >
> > > This patch was in 4.20.10, which was released on Feb 12 while 5.0-rc1
> > > came out on Jan 6. Over a month delay.
> >
> > Obviously not long enough.
>
> You're assuming that if we wouldn't have taken this patch to stable
> somehow someone else would notice this bug and fix it.
>
> What test do we have that would catch this? Which testsuite tests for
> long shebang lines? Where is the test added together with this patch
> that covers this and similar cases?
The test is the "users out there". Right now we do not have any
specialized test case because we haven't even realized it might break
something. The main difference between breaking on the bleeding edge vs.
stable tree is that people running on bleeding edge are more likely to
expect a breakage while stable users would most likely prefer to not be
guinea pigs and have, well stable trees.
[...]
> > > We have a list of blacklisted files/subsystems for people that do not
> > > want this to happen to their area of the kernel. The patch seemed to
> > > make sense, and it passed all known tests that we currently have.
> >
> > Yes, the patch makes sense (I wouldn't give my acked-by otherwise). But
> > this is one of the area where things that make sense might still break
> > because it is hard to assume what userspace depends on.
>
> Great, so the solution is to just not take these things into stable at
> all?
No, but if the patch author and the maintainer have considered the
stable tree and haven't found convincing arguments to mark for stable
then it is likely that the patch doesn't need an urgent backporting.
> The solution should be to add tests to the patches that go in there
> to verify their correctness and that they don't regress in the future.
>
> If you're really concerned about subsystems being brittle the solution
> is to improve their testing rather push stuff in and hope nothing
> explodes.
>
> On one hand you Ack it saying it looks great to you and should be
> merged, but on the other hand you're saying that you don't really trust
> the patch?
No. But I didn't consider it a stable material. You just do not really
need all the patches in the stable, right? I have already said that this
code is there for ages and fixing it is good to have for future but
considering that nobody was really complaining then a backporting just
adds a risk and as it turned out that risk was really not zero.
> Really, if I wouldn't pick this patch now what do you think would have
> happened? It would just pop up in a few months as we roll our stable
> kernel forward.
and that would be a different kernel version and people kinda expect
bugs with newer versions. This is not the case with the stable update.
But I guess we are just repeating the same discussion over and over. Our
expectations about what the stable kernel should be differs a lot. I
would like to see fewer but only important fixes while you would like to
take as many fixes as possible.
--
Michal Hocko
SUSE Labs
On 15/02/2019, Michal Hocko <[email protected]> wrote:
> On Fri 15-02-19 10:19:12, Sasha Levin wrote:
>> On Fri, Feb 15, 2019 at 10:42:05AM +0100, Michal Hocko wrote:
>> > On Fri 15-02-19 10:20:13, Greg KH wrote:
>> > > On Fri, Feb 15, 2019 at 10:10:00AM +0100, Michal Hocko wrote:
>> > > > On Fri 15-02-19 08:00:22, Greg KH wrote:
>> > > > > On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:
>> > > > > > On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds
>> > > > > > <[email protected]> wrote:
>> > > > > >
>> > > > > > > On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger
>> > > > > > > <[email protected]> wrote:
>> > > > > > > >
>> > > > > > > > Your shebang line exceeds BINPRM_BUF_SIZE.
>> > > > > > > > Before the said commit the kernel silently truncated the
>> > > > > > > > shebang line
>> > > > > > > > (and corrupted it),
>> > > > > > > > now it tells the user that the line is too long.
>> > > > > > >
>> > > > > > > It doesn't matter if it "corrupted" things by truncating it.
>> > > > > > > All that
>> > > > > > > matters is "it used to work, now it doesn't"
>> > > > > > >
>> > > > > > > Yes, maybe it never *should* have worked. And yes, it's sad
>> > > > > > > that
>> > > > > > > people apparently had cases that depended on this odd
>> > > > > > > behavior, but
>> > > > > > > there we are.
>> > > > > > >
>> > > > > > > I see that Kees has a patch to fix it up.
>> > > > > > >
>> > > > > >
>> > > > > > Greg, I think we have a problem here.
>> > > > > >
>> > > > > > 8099b047ecc431518 ("exec: load_script: don't blindly truncate
>> > > > > > shebang
>> > > > > > string") wasn't marked for backporting. And, presumably as a
>> > > > > > consequence, Kees's fix "exec: load_script: allow interpreter
>> > > > > > argument
>> > > > > > truncation" was not marked for backporting.
>> > > > > >
>> > > > > > 8099b047ecc431518 hasn't even appeared in a Linus released
>> > > > > > kernel, yet
>> > > > > > it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.
>> > > > >
>> > > > > It came in 5.0-rc1, so it fits the "in a Linus released kernel"
>> > > > > requirement. If we are to wait until it shows up in a -final,
>> > > > > that
>> > > > > would be months too late for almost all of these types of patches
>> > > > > that
>> > > > > are picked up.
>> > > >
>> > > > rc1 is just a too early. Waiting few more rcs or even a final
>> > > > release
>> > > > for something that people do not see as an issue should be just
>> > > > fine.
>> > > > Consider this particular patch and tell me why it had to be rushed
>> > > > in
>> > > > the first place. The original code was broken for _years_ but I do
>> > > > not
>> > > > remember anybody would be complaining.
>> > >
>> > > This patch was in 4.20.10, which was released on Feb 12 while 5.0-rc1
>> > > came out on Jan 6. Over a month delay.
>> >
>> > Obviously not long enough.
>>
>> You're assuming that if we wouldn't have taken this patch to stable
>> somehow someone else would notice this bug and fix it.
>>
>> What test do we have that would catch this? Which testsuite tests for
>> long shebang lines? Where is the test added together with this patch
>> that covers this and similar cases?
>
> The test is the "users out there". Right now we do not have any
> specialized test case because we haven't even realized it might break
> something. The main difference between breaking on the bleeding edge vs.
> stable tree is that people running on bleeding edge are more likely to
> expect a breakage while stable users would most likely prefer to not be
> guinea pigs and have, well stable trees.
> [...]
>
>> > > We have a list of blacklisted files/subsystems for people that do not
>> > > want this to happen to their area of the kernel. The patch seemed to
>> > > make sense, and it passed all known tests that we currently have.
>> >
>> > Yes, the patch makes sense (I wouldn't give my acked-by otherwise). But
>> > this is one of the area where things that make sense might still break
>> > because it is hard to assume what userspace depends on.
>>
>> Great, so the solution is to just not take these things into stable at
>> all?
>
> No, but if the patch author and the maintainer have considered the
> stable tree and haven't found convincing arguments to mark for stable
> then it is likely that the patch doesn't need an urgent backporting.
>
>> The solution should be to add tests to the patches that go in there
>> to verify their correctness and that they don't regress in the future.
>>
>> If you're really concerned about subsystems being brittle the solution
>> is to improve their testing rather push stuff in and hope nothing
>> explodes.
>>
>> On one hand you Ack it saying it looks great to you and should be
>> merged, but on the other hand you're saying that you don't really trust
>> the patch?
>
> No. But I didn't consider it a stable material. You just do not really
> need all the patches in the stable, right? I have already said that this
> code is there for ages and fixing it is good to have for future but
> considering that nobody was really complaining then a backporting just
> adds a risk and as it turned out that risk was really not zero.
>
I'm sorry to interject here, but the issue was reported on the
Kernel.org Bugzilla on February 2nd
- https://bugzilla.kernel.org/show_bug.cgi?id=202497
In the interest of better communication, if the need arises again, how
should bugs in the RC kernels be reported so they (1) are spotted by
the right maintainers and (2) not backported even though they were
reported as causing breaking changes?
>> Really, if I wouldn't pick this patch now what do you think would have
>> happened? It would just pop up in a few months as we roll our stable
>> kernel forward.
>
> and that would be a different kernel version and people kinda expect
> bugs with newer versions. This is not the case with the stable update.
>
> But I guess we are just repeating the same discussion over and over. Our
> expectations about what the stable kernel should be differs a lot. I
> would like to see fewer but only important fixes while you would like to
> take as many fixes as possible.
> --
> Michal Hocko
> SUSE Labs
>
--
— Samuel Dionne-Riel
On Fri, Feb 15, 2019 at 11:18:30AM -0500, Samuel Dionne-Riel wrote:
>I'm sorry to interject here, but the issue was reported on the
>Kernel.org Bugzilla on February 2nd
>
> - https://bugzilla.kernel.org/show_bug.cgi?id=202497
>
>In the interest of better communication, if the need arises again, how
>should bugs in the RC kernels be reported so they (1) are spotted by
>the right maintainers and (2) not backported even though they were
>reported as causing breaking changes?
Sadly our bugzilla is rarely used, even though the information in that
particular bug report is perfect.
Maybe pinging LKML and [email protected] would be enough, specially
if you know that it's a stable commit that caused the regression.
--
Thanks,
Sasha
On Fri, Feb 15, 2019 at 04:52:00PM +0100, Michal Hocko wrote:
>On Fri 15-02-19 10:19:12, Sasha Levin wrote:
>> On Fri, Feb 15, 2019 at 10:42:05AM +0100, Michal Hocko wrote:
>> > On Fri 15-02-19 10:20:13, Greg KH wrote:
>> > > On Fri, Feb 15, 2019 at 10:10:00AM +0100, Michal Hocko wrote:
>> > > > On Fri 15-02-19 08:00:22, Greg KH wrote:
>> > > > > On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:
>> > > > > > On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds <[email protected]> wrote:
>> > > > > >
>> > > > > > > On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger
>> > > > > > > <[email protected]> wrote:
>> > > > > > > >
>> > > > > > > > Your shebang line exceeds BINPRM_BUF_SIZE.
>> > > > > > > > Before the said commit the kernel silently truncated the shebang line
>> > > > > > > > (and corrupted it),
>> > > > > > > > now it tells the user that the line is too long.
>> > > > > > >
>> > > > > > > It doesn't matter if it "corrupted" things by truncating it. All that
>> > > > > > > matters is "it used to work, now it doesn't"
>> > > > > > >
>> > > > > > > Yes, maybe it never *should* have worked. And yes, it's sad that
>> > > > > > > people apparently had cases that depended on this odd behavior, but
>> > > > > > > there we are.
>> > > > > > >
>> > > > > > > I see that Kees has a patch to fix it up.
>> > > > > > >
>> > > > > >
>> > > > > > Greg, I think we have a problem here.
>> > > > > >
>> > > > > > 8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang
>> > > > > > string") wasn't marked for backporting. And, presumably as a
>> > > > > > consequence, Kees's fix "exec: load_script: allow interpreter argument
>> > > > > > truncation" was not marked for backporting.
>> > > > > >
>> > > > > > 8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet
>> > > > > > it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.
>> > > > >
>> > > > > It came in 5.0-rc1, so it fits the "in a Linus released kernel"
>> > > > > requirement. If we are to wait until it shows up in a -final, that
>> > > > > would be months too late for almost all of these types of patches that
>> > > > > are picked up.
>> > > >
>> > > > rc1 is just a too early. Waiting few more rcs or even a final release
>> > > > for something that people do not see as an issue should be just fine.
>> > > > Consider this particular patch and tell me why it had to be rushed in
>> > > > the first place. The original code was broken for _years_ but I do not
>> > > > remember anybody would be complaining.
>> > >
>> > > This patch was in 4.20.10, which was released on Feb 12 while 5.0-rc1
>> > > came out on Jan 6. Over a month delay.
>> >
>> > Obviously not long enough.
>>
>> You're assuming that if we wouldn't have taken this patch to stable
>> somehow someone else would notice this bug and fix it.
>>
>> What test do we have that would catch this? Which testsuite tests for
>> long shebang lines? Where is the test added together with this patch
>> that covers this and similar cases?
>
>The test is the "users out there". Right now we do not have any
>specialized test case because we haven't even realized it might break
>something. The main difference between breaking on the bleeding edge vs.
>stable tree is that people running on bleeding edge are more likely to
>expect a breakage while stable users would most likely prefer to not be
>guinea pigs and have, well stable trees.
>[...]
Exactly, and my argument here is that no one really tests Linus's tree.
Sure, folks run -rc kernels and report bugs, but no one actually runs
these kernels at larger scales.
Most "users out there" wouldn't see this patch until it ends up in a
stable kernel.
>> > > We have a list of blacklisted files/subsystems for people that do not
>> > > want this to happen to their area of the kernel. The patch seemed to
>> > > make sense, and it passed all known tests that we currently have.
>> >
>> > Yes, the patch makes sense (I wouldn't give my acked-by otherwise). But
>> > this is one of the area where things that make sense might still break
>> > because it is hard to assume what userspace depends on.
>>
>> Great, so the solution is to just not take these things into stable at
>> all?
>
>No, but if the patch author and the maintainer have considered the
>stable tree and haven't found convincing arguments to mark for stable
>then it is likely that the patch doesn't need an urgent backporting.
Are you suggesting that waiting longer would somehow made this "safer"?
This goes back to my argument above.
>> The solution should be to add tests to the patches that go in there
>> to verify their correctness and that they don't regress in the future.
>>
>> If you're really concerned about subsystems being brittle the solution
>> is to improve their testing rather push stuff in and hope nothing
>> explodes.
>>
>> On one hand you Ack it saying it looks great to you and should be
>> merged, but on the other hand you're saying that you don't really trust
>> the patch?
>
>No. But I didn't consider it a stable material. You just do not really
>need all the patches in the stable, right? I have already said that this
>code is there for ages and fixing it is good to have for future but
>considering that nobody was really complaining then a backporting just
>adds a risk and as it turned out that risk was really not zero.
>
>> Really, if I wouldn't pick this patch now what do you think would have
>> happened? It would just pop up in a few months as we roll our stable
>> kernel forward.
>
>and that would be a different kernel version and people kinda expect
>bugs with newer versions. This is not the case with the stable update.
>
>But I guess we are just repeating the same discussion over and over. Our
>expectations about what the stable kernel should be differs a lot. I
>would like to see fewer but only important fixes while you would like to
>take as many fixes as possible.
Maybe to clarify here: I don't want to blindly take as much patches as I
can. I want to take patches based on testing results: if something looks
like a fix and it passes all our tests, there shouldn't be a reason not
to take it.
My view is that humans are terrible at writing and understanding code:
if folks fully understood the impact of their patches we would never
have bugs, right? Assuming we both agree here that we make mistakes and
introduce bugs, why do you think that these very same people fully
understand whether a patch should go in stable or not?
The approach of manually deciding if a patch needs to go in stable is
wrong and it doesn't scale. We need to beef up our testing story and
make these decisions based off of that, and not our error-prone brains
that introduced these bugs to begin with.
Look at the outcome of this very issue: people sprung into action and
fixed this bug quickly, but how many tests were added as a result of
this? How do we know it's not going to regress again?
--
Thanks,
Sasha
On Fri 15-02-19 13:00:26, Sasha Levin wrote:
> On Fri, Feb 15, 2019 at 04:52:00PM +0100, Michal Hocko wrote:
> > On Fri 15-02-19 10:19:12, Sasha Levin wrote:
> > > On Fri, Feb 15, 2019 at 10:42:05AM +0100, Michal Hocko wrote:
> > > > On Fri 15-02-19 10:20:13, Greg KH wrote:
> > > > > On Fri, Feb 15, 2019 at 10:10:00AM +0100, Michal Hocko wrote:
> > > > > > On Fri 15-02-19 08:00:22, Greg KH wrote:
> > > > > > > On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:
> > > > > > > > On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds <[email protected]> wrote:
> > > > > > > >
> > > > > > > > > On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger
> > > > > > > > > <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > Your shebang line exceeds BINPRM_BUF_SIZE.
> > > > > > > > > > Before the said commit the kernel silently truncated the shebang line
> > > > > > > > > > (and corrupted it),
> > > > > > > > > > now it tells the user that the line is too long.
> > > > > > > > >
> > > > > > > > > It doesn't matter if it "corrupted" things by truncating it. All that
> > > > > > > > > matters is "it used to work, now it doesn't"
> > > > > > > > >
> > > > > > > > > Yes, maybe it never *should* have worked. And yes, it's sad that
> > > > > > > > > people apparently had cases that depended on this odd behavior, but
> > > > > > > > > there we are.
> > > > > > > > >
> > > > > > > > > I see that Kees has a patch to fix it up.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Greg, I think we have a problem here.
> > > > > > > >
> > > > > > > > 8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang
> > > > > > > > string") wasn't marked for backporting. And, presumably as a
> > > > > > > > consequence, Kees's fix "exec: load_script: allow interpreter argument
> > > > > > > > truncation" was not marked for backporting.
> > > > > > > >
> > > > > > > > 8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet
> > > > > > > > it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.
> > > > > > >
> > > > > > > It came in 5.0-rc1, so it fits the "in a Linus released kernel"
> > > > > > > requirement. If we are to wait until it shows up in a -final, that
> > > > > > > would be months too late for almost all of these types of patches that
> > > > > > > are picked up.
> > > > > >
> > > > > > rc1 is just a too early. Waiting few more rcs or even a final release
> > > > > > for something that people do not see as an issue should be just fine.
> > > > > > Consider this particular patch and tell me why it had to be rushed in
> > > > > > the first place. The original code was broken for _years_ but I do not
> > > > > > remember anybody would be complaining.
> > > > >
> > > > > This patch was in 4.20.10, which was released on Feb 12 while 5.0-rc1
> > > > > came out on Jan 6. Over a month delay.
> > > >
> > > > Obviously not long enough.
> > >
> > > You're assuming that if we wouldn't have taken this patch to stable
> > > somehow someone else would notice this bug and fix it.
> > >
> > > What test do we have that would catch this? Which testsuite tests for
> > > long shebang lines? Where is the test added together with this patch
> > > that covers this and similar cases?
> >
> > The test is the "users out there". Right now we do not have any
> > specialized test case because we haven't even realized it might break
> > something. The main difference between breaking on the bleeding edge vs.
> > stable tree is that people running on bleeding edge are more likely to
> > expect a breakage while stable users would most likely prefer to not be
> > guinea pigs and have, well stable trees.
> > [...]
>
> Exactly, and my argument here is that no one really tests Linus's tree.
I would beg to disagree. The testing coverage is smaller of course
because most people are running on a distribution/stable kernels.
> Sure, folks run -rc kernels and report bugs, but no one actually runs
> these kernels at larger scales.
And this just screams that a (much) more time has to pass before fixes
which are nice-to-have are passed to the stable tree - assuming they are
not fixing something that users of the said stable tree are seeing the
issue of course.
> Most "users out there" wouldn't see this patch until it ends up in a
> stable kernel.
...and this would be on a kernel version upgrade when some breakage is
expected and tolerated more than on minor version stable update.
[...]
> > But I guess we are just repeating the same discussion over and over. Our
> > expectations about what the stable kernel should be differs a lot. I
> > would like to see fewer but only important fixes while you would like to
> > take as many fixes as possible.
>
> Maybe to clarify here: I don't want to blindly take as much patches as I
> can. I want to take patches based on testing results: if something looks
> like a fix and it passes all our tests, there shouldn't be a reason not
> to take it.
There are many things we do not have any tests for. E.g. I wasn't even
aware that Perl (and others) are dealing with an excessive shebang input
by re-reading the input. There are always going to be corner cases like
that. The underlying thing is that nobody seem to be complaining about
the original issue addressed by Oleg. So why the heck should we push it
to the stable tree and _risk_ a regression.
> My view is that humans are terrible at writing and understanding code:
> if folks fully understood the impact of their patches we would never
> have bugs, right? Assuming we both agree here that we make mistakes and
> introduce bugs, why do you think that these very same people fully
> understand whether a patch should go in stable or not?
I haven't really seen a script that would be more efficient in this
evaluation. With a lack of the full test coverage I do not see this
going to change anytime soon.
> The approach of manually deciding if a patch needs to go in stable is
> wrong and it doesn't scale. We need to beef up our testing story and
> make these decisions based off of that, and not our error-prone brains
> that introduced these bugs to begin with.
>
> Look at the outcome of this very issue: people sprung into action and
> fixed this bug quickly, but how many tests were added as a result of
> this? How do we know it's not going to regress again?
Yes, the issue got identified and analyzed quickly. There was no
questioning this part. It is the regression in stable that bothers me.
You have exposed users of a tree, which is supposed to be stable, to a bug
which was totally unnecessary because nobody cared for the parsing
behavior for years.
--
Michal Hocko
SUSE Labs