Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753821AbbFDTgn (ORCPT ); Thu, 4 Jun 2015 15:36:43 -0400 Received: from mail-vn0-f42.google.com ([209.85.216.42]:43983 "EHLO mail-vn0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751010AbbFDTgi convert rfc822-to-8bit (ORCPT ); Thu, 4 Jun 2015 15:36:38 -0400 MIME-Version: 1.0 In-Reply-To: <554DCB33.8080101@gmail.com> References: <554DCB33.8080101@gmail.com> Date: Thu, 4 Jun 2015 12:36:37 -0700 X-Google-Sender-Auth: 7P0owO161IPs8KEJFhjB9qHuwkk Message-ID: Subject: Re: sysctl_writes_strict documentation + an oddity? From: Kees Cook To: "Michael Kerrisk (man-pages)" Cc: lkml , Randy Dunlap , Andrew Morton , "linux-man@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7711 Lines: 206 On Sat, May 9, 2015 at 1:54 AM, Michael Kerrisk (man-pages) wrote: > Hi Kees, > > I discovered that you added /proc/sys/kernel/sysctl_writes_strict in > Linux 3.16. In passing, I'll just mention that was an API change that > should have been CCed to linux-api@vger.kernel.org. Sorry about that! I'm trying to get better. I think my main trigger for this is "if I'm adding a file to Documentation/ I should probably CC linux-api" now. :) > Anyway, I've tried to write this file up for the proc(5) man page, > and I have two requests: > > 1) Could you review this text? > 2) I've found some behavior that surprised me, and I am wondering if it > is intended. Could you let me know your thoughts? > > ===== 1) man-page text ===== > > The man-page text, heavily based on your text in > Documentation/sysctl/kernel.txt, is as follows: > > /proc/sys/kernel/sysctl_writes_strict (since Linux 3.16) > The value in this file determines how the file offset > affects the behavior of updating entries in files under > /proc/sys. The file has three possible values: > > -1 This provides legacy handling, with no printk warn‐ > ings. Each write(2) must fully contain the value to > be written, and multiple writes on the same file > descriptor will overwrite the entire value, regardless > of the file position. > > 0 (default) This provides the same behavior as for -1, > but printk warnings are written for processes that > perform writes when the file offset is not 0. > > 1 Respect the file offset when writing strings into > /proc/sys files. Multiple writes will append to the > value buffer. Anything written beyond the maximum > length of the value buffer will be ignored. Writes to > numeric /proc/sys entries must always be at file off‐ > set 0 and the value must be fully contained in the > buffer provided to write(2). That looks correct, yes. Thanks! > > ===== 2) Behavior puzzle (a) ===== > > The last sentence quoted from the man page was based on your sentence > > Writes to numeric sysctl entries must always be at file position 0 > and the value must be fully contained in the buffer sent in the write > syscall. > > So, I had interpreted /proc/sys/kernel/sysctl_writes_strict==1 to > mean that if one writes into a numeric /proc/sys file at an offset > other than zero, the write() will fail with some kind of error. Reporting back an error wasn't something I'd tested before. Looking at the code again now, it should be possible make this change. Regardless, in the case of the numeric value error condition, it's the same as the "past the end" string error condition: "Anything written beyond the maximum length of the value buffer will be ignored." i.e. anything other than file offset 0 is considered "past the end of the buffer" for a numeric value and is ignored. > But this seems not to be the case. Instead, the write() succeeds, > but the file is left unmodified. That's surprising, I find. So, I'm > wondering whether the implementation deviates from your intention. > > There's a test program below, which takes arguments as follows > > ./a.out pathname offset string I have tests in tools/testing/selftests/sysctl for checking the various behaviors too. They don't actually examine any error conditions from the sysctl writing itself. It should be simple to make sysctl_writes_strict failures return an error, though. > > And here's a test run that demonstrates the behavior: > > $ sudo sh -c "echo 1 > /proc/sys/kernel/sysctl_writes_strict" > $ cat /proc/sys/kernel/pid_max > 32768 > $ sudo dmesg --clear > $ sudo ./a.out /proc/sys/kernel/pid_max 1 3000 > write() succeeded (return value 4) > $ cat /proc/sys/kernel/pid_max > 32768 > $ dmesg > > As you can see above, an attempt was made to write into the > /proc/sys/kernel/pid_max file at offset 1. > The write() returned successfully (reporting 4 bytes written) > but the file contents were unchanged, and no printk() warning > was issued. Is this intended behavior? > > ===== 2) Behavior puzzle (b) ===== > > In commit f88083005ab319abba5d0b2e4e997558245493c8, there is this note: > > This adds the sysctl kernel.sysctl_writes_strict to control the write > behavior. The default (0) reports when VFS position is non-0 on a > write, but retains legacy behavior, -1 disables the warning, and 1 > enables the position-respecting behavior. > > The long-term plan here is to wait for userspace to be fixed in response > to the new warning and to then switch the default kernel behavior to the > new position-respecting behavior. > > (That last para was added to the commit message by AKPM, I see.) > > But, I wonder here whether /proc/sys/kernel/sysctl_writes_strict==0 > is going to help with the long-term plan. The problem is that in > warn_sysctl_write(), pr_warn_once() is used. This means that only > the first offending user-space application that writes to *any* > /proc/sys file will generate the printk warning. If that application > isn't fixed, then none of the other "broken" applications will be > discovered. It therefore seems possible that it could be a very long > time before we could "switch the default kernel behavior to the > new position-respecting behavior". > > Looking over old mails > (http://thread.gmane.org/gmane.linux.kernel/1695177/focus=23240), > I see that you're aware of the problem, but it seems to me that > the switch to pr_warn_once() (for fear of spamming the log) likely > dooms the long-term plan to failure. Your thoughts? In actual regular use, the situation that triggers the warning should be vanishingly rare, but the condition can be trivially met by someone intending to hit it for the purposes of filling log files. As such, it makes sense to me to use _once to avoid spamming, but still catch a rare usage under normal conditions. > > Cheers, > > Michael > > > 8x--8x--8x--8x--8x--8x--8x--8x--8x--8x--8x--8x--8x--8x--8x-- > > #include > #include > #include > #include > #include > #include > #include > > #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0) > > int > main(int argc, char *argv[]) > { > char *pathname; > off_t offset; > char *string; > int fd; > ssize_t numWritten; > > if (argc != 4) { > fprintf(stderr, "Usage: %s pathname offset string\n", argv[0]); > exit(EXIT_FAILURE); > } > > pathname = argv[1]; > offset = strtoll(argv[2], NULL, 0); > string = argv[3]; > > fd = open(pathname, O_RDWR); > if (fd == -1) > errExit("open"); > > if (lseek(fd, offset, SEEK_SET) == -1) > errExit("lseek"); > > numWritten = write(fd, string, strlen(string)); > if (numWritten == -1) > errExit("write"); > > printf("write() succeeded (return value %zd)\n", numWritten); > > exit(EXIT_SUCCESS); > } > > -- > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Linux/UNIX System Programming Training: http://man7.org/training/ -- Kees Cook Chrome OS Security -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/