For the past few days (I was away last week...) the fd-003-kthread.c
test from the proc kselftests has been failing on arm64, this is an
nfsroot system if that makes any odds. The test output itself is:
# selftests: proc: fd-003-kthread
# fd-003-kthread: fd-003-kthread.c:113: test_readdir: Assertion `!de' failed.
# Aborted
not ok 3 selftests: proc: fd-003-kthread # exit=134
I ran a bisect which pointed at the commit
d089d9d056c048303aedd40a7f3f26593ebd040c file: convert to SLAB_TYPESAFE_BY_RCU
(I can't seem to find that on lore.) I've not done any further analysis
of what the commit is doing or anything, though it does look like the
bisect ran fairly smoothly and it looks at least plausibly related to
the issue and reverting the commit on top of -next causes the test to
start passing again.
The bisect log is below and a full log from a failing test job can be
seen here:
https://lava.sirena.org.uk/scheduler/job/154334
Thanks,
Mark
git bisect start
# bad: [7d730f1bf6f39ece2d9f3ae682f12e5b593d534d] Add linux-next specific files for 20231005
git bisect bad 7d730f1bf6f39ece2d9f3ae682f12e5b593d534d
# good: [4d6ee1bd3e3820b523d43349cbcae230fdfcb613] Merge branch 'for-linux-next-fixes' of git://anongit.freedesktop.org/drm/drm-misc
git bisect good 4d6ee1bd3e3820b523d43349cbcae230fdfcb613
# bad: [d6dfa62947bd47317de464c3ca55a6eaafe2e5af] Merge branch 'master' of git://linuxtv.org/media_tree.git
git bisect bad d6dfa62947bd47317de464c3ca55a6eaafe2e5af
# good: [73773d2fdd7172955a4476e8956c34aa49959219] bcachefs: Data update path no longer leaves cached replicas
git bisect good 73773d2fdd7172955a4476e8956c34aa49959219
# good: [87825243dfb1a14a0d8d3ae60754c34dfc084802] Merge branch 'loongarch-next' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson.git
git bisect good 87825243dfb1a14a0d8d3ae60754c34dfc084802
# good: [3855d73729a7ab4c3a6694a6efdf0816c8ad9dd1] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git
git bisect good 3855d73729a7ab4c3a6694a6efdf0816c8ad9dd1
# bad: [298370bcbb0e5a2fae6c8efd35e2b7bf4c918f54] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git
git bisect bad 298370bcbb0e5a2fae6c8efd35e2b7bf4c918f54
# good: [927cf8c9dd2a58846b541d106149c5e94fd0556f] Merge branch 'nfsd-next' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
git bisect good 927cf8c9dd2a58846b541d106149c5e94fd0556f
# bad: [530d23ed060f9c9ddf3b03398367aba9a9577b82] Merge branch 'vfs.super' into vfs.all
git bisect bad 530d23ed060f9c9ddf3b03398367aba9a9577b82
# bad: [7291b00e6f694af2b42d223145acb77b2bf1f62c] Merge branch 'vfs.iov_iter' into vfs.all
git bisect bad 7291b00e6f694af2b42d223145acb77b2bf1f62c
# bad: [459218d4c663f094951d71bbf293fd14dbb688e1] Merge branch 'vfs.mount.write' into vfs.all
git bisect bad 459218d4c663f094951d71bbf293fd14dbb688e1
# good: [cbe52963050bac0f2bfe2c24e09f50ffdc41132b] watch_queue: Annotate struct watch_filter with __counted_by
git bisect good cbe52963050bac0f2bfe2c24e09f50ffdc41132b
# bad: [450f431b47219235870f57a3f72fa8fec0a0ba43] vfs: fix readahead(2) on block devices
git bisect bad 450f431b47219235870f57a3f72fa8fec0a0ba43
# good: [1cf2d167e7f661b687feb0bd15278b859fe1513c] vfs: shave work on failed file open
git bisect good 1cf2d167e7f661b687feb0bd15278b859fe1513c
# bad: [d089d9d056c048303aedd40a7f3f26593ebd040c] file: convert to SLAB_TYPESAFE_BY_RCU
git bisect bad d089d9d056c048303aedd40a7f3f26593ebd040c
# first bad commit: [d089d9d056c048303aedd40a7f3f26593ebd040c] file: convert to SLAB_TYPESAFE_BY_RCU
On Fri, Oct 06, 2023 at 01:04:19AM +0100, Mark Brown wrote:
> For the past few days (I was away last week...) the fd-003-kthread.c
> test from the proc kselftests has been failing on arm64, this is an
> nfsroot system if that makes any odds. The test output itself is:
I'm out this week but will look at this right when I'm back.
Mark Brown <[email protected]> writes:
> For the past few days (I was away last week...) the fd-003-kthread.c
> test from the proc kselftests has been failing on arm64, this is an
> nfsroot system if that makes any odds. The test output itself is:
>
> # selftests: proc: fd-003-kthread
> # fd-003-kthread: fd-003-kthread.c:113: test_readdir: Assertion `!de' failed.
> # Aborted
> not ok 3 selftests: proc: fd-003-kthread # exit=134
>
> I ran a bisect which pointed at the commit
>
> d089d9d056c048303aedd40a7f3f26593ebd040c file: convert to SLAB_TYPESAFE_BY_RCU
>
> (I can't seem to find that on lore.) I've not done any further analysis
> of what the commit is doing or anything, though it does look like the
> bisect ran fairly smoothly and it looks at least plausibly related to
> the issue and reverting the commit on top of -next causes the test to
> start passing again.
I'm seeing the same with the strace test-suite on s390. The problem is
that /proc/*/fd now contains the file descriptors of the calling
process, and not the target process.
Old kernel:
# ls -l /proc/1/fd
total 0
lrwx------. 1 root root 64 Oct 6 11:11 0 -> /dev/null
lrwx------. 1 root root 64 Oct 6 11:11 1 -> /dev/null
lr-x------. 1 root root 64 Oct 6 11:12 10 -> /proc/1/mountinfo
lr-x------. 1 root root 64 Oct 6 11:12 11 -> anon_inode:inotify
lr-x------. 1 root root 64 Oct 6 11:12 13 -> anon_inode:inotify
lr-x------. 1 root root 64 Oct 6 11:12 14 -> /proc/swaps
lrwx------. 1 root root 64 Oct 6 11:12 15 -> 'socket:[5419]'
lrwx------. 1 root root 64 Oct 6 11:12 16 -> 'socket:[5420]'
[..]
# ls -l /proc/2/fd
total 0
#
New kernel:
# ls -l /proc/1/fd
total 0
lrwx------. 1 root root 64 Oct 6 11:14 0 -> /dev/null
lrwx------. 1 root root 64 Oct 6 11:14 1 -> /dev/null
lrwx------. 1 root root 64 Oct 6 11:14 2 -> /dev/null
lr-x------. 1 root root 64 Oct 6 11:14 3 -> /dev/kmsg
# ls -l /proc/2/fd
ls: cannot read symbolic link '/proc/2/fd/0': No such file or directory
ls: cannot read symbolic link '/proc/2/fd/1': No such file or directory
ls: cannot read symbolic link '/proc/2/fd/2': No such file or directory
ls: cannot read symbolic link '/proc/2/fd/3': No such file or directory
total 0
lrwx------. 1 root root 64 Oct 6 11:14 0
lrwx------. 1 root root 64 Oct 6 11:14 1
lrwx------. 1 root root 64 Oct 6 11:14 2
lr-x------. 1 root root 64 Oct 6 11:14 3
On Fri, Oct 06, 2023 at 11:19:58AM +0200, Sven Schnelle wrote:
> I'm seeing the same with the strace test-suite on s390. The problem is
> that /proc/*/fd now contains the file descriptors of the calling
> process, and not the target process.
>
This is why:
+static inline struct file *files_lookup_fdget_rcu(struct files_struct *files, unsigned int fd)
+{
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
+ "suspicious rcu_dereference_check() usage");
+ return lookup_fdget_rcu(fd);
+}
files argument is now thrown away, instead it always uses current.
Mateusz Guzik <[email protected]> writes:
> On Fri, Oct 06, 2023 at 11:19:58AM +0200, Sven Schnelle wrote:
>> I'm seeing the same with the strace test-suite on s390. The problem is
>> that /proc/*/fd now contains the file descriptors of the calling
>> process, and not the target process.
>>
>
> This is why:
>
> +static inline struct file *files_lookup_fdget_rcu(struct files_struct *files, unsigned int fd)
> +{
> + RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> + "suspicious rcu_dereference_check() usage");
> + return lookup_fdget_rcu(fd);
> +}
>
> files argument is now thrown away, instead it always uses current.
Yes, passing files to lookup_fdget_rcu() fixes the issue.
On Fri, Oct 06, 2023 at 02:54:04PM +0200, Sven Schnelle wrote:
> Mateusz Guzik <[email protected]> writes:
>
> > On Fri, Oct 06, 2023 at 11:19:58AM +0200, Sven Schnelle wrote:
> >> I'm seeing the same with the strace test-suite on s390. The problem is
> >> that /proc/*/fd now contains the file descriptors of the calling
> >> process, and not the target process.
> >>
> >
> > This is why:
> >
> > +static inline struct file *files_lookup_fdget_rcu(struct files_struct *files, unsigned int fd)
> > +{
> > + RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> > + "suspicious rcu_dereference_check() usage");
> > + return lookup_fdget_rcu(fd);
> > +}
> >
> > files argument is now thrown away, instead it always uses current.
>
> Yes, passing files to lookup_fdget_rcu() fixes the issue.
so i wrote this as an immediate fixup. not the prettiest, but should
prevent the need to drop the patch from linux-next.
diff --git a/fs/file.c b/fs/file.c
index 2f6965848907..8d62d6f46982 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1024,7 +1024,7 @@ static inline struct file *files_lookup_fdget_rcu(struct files_struct *files, un
{
RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
"suspicious rcu_dereference_check() usage");
- return lookup_fdget_rcu(fd);
+ return __fget_files_rcu(files, fd, 0);
}
struct file *task_lookup_fdget_rcu(struct task_struct *task, unsigned int fd)
On Fri, Oct 06, 2023 at 10:38:30AM +0200, Christian Brauner wrote:
> On Fri, Oct 06, 2023 at 01:04:19AM +0100, Mark Brown wrote:
> > For the past few days (I was away last week...) the fd-003-kthread.c
> > test from the proc kselftests has been failing on arm64, this is an
> > nfsroot system if that makes any odds. The test output itself is:
>
> I'm out this week but will look at this right when I'm back.
Fixed in-tree. Was a pretty dump s// bug on my side. Sorry about the
noise. Should be gone tomorrow. Thanks for the report.
On Fri, Oct 06, 2023 at 02:39:21PM +0200, Mateusz Guzik wrote:
> On Fri, Oct 06, 2023 at 11:19:58AM +0200, Sven Schnelle wrote:
> > I'm seeing the same with the strace test-suite on s390. The problem is
> > that /proc/*/fd now contains the file descriptors of the calling
> > process, and not the target process.
> >
>
> This is why:
Yes, I realized this right when I had to take a ride for 5h. Just fixed
this a while ago and ran selftests as well. Thanks for digging into it
as well.