2023-10-17 10:27:31

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> Hello Linus,
>
> could you please pull from
>
> git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git for_v6.6-rc1
>
> to get:
> * fixes for possible use-after-free issues with quota when racing with
> chown
> * fixes for ext2 crashing when xattr allocation races with another
> block allocation to the same file from page writeback code
> * fix for block number overflow in ext2
> * marking of reiserfs as obsolete in MAINTAINERS
> * assorted minor cleanups
>
> Top of the tree is df1ae36a4a0e. The full shortlog is:

This merge commit (?) broke boot on Intel Merrifield.
It has earlycon enabled and only what I got is watchdog
trigger without a bit of information printed out.

I tried to give a two bisects with the same result.

Try 1:

git bisect bad 461f35f014466c4e26dca6be0f431f57297df3f2 # good: [bd6c11bc43c496cddfc6cf603b5d45365606dbd5] Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
git bisect good bd6c11bc43c496cddfc6cf603b5d45365606dbd5
# good: [ef35c7ba60410926d0501e45aad299656a83826c] Revert "Revert "drm/amdgpu/display: change pipe policy for DCN 2.0""
git bisect good ef35c7ba60410926d0501e45aad299656a83826c
# good: [d68b4b6f307d155475cce541f2aee938032ed22e] Merge tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect good d68b4b6f307d155475cce541f2aee938032ed22e
# good: [87fa732dc5ff9ea6a2e75b630f7931899e845eb1] Merge tag 'x86-core-2023-08-30-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 87fa732dc5ff9ea6a2e75b630f7931899e845eb1
# good: [bc609f4867f6a14db0efda55a7adef4dca16762e] Merge tag 'drm-misc-next-fixes-2023-08-24' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
git bisect good bc609f4867f6a14db0efda55a7adef4dca16762e
# good: [63580f669d7ff5aa5a1fa2e3994114770a491722] Merge tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
git bisect good 63580f669d7ff5aa5a1fa2e3994114770a491722
# good: [5221002c054376fcf2f0cea1d13f00291a90222e] Merge tag 'repair-force-rebuild-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA
git bisect good 5221002c054376fcf2f0cea1d13f00291a90222e
# bad: [1500e7e0726e963f64b9785a0cb0a820b2587bad] Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
git bisect bad 1500e7e0726e963f64b9785a0cb0a820b2587bad
# good: [7a64774add85ce673a089810fae193b02003be24] quota: use lockdep_assert_held_write in dquot_load_quota_sb
git bisect good 7a64774add85ce673a089810fae193b02003be24
# good: [b450159d0903b06ebea121a010ab9c424b67c408] ext2: introduce new flags argument for ext2_new_blocks()
git bisect good b450159d0903b06ebea121a010ab9c424b67c408
# good: [9bc6fc3304d89f19c028cb4a8d6af94f9e5faeb0] ext2: dump current reservation window info
git bisect good 9bc6fc3304d89f19c028cb4a8d6af94f9e5faeb0
# good: [df1ae36a4a0e92340daea12e88d43eeb2eb013b1] ext2: Fix kernel-doc warnings
git bisect good df1ae36a4a0e92340daea12e88d43eeb2eb013b1
# first bad commit: [1500e7e0726e963f64b9785a0cb0a820b2587bad] Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Try 2:

git bisect start
# status: waiting for both good and bad commits
# bad: [58720809f52779dc0f08e53e54b014209d13eebb] Linux 6.6-rc6
git bisect bad 58720809f52779dc0f08e53e54b014209d13eebb
# status: waiting for good commit(s), bad commit known
# good: [5221002c054376fcf2f0cea1d13f00291a90222e] Merge tag 'repair-force-rebuild-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA
git bisect good 5221002c054376fcf2f0cea1d13f00291a90222e
# bad: [c66403f62717e1af3be2a1873d52d2cf4724feba] Merge tag 'genpd-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm
git bisect bad c66403f62717e1af3be2a1873d52d2cf4724feba
# good: [bd6c11bc43c496cddfc6cf603b5d45365606dbd5] Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
git bisect good bd6c11bc43c496cddfc6cf603b5d45365606dbd5
# good: [3698a75f5a98d0a6599e2878ab25d30a82dd836a] Merge tag 'drm-intel-next-fixes-2023-08-24' of git://anongit.freedesktop.org/drm/drm-intel into drm-next
git bisect good 3698a75f5a98d0a6599e2878ab25d30a82dd836a
# good: [1a35914f738c564060a14388f52a06669b09e0b3] Merge tag 'integrity-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
git bisect good 1a35914f738c564060a14388f52a06669b09e0b3
# good: [b36e672b6b6fa4f68fc74c3b85ba9b4a615fc1d9] ASoC: tegra: merge DAI call back functions into ops
git bisect good b36e672b6b6fa4f68fc74c3b85ba9b4a615fc1d9
# good: [692f5510159c79bfa312a4e27a15e266232bfb4c] Merge tag 'asoc-v6.6' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
git bisect good 692f5510159c79bfa312a4e27a15e266232bfb4c
# good: [1687d8aca5488674686eb46bf49d1d908b2672a1] Merge tag 'x86_apic_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 1687d8aca5488674686eb46bf49d1d908b2672a1
# bad: [53ea7f624fb91074c2f9458832ed74975ee5d64c] Merge tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
git bisect bad 53ea7f624fb91074c2f9458832ed74975ee5d64c
# good: [df1ae36a4a0e92340daea12e88d43eeb2eb013b1] ext2: Fix kernel-doc warnings
git bisect good df1ae36a4a0e92340daea12e88d43eeb2eb013b1
# bad: [1500e7e0726e963f64b9785a0cb0a820b2587bad] Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
git bisect bad 1500e7e0726e963f64b9785a0cb0a820b2587bad
# good: [b0504bfe1b8acdcfb5ef466581d930835ef3c49e] ovl: add support for unique fsid per instance
git bisect good b0504bfe1b8acdcfb5ef466581d930835ef3c49e
# good: [36295542969dcfe7443f8cc5247863ed06a936d5] ovl: Kconfig: introduce CONFIG_OVERLAY_FS_DEBUG
git bisect good 36295542969dcfe7443f8cc5247863ed06a936d5
# good: [adcd459ff805ce5e11956cfa1e9aa85471b6ae8d] ovl: validate superblock in OVL_FS()
git bisect good adcd459ff805ce5e11956cfa1e9aa85471b6ae8d
# good: [63580f669d7ff5aa5a1fa2e3994114770a491722] Merge tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
git bisect good 63580f669d7ff5aa5a1fa2e3994114770a491722
# first bad commit: [1500e7e0726e963f64b9785a0cb0a820b2587bad] Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

I'm full ears on how to debug this. Note, the earlyprintk doesn't work on this
platform as it has no legacy UARTs.

--
With Best Regards,
Andy Shevchenko



2023-10-17 10:29:38

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

+Ferry

On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > Hello Linus,
> >
> > could you please pull from
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git for_v6.6-rc1
> >
> > to get:
> > * fixes for possible use-after-free issues with quota when racing with
> > chown
> > * fixes for ext2 crashing when xattr allocation races with another
> > block allocation to the same file from page writeback code
> > * fix for block number overflow in ext2
> > * marking of reiserfs as obsolete in MAINTAINERS
> > * assorted minor cleanups
> >
> > Top of the tree is df1ae36a4a0e. The full shortlog is:
>
> This merge commit (?) broke boot on Intel Merrifield.
> It has earlycon enabled and only what I got is watchdog
> trigger without a bit of information printed out.
>
> I tried to give a two bisects with the same result.
>
> Try 1:
>
> git bisect bad 461f35f014466c4e26dca6be0f431f57297df3f2 # good: [bd6c11bc43c496cddfc6cf603b5d45365606dbd5] Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
> git bisect good bd6c11bc43c496cddfc6cf603b5d45365606dbd5
> # good: [ef35c7ba60410926d0501e45aad299656a83826c] Revert "Revert "drm/amdgpu/display: change pipe policy for DCN 2.0""
> git bisect good ef35c7ba60410926d0501e45aad299656a83826c
> # good: [d68b4b6f307d155475cce541f2aee938032ed22e] Merge tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> git bisect good d68b4b6f307d155475cce541f2aee938032ed22e
> # good: [87fa732dc5ff9ea6a2e75b630f7931899e845eb1] Merge tag 'x86-core-2023-08-30-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect good 87fa732dc5ff9ea6a2e75b630f7931899e845eb1
> # good: [bc609f4867f6a14db0efda55a7adef4dca16762e] Merge tag 'drm-misc-next-fixes-2023-08-24' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
> git bisect good bc609f4867f6a14db0efda55a7adef4dca16762e
> # good: [63580f669d7ff5aa5a1fa2e3994114770a491722] Merge tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
> git bisect good 63580f669d7ff5aa5a1fa2e3994114770a491722
> # good: [5221002c054376fcf2f0cea1d13f00291a90222e] Merge tag 'repair-force-rebuild-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA
> git bisect good 5221002c054376fcf2f0cea1d13f00291a90222e
> # bad: [1500e7e0726e963f64b9785a0cb0a820b2587bad] Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
> git bisect bad 1500e7e0726e963f64b9785a0cb0a820b2587bad
> # good: [7a64774add85ce673a089810fae193b02003be24] quota: use lockdep_assert_held_write in dquot_load_quota_sb
> git bisect good 7a64774add85ce673a089810fae193b02003be24
> # good: [b450159d0903b06ebea121a010ab9c424b67c408] ext2: introduce new flags argument for ext2_new_blocks()
> git bisect good b450159d0903b06ebea121a010ab9c424b67c408
> # good: [9bc6fc3304d89f19c028cb4a8d6af94f9e5faeb0] ext2: dump current reservation window info
> git bisect good 9bc6fc3304d89f19c028cb4a8d6af94f9e5faeb0
> # good: [df1ae36a4a0e92340daea12e88d43eeb2eb013b1] ext2: Fix kernel-doc warnings
> git bisect good df1ae36a4a0e92340daea12e88d43eeb2eb013b1
> # first bad commit: [1500e7e0726e963f64b9785a0cb0a820b2587bad] Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
>
> Try 2:
>
> git bisect start
> # status: waiting for both good and bad commits
> # bad: [58720809f52779dc0f08e53e54b014209d13eebb] Linux 6.6-rc6
> git bisect bad 58720809f52779dc0f08e53e54b014209d13eebb
> # status: waiting for good commit(s), bad commit known
> # good: [5221002c054376fcf2f0cea1d13f00291a90222e] Merge tag 'repair-force-rebuild-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA
> git bisect good 5221002c054376fcf2f0cea1d13f00291a90222e
> # bad: [c66403f62717e1af3be2a1873d52d2cf4724feba] Merge tag 'genpd-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm
> git bisect bad c66403f62717e1af3be2a1873d52d2cf4724feba
> # good: [bd6c11bc43c496cddfc6cf603b5d45365606dbd5] Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
> git bisect good bd6c11bc43c496cddfc6cf603b5d45365606dbd5
> # good: [3698a75f5a98d0a6599e2878ab25d30a82dd836a] Merge tag 'drm-intel-next-fixes-2023-08-24' of git://anongit.freedesktop.org/drm/drm-intel into drm-next
> git bisect good 3698a75f5a98d0a6599e2878ab25d30a82dd836a
> # good: [1a35914f738c564060a14388f52a06669b09e0b3] Merge tag 'integrity-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
> git bisect good 1a35914f738c564060a14388f52a06669b09e0b3
> # good: [b36e672b6b6fa4f68fc74c3b85ba9b4a615fc1d9] ASoC: tegra: merge DAI call back functions into ops
> git bisect good b36e672b6b6fa4f68fc74c3b85ba9b4a615fc1d9
> # good: [692f5510159c79bfa312a4e27a15e266232bfb4c] Merge tag 'asoc-v6.6' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
> git bisect good 692f5510159c79bfa312a4e27a15e266232bfb4c
> # good: [1687d8aca5488674686eb46bf49d1d908b2672a1] Merge tag 'x86_apic_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect good 1687d8aca5488674686eb46bf49d1d908b2672a1
> # bad: [53ea7f624fb91074c2f9458832ed74975ee5d64c] Merge tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
> git bisect bad 53ea7f624fb91074c2f9458832ed74975ee5d64c
> # good: [df1ae36a4a0e92340daea12e88d43eeb2eb013b1] ext2: Fix kernel-doc warnings
> git bisect good df1ae36a4a0e92340daea12e88d43eeb2eb013b1
> # bad: [1500e7e0726e963f64b9785a0cb0a820b2587bad] Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
> git bisect bad 1500e7e0726e963f64b9785a0cb0a820b2587bad
> # good: [b0504bfe1b8acdcfb5ef466581d930835ef3c49e] ovl: add support for unique fsid per instance
> git bisect good b0504bfe1b8acdcfb5ef466581d930835ef3c49e
> # good: [36295542969dcfe7443f8cc5247863ed06a936d5] ovl: Kconfig: introduce CONFIG_OVERLAY_FS_DEBUG
> git bisect good 36295542969dcfe7443f8cc5247863ed06a936d5
> # good: [adcd459ff805ce5e11956cfa1e9aa85471b6ae8d] ovl: validate superblock in OVL_FS()
> git bisect good adcd459ff805ce5e11956cfa1e9aa85471b6ae8d
> # good: [63580f669d7ff5aa5a1fa2e3994114770a491722] Merge tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
> git bisect good 63580f669d7ff5aa5a1fa2e3994114770a491722
> # first bad commit: [1500e7e0726e963f64b9785a0cb0a820b2587bad] Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
>
> I'm full ears on how to debug this. Note, the earlyprintk doesn't work on this
> platform as it has no legacy UARTs.
>
> --
> With Best Regards,
> Andy Shevchenko
>
>

--
With Best Regards,
Andy Shevchenko


2023-10-17 10:33:02

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> +Ferry
>
> On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > Hello Linus,
> > >
> > > could you please pull from
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git for_v6.6-rc1
> > >
> > > to get:
> > > * fixes for possible use-after-free issues with quota when racing with
> > > chown
> > > * fixes for ext2 crashing when xattr allocation races with another
> > > block allocation to the same file from page writeback code
> > > * fix for block number overflow in ext2
> > > * marking of reiserfs as obsolete in MAINTAINERS
> > > * assorted minor cleanups
> > >
> > > Top of the tree is df1ae36a4a0e. The full shortlog is:
> >
> > This merge commit (?) broke boot on Intel Merrifield.
> > It has earlycon enabled and only what I got is watchdog
> > trigger without a bit of information printed out.
> >
> > I tried to give a two bisects with the same result.
> >
> > Try 1:

+ Missed start of this

git bisect start
# status: waiting for both good and bad commits
# good: [2dde18cd1d8fac735875f2e4987f11817cc0bc2c] Linux 6.5
git bisect good 2dde18cd1d8fac735875f2e4987f11817cc0bc2c
# status: waiting for bad commit, 1 good commit known
# bad: [0bb80ecc33a8fb5a682236443c1e740d5c917d1d] Linux 6.6-rc1
git bisect bad 0bb80ecc33a8fb5a682236443c1e740d5c917d1d
# bad: [461f35f014466c4e26dca6be0f431f57297df3f2] Merge tag 'drm-next-2023-08-30' of git://anongit.freedesktop.org/drm/drm


> > git bisect bad 461f35f014466c4e26dca6be0f431f57297df3f2 # good: [bd6c11bc43c496cddfc6cf603b5d45365606dbd5] Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
> > git bisect good bd6c11bc43c496cddfc6cf603b5d45365606dbd5
> > # good: [ef35c7ba60410926d0501e45aad299656a83826c] Revert "Revert "drm/amdgpu/display: change pipe policy for DCN 2.0""
> > git bisect good ef35c7ba60410926d0501e45aad299656a83826c
> > # good: [d68b4b6f307d155475cce541f2aee938032ed22e] Merge tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > git bisect good d68b4b6f307d155475cce541f2aee938032ed22e
> > # good: [87fa732dc5ff9ea6a2e75b630f7931899e845eb1] Merge tag 'x86-core-2023-08-30-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> > git bisect good 87fa732dc5ff9ea6a2e75b630f7931899e845eb1
> > # good: [bc609f4867f6a14db0efda55a7adef4dca16762e] Merge tag 'drm-misc-next-fixes-2023-08-24' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
> > git bisect good bc609f4867f6a14db0efda55a7adef4dca16762e
> > # good: [63580f669d7ff5aa5a1fa2e3994114770a491722] Merge tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
> > git bisect good 63580f669d7ff5aa5a1fa2e3994114770a491722
> > # good: [5221002c054376fcf2f0cea1d13f00291a90222e] Merge tag 'repair-force-rebuild-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA
> > git bisect good 5221002c054376fcf2f0cea1d13f00291a90222e
> > # bad: [1500e7e0726e963f64b9785a0cb0a820b2587bad] Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
> > git bisect bad 1500e7e0726e963f64b9785a0cb0a820b2587bad
> > # good: [7a64774add85ce673a089810fae193b02003be24] quota: use lockdep_assert_held_write in dquot_load_quota_sb
> > git bisect good 7a64774add85ce673a089810fae193b02003be24
> > # good: [b450159d0903b06ebea121a010ab9c424b67c408] ext2: introduce new flags argument for ext2_new_blocks()
> > git bisect good b450159d0903b06ebea121a010ab9c424b67c408
> > # good: [9bc6fc3304d89f19c028cb4a8d6af94f9e5faeb0] ext2: dump current reservation window info
> > git bisect good 9bc6fc3304d89f19c028cb4a8d6af94f9e5faeb0
> > # good: [df1ae36a4a0e92340daea12e88d43eeb2eb013b1] ext2: Fix kernel-doc warnings
> > git bisect good df1ae36a4a0e92340daea12e88d43eeb2eb013b1
> > # first bad commit: [1500e7e0726e963f64b9785a0cb0a820b2587bad] Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
> >
> > Try 2:
> >
> > git bisect start
> > # status: waiting for both good and bad commits
> > # bad: [58720809f52779dc0f08e53e54b014209d13eebb] Linux 6.6-rc6
> > git bisect bad 58720809f52779dc0f08e53e54b014209d13eebb
> > # status: waiting for good commit(s), bad commit known
> > # good: [5221002c054376fcf2f0cea1d13f00291a90222e] Merge tag 'repair-force-rebuild-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA
> > git bisect good 5221002c054376fcf2f0cea1d13f00291a90222e
> > # bad: [c66403f62717e1af3be2a1873d52d2cf4724feba] Merge tag 'genpd-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm
> > git bisect bad c66403f62717e1af3be2a1873d52d2cf4724feba
> > # good: [bd6c11bc43c496cddfc6cf603b5d45365606dbd5] Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
> > git bisect good bd6c11bc43c496cddfc6cf603b5d45365606dbd5
> > # good: [3698a75f5a98d0a6599e2878ab25d30a82dd836a] Merge tag 'drm-intel-next-fixes-2023-08-24' of git://anongit.freedesktop.org/drm/drm-intel into drm-next
> > git bisect good 3698a75f5a98d0a6599e2878ab25d30a82dd836a
> > # good: [1a35914f738c564060a14388f52a06669b09e0b3] Merge tag 'integrity-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
> > git bisect good 1a35914f738c564060a14388f52a06669b09e0b3
> > # good: [b36e672b6b6fa4f68fc74c3b85ba9b4a615fc1d9] ASoC: tegra: merge DAI call back functions into ops
> > git bisect good b36e672b6b6fa4f68fc74c3b85ba9b4a615fc1d9
> > # good: [692f5510159c79bfa312a4e27a15e266232bfb4c] Merge tag 'asoc-v6.6' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
> > git bisect good 692f5510159c79bfa312a4e27a15e266232bfb4c
> > # good: [1687d8aca5488674686eb46bf49d1d908b2672a1] Merge tag 'x86_apic_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> > git bisect good 1687d8aca5488674686eb46bf49d1d908b2672a1
> > # bad: [53ea7f624fb91074c2f9458832ed74975ee5d64c] Merge tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
> > git bisect bad 53ea7f624fb91074c2f9458832ed74975ee5d64c
> > # good: [df1ae36a4a0e92340daea12e88d43eeb2eb013b1] ext2: Fix kernel-doc warnings
> > git bisect good df1ae36a4a0e92340daea12e88d43eeb2eb013b1
> > # bad: [1500e7e0726e963f64b9785a0cb0a820b2587bad] Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
> > git bisect bad 1500e7e0726e963f64b9785a0cb0a820b2587bad
> > # good: [b0504bfe1b8acdcfb5ef466581d930835ef3c49e] ovl: add support for unique fsid per instance
> > git bisect good b0504bfe1b8acdcfb5ef466581d930835ef3c49e
> > # good: [36295542969dcfe7443f8cc5247863ed06a936d5] ovl: Kconfig: introduce CONFIG_OVERLAY_FS_DEBUG
> > git bisect good 36295542969dcfe7443f8cc5247863ed06a936d5
> > # good: [adcd459ff805ce5e11956cfa1e9aa85471b6ae8d] ovl: validate superblock in OVL_FS()
> > git bisect good adcd459ff805ce5e11956cfa1e9aa85471b6ae8d
> > # good: [63580f669d7ff5aa5a1fa2e3994114770a491722] Merge tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
> > git bisect good 63580f669d7ff5aa5a1fa2e3994114770a491722
> > # first bad commit: [1500e7e0726e963f64b9785a0cb0a820b2587bad] Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
> >
> > I'm full ears on how to debug this. Note, the earlyprintk doesn't work on this
> > platform as it has no legacy UARTs.

--
With Best Regards,
Andy Shevchenko


2023-10-17 11:36:39

by Jan Kara

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

Hello!

On Tue 17-10-23 13:32:53, Andy Shevchenko wrote:
> On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> > On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > > Hello Linus,
> > > >
> > > > could you please pull from
> > > >
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git for_v6.6-rc1
> > > >
> > > > to get:
> > > > * fixes for possible use-after-free issues with quota when racing with
> > > > chown
> > > > * fixes for ext2 crashing when xattr allocation races with another
> > > > block allocation to the same file from page writeback code
> > > > * fix for block number overflow in ext2
> > > > * marking of reiserfs as obsolete in MAINTAINERS
> > > > * assorted minor cleanups
> > > >
> > > > Top of the tree is df1ae36a4a0e. The full shortlog is:
> > >
> > > This merge commit (?) broke boot on Intel Merrifield.
> > > It has earlycon enabled and only what I got is watchdog
> > > trigger without a bit of information printed out.
> > >
> > > I tried to give a two bisects with the same result.
> > >
> > > Try 1:
>
> + Missed start of this
>
> git bisect start
> # status: waiting for both good and bad commits
> # good: [2dde18cd1d8fac735875f2e4987f11817cc0bc2c] Linux 6.5
> git bisect good 2dde18cd1d8fac735875f2e4987f11817cc0bc2c
> # status: waiting for bad commit, 1 good commit known
> # bad: [0bb80ecc33a8fb5a682236443c1e740d5c917d1d] Linux 6.6-rc1
> git bisect bad 0bb80ecc33a8fb5a682236443c1e740d5c917d1d
> # bad: [461f35f014466c4e26dca6be0f431f57297df3f2] Merge tag 'drm-next-2023-08-30' of git://anongit.freedesktop.org/drm/drm
>
> > > git bisect bad 461f35f014466c4e26dca6be0f431f57297df3f2 # good: [bd6c11bc43c496cddfc6cf603b5d45365606dbd5] Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
> > > git bisect good bd6c11bc43c496cddfc6cf603b5d45365606dbd5
> > > # good: [ef35c7ba60410926d0501e45aad299656a83826c] Revert "Revert "drm/amdgpu/display: change pipe policy for DCN 2.0""
> > > git bisect good ef35c7ba60410926d0501e45aad299656a83826c
> > > # good: [d68b4b6f307d155475cce541f2aee938032ed22e] Merge tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > git bisect good d68b4b6f307d155475cce541f2aee938032ed22e
> > > # good: [87fa732dc5ff9ea6a2e75b630f7931899e845eb1] Merge tag 'x86-core-2023-08-30-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> > > git bisect good 87fa732dc5ff9ea6a2e75b630f7931899e845eb1
> > > # good: [bc609f4867f6a14db0efda55a7adef4dca16762e] Merge tag 'drm-misc-next-fixes-2023-08-24' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
> > > git bisect good bc609f4867f6a14db0efda55a7adef4dca16762e
> > > # good: [63580f669d7ff5aa5a1fa2e3994114770a491722] Merge tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
> > > git bisect good 63580f669d7ff5aa5a1fa2e3994114770a491722
> > > # good: [5221002c054376fcf2f0cea1d13f00291a90222e] Merge tag 'repair-force-rebuild-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA
> > > git bisect good 5221002c054376fcf2f0cea1d13f00291a90222e
> > > # bad: [1500e7e0726e963f64b9785a0cb0a820b2587bad] Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
> > > git bisect bad 1500e7e0726e963f64b9785a0cb0a820b2587bad
> > > # good: [7a64774add85ce673a089810fae193b02003be24] quota: use lockdep_assert_held_write in dquot_load_quota_sb
> > > git bisect good 7a64774add85ce673a089810fae193b02003be24
> > > # good: [b450159d0903b06ebea121a010ab9c424b67c408] ext2: introduce new flags argument for ext2_new_blocks()
> > > git bisect good b450159d0903b06ebea121a010ab9c424b67c408
> > > # good: [9bc6fc3304d89f19c028cb4a8d6af94f9e5faeb0] ext2: dump current reservation window info
> > > git bisect good 9bc6fc3304d89f19c028cb4a8d6af94f9e5faeb0
> > > # good: [df1ae36a4a0e92340daea12e88d43eeb2eb013b1] ext2: Fix kernel-doc warnings
> > > git bisect good df1ae36a4a0e92340daea12e88d43eeb2eb013b1
> > > # first bad commit: [1500e7e0726e963f64b9785a0cb0a820b2587bad] Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

That's strange because I don't see anything suspicious in the merge and
furthermore I'd expent none of the changes in the merge to influence early
boot in any way. Can you share your kernel config? What root filesystem do
you use? Thanks for the report!

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-10-17 11:46:31

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Tue, Oct 17, 2023 at 01:32:53PM +0300, Andy Shevchenko wrote:
> On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> > On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > > Hello Linus,

...

> > > This merge commit (?) broke boot on Intel Merrifield.
> > > It has earlycon enabled and only what I got is watchdog
> > > trigger without a bit of information printed out.

Okay, seems false positive as with different configuration it
boots. It might be related to the size of the kernel itself.

--
With Best Regards,
Andy Shevchenko


2023-10-17 11:50:01

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Tue, Oct 17, 2023 at 01:36:28PM +0200, Jan Kara wrote:
> On Tue 17-10-23 13:32:53, Andy Shevchenko wrote:

...

> That's strange because I don't see anything suspicious in the merge and
> furthermore I'd expent none of the changes in the merge to influence early
> boot in any way. Can you share your kernel config? What root filesystem do
> you use? Thanks for the report!

I just read this message and responded a few minutes before that the issue
somewhere else related to the configuration. Thanks for the prompt reply!

Still if you need to see the working/non-working configurations I can
share them.

--
With Best Regards,
Andy Shevchenko


2023-10-17 13:33:01

by Jan Kara

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Tue 17-10-23 14:46:20, Andy Shevchenko wrote:
> On Tue, Oct 17, 2023 at 01:32:53PM +0300, Andy Shevchenko wrote:
> > On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> > > On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > > > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > > > Hello Linus,
>
> ...
>
> > > > This merge commit (?) broke boot on Intel Merrifield.
> > > > It has earlycon enabled and only what I got is watchdog
> > > > trigger without a bit of information printed out.
>
> Okay, seems false positive as with different configuration it
> boots. It might be related to the size of the kernel itself.

Ah, ok, that makes some sense.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-10-17 13:42:47

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Tue, Oct 17, 2023 at 03:32:45PM +0200, Jan Kara wrote:
> On Tue 17-10-23 14:46:20, Andy Shevchenko wrote:
> > On Tue, Oct 17, 2023 at 01:32:53PM +0300, Andy Shevchenko wrote:
> > > On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> > > > On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > > > > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > > > > Hello Linus,

...

> > > > > This merge commit (?) broke boot on Intel Merrifield.
> > > > > It has earlycon enabled and only what I got is watchdog
> > > > > trigger without a bit of information printed out.
> >
> > Okay, seems false positive as with different configuration it
> > boots. It might be related to the size of the kernel itself.
>
> Ah, ok, that makes some sense.

I should have mentioned that it boots with the configuration say "A",
while not with "B", where "B" = "A" + "C" and definitely the kernel
and initrd sizes in the "B" case are bigger.

--
With Best Regards,
Andy Shevchenko


2023-10-17 14:59:42

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Tue, Oct 17, 2023 at 04:42:29PM +0300, Andy Shevchenko wrote:
> On Tue, Oct 17, 2023 at 03:32:45PM +0200, Jan Kara wrote:
> > On Tue 17-10-23 14:46:20, Andy Shevchenko wrote:
> > > On Tue, Oct 17, 2023 at 01:32:53PM +0300, Andy Shevchenko wrote:
> > > > On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> > > > > On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > > > > > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > > > > > Hello Linus,

...

> > > > > > This merge commit (?) broke boot on Intel Merrifield.
> > > > > > It has earlycon enabled and only what I got is watchdog
> > > > > > trigger without a bit of information printed out.
> > >
> > > Okay, seems false positive as with different configuration it
> > > boots. It might be related to the size of the kernel itself.
> >
> > Ah, ok, that makes some sense.
>
> I should have mentioned that it boots with the configuration say "A",
> while not with "B", where "B" = "A" + "C" and definitely the kernel
> and initrd sizes in the "B" case are bigger.

If it's a size (which is only grew from 13M->14M), it's weird.

Nevertheless, I reverted these in my local tree

85515a7f0ae7 (HEAD -> topic/mrfld) Revert "defconfig: enable DEBUG_SPINLOCK"
786e04262621 Revert "defconfig: enable DEBUG_ATOMIC_SLEEP"
76ad0a0c3f2d Revert "defconfig: enable DEBUG_INFO"
f8090166c1be Revert "defconfig: enable DEBUG_LIST && DEBUG_OBJECTS_RCU_HEAD"

and it boots again! So, after this merge something affects one of this?

I'll continuing debugging which one is a culprit, just want to share
the intermediate findings.

--
With Best Regards,
Andy Shevchenko


2023-10-17 15:15:57

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Tue, Oct 17, 2023 at 05:50:10PM +0300, Andy Shevchenko wrote:
> On Tue, Oct 17, 2023 at 04:42:29PM +0300, Andy Shevchenko wrote:
> > On Tue, Oct 17, 2023 at 03:32:45PM +0200, Jan Kara wrote:
> > > On Tue 17-10-23 14:46:20, Andy Shevchenko wrote:
> > > > On Tue, Oct 17, 2023 at 01:32:53PM +0300, Andy Shevchenko wrote:
> > > > > On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> > > > > > On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > > > > > > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > > > > > > Hello Linus,

...

> > > > > > > This merge commit (?) broke boot on Intel Merrifield.
> > > > > > > It has earlycon enabled and only what I got is watchdog
> > > > > > > trigger without a bit of information printed out.
> > > >
> > > > Okay, seems false positive as with different configuration it
> > > > boots. It might be related to the size of the kernel itself.
> > >
> > > Ah, ok, that makes some sense.
> >
> > I should have mentioned that it boots with the configuration say "A",
> > while not with "B", where "B" = "A" + "C" and definitely the kernel
> > and initrd sizes in the "B" case are bigger.
>
> If it's a size (which is only grew from 13M->14M), it's weird.
>
> Nevertheless, I reverted these in my local tree
>
> 85515a7f0ae7 (HEAD -> topic/mrfld) Revert "defconfig: enable DEBUG_SPINLOCK"
> 786e04262621 Revert "defconfig: enable DEBUG_ATOMIC_SLEEP"
> 76ad0a0c3f2d Revert "defconfig: enable DEBUG_INFO"
> f8090166c1be Revert "defconfig: enable DEBUG_LIST && DEBUG_OBJECTS_RCU_HEAD"
>
> and it boots again! So, after this merge something affects one of this?
>
> I'll continuing debugging which one is a culprit, just want to share
> the intermediate findings.

CONFIG_DEBUG_LIST with this merge commit somehow triggers this issue.
Any ideas?

--
With Best Regards,
Andy Shevchenko


2023-10-17 15:36:01

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Tue, Oct 17, 2023 at 06:14:54PM +0300, Andy Shevchenko wrote:
> On Tue, Oct 17, 2023 at 05:50:10PM +0300, Andy Shevchenko wrote:
> > On Tue, Oct 17, 2023 at 04:42:29PM +0300, Andy Shevchenko wrote:
> > > On Tue, Oct 17, 2023 at 03:32:45PM +0200, Jan Kara wrote:
> > > > On Tue 17-10-23 14:46:20, Andy Shevchenko wrote:
> > > > > On Tue, Oct 17, 2023 at 01:32:53PM +0300, Andy Shevchenko wrote:
> > > > > > On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> > > > > > > On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > > > > > > > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > > > > > > > Hello Linus,

...

> > > > > > > > This merge commit (?) broke boot on Intel Merrifield.
> > > > > > > > It has earlycon enabled and only what I got is watchdog
> > > > > > > > trigger without a bit of information printed out.
> > > > >
> > > > > Okay, seems false positive as with different configuration it
> > > > > boots. It might be related to the size of the kernel itself.
> > > >
> > > > Ah, ok, that makes some sense.
> > >
> > > I should have mentioned that it boots with the configuration say "A",
> > > while not with "B", where "B" = "A" + "C" and definitely the kernel
> > > and initrd sizes in the "B" case are bigger.
> >
> > If it's a size (which is only grew from 13M->14M), it's weird.
> >
> > Nevertheless, I reverted these in my local tree
> >
> > 85515a7f0ae7 (HEAD -> topic/mrfld) Revert "defconfig: enable DEBUG_SPINLOCK"
> > 786e04262621 Revert "defconfig: enable DEBUG_ATOMIC_SLEEP"
> > 76ad0a0c3f2d Revert "defconfig: enable DEBUG_INFO"
> > f8090166c1be Revert "defconfig: enable DEBUG_LIST && DEBUG_OBJECTS_RCU_HEAD"
> >
> > and it boots again! So, after this merge something affects one of this?
> >
> > I'll continuing debugging which one is a culprit, just want to share
> > the intermediate findings.
>
> CONFIG_DEBUG_LIST with this merge commit somehow triggers this issue.
> Any ideas?

Dropping CONFIG_QUOTA* helps as well.
So something definitely goes on in this merge commit.

--
With Best Regards,
Andy Shevchenko


2023-10-17 16:03:44

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Tue, Oct 17, 2023 at 06:34:50PM +0300, Andy Shevchenko wrote:
> On Tue, Oct 17, 2023 at 06:14:54PM +0300, Andy Shevchenko wrote:
> > On Tue, Oct 17, 2023 at 05:50:10PM +0300, Andy Shevchenko wrote:
> > > On Tue, Oct 17, 2023 at 04:42:29PM +0300, Andy Shevchenko wrote:
> > > > On Tue, Oct 17, 2023 at 03:32:45PM +0200, Jan Kara wrote:
> > > > > On Tue 17-10-23 14:46:20, Andy Shevchenko wrote:
> > > > > > On Tue, Oct 17, 2023 at 01:32:53PM +0300, Andy Shevchenko wrote:
> > > > > > > On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> > > > > > > > On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > > > > > > > > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > > > > > > > > Hello Linus,

...

> > > > > > > > > This merge commit (?) broke boot on Intel Merrifield.
> > > > > > > > > It has earlycon enabled and only what I got is watchdog
> > > > > > > > > trigger without a bit of information printed out.
> > > > > >
> > > > > > Okay, seems false positive as with different configuration it
> > > > > > boots. It might be related to the size of the kernel itself.
> > > > >
> > > > > Ah, ok, that makes some sense.
> > > >
> > > > I should have mentioned that it boots with the configuration say "A",
> > > > while not with "B", where "B" = "A" + "C" and definitely the kernel
> > > > and initrd sizes in the "B" case are bigger.
> > >
> > > If it's a size (which is only grew from 13M->14M), it's weird.
> > >
> > > Nevertheless, I reverted these in my local tree
> > >
> > > 85515a7f0ae7 (HEAD -> topic/mrfld) Revert "defconfig: enable DEBUG_SPINLOCK"
> > > 786e04262621 Revert "defconfig: enable DEBUG_ATOMIC_SLEEP"
> > > 76ad0a0c3f2d Revert "defconfig: enable DEBUG_INFO"
> > > f8090166c1be Revert "defconfig: enable DEBUG_LIST && DEBUG_OBJECTS_RCU_HEAD"
> > >
> > > and it boots again! So, after this merge something affects one of this?
> > >
> > > I'll continuing debugging which one is a culprit, just want to share
> > > the intermediate findings.
> >
> > CONFIG_DEBUG_LIST with this merge commit somehow triggers this issue.
> > Any ideas?

> Dropping CONFIG_QUOTA* helps as well.

More precisely it's enough to drop either from CONFIG_DEBUG_LIST and CONFIG_QUOTA
to make it boot again.

And I'm done for today.

> So something definitely goes on in this merge commit.

--
With Best Regards,
Andy Shevchenko


2023-10-18 18:46:26

by Jan Kara

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Tue 17-10-23 19:02:52, Andy Shevchenko wrote:
> On Tue, Oct 17, 2023 at 06:34:50PM +0300, Andy Shevchenko wrote:
> > On Tue, Oct 17, 2023 at 06:14:54PM +0300, Andy Shevchenko wrote:
> > > On Tue, Oct 17, 2023 at 05:50:10PM +0300, Andy Shevchenko wrote:
> > > > On Tue, Oct 17, 2023 at 04:42:29PM +0300, Andy Shevchenko wrote:
> > > > > On Tue, Oct 17, 2023 at 03:32:45PM +0200, Jan Kara wrote:
> > > > > > On Tue 17-10-23 14:46:20, Andy Shevchenko wrote:
> > > > > > > On Tue, Oct 17, 2023 at 01:32:53PM +0300, Andy Shevchenko wrote:
> > > > > > > > On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> > > > > > > > > On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > > > > > > > > > Hello Linus,
>
> ...
>
> > > > > > > > > > This merge commit (?) broke boot on Intel Merrifield.
> > > > > > > > > > It has earlycon enabled and only what I got is watchdog
> > > > > > > > > > trigger without a bit of information printed out.
> > > > > > >
> > > > > > > Okay, seems false positive as with different configuration it
> > > > > > > boots. It might be related to the size of the kernel itself.
> > > > > >
> > > > > > Ah, ok, that makes some sense.
> > > > >
> > > > > I should have mentioned that it boots with the configuration say "A",
> > > > > while not with "B", where "B" = "A" + "C" and definitely the kernel
> > > > > and initrd sizes in the "B" case are bigger.
> > > >
> > > > If it's a size (which is only grew from 13M->14M), it's weird.
> > > >
> > > > Nevertheless, I reverted these in my local tree
> > > >
> > > > 85515a7f0ae7 (HEAD -> topic/mrfld) Revert "defconfig: enable DEBUG_SPINLOCK"
> > > > 786e04262621 Revert "defconfig: enable DEBUG_ATOMIC_SLEEP"
> > > > 76ad0a0c3f2d Revert "defconfig: enable DEBUG_INFO"
> > > > f8090166c1be Revert "defconfig: enable DEBUG_LIST && DEBUG_OBJECTS_RCU_HEAD"
> > > >
> > > > and it boots again! So, after this merge something affects one of this?
> > > >
> > > > I'll continuing debugging which one is a culprit, just want to share
> > > > the intermediate findings.
> > >
> > > CONFIG_DEBUG_LIST with this merge commit somehow triggers this issue.
> > > Any ideas?
>
> > Dropping CONFIG_QUOTA* helps as well.
>
> More precisely it's enough to drop either from CONFIG_DEBUG_LIST and CONFIG_QUOTA
> to make it boot again.
>
> And I'm done for today.

OK, thanks for debugging! So can you perhaps enable CONFIG_DEBUG_LIST
permanently in your kernel config and then bisect through the quota changes
in the merge? My guess is commit dabc8b20756 ("quota: fix dqput() to follow
the guarantees dquot_srcu should provide") might be the culprit given your
testing but I fail to see how given I don't expect any quotas to be used
during boot of your platform... BTW, there's also fixup: 869b6ea160
("quota: Fix slow quotaoff") merged last week so you could try testing a
kernel after this fix to see whether it changes anything.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-10-19 08:47:14

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Wed, Oct 18, 2023 at 08:46:13PM +0200, Jan Kara wrote:
> On Tue 17-10-23 19:02:52, Andy Shevchenko wrote:
> > On Tue, Oct 17, 2023 at 06:34:50PM +0300, Andy Shevchenko wrote:
> > > On Tue, Oct 17, 2023 at 06:14:54PM +0300, Andy Shevchenko wrote:
> > > > On Tue, Oct 17, 2023 at 05:50:10PM +0300, Andy Shevchenko wrote:
> > > > > On Tue, Oct 17, 2023 at 04:42:29PM +0300, Andy Shevchenko wrote:
> > > > > > On Tue, Oct 17, 2023 at 03:32:45PM +0200, Jan Kara wrote:
> > > > > > > On Tue 17-10-23 14:46:20, Andy Shevchenko wrote:
> > > > > > > > On Tue, Oct 17, 2023 at 01:32:53PM +0300, Andy Shevchenko wrote:
> > > > > > > > > On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > > > > > > > > > > Hello Linus,

...

> > > > > > > > > > > This merge commit (?) broke boot on Intel Merrifield.
> > > > > > > > > > > It has earlycon enabled and only what I got is watchdog
> > > > > > > > > > > trigger without a bit of information printed out.
> > > > > > > >
> > > > > > > > Okay, seems false positive as with different configuration it
> > > > > > > > boots. It might be related to the size of the kernel itself.
> > > > > > >
> > > > > > > Ah, ok, that makes some sense.
> > > > > >
> > > > > > I should have mentioned that it boots with the configuration say "A",
> > > > > > while not with "B", where "B" = "A" + "C" and definitely the kernel
> > > > > > and initrd sizes in the "B" case are bigger.
> > > > >
> > > > > If it's a size (which is only grew from 13M->14M), it's weird.
> > > > >
> > > > > Nevertheless, I reverted these in my local tree
> > > > >
> > > > > 85515a7f0ae7 (HEAD -> topic/mrfld) Revert "defconfig: enable DEBUG_SPINLOCK"
> > > > > 786e04262621 Revert "defconfig: enable DEBUG_ATOMIC_SLEEP"
> > > > > 76ad0a0c3f2d Revert "defconfig: enable DEBUG_INFO"
> > > > > f8090166c1be Revert "defconfig: enable DEBUG_LIST && DEBUG_OBJECTS_RCU_HEAD"
> > > > >
> > > > > and it boots again! So, after this merge something affects one of this?
> > > > >
> > > > > I'll continuing debugging which one is a culprit, just want to share
> > > > > the intermediate findings.
> > > >
> > > > CONFIG_DEBUG_LIST with this merge commit somehow triggers this issue.
> > > > Any ideas?
> >
> > > Dropping CONFIG_QUOTA* helps as well.
> >
> > More precisely it's enough to drop either from CONFIG_DEBUG_LIST and CONFIG_QUOTA
> > to make it boot again.
> >
> > And I'm done for today.
>
> OK, thanks for debugging! So can you perhaps enable CONFIG_DEBUG_LIST
> permanently in your kernel config and then bisect through the quota changes
> in the merge? My guess is commit dabc8b20756 ("quota: fix dqput() to follow
> the guarantees dquot_srcu should provide") might be the culprit given your
> testing but I fail to see how given I don't expect any quotas to be used
> during boot of your platform... BTW, there's also fixup: 869b6ea160
> ("quota: Fix slow quotaoff") merged last week so you could try testing a
> kernel after this fix to see whether it changes anything.

It's exactly what my initial report is about, CONFIG_DEBUG_LIST was there
always with CONFIG_QUOTA as well.

Two bisections (v6.5 .. v6.6-rc1 & something...v6.6-rc6) pointed out to
merge commit! I _had_ tried to simply revert the quota changes (I haven't
said about that before) and it didn't help. I'm so puzzled with all this.

--
With Best Regards,
Andy Shevchenko


2023-10-19 10:19:13

by Jan Kara

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Thu 19-10-23 11:46:58, Andy Shevchenko wrote:
> On Wed, Oct 18, 2023 at 08:46:13PM +0200, Jan Kara wrote:
> > On Tue 17-10-23 19:02:52, Andy Shevchenko wrote:
> > > On Tue, Oct 17, 2023 at 06:34:50PM +0300, Andy Shevchenko wrote:
> > > > On Tue, Oct 17, 2023 at 06:14:54PM +0300, Andy Shevchenko wrote:
> > > > > On Tue, Oct 17, 2023 at 05:50:10PM +0300, Andy Shevchenko wrote:
> > > > > > On Tue, Oct 17, 2023 at 04:42:29PM +0300, Andy Shevchenko wrote:
> > > > > > > On Tue, Oct 17, 2023 at 03:32:45PM +0200, Jan Kara wrote:
> > > > > > > > On Tue 17-10-23 14:46:20, Andy Shevchenko wrote:
> > > > > > > > > On Tue, Oct 17, 2023 at 01:32:53PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > > On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > > > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > > > > > > > > > > > Hello Linus,
>
> ...
>
> > > > > > > > > > > > This merge commit (?) broke boot on Intel Merrifield.
> > > > > > > > > > > > It has earlycon enabled and only what I got is watchdog
> > > > > > > > > > > > trigger without a bit of information printed out.
> > > > > > > > >
> > > > > > > > > Okay, seems false positive as with different configuration it
> > > > > > > > > boots. It might be related to the size of the kernel itself.
> > > > > > > >
> > > > > > > > Ah, ok, that makes some sense.
> > > > > > >
> > > > > > > I should have mentioned that it boots with the configuration say "A",
> > > > > > > while not with "B", where "B" = "A" + "C" and definitely the kernel
> > > > > > > and initrd sizes in the "B" case are bigger.
> > > > > >
> > > > > > If it's a size (which is only grew from 13M->14M), it's weird.
> > > > > >
> > > > > > Nevertheless, I reverted these in my local tree
> > > > > >
> > > > > > 85515a7f0ae7 (HEAD -> topic/mrfld) Revert "defconfig: enable DEBUG_SPINLOCK"
> > > > > > 786e04262621 Revert "defconfig: enable DEBUG_ATOMIC_SLEEP"
> > > > > > 76ad0a0c3f2d Revert "defconfig: enable DEBUG_INFO"
> > > > > > f8090166c1be Revert "defconfig: enable DEBUG_LIST && DEBUG_OBJECTS_RCU_HEAD"
> > > > > >
> > > > > > and it boots again! So, after this merge something affects one of this?
> > > > > >
> > > > > > I'll continuing debugging which one is a culprit, just want to share
> > > > > > the intermediate findings.
> > > > >
> > > > > CONFIG_DEBUG_LIST with this merge commit somehow triggers this issue.
> > > > > Any ideas?
> > >
> > > > Dropping CONFIG_QUOTA* helps as well.
> > >
> > > More precisely it's enough to drop either from CONFIG_DEBUG_LIST and CONFIG_QUOTA
> > > to make it boot again.
> > >
> > > And I'm done for today.
> >
> > OK, thanks for debugging! So can you perhaps enable CONFIG_DEBUG_LIST
> > permanently in your kernel config and then bisect through the quota changes
> > in the merge? My guess is commit dabc8b20756 ("quota: fix dqput() to follow
> > the guarantees dquot_srcu should provide") might be the culprit given your
> > testing but I fail to see how given I don't expect any quotas to be used
> > during boot of your platform... BTW, there's also fixup: 869b6ea160
> > ("quota: Fix slow quotaoff") merged last week so you could try testing a
> > kernel after this fix to see whether it changes anything.
>
> It's exactly what my initial report is about, CONFIG_DEBUG_LIST was there
> always with CONFIG_QUOTA as well.

Ah, ok.

> Two bisections (v6.5 .. v6.6-rc1 & something...v6.6-rc6) pointed out to
> merge commit!

I thought CONFIG_DEBUG_LIST arrived through one path, some problematic
quota change arrived through another path and because they cause problems
only together, then bisecting to the merge would be exactly the outcome.
Alas that doesn't seem to be the case :-|.

> I _had_ tried to simply revert the quota changes (I haven't
> said about that before) and it didn't help. I'm so puzzled with all this.

Aha, OK. If even reverting quota changes doesn't help, then it's really
weird...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-10-19 12:01:58

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Thu, Oct 19, 2023 at 12:18:54PM +0200, Jan Kara wrote:
> On Thu 19-10-23 11:46:58, Andy Shevchenko wrote:
> > On Wed, Oct 18, 2023 at 08:46:13PM +0200, Jan Kara wrote:
> > > On Tue 17-10-23 19:02:52, Andy Shevchenko wrote:
> > > > On Tue, Oct 17, 2023 at 06:34:50PM +0300, Andy Shevchenko wrote:
> > > > > On Tue, Oct 17, 2023 at 06:14:54PM +0300, Andy Shevchenko wrote:
> > > > > > On Tue, Oct 17, 2023 at 05:50:10PM +0300, Andy Shevchenko wrote:
> > > > > > > On Tue, Oct 17, 2023 at 04:42:29PM +0300, Andy Shevchenko wrote:
> > > > > > > > On Tue, Oct 17, 2023 at 03:32:45PM +0200, Jan Kara wrote:
> > > > > > > > > On Tue 17-10-23 14:46:20, Andy Shevchenko wrote:
> > > > > > > > > > On Tue, Oct 17, 2023 at 01:32:53PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > > On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > > > On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > > > > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > > > > > > > > > > > > Hello Linus,

...

> > > > > > > > > > > > > This merge commit (?) broke boot on Intel Merrifield.
> > > > > > > > > > > > > It has earlycon enabled and only what I got is watchdog
> > > > > > > > > > > > > trigger without a bit of information printed out.
> > > > > > > > > >
> > > > > > > > > > Okay, seems false positive as with different configuration it
> > > > > > > > > > boots. It might be related to the size of the kernel itself.
> > > > > > > > >
> > > > > > > > > Ah, ok, that makes some sense.
> > > > > > > >
> > > > > > > > I should have mentioned that it boots with the configuration say "A",
> > > > > > > > while not with "B", where "B" = "A" + "C" and definitely the kernel
> > > > > > > > and initrd sizes in the "B" case are bigger.
> > > > > > >
> > > > > > > If it's a size (which is only grew from 13M->14M), it's weird.
> > > > > > >
> > > > > > > Nevertheless, I reverted these in my local tree
> > > > > > >
> > > > > > > 85515a7f0ae7 (HEAD -> topic/mrfld) Revert "defconfig: enable DEBUG_SPINLOCK"
> > > > > > > 786e04262621 Revert "defconfig: enable DEBUG_ATOMIC_SLEEP"
> > > > > > > 76ad0a0c3f2d Revert "defconfig: enable DEBUG_INFO"
> > > > > > > f8090166c1be Revert "defconfig: enable DEBUG_LIST && DEBUG_OBJECTS_RCU_HEAD"
> > > > > > >
> > > > > > > and it boots again! So, after this merge something affects one of this?
> > > > > > >
> > > > > > > I'll continuing debugging which one is a culprit, just want to share
> > > > > > > the intermediate findings.
> > > > > >
> > > > > > CONFIG_DEBUG_LIST with this merge commit somehow triggers this issue.
> > > > > > Any ideas?
> > > >
> > > > > Dropping CONFIG_QUOTA* helps as well.
> > > >
> > > > More precisely it's enough to drop either from CONFIG_DEBUG_LIST and CONFIG_QUOTA
> > > > to make it boot again.
> > > >
> > > > And I'm done for today.
> > >
> > > OK, thanks for debugging! So can you perhaps enable CONFIG_DEBUG_LIST
> > > permanently in your kernel config and then bisect through the quota changes
> > > in the merge? My guess is commit dabc8b20756 ("quota: fix dqput() to follow
> > > the guarantees dquot_srcu should provide") might be the culprit given your
> > > testing but I fail to see how given I don't expect any quotas to be used
> > > during boot of your platform... BTW, there's also fixup: 869b6ea160
> > > ("quota: Fix slow quotaoff") merged last week so you could try testing a
> > > kernel after this fix to see whether it changes anything.
> >
> > It's exactly what my initial report is about, CONFIG_DEBUG_LIST was there
> > always with CONFIG_QUOTA as well.
>
> Ah, ok.
>
> > Two bisections (v6.5 .. v6.6-rc1 & something...v6.6-rc6) pointed out to
> > merge commit!
>
> I thought CONFIG_DEBUG_LIST arrived through one path, some problematic
> quota change arrived through another path and because they cause problems
> only together, then bisecting to the merge would be exactly the outcome.
> Alas that doesn't seem to be the case :-|.
>
> > I _had_ tried to simply revert the quota changes (I haven't
> > said about that before) and it didn't help. I'm so puzzled with all this.
>
> Aha, OK. If even reverting quota changes doesn't help, then it's really
> weird...

Lemme to confirm that, it might be that I forgot to update configuration in
between.

--
With Best Regards,
Andy Shevchenko


2023-10-19 14:36:10

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

+Cc: compiler related guys (as far as my heuristics work).
Any ideas? (see below)

On Thu, Oct 19, 2023 at 03:01:43PM +0300, Andy Shevchenko wrote:
> On Thu, Oct 19, 2023 at 12:18:54PM +0200, Jan Kara wrote:
> > On Thu 19-10-23 11:46:58, Andy Shevchenko wrote:
> > > On Wed, Oct 18, 2023 at 08:46:13PM +0200, Jan Kara wrote:
> > > > On Tue 17-10-23 19:02:52, Andy Shevchenko wrote:
> > > > > On Tue, Oct 17, 2023 at 06:34:50PM +0300, Andy Shevchenko wrote:
> > > > > > On Tue, Oct 17, 2023 at 06:14:54PM +0300, Andy Shevchenko wrote:
> > > > > > > On Tue, Oct 17, 2023 at 05:50:10PM +0300, Andy Shevchenko wrote:
> > > > > > > > On Tue, Oct 17, 2023 at 04:42:29PM +0300, Andy Shevchenko wrote:
> > > > > > > > > On Tue, Oct 17, 2023 at 03:32:45PM +0200, Jan Kara wrote:
> > > > > > > > > > On Tue 17-10-23 14:46:20, Andy Shevchenko wrote:
> > > > > > > > > > > On Tue, Oct 17, 2023 at 01:32:53PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > > > On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > > > > On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > > > > > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > > > > > > > > > > > > > Hello Linus,

...

> > > > > > > > > > > > > > This merge commit (?) broke boot on Intel Merrifield.
> > > > > > > > > > > > > > It has earlycon enabled and only what I got is watchdog
> > > > > > > > > > > > > > trigger without a bit of information printed out.
> > > > > > > > > > >
> > > > > > > > > > > Okay, seems false positive as with different configuration it
> > > > > > > > > > > boots. It might be related to the size of the kernel itself.
> > > > > > > > > >
> > > > > > > > > > Ah, ok, that makes some sense.
> > > > > > > > >
> > > > > > > > > I should have mentioned that it boots with the configuration say "A",
> > > > > > > > > while not with "B", where "B" = "A" + "C" and definitely the kernel
> > > > > > > > > and initrd sizes in the "B" case are bigger.
> > > > > > > >
> > > > > > > > If it's a size (which is only grew from 13M->14M), it's weird.
> > > > > > > >
> > > > > > > > Nevertheless, I reverted these in my local tree
> > > > > > > >
> > > > > > > > 85515a7f0ae7 (HEAD -> topic/mrfld) Revert "defconfig: enable DEBUG_SPINLOCK"
> > > > > > > > 786e04262621 Revert "defconfig: enable DEBUG_ATOMIC_SLEEP"
> > > > > > > > 76ad0a0c3f2d Revert "defconfig: enable DEBUG_INFO"
> > > > > > > > f8090166c1be Revert "defconfig: enable DEBUG_LIST && DEBUG_OBJECTS_RCU_HEAD"
> > > > > > > >
> > > > > > > > and it boots again! So, after this merge something affects one of this?
> > > > > > > >
> > > > > > > > I'll continuing debugging which one is a culprit, just want to share
> > > > > > > > the intermediate findings.
> > > > > > >
> > > > > > > CONFIG_DEBUG_LIST with this merge commit somehow triggers this issue.
> > > > > > > Any ideas?
> > > > >
> > > > > > Dropping CONFIG_QUOTA* helps as well.
> > > > >
> > > > > More precisely it's enough to drop either from CONFIG_DEBUG_LIST and CONFIG_QUOTA
> > > > > to make it boot again.
> > > > >
> > > > > And I'm done for today.
> > > >
> > > > OK, thanks for debugging! So can you perhaps enable CONFIG_DEBUG_LIST
> > > > permanently in your kernel config and then bisect through the quota changes
> > > > in the merge? My guess is commit dabc8b20756 ("quota: fix dqput() to follow
> > > > the guarantees dquot_srcu should provide") might be the culprit given your
> > > > testing but I fail to see how given I don't expect any quotas to be used
> > > > during boot of your platform... BTW, there's also fixup: 869b6ea160
> > > > ("quota: Fix slow quotaoff") merged last week so you could try testing a
> > > > kernel after this fix to see whether it changes anything.
> > >
> > > It's exactly what my initial report is about, CONFIG_DEBUG_LIST was there
> > > always with CONFIG_QUOTA as well.
> >
> > Ah, ok.
> >
> > > Two bisections (v6.5 .. v6.6-rc1 & something...v6.6-rc6) pointed out to
> > > merge commit!
> >
> > I thought CONFIG_DEBUG_LIST arrived through one path, some problematic
> > quota change arrived through another path and because they cause problems
> > only together, then bisecting to the merge would be exactly the outcome.
> > Alas that doesn't seem to be the case :-|.
> >
> > > I _had_ tried to simply revert the quota changes (I haven't
> > > said about that before) and it didn't help. I'm so puzzled with all this.
> >
> > Aha, OK. If even reverting quota changes doesn't help, then it's really
> > weird...
>
> Lemme to confirm that, it might be that I forgot to update configuration in
> between.

So, what I have done so far.
1) I have cleaned ccaches and stuff as I used it to avoid collisions;
2) I have confirmed that CONFIG_DEBUG_LIST affects boot, the repo
I'm using is published here [0][1];
3) reverted quota patches until before this merge ([2] - last patch),
still boots;
4) reverted disabling of CONFIG_DEBUG_LIST [2], doesn't boot;
5) okay, rebased on top of merge, i.e. 1500e7e0726e, with DEBUG_LIST [3],
doesn't boot;
6) rebased [3] on one merge before, i.e. 63580f669d7f [4], voil? -- it boots!;

And (tadaam!) I have had an idea for a while to replace GCC with LLVM
(at least for this test), so [0] boots as well!

So, this merge triggered a bug in GCC, seems like... And it's _the_ merge
commit, which is so-o weird!

$ gcc --version
gcc (Debian 13.2.0-4) 13.2.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[0]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-dbg-list/
[1]: https://bitbucket.org/andy-shev/linux/src/test-mrfld/
[2]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-no-quota-dbg-list/
[3]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-after-merge-dbg-list/
[4]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-before-merge/

--
With Best Regards,
Andy Shevchenko


2023-10-19 14:44:45

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

+Cc: compiler related guys (as far as my heuristics work).
Any ideas? (see below)

On Thu, Oct 19, 2023 at 03:01:43PM +0300, Andy Shevchenko wrote:
> On Thu, Oct 19, 2023 at 12:18:54PM +0200, Jan Kara wrote:
> > On Thu 19-10-23 11:46:58, Andy Shevchenko wrote:
> > > On Wed, Oct 18, 2023 at 08:46:13PM +0200, Jan Kara wrote:
> > > > On Tue 17-10-23 19:02:52, Andy Shevchenko wrote:
> > > > > On Tue, Oct 17, 2023 at 06:34:50PM +0300, Andy Shevchenko wrote:
> > > > > > On Tue, Oct 17, 2023 at 06:14:54PM +0300, Andy Shevchenko wrote:
> > > > > > > On Tue, Oct 17, 2023 at 05:50:10PM +0300, Andy Shevchenko wrote:
> > > > > > > > On Tue, Oct 17, 2023 at 04:42:29PM +0300, Andy Shevchenko wrote:
> > > > > > > > > On Tue, Oct 17, 2023 at 03:32:45PM +0200, Jan Kara wrote:
> > > > > > > > > > On Tue 17-10-23 14:46:20, Andy Shevchenko wrote:
> > > > > > > > > > > On Tue, Oct 17, 2023 at 01:32:53PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > > > On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > > > > On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > > > > > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > > > > > > > > > > > > > Hello Linus,

...

> > > > > > > > > > > > > > This merge commit (?) broke boot on Intel Merrifield.
> > > > > > > > > > > > > > It has earlycon enabled and only what I got is watchdog
> > > > > > > > > > > > > > trigger without a bit of information printed out.
> > > > > > > > > > >
> > > > > > > > > > > Okay, seems false positive as with different configuration it
> > > > > > > > > > > boots. It might be related to the size of the kernel itself.
> > > > > > > > > >
> > > > > > > > > > Ah, ok, that makes some sense.
> > > > > > > > >
> > > > > > > > > I should have mentioned that it boots with the configuration say "A",
> > > > > > > > > while not with "B", where "B" = "A" + "C" and definitely the kernel
> > > > > > > > > and initrd sizes in the "B" case are bigger.
> > > > > > > >
> > > > > > > > If it's a size (which is only grew from 13M->14M), it's weird.
> > > > > > > >
> > > > > > > > Nevertheless, I reverted these in my local tree
> > > > > > > >
> > > > > > > > 85515a7f0ae7 (HEAD -> topic/mrfld) Revert "defconfig: enable DEBUG_SPINLOCK"
> > > > > > > > 786e04262621 Revert "defconfig: enable DEBUG_ATOMIC_SLEEP"
> > > > > > > > 76ad0a0c3f2d Revert "defconfig: enable DEBUG_INFO"
> > > > > > > > f8090166c1be Revert "defconfig: enable DEBUG_LIST && DEBUG_OBJECTS_RCU_HEAD"
> > > > > > > >
> > > > > > > > and it boots again! So, after this merge something affects one of this?
> > > > > > > >
> > > > > > > > I'll continuing debugging which one is a culprit, just want to share
> > > > > > > > the intermediate findings.
> > > > > > >
> > > > > > > CONFIG_DEBUG_LIST with this merge commit somehow triggers this issue.
> > > > > > > Any ideas?
> > > > >
> > > > > > Dropping CONFIG_QUOTA* helps as well.
> > > > >
> > > > > More precisely it's enough to drop either from CONFIG_DEBUG_LIST and CONFIG_QUOTA
> > > > > to make it boot again.
> > > > >
> > > > > And I'm done for today.
> > > >
> > > > OK, thanks for debugging! So can you perhaps enable CONFIG_DEBUG_LIST
> > > > permanently in your kernel config and then bisect through the quota changes
> > > > in the merge? My guess is commit dabc8b20756 ("quota: fix dqput() to follow
> > > > the guarantees dquot_srcu should provide") might be the culprit given your
> > > > testing but I fail to see how given I don't expect any quotas to be used
> > > > during boot of your platform... BTW, there's also fixup: 869b6ea160
> > > > ("quota: Fix slow quotaoff") merged last week so you could try testing a
> > > > kernel after this fix to see whether it changes anything.
> > >
> > > It's exactly what my initial report is about, CONFIG_DEBUG_LIST was there
> > > always with CONFIG_QUOTA as well.
> >
> > Ah, ok.
> >
> > > Two bisections (v6.5 .. v6.6-rc1 & something...v6.6-rc6) pointed out to
> > > merge commit!
> >
> > I thought CONFIG_DEBUG_LIST arrived through one path, some problematic
> > quota change arrived through another path and because they cause problems
> > only together, then bisecting to the merge would be exactly the outcome.
> > Alas that doesn't seem to be the case :-|.
> >
> > > I _had_ tried to simply revert the quota changes (I haven't
> > > said about that before) and it didn't help. I'm so puzzled with all this.
> >
> > Aha, OK. If even reverting quota changes doesn't help, then it's really
> > weird...
>
> Lemme to confirm that, it might be that I forgot to update configuration in
> between.

So, what I have done so far.
1) I have cleaned ccaches and stuff as I used it to avoid collisions;
2) I have confirmed that CONFIG_DEBUG_LIST affects boot, the repo
I'm using is published here [0][1];
3) reverted quota patches until before this merge ([2] - last patch),
still boots;
4) reverted disabling of CONFIG_DEBUG_LIST [2], doesn't boot;
5) okay, rebased on top of merge, i.e. 1500e7e0726e, with DEBUG_LIST [3],
doesn't boot;
6) rebased [3] on one merge before, i.e. 63580f669d7f [4], voil? -- it boots!;

And (tadaam!) I have had an idea for a while to replace GCC with LLVM
(at least for this test), so [0] boots as well!

So, this merge triggered a bug in GCC, seems like... And it's _the_ merge
commit, which is so-o weird!

$ gcc --version
gcc (Debian 13.2.0-4) 13.2.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[0]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-dbg-list/
[1]: https://bitbucket.org/andy-shev/linux/src/test-mrfld/
[2]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-no-quota-dbg-list/
[3]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-after-merge-dbg-list/
[4]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-before-merge/

--
With Best Regards,
Andy Shevchenko


2023-10-19 14:50:01

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Thu, Oct 19, 2023 at 05:12:59PM +0300, Andy Shevchenko wrote:
> On Thu, Oct 19, 2023 at 03:01:43PM +0300, Andy Shevchenko wrote:
> > On Thu, Oct 19, 2023 at 12:18:54PM +0200, Jan Kara wrote:
> > > On Thu 19-10-23 11:46:58, Andy Shevchenko wrote:
> > > > On Wed, Oct 18, 2023 at 08:46:13PM +0200, Jan Kara wrote:
> > > > > On Tue 17-10-23 19:02:52, Andy Shevchenko wrote:
> > > > > > On Tue, Oct 17, 2023 at 06:34:50PM +0300, Andy Shevchenko wrote:
> > > > > > > On Tue, Oct 17, 2023 at 06:14:54PM +0300, Andy Shevchenko wrote:
> > > > > > > > On Tue, Oct 17, 2023 at 05:50:10PM +0300, Andy Shevchenko wrote:
> > > > > > > > > On Tue, Oct 17, 2023 at 04:42:29PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > On Tue, Oct 17, 2023 at 03:32:45PM +0200, Jan Kara wrote:
> > > > > > > > > > > On Tue 17-10-23 14:46:20, Andy Shevchenko wrote:
> > > > > > > > > > > > On Tue, Oct 17, 2023 at 01:32:53PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > > > > On Tue, Oct 17, 2023 at 01:29:27PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > > > > > On Tue, Oct 17, 2023 at 01:27:19PM +0300, Andy Shevchenko wrote:
> > > > > > > > > > > > > > > On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> > > > > > > > > > > > > > > > Hello Linus,

...

> > > > > > > > > > > > > > > This merge commit (?) broke boot on Intel Merrifield.
> > > > > > > > > > > > > > > It has earlycon enabled and only what I got is watchdog
> > > > > > > > > > > > > > > trigger without a bit of information printed out.
> > > > > > > > > > > >
> > > > > > > > > > > > Okay, seems false positive as with different configuration it
> > > > > > > > > > > > boots. It might be related to the size of the kernel itself.
> > > > > > > > > > >
> > > > > > > > > > > Ah, ok, that makes some sense.
> > > > > > > > > >
> > > > > > > > > > I should have mentioned that it boots with the configuration say "A",
> > > > > > > > > > while not with "B", where "B" = "A" + "C" and definitely the kernel
> > > > > > > > > > and initrd sizes in the "B" case are bigger.
> > > > > > > > >
> > > > > > > > > If it's a size (which is only grew from 13M->14M), it's weird.
> > > > > > > > >
> > > > > > > > > Nevertheless, I reverted these in my local tree
> > > > > > > > >
> > > > > > > > > 85515a7f0ae7 (HEAD -> topic/mrfld) Revert "defconfig: enable DEBUG_SPINLOCK"
> > > > > > > > > 786e04262621 Revert "defconfig: enable DEBUG_ATOMIC_SLEEP"
> > > > > > > > > 76ad0a0c3f2d Revert "defconfig: enable DEBUG_INFO"
> > > > > > > > > f8090166c1be Revert "defconfig: enable DEBUG_LIST && DEBUG_OBJECTS_RCU_HEAD"
> > > > > > > > >
> > > > > > > > > and it boots again! So, after this merge something affects one of this?
> > > > > > > > >
> > > > > > > > > I'll continuing debugging which one is a culprit, just want to share
> > > > > > > > > the intermediate findings.
> > > > > > > >
> > > > > > > > CONFIG_DEBUG_LIST with this merge commit somehow triggers this issue.
> > > > > > > > Any ideas?
> > > > > >
> > > > > > > Dropping CONFIG_QUOTA* helps as well.
> > > > > >
> > > > > > More precisely it's enough to drop either from CONFIG_DEBUG_LIST and CONFIG_QUOTA
> > > > > > to make it boot again.
> > > > > >
> > > > > > And I'm done for today.
> > > > >
> > > > > OK, thanks for debugging! So can you perhaps enable CONFIG_DEBUG_LIST
> > > > > permanently in your kernel config and then bisect through the quota changes
> > > > > in the merge? My guess is commit dabc8b20756 ("quota: fix dqput() to follow
> > > > > the guarantees dquot_srcu should provide") might be the culprit given your
> > > > > testing but I fail to see how given I don't expect any quotas to be used
> > > > > during boot of your platform... BTW, there's also fixup: 869b6ea160
> > > > > ("quota: Fix slow quotaoff") merged last week so you could try testing a
> > > > > kernel after this fix to see whether it changes anything.
> > > >
> > > > It's exactly what my initial report is about, CONFIG_DEBUG_LIST was there
> > > > always with CONFIG_QUOTA as well.
> > >
> > > Ah, ok.
> > >
> > > > Two bisections (v6.5 .. v6.6-rc1 & something...v6.6-rc6) pointed out to
> > > > merge commit!
> > >
> > > I thought CONFIG_DEBUG_LIST arrived through one path, some problematic
> > > quota change arrived through another path and because they cause problems
> > > only together, then bisecting to the merge would be exactly the outcome.
> > > Alas that doesn't seem to be the case :-|.
> > >
> > > > I _had_ tried to simply revert the quota changes (I haven't
> > > > said about that before) and it didn't help. I'm so puzzled with all this.
> > >
> > > Aha, OK. If even reverting quota changes doesn't help, then it's really
> > > weird...
> >
> > Lemme to confirm that, it might be that I forgot to update configuration in
> > between.
>
> So, what I have done so far.
> 1) I have cleaned ccaches and stuff as I used it to avoid collisions;
> 2) I have confirmed that CONFIG_DEBUG_LIST affects boot, the repo
> I'm using is published here [0][1];
> 3) reverted quota patches until before this merge ([2] - last patch),
> still boots;
> 4) reverted disabling of CONFIG_DEBUG_LIST [2], doesn't boot;
> 5) okay, rebased on top of merge, i.e. 1500e7e0726e, with DEBUG_LIST [3],
> doesn't boot;
> 6) rebased [3] on one merge before, i.e. 63580f669d7f [4], voil? -- it boots!;
>
> And (tadaam!) I have had an idea for a while to replace GCC with LLVM
> (at least for this test), so [0] boots as well!
>
> So, this merge triggered a bug in GCC, seems like... And it's _the_ merge
> commit, which is so-o weird!
>
> $ gcc --version
> gcc (Debian 13.2.0-4) 13.2.0
> Copyright (C) 2023 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions. There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

For the cleaner experiment I took [0], cleaned $OUTPUT and ccache (fully)
and build the kernel with

$ clang --version
Debian clang version 16.0.6 (15)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

it boots,

Then I have done the same with GCC (see above the version)
and it does not boot.

[0]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-dbg-list/
[1]: https://bitbucket.org/andy-shev/linux/src/test-mrfld/
[2]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-no-quota-dbg-list/
[3]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-after-merge-dbg-list/
[4]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-before-merge/

--
With Best Regards,
Andy Shevchenko


2023-10-19 16:42:49

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Thu, Oct 19, 2023 at 05:44:30PM +0300, Andy Shevchenko wrote:
> So, what I have done so far.
> 1) I have cleaned ccaches and stuff as I used it to avoid collisions;
> 2) I have confirmed that CONFIG_DEBUG_LIST affects boot, the repo
> I'm using is published here [0][1];
> 3) reverted quota patches until before this merge ([2] - last patch),
> still boots;
> 4) reverted disabling of CONFIG_DEBUG_LIST [2], doesn't boot;
> 5) okay, rebased on top of merge, i.e. 1500e7e0726e, with DEBUG_LIST [3],
> doesn't boot;
> 6) rebased [3] on one merge before, i.e. 63580f669d7f [4], voilà -- it boots!;
>
> And (tadaam!) I have had an idea for a while to replace GCC with LLVM
> (at least for this test), so [0] boots as well!
>
> So, this merge triggered a bug in GCC, seems like... And it's _the_ merge
> commit, which is so-o weird!

I'm not really a compiler person, but IMO it's highly unlikely to be a
GCC bug unless you can point to the bad code generation.

If CONFIG_DEBUG_LIST is triggering it, it's most likely some kind of
memory corruption, in which case seemingly random events can trigger the
detection of it (or lack thereof).

Any chance it boots with the following?

diff --git a/include/linux/bug.h b/include/linux/bug.h
index 348acf2558f3..29e9e3498902 100644
--- a/include/linux/bug.h
+++ b/include/linux/bug.h
@@ -84,7 +84,7 @@ static inline __must_check bool check_data_corruption(bool v) { return v; }
if (corruption) { \
if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION)) { \
pr_err(fmt, ##__VA_ARGS__); \
- BUG(); \
+ WARN_ON(1); \
} else \
WARN(1, fmt, ##__VA_ARGS__); \
} \
diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
index ae1b541446c9..395c4f5d8aa6 100644
--- a/include/linux/list_bl.h
+++ b/include/linux/list_bl.h
@@ -25,7 +25,7 @@
#endif

#ifdef CONFIG_DEBUG_LIST
-#define LIST_BL_BUG_ON(x) BUG_ON(x)
+#define LIST_BL_BUG_ON(x) WARN_ON(x)
#else
#define LIST_BL_BUG_ON(x)
#endif

2023-10-19 17:06:26

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Thu, Oct 19, 2023 at 09:42:40AM -0700, Josh Poimboeuf wrote:
> On Thu, Oct 19, 2023 at 05:44:30PM +0300, Andy Shevchenko wrote:
> > So, what I have done so far.
> > 1) I have cleaned ccaches and stuff as I used it to avoid collisions;
> > 2) I have confirmed that CONFIG_DEBUG_LIST affects boot, the repo
> > I'm using is published here [0][1];
> > 3) reverted quota patches until before this merge ([2] - last patch),
> > still boots;
> > 4) reverted disabling of CONFIG_DEBUG_LIST [2], doesn't boot;
> > 5) okay, rebased on top of merge, i.e. 1500e7e0726e, with DEBUG_LIST [3],
> > doesn't boot;
> > 6) rebased [3] on one merge before, i.e. 63580f669d7f [4], voil? -- it boots!;
> >
> > And (tadaam!) I have had an idea for a while to replace GCC with LLVM
> > (at least for this test), so [0] boots as well!
> >
> > So, this merge triggered a bug in GCC, seems like... And it's _the_ merge
> > commit, which is so-o weird!
>
> I'm not really a compiler person, but IMO it's highly unlikely to be a
> GCC bug unless you can point to the bad code generation.

Hmm... Then what's the difference between clang and GCC on the very same source
code? One of them has a bug in my opinion.

> If CONFIG_DEBUG_LIST is triggering it, it's most likely some kind of
> memory corruption, in which case seemingly random events can trigger the
> detection of it (or lack thereof).

Note disabling QUOTA has the same effect, so if it's a corruption it happens
somewhere there.

> Any chance it boots with the following?

Nope, no luck.

--
With Best Regards,
Andy Shevchenko


2023-10-19 17:26:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Thu, 19 Oct 2023 at 10:05, Andy Shevchenko
<[email protected]> wrote:
>
> Hmm... Then what's the difference between clang and GCC on the very same source
> code? One of them has a bug in my opinion.

Compiler bugs do happen, but they are quite rare (happily).

It's almost certainly just ambiguous code that happens to work with
one code generation, and not another.

It might be as simple as just hitting a timing bug, but considering
how consistent it is for you (with a particular config), it's more
likely to be something like an optimization that just happens to
trigger some subtle ordering requirement or other.

So then the "different compiler" is really just largely equivalent to
"different optimization options", and sometimes that causes problems.

That said, the quota dependency is quite odd, since normally I
wouldn't expect the quota code to really even trigger much during
boot. When it triggers that consistently, and that early during boot,
I would expect others to have reported more of this.

Strange.

Linus

2023-10-19 17:51:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Thu, 19 Oct 2023 at 10:26, Linus Torvalds
<[email protected]> wrote:
>
> That said, the quota dependency is quite odd, since normally I
> wouldn't expect the quota code to really even trigger much during
> boot. When it triggers that consistently, and that early during boot,
> I would expect others to have reported more of this.
>
> Strange.

Hmm. I do think the quota list handling has some odd things going on.
And it did change with the whole ->dq_free thing.

Some of it is just bad:

#ifdef CONFIG_QUOTA_DEBUG
/* sanity check */
BUG_ON(!list_empty(&dquot->dq_free));
#endif

is done under a spinlock, and if it ever triggers, the machine is
dead. Dammit, I *hate* how people use BUG_ON() for assertions. It's a
disgrace. That should be a WARN_ON_ONCE().

And it does have quite a bit of list-related changes, with the whole
series from Baokun Li changing how the ->dq_free list works.

The fact that it consistently bisects to the merge is still odd.

Linus

2023-10-19 18:10:39

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Thu, Oct 19, 2023 at 10:51:18AM -0700, Linus Torvalds wrote:
> On Thu, 19 Oct 2023 at 10:26, Linus Torvalds
> <[email protected]> wrote:
> >
> > That said, the quota dependency is quite odd, since normally I
> > wouldn't expect the quota code to really even trigger much during
> > boot. When it triggers that consistently, and that early during boot,
> > I would expect others to have reported more of this.
> >
> > Strange.
>
> Hmm. I do think the quota list handling has some odd things going on.
> And it did change with the whole ->dq_free thing.
>
> Some of it is just bad:
>
> #ifdef CONFIG_QUOTA_DEBUG
> /* sanity check */
> BUG_ON(!list_empty(&dquot->dq_free));
> #endif
>
> is done under a spinlock, and if it ever triggers, the machine is
> dead. Dammit, I *hate* how people use BUG_ON() for assertions. It's a
> disgrace. That should be a WARN_ON_ONCE().

In my configuration

CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y

> And it does have quite a bit of list-related changes, with the whole
> series from Baokun Li changing how the ->dq_free list works.
>
> The fact that it consistently bisects to the merge is still odd.

Exactly! Imre suggested to test the merge point itself, so
far I tested the result of the merge in the upstream, but not
the branch/tag that has been merged.

Let's see if I have time this week for that. This hunting is a bit exhaustive.

--
With Best Regards,
Andy Shevchenko


2023-10-19 18:17:27

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Thu, Oct 19, 2023 at 09:10:25PM +0300, Andy Shevchenko wrote:
> On Thu, Oct 19, 2023 at 10:51:18AM -0700, Linus Torvalds wrote:
> > On Thu, 19 Oct 2023 at 10:26, Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > That said, the quota dependency is quite odd, since normally I
> > > wouldn't expect the quota code to really even trigger much during
> > > boot. When it triggers that consistently, and that early during boot,
> > > I would expect others to have reported more of this.
> > >
> > > Strange.
> >
> > Hmm. I do think the quota list handling has some odd things going on.
> > And it did change with the whole ->dq_free thing.
> >
> > Some of it is just bad:
> >
> > #ifdef CONFIG_QUOTA_DEBUG
> > /* sanity check */
> > BUG_ON(!list_empty(&dquot->dq_free));
> > #endif
> >
> > is done under a spinlock, and if it ever triggers, the machine is
> > dead. Dammit, I *hate* how people use BUG_ON() for assertions. It's a
> > disgrace. That should be a WARN_ON_ONCE().
>
> In my configuration
>
> CONFIG_QUOTA=y
> CONFIG_QUOTA_NETLINK_INTERFACE=y
> # CONFIG_QUOTA_DEBUG is not set
> CONFIG_QUOTA_TREE=y
> # CONFIG_QFMT_V1 is not set
> CONFIG_QFMT_V2=y
> CONFIG_QUOTACTL=y
>
> > And it does have quite a bit of list-related changes, with the whole
> > series from Baokun Li changing how the ->dq_free list works.
> >
> > The fact that it consistently bisects to the merge is still odd.
>
> Exactly! Imre suggested to test the merge point itself, so
> far I tested the result of the merge in the upstream, but not
> the branch/tag that has been merged.
>
> Let's see if I have time this week for that. This hunting is a bit exhaustive.

Meanwhile a wild idea, can it be some git (automatic) conflict resolution that
makes that merge affect another (not related to the main contents of the merge)
files? Like upstream has one base, the merge has another which is older/newer
in the history?

--
With Best Regards,
Andy Shevchenko


2023-10-19 18:44:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Thu, 19 Oct 2023 at 11:16, Andy Shevchenko
<[email protected]> wrote:
>
> Meanwhile a wild idea, can it be some git (automatic) conflict resolution that
> makes that merge affect another (not related to the main contents of the merge)
> files? Like upstream has one base, the merge has another which is older/newer
> in the history?

I already looked at any obvious case of that.

The only quota-related issue on the other side is an obvious
one-liner: commit 86be6b8bd834 ("quota: Check presence of quota
operation structures instead of ->quota_read and ->quota_write
callbacks").

It didn't affect the merge, because it was not related to any of the
changes that came in from the quota branch (it was physically close to
the change that used lockdep_assert_held_write() instead of a
WARN_ON_ONCE(down_read_trylock()) sequence, but it is unrelated to
it).

I guess you could try reverting that one-liner after the merge, but I
_really_ don't think it is at all relevant.

What *would* probably be interesting is to start at the pre-merge
state, and rebase the code that got merged in. And then bisect *that*
series.

IOW, with the merge that triggers your bisection being commit
1500e7e0726e, do perhaps something like this:

# Name the states before the merge
git branch pre-merge 1500e7e0726e^
git branch jan-state 1500e7e0726e^2

# Now double-check that this works for you, of course.
# Just to be safe...
git checkout pre-merge
.. test-build and test-boot this with the bad config ..

# Then, let's create a new branch that is
# the rebased version of Jan's state:
git checkout -b jan-rebased jan-state
git rebase pre-merge

# Verify that the tree is the same as the merge
git diff 1500e7e0726e

# Ok, that was empty, so do a bisect on this
# rebased history
git bisect start
git bisect bad
git bisect good pre-merge

.. and see what commit it *now* claims is the bad commit.

Would you be willing to do this? It should be only a few bisects,
since Jan's branch only brought in 17 commits that the above rebases
into this test branch. So four or five bisections should pinpoint the
exact point where it goes bad.

Of course, since this is apparently about some "random code generation
issue", that exact point still may not be very interesting.

Linus

2023-10-20 11:07:58

by Jan Kara

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Thu 19-10-23 10:51:18, Linus Torvalds wrote:
> On Thu, 19 Oct 2023 at 10:26, Linus Torvalds
> <[email protected]> wrote:
> >
> > That said, the quota dependency is quite odd, since normally I
> > wouldn't expect the quota code to really even trigger much during
> > boot. When it triggers that consistently, and that early during boot,
> > I would expect others to have reported more of this.
> >
> > Strange.
>
> Hmm. I do think the quota list handling has some odd things going on.
> And it did change with the whole ->dq_free thing.
>
> Some of it is just bad:
>
> #ifdef CONFIG_QUOTA_DEBUG
> /* sanity check */
> BUG_ON(!list_empty(&dquot->dq_free));
> #endif
>
> is done under a spinlock, and if it ever triggers, the machine is
> dead. Dammit, I *hate* how people use BUG_ON() for assertions. It's a
> disgrace. That should be a WARN_ON_ONCE().

I agree. I should go one day and replace these BUG_ONs. This one is from
2006 when we were more accepting of such checks...

> And it does have quite a bit of list-related changes, with the whole
> series from Baokun Li changing how the ->dq_free list works.
>
> The fact that it consistently bisects to the merge is still odd.

Yes. Plus Andy said that when he reverts quota changes on top of the merge
commit the resulting kernel still crashes. Really puzzling.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-10-20 14:52:24

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Thu, Oct 19, 2023 at 11:43:47AM -0700, Linus Torvalds wrote:
> On Thu, 19 Oct 2023 at 11:16, Andy Shevchenko
> <[email protected]> wrote:
> >
> > Meanwhile a wild idea, can it be some git (automatic) conflict resolution that
> > makes that merge affect another (not related to the main contents of the merge)
> > files? Like upstream has one base, the merge has another which is older/newer
> > in the history?
>
> I already looked at any obvious case of that.
>
> The only quota-related issue on the other side is an obvious
> one-liner: commit 86be6b8bd834 ("quota: Check presence of quota
> operation structures instead of ->quota_read and ->quota_write
> callbacks").
>
> It didn't affect the merge, because it was not related to any of the
> changes that came in from the quota branch (it was physically close to
> the change that used lockdep_assert_held_write() instead of a
> WARN_ON_ONCE(down_read_trylock()) sequence, but it is unrelated to
> it).
>
> I guess you could try reverting that one-liner after the merge, but I
> _really_ don't think it is at all relevant.
>
> What *would* probably be interesting is to start at the pre-merge
> state, and rebase the code that got merged in. And then bisect *that*
> series.
>
> IOW, with the merge that triggers your bisection being commit
> 1500e7e0726e, do perhaps something like this:
>
> # Name the states before the merge
> git branch pre-merge 1500e7e0726e^
> git branch jan-state 1500e7e0726e^2
>
> # Now double-check that this works for you, of course.
> # Just to be safe...
> git checkout pre-merge
> .. test-build and test-boot this with the bad config ..

That's I have checked already [4], but okay, let me double check.
[5] is the same as [4] according to `git diff`.

It boots.

> # Then, let's create a new branch that is
> # the rebased version of Jan's state:
> git checkout -b jan-rebased jan-state
> git rebase pre-merge

[6] is created.

> # Verify that the tree is the same as the merge
> git diff 1500e7e0726e

Yes, nothing in output.

And it does not boot.

> # Ok, that was empty, so do a bisect on this
> # rebased history
> git bisect start
> git bisect bad
> git bisect good pre-merge
>
> .. and see what commit it *now* claims is the bad commit.

git bisect start
# status: waiting for both good and bad commits
# good: [63580f669d7ff5aa5a1fa2e3994114770a491722] Merge tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
git bisect good 63580f669d7ff5aa5a1fa2e3994114770a491722
# status: waiting for bad commit, 1 good commit known
# bad: [2447ff4196091e41d385635f9b6d003119f24199] ext2: Fix kernel-doc warnings
git bisect bad 2447ff4196091e41d385635f9b6d003119f24199
# bad: [a7c4109a1fa7f9f8cfa9aa93e7aae52d0df820f6] MAINTAINERS: change reiserfs status to obsolete
git bisect bad a7c4109a1fa7f9f8cfa9aa93e7aae52d0df820f6
# bad: [74fdc82e4a4302bf8a519101a40691b78d9beb6c] quota: add new helper dquot_active()
git bisect bad 74fdc82e4a4302bf8a519101a40691b78d9beb6c
# bad: [e64db1c50eb5d3be2187b56d32ec39e56b739845] quota: factor out dquot_write_dquot()
git bisect bad e64db1c50eb5d3be2187b56d32ec39e56b739845
# good: [eea7e964642984753768ddbb710e2eefd32c7a89] ext2: remove redundant assignment to variable desc and variable best_desc
git bisect good eea7e964642984753768ddbb710e2eefd32c7a89
# first bad commit: [e64db1c50eb5d3be2187b56d32ec39e56b739845] quota: factor out dquot_write_dquot()

> Would you be willing to do this? It should be only a few bisects,
> since Jan's branch only brought in 17 commits that the above rebases
> into this test branch. So four or five bisections should pinpoint the
> exact point where it goes bad.

See above.

I even rebuilt again with just rebased on top of e64db1c50eb5 and it doesn't
boot, so we found the culprit that triggers this issue.

> Of course, since this is apparently about some "random code generation
> issue", that exact point still may not be very interesting.

On top of the above I have tried the following:
1) dropping inline, replacing it to __always_inline -- no help;
2) commenting out the error message -- helps!

--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -632,8 +632,10 @@ static inline int dquot_write_dquot(struct dquot *dquot)
{
int ret = dquot->dq_sb->dq_op->write_dquot(dquot);
if (ret < 0) {
+#if 0
quota_error(dquot->dq_sb, "Can't write quota structure "
"(error %d). Quota may get out of sync!", ret);
+#endif
/* Clear dirty bit anyway to avoid infinite loop. */
clear_dquot_dirty(dquot);
}

If it's a timing issue it's related to that error message, as the new helper is
run outside of the spinlock.

What's is fishy there besides the error message being available only in one
case, is the pointer that is used for dp_op. I'm not at all familiar with the
code, but can it be that these superblocks are different for those two cases?

[4]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-before-merge/
[5]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-pre-merge/
[6]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-jan-rebased/


--
With Best Regards,
Andy Shevchenko


2023-10-20 15:06:56

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

+Cc: Baokun, the author of the patch.

On Fri, Oct 20, 2023 at 05:51:59PM +0300, Andy Shevchenko wrote:
> On Thu, Oct 19, 2023 at 11:43:47AM -0700, Linus Torvalds wrote:
> > On Thu, 19 Oct 2023 at 11:16, Andy Shevchenko
> > <[email protected]> wrote:
> > >
> > > Meanwhile a wild idea, can it be some git (automatic) conflict resolution that
> > > makes that merge affect another (not related to the main contents of the merge)
> > > files? Like upstream has one base, the merge has another which is older/newer
> > > in the history?
> >
> > I already looked at any obvious case of that.
> >
> > The only quota-related issue on the other side is an obvious
> > one-liner: commit 86be6b8bd834 ("quota: Check presence of quota
> > operation structures instead of ->quota_read and ->quota_write
> > callbacks").
> >
> > It didn't affect the merge, because it was not related to any of the
> > changes that came in from the quota branch (it was physically close to
> > the change that used lockdep_assert_held_write() instead of a
> > WARN_ON_ONCE(down_read_trylock()) sequence, but it is unrelated to
> > it).
> >
> > I guess you could try reverting that one-liner after the merge, but I
> > _really_ don't think it is at all relevant.
> >
> > What *would* probably be interesting is to start at the pre-merge
> > state, and rebase the code that got merged in. And then bisect *that*
> > series.
> >
> > IOW, with the merge that triggers your bisection being commit
> > 1500e7e0726e, do perhaps something like this:
> >
> > # Name the states before the merge
> > git branch pre-merge 1500e7e0726e^
> > git branch jan-state 1500e7e0726e^2
> >
> > # Now double-check that this works for you, of course.
> > # Just to be safe...
> > git checkout pre-merge
> > .. test-build and test-boot this with the bad config ..
>
> That's I have checked already [4], but okay, let me double check.
> [5] is the same as [4] according to `git diff`.
>
> It boots.
>
> > # Then, let's create a new branch that is
> > # the rebased version of Jan's state:
> > git checkout -b jan-rebased jan-state
> > git rebase pre-merge
>
> [6] is created.
>
> > # Verify that the tree is the same as the merge
> > git diff 1500e7e0726e
>
> Yes, nothing in output.
>
> And it does not boot.
>
> > # Ok, that was empty, so do a bisect on this
> > # rebased history
> > git bisect start
> > git bisect bad
> > git bisect good pre-merge
> >
> > .. and see what commit it *now* claims is the bad commit.
>
> git bisect start
> # status: waiting for both good and bad commits
> # good: [63580f669d7ff5aa5a1fa2e3994114770a491722] Merge tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
> git bisect good 63580f669d7ff5aa5a1fa2e3994114770a491722
> # status: waiting for bad commit, 1 good commit known
> # bad: [2447ff4196091e41d385635f9b6d003119f24199] ext2: Fix kernel-doc warnings
> git bisect bad 2447ff4196091e41d385635f9b6d003119f24199
> # bad: [a7c4109a1fa7f9f8cfa9aa93e7aae52d0df820f6] MAINTAINERS: change reiserfs status to obsolete
> git bisect bad a7c4109a1fa7f9f8cfa9aa93e7aae52d0df820f6
> # bad: [74fdc82e4a4302bf8a519101a40691b78d9beb6c] quota: add new helper dquot_active()
> git bisect bad 74fdc82e4a4302bf8a519101a40691b78d9beb6c
> # bad: [e64db1c50eb5d3be2187b56d32ec39e56b739845] quota: factor out dquot_write_dquot()
> git bisect bad e64db1c50eb5d3be2187b56d32ec39e56b739845
> # good: [eea7e964642984753768ddbb710e2eefd32c7a89] ext2: remove redundant assignment to variable desc and variable best_desc
> git bisect good eea7e964642984753768ddbb710e2eefd32c7a89
> # first bad commit: [e64db1c50eb5d3be2187b56d32ec39e56b739845] quota: factor out dquot_write_dquot()
>
> > Would you be willing to do this? It should be only a few bisects,
> > since Jan's branch only brought in 17 commits that the above rebases
> > into this test branch. So four or five bisections should pinpoint the
> > exact point where it goes bad.
>
> See above.
>
> I even rebuilt again with just rebased on top of e64db1c50eb5 and it doesn't
> boot, so we found the culprit that triggers this issue.
>
> > Of course, since this is apparently about some "random code generation
> > issue", that exact point still may not be very interesting.
>
> On top of the above I have tried the following:
> 1) dropping inline, replacing it to __always_inline -- no help;
> 2) commenting out the error message -- helps!
>
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -632,8 +632,10 @@ static inline int dquot_write_dquot(struct dquot *dquot)
> {
> int ret = dquot->dq_sb->dq_op->write_dquot(dquot);
> if (ret < 0) {
> +#if 0
> quota_error(dquot->dq_sb, "Can't write quota structure "
> "(error %d). Quota may get out of sync!", ret);
> +#endif
> /* Clear dirty bit anyway to avoid infinite loop. */
> clear_dquot_dirty(dquot);
> }
>
> If it's a timing issue it's related to that error message, as the new helper is
> run outside of the spinlock.
>
> What's is fishy there besides the error message being available only in one
> case, is the pointer that is used for dp_op. I'm not at all familiar with the
> code, but can it be that these superblocks are different for those two cases?
>
> [4]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-before-merge/
> [5]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-pre-merge/
> [6]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-jan-rebased/
>
>
> --
> With Best Regards,
> Andy Shevchenko
>
>

--
With Best Regards,
Andy Shevchenko


2023-10-20 15:13:02

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Fri, Oct 20, 2023 at 06:06:22PM +0300, Andy Shevchenko wrote:
> On Fri, Oct 20, 2023 at 05:51:59PM +0300, Andy Shevchenko wrote:
> > On Thu, Oct 19, 2023 at 11:43:47AM -0700, Linus Torvalds wrote:
> > > On Thu, 19 Oct 2023 at 11:16, Andy Shevchenko
> > > <[email protected]> wrote:
> > > >
> > > > Meanwhile a wild idea, can it be some git (automatic) conflict resolution that
> > > > makes that merge affect another (not related to the main contents of the merge)
> > > > files? Like upstream has one base, the merge has another which is older/newer
> > > > in the history?
> > >
> > > I already looked at any obvious case of that.
> > >
> > > The only quota-related issue on the other side is an obvious
> > > one-liner: commit 86be6b8bd834 ("quota: Check presence of quota
> > > operation structures instead of ->quota_read and ->quota_write
> > > callbacks").
> > >
> > > It didn't affect the merge, because it was not related to any of the
> > > changes that came in from the quota branch (it was physically close to
> > > the change that used lockdep_assert_held_write() instead of a
> > > WARN_ON_ONCE(down_read_trylock()) sequence, but it is unrelated to
> > > it).
> > >
> > > I guess you could try reverting that one-liner after the merge, but I
> > > _really_ don't think it is at all relevant.
> > >
> > > What *would* probably be interesting is to start at the pre-merge
> > > state, and rebase the code that got merged in. And then bisect *that*
> > > series.
> > >
> > > IOW, with the merge that triggers your bisection being commit
> > > 1500e7e0726e, do perhaps something like this:
> > >
> > > # Name the states before the merge
> > > git branch pre-merge 1500e7e0726e^
> > > git branch jan-state 1500e7e0726e^2
> > >
> > > # Now double-check that this works for you, of course.
> > > # Just to be safe...
> > > git checkout pre-merge
> > > .. test-build and test-boot this with the bad config ..
> >
> > That's I have checked already [4], but okay, let me double check.
> > [5] is the same as [4] according to `git diff`.
> >
> > It boots.
> >
> > > # Then, let's create a new branch that is
> > > # the rebased version of Jan's state:
> > > git checkout -b jan-rebased jan-state
> > > git rebase pre-merge
> >
> > [6] is created.
> >
> > > # Verify that the tree is the same as the merge
> > > git diff 1500e7e0726e
> >
> > Yes, nothing in output.
> >
> > And it does not boot.
> >
> > > # Ok, that was empty, so do a bisect on this
> > > # rebased history
> > > git bisect start
> > > git bisect bad
> > > git bisect good pre-merge
> > >
> > > .. and see what commit it *now* claims is the bad commit.
> >
> > git bisect start
> > # status: waiting for both good and bad commits
> > # good: [63580f669d7ff5aa5a1fa2e3994114770a491722] Merge tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
> > git bisect good 63580f669d7ff5aa5a1fa2e3994114770a491722
> > # status: waiting for bad commit, 1 good commit known
> > # bad: [2447ff4196091e41d385635f9b6d003119f24199] ext2: Fix kernel-doc warnings
> > git bisect bad 2447ff4196091e41d385635f9b6d003119f24199
> > # bad: [a7c4109a1fa7f9f8cfa9aa93e7aae52d0df820f6] MAINTAINERS: change reiserfs status to obsolete
> > git bisect bad a7c4109a1fa7f9f8cfa9aa93e7aae52d0df820f6
> > # bad: [74fdc82e4a4302bf8a519101a40691b78d9beb6c] quota: add new helper dquot_active()
> > git bisect bad 74fdc82e4a4302bf8a519101a40691b78d9beb6c
> > # bad: [e64db1c50eb5d3be2187b56d32ec39e56b739845] quota: factor out dquot_write_dquot()
> > git bisect bad e64db1c50eb5d3be2187b56d32ec39e56b739845
> > # good: [eea7e964642984753768ddbb710e2eefd32c7a89] ext2: remove redundant assignment to variable desc and variable best_desc
> > git bisect good eea7e964642984753768ddbb710e2eefd32c7a89
> > # first bad commit: [e64db1c50eb5d3be2187b56d32ec39e56b739845] quota: factor out dquot_write_dquot()
> >
> > > Would you be willing to do this? It should be only a few bisects,
> > > since Jan's branch only brought in 17 commits that the above rebases
> > > into this test branch. So four or five bisections should pinpoint the
> > > exact point where it goes bad.
> >
> > See above.
> >
> > I even rebuilt again with just rebased on top of e64db1c50eb5 and it doesn't
> > boot, so we found the culprit that triggers this issue.
> >
> > > Of course, since this is apparently about some "random code generation
> > > issue", that exact point still may not be very interesting.
> >
> > On top of the above I have tried the following:
> > 1) dropping inline, replacing it to __always_inline -- no help;
> > 2) commenting out the error message -- helps!
> >
> > --- a/fs/quota/dquot.c
> > +++ b/fs/quota/dquot.c
> > @@ -632,8 +632,10 @@ static inline int dquot_write_dquot(struct dquot *dquot)
> > {
> > int ret = dquot->dq_sb->dq_op->write_dquot(dquot);
> > if (ret < 0) {
> > +#if 0
> > quota_error(dquot->dq_sb, "Can't write quota structure "
> > "(error %d). Quota may get out of sync!", ret);
> > +#endif
> > /* Clear dirty bit anyway to avoid infinite loop. */
> > clear_dquot_dirty(dquot);
> > }

Doing the same on the my branch based on top of v6.6-rc6 does not help.
So looks like a race condition somewhere happening related to that dirty bit
(as comment states it needs to be cleaned to avoid infinite loop, that's
probably what happens).

> > If it's a timing issue it's related to that error message, as the new helper is
> > run outside of the spinlock.
> >
> > What's is fishy there besides the error message being available only in one
> > case, is the pointer that is used for dp_op. I'm not at all familiar with the
> > code, but can it be that these superblocks are different for those two cases?
> >
> > [4]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-before-merge/
> > [5]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-pre-merge/
> > [6]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-jan-rebased/

--
With Best Regards,
Andy Shevchenko


2023-10-20 17:24:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Fri, 20 Oct 2023 at 07:52, Andy Shevchenko
<[email protected]> wrote:
>
> # first bad commit: [e64db1c50eb5d3be2187b56d32ec39e56b739845] quota: factor out dquot_write_dquot()

Ok, so commit 024128477809 ("quota: factor out dquot_write_dquot()") pre-rebase.

Which honestly seems entirely innocuous, and the only change seems to
be a slight massaging of the return value checking, in that it did a
"if (err)" ine one place before, now it does "if (err < 0)".

And the whole "now it always warns about errors", which used to happen
only in dqput() before.

Neither seems to be very relevant, which just reinforces that yes,
this looks like a timing thing.

> On top of the above I have tried the following:
> 1) dropping inline, replacing it to __always_inline -- no help;
> 2) commenting out the error message -- helps!
>
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -632,8 +632,10 @@ static inline int dquot_write_dquot(struct dquot *dquot)
> {
> int ret = dquot->dq_sb->dq_op->write_dquot(dquot);
> if (ret < 0) {
> +#if 0
> quota_error(dquot->dq_sb, "Can't write quota structure "
> "(error %d). Quota may get out of sync!", ret);
> +#endif
> /* Clear dirty bit anyway to avoid infinite loop. */
> clear_dquot_dirty(dquot);
> }

The only thing quota_error() does is the varags handling and a printk,
so yeah, all that #if 0" would do even if the error triggers (and it
presumably doesn't) is to change code generation around that point,
and change timing.

But what *is* interesting is that that commit that triggers it is
before all the other list-handling changes, so the fact that this was
triggered by that merge and that one commit, *all* that really
happened to trigger your boot failure is literally this:

git log 1500e7e0726e^..024128477809

(that "1500e7e0726e^" is the pre-merge state). So it's not that the
problem was introduced by one of the other list-handling changes and
then 024128477809 just happened to change the timing. No, it's
literally that one commit that moves code around, and that one
quota_error() printout that makes the problem show for you.

So it really looks like the bug is pre-existing. Or actually a
compiler problem that is introduced by the added call that changes
code generation, but honestly, that is a very unlikely thing.

That said - while unlikely, mind just sending me the *failing* copy of
the fs/quota/dquot.o object file, and I'll take a look at the code
around that call. I've looked at enough code generation issues that
it's worth trying..

Linus

2023-10-20 17:27:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Fri, 20 Oct 2023 at 08:12, Andy Shevchenko
<[email protected]> wrote:
>
> > > --- a/fs/quota/dquot.c
> > > +++ b/fs/quota/dquot.c
> > > @@ -632,8 +632,10 @@ static inline int dquot_write_dquot(struct dquot *dquot)
> > > {
> > > int ret = dquot->dq_sb->dq_op->write_dquot(dquot);
> > > if (ret < 0) {
> > > +#if 0
> > > quota_error(dquot->dq_sb, "Can't write quota structure "
> > > "(error %d). Quota may get out of sync!", ret);
> > > +#endif
> > > /* Clear dirty bit anyway to avoid infinite loop. */
> > > clear_dquot_dirty(dquot);
> > > }
>
> Doing the same on the my branch based on top of v6.6-rc6 does not help.
> So looks like a race condition somewhere happening related to that dirty bit
> (as comment states it needs to be cleaned to avoid infinite loop, that's
> probably what happens).

Hmm. Normally, dirty bits should always be cleared *before* the
write-back, not after it. Otherwise you might lose a dirty event that
happened *during* writeback.

But I don't know the quota code.

... the fact that the #if 0 doesn't help in another case does say that
it's not the quota_error() call itself. Which it really couldn't have
been (apart from timing and compiler bugs), but it's still a data
point, I guess.

Linus

2023-10-20 18:06:42

by Jan Kara

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Fri 20-10-23 17:51:59, Andy Shevchenko wrote:
> On Thu, Oct 19, 2023 at 11:43:47AM -0700, Linus Torvalds wrote:
> > On Thu, 19 Oct 2023 at 11:16, Andy Shevchenko
> > <[email protected]> wrote:
> > >
> > > Meanwhile a wild idea, can it be some git (automatic) conflict resolution that
> > > makes that merge affect another (not related to the main contents of the merge)
> > > files? Like upstream has one base, the merge has another which is older/newer
> > > in the history?
> >
> > I already looked at any obvious case of that.
> >
> > The only quota-related issue on the other side is an obvious
> > one-liner: commit 86be6b8bd834 ("quota: Check presence of quota
> > operation structures instead of ->quota_read and ->quota_write
> > callbacks").
> >
> > It didn't affect the merge, because it was not related to any of the
> > changes that came in from the quota branch (it was physically close to
> > the change that used lockdep_assert_held_write() instead of a
> > WARN_ON_ONCE(down_read_trylock()) sequence, but it is unrelated to
> > it).
> >
> > I guess you could try reverting that one-liner after the merge, but I
> > _really_ don't think it is at all relevant.
> >
> > What *would* probably be interesting is to start at the pre-merge
> > state, and rebase the code that got merged in. And then bisect *that*
> > series.
> >
> > IOW, with the merge that triggers your bisection being commit
> > 1500e7e0726e, do perhaps something like this:
> >
> > # Name the states before the merge
> > git branch pre-merge 1500e7e0726e^
> > git branch jan-state 1500e7e0726e^2
> >
> > # Now double-check that this works for you, of course.
> > # Just to be safe...
> > git checkout pre-merge
> > .. test-build and test-boot this with the bad config ..
>
> That's I have checked already [4], but okay, let me double check.
> [5] is the same as [4] according to `git diff`.
>
> It boots.
>
> > # Then, let's create a new branch that is
> > # the rebased version of Jan's state:
> > git checkout -b jan-rebased jan-state
> > git rebase pre-merge
>
> [6] is created.
>
> > # Verify that the tree is the same as the merge
> > git diff 1500e7e0726e
>
> Yes, nothing in output.
>
> And it does not boot.
>
> > # Ok, that was empty, so do a bisect on this
> > # rebased history
> > git bisect start
> > git bisect bad
> > git bisect good pre-merge
> >
> > .. and see what commit it *now* claims is the bad commit.
>
> git bisect start
> # status: waiting for both good and bad commits
> # good: [63580f669d7ff5aa5a1fa2e3994114770a491722] Merge tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
> git bisect good 63580f669d7ff5aa5a1fa2e3994114770a491722
> # status: waiting for bad commit, 1 good commit known
> # bad: [2447ff4196091e41d385635f9b6d003119f24199] ext2: Fix kernel-doc warnings
> git bisect bad 2447ff4196091e41d385635f9b6d003119f24199
> # bad: [a7c4109a1fa7f9f8cfa9aa93e7aae52d0df820f6] MAINTAINERS: change reiserfs status to obsolete
> git bisect bad a7c4109a1fa7f9f8cfa9aa93e7aae52d0df820f6
> # bad: [74fdc82e4a4302bf8a519101a40691b78d9beb6c] quota: add new helper dquot_active()
> git bisect bad 74fdc82e4a4302bf8a519101a40691b78d9beb6c
> # bad: [e64db1c50eb5d3be2187b56d32ec39e56b739845] quota: factor out dquot_write_dquot()
> git bisect bad e64db1c50eb5d3be2187b56d32ec39e56b739845
> # good: [eea7e964642984753768ddbb710e2eefd32c7a89] ext2: remove redundant assignment to variable desc and variable best_desc
> git bisect good eea7e964642984753768ddbb710e2eefd32c7a89
> # first bad commit: [e64db1c50eb5d3be2187b56d32ec39e56b739845] quota: factor out dquot_write_dquot()

Interesting, but it's a data point :)

> > Would you be willing to do this? It should be only a few bisects,
> > since Jan's branch only brought in 17 commits that the above rebases
> > into this test branch. So four or five bisections should pinpoint the
> > exact point where it goes bad.
>
> See above.
>
> I even rebuilt again with just rebased on top of e64db1c50eb5 and it doesn't
> boot, so we found the culprit that triggers this issue.
>
> > Of course, since this is apparently about some "random code generation
> > issue", that exact point still may not be very interesting.
>
> On top of the above I have tried the following:
> 1) dropping inline, replacing it to __always_inline -- no help;
> 2) commenting out the error message -- helps!
>
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -632,8 +632,10 @@ static inline int dquot_write_dquot(struct dquot *dquot)
> {
> int ret = dquot->dq_sb->dq_op->write_dquot(dquot);
> if (ret < 0) {
> +#if 0
> quota_error(dquot->dq_sb, "Can't write quota structure "
> "(error %d). Quota may get out of sync!", ret);
> +#endif
> /* Clear dirty bit anyway to avoid infinite loop. */
> clear_dquot_dirty(dquot);
> }
>
> If it's a timing issue it's related to that error message, as the new helper is
> run outside of the spinlock.
>
> What's is fishy there besides the error message being available only in one
> case, is the pointer that is used for dp_op. I'm not at all familiar with the
> code, but can it be that these superblocks are different for those two cases?

dquot->dq_sb could be different from the sb we have only if the dirty list
would get corrupted in some way. Not impossible but does not seem too
likely. Let's first check whether there are any quotas in the first place.

I've asked this already but I don't think I've got an answer: What
filesystem type is the root filesystem? Does it have any quotas (either
there would be files like aquota.user, quota.user, aquota.group,
quota.group in / or there would be quota feature enabled - how to find that
out depends on the filesystem so once I know the fs type I can give you
instructions).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-10-20 18:10:26

by Jan Kara

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Fri 20-10-23 10:26:26, Linus Torvalds wrote:
> On Fri, 20 Oct 2023 at 08:12, Andy Shevchenko
> <[email protected]> wrote:
> >
> > > > --- a/fs/quota/dquot.c
> > > > +++ b/fs/quota/dquot.c
> > > > @@ -632,8 +632,10 @@ static inline int dquot_write_dquot(struct dquot *dquot)
> > > > {
> > > > int ret = dquot->dq_sb->dq_op->write_dquot(dquot);
> > > > if (ret < 0) {
> > > > +#if 0
> > > > quota_error(dquot->dq_sb, "Can't write quota structure "
> > > > "(error %d). Quota may get out of sync!", ret);
> > > > +#endif
> > > > /* Clear dirty bit anyway to avoid infinite loop. */
> > > > clear_dquot_dirty(dquot);
> > > > }
> >
> > Doing the same on the my branch based on top of v6.6-rc6 does not help.
> > So looks like a race condition somewhere happening related to that dirty bit
> > (as comment states it needs to be cleaned to avoid infinite loop, that's
> > probably what happens).
>
> Hmm. Normally, dirty bits should always be cleared *before* the
> write-back, not after it. Otherwise you might lose a dirty event that
> happened *during* writeback.

Yes, and normally we clear the dirty bit in dquot_commit() before writing
the dquot. However if there is an error in fs-private ->write_dquot()
helper before calling back into dquot_commit() (e.g. ext4 fails to start a
transaction), ->write_dquot() can return without clearing the dirty bit.
For dqput() to not loop indefinitely trying to clean the dquot, we clear
the dirty bit here just to be sure in case of error.

> But I don't know the quota code.
>
> ... the fact that the #if 0 doesn't help in another case does say that
> it's not the quota_error() call itself. Which it really couldn't have
> been (apart from timing and compiler bugs), but it's still a data
> point, I guess.

Yeah, that's a bit weird. I'm really curious what the problem is.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-10-20 18:29:48

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Fri, Oct 20, 2023 at 10:23:57AM -0700, Linus Torvalds wrote:
> On Fri, 20 Oct 2023 at 07:52, Andy Shevchenko
> <[email protected]> wrote:

...

> That said - while unlikely, mind just sending me the *failing* copy of
> the fs/quota/dquot.o object file, and I'll take a look at the code
> around that call. I've looked at enough code generation issues that
> it's worth trying..

For the sake of purity, I have rebuilt the whole tree to confirm it doesn't boot.
Here [7] is what I got.

I'll reply to this with the attached object file, I assume it won't go to the
mailing list, but should be available in your mailbox.

[7]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-jr/

--
With Best Regards,
Andy Shevchenko


2023-10-20 18:32:29

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Fri, Oct 20, 2023 at 09:29:25PM +0300, Andy Shevchenko wrote:
> On Fri, Oct 20, 2023 at 10:23:57AM -0700, Linus Torvalds wrote:
> > On Fri, 20 Oct 2023 at 07:52, Andy Shevchenko
> > <[email protected]> wrote:

...

> > That said - while unlikely, mind just sending me the *failing* copy of
> > the fs/quota/dquot.o object file, and I'll take a look at the code
> > around that call. I've looked at enough code generation issues that
> > it's worth trying..
>
> For the sake of purity, I have rebuilt the whole tree to confirm it doesn't boot.
> Here [7] is what I got.
>
> I'll reply to this with the attached object file, I assume it won't go to the
> mailing list, but should be available in your mailbox.
>
> [7]: https://bitbucket.org/andy-shev/linux/src/test-mrfld-jr/

Files: failed and with ifdeffery.

--
With Best Regards,
Andy Shevchenko



Attachments:
(No filename) (894.00 B)
dquot.o (76.55 kB)
dquot+ifdef0.o (76.18 kB)
Download all attachments

2023-10-20 19:44:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Fri, 20 Oct 2023 at 11:29, Andy Shevchenko
<[email protected]> wrote:
>
> I'll reply to this with the attached object file, I assume it won't go to the
> mailing list, but should be available in your mailbox.

Honestly, both cases (that function gets inlined twice) look
*identical* from a quick look, apart from obviously the extra call to
__quota_error().

I might be missing something, but this most definitely is not a "gcc
ends up creating very different code when it doesn't need to
synchronize around the call" thing.

So a compiler issue looks very unlikely. No absolute guarantees - I
didn't do *that* kind of walk-through instruction by instruction - but
the results actually seem to line up perfectly.

Even register allocation didn't change, making the compare between #if
0 and without rather easy.

There's one extra spill/reload due to the call in the "non-#if0" case,
and that actually made me look twice (because it spilled %eax, and
then reloaded it as %rcx), but it turns that %eax/%ecx had the same
value at the time of the spill, so even that was not a "real"
difference.

So I will claim that no, it's not the compiler. It's some unrelated
subtle timing, or possibly just a random code layout issue (because
the code addresses do obviously change).

Linus

2023-10-20 20:31:18

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

Fri, Oct 20, 2023 at 08:05:47PM +0200, Jan Kara kirjoitti:
> On Fri 20-10-23 17:51:59, Andy Shevchenko wrote:
> > On Thu, Oct 19, 2023 at 11:43:47AM -0700, Linus Torvalds wrote:
> > > On Thu, 19 Oct 2023 at 11:16, Andy Shevchenko
> > > <[email protected]> wrote:
> > > >
> > > > Meanwhile a wild idea, can it be some git (automatic) conflict resolution that
> > > > makes that merge affect another (not related to the main contents of the merge)
> > > > files? Like upstream has one base, the merge has another which is older/newer
> > > > in the history?
> > >
> > > I already looked at any obvious case of that.
> > >
> > > The only quota-related issue on the other side is an obvious
> > > one-liner: commit 86be6b8bd834 ("quota: Check presence of quota
> > > operation structures instead of ->quota_read and ->quota_write
> > > callbacks").
> > >
> > > It didn't affect the merge, because it was not related to any of the
> > > changes that came in from the quota branch (it was physically close to
> > > the change that used lockdep_assert_held_write() instead of a
> > > WARN_ON_ONCE(down_read_trylock()) sequence, but it is unrelated to
> > > it).
> > >
> > > I guess you could try reverting that one-liner after the merge, but I
> > > _really_ don't think it is at all relevant.
> > >
> > > What *would* probably be interesting is to start at the pre-merge
> > > state, and rebase the code that got merged in. And then bisect *that*
> > > series.
> > >
> > > IOW, with the merge that triggers your bisection being commit
> > > 1500e7e0726e, do perhaps something like this:
> > >
> > > # Name the states before the merge
> > > git branch pre-merge 1500e7e0726e^
> > > git branch jan-state 1500e7e0726e^2
> > >
> > > # Now double-check that this works for you, of course.
> > > # Just to be safe...
> > > git checkout pre-merge
> > > .. test-build and test-boot this with the bad config ..
> >
> > That's I have checked already [4], but okay, let me double check.
> > [5] is the same as [4] according to `git diff`.
> >
> > It boots.
> >
> > > # Then, let's create a new branch that is
> > > # the rebased version of Jan's state:
> > > git checkout -b jan-rebased jan-state
> > > git rebase pre-merge
> >
> > [6] is created.
> >
> > > # Verify that the tree is the same as the merge
> > > git diff 1500e7e0726e
> >
> > Yes, nothing in output.
> >
> > And it does not boot.
> >
> > > # Ok, that was empty, so do a bisect on this
> > > # rebased history
> > > git bisect start
> > > git bisect bad
> > > git bisect good pre-merge
> > >
> > > .. and see what commit it *now* claims is the bad commit.
> >
> > git bisect start
> > # status: waiting for both good and bad commits
> > # good: [63580f669d7ff5aa5a1fa2e3994114770a491722] Merge tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
> > git bisect good 63580f669d7ff5aa5a1fa2e3994114770a491722
> > # status: waiting for bad commit, 1 good commit known
> > # bad: [2447ff4196091e41d385635f9b6d003119f24199] ext2: Fix kernel-doc warnings
> > git bisect bad 2447ff4196091e41d385635f9b6d003119f24199
> > # bad: [a7c4109a1fa7f9f8cfa9aa93e7aae52d0df820f6] MAINTAINERS: change reiserfs status to obsolete
> > git bisect bad a7c4109a1fa7f9f8cfa9aa93e7aae52d0df820f6
> > # bad: [74fdc82e4a4302bf8a519101a40691b78d9beb6c] quota: add new helper dquot_active()
> > git bisect bad 74fdc82e4a4302bf8a519101a40691b78d9beb6c
> > # bad: [e64db1c50eb5d3be2187b56d32ec39e56b739845] quota: factor out dquot_write_dquot()
> > git bisect bad e64db1c50eb5d3be2187b56d32ec39e56b739845
> > # good: [eea7e964642984753768ddbb710e2eefd32c7a89] ext2: remove redundant assignment to variable desc and variable best_desc
> > git bisect good eea7e964642984753768ddbb710e2eefd32c7a89
> > # first bad commit: [e64db1c50eb5d3be2187b56d32ec39e56b739845] quota: factor out dquot_write_dquot()
>
> Interesting, but it's a data point :)
>
> > > Would you be willing to do this? It should be only a few bisects,
> > > since Jan's branch only brought in 17 commits that the above rebases
> > > into this test branch. So four or five bisections should pinpoint the
> > > exact point where it goes bad.
> >
> > See above.
> >
> > I even rebuilt again with just rebased on top of e64db1c50eb5 and it doesn't
> > boot, so we found the culprit that triggers this issue.
> >
> > > Of course, since this is apparently about some "random code generation
> > > issue", that exact point still may not be very interesting.
> >
> > On top of the above I have tried the following:
> > 1) dropping inline, replacing it to __always_inline -- no help;
> > 2) commenting out the error message -- helps!
> >
> > --- a/fs/quota/dquot.c
> > +++ b/fs/quota/dquot.c
> > @@ -632,8 +632,10 @@ static inline int dquot_write_dquot(struct dquot *dquot)
> > {
> > int ret = dquot->dq_sb->dq_op->write_dquot(dquot);
> > if (ret < 0) {
> > +#if 0
> > quota_error(dquot->dq_sb, "Can't write quota structure "
> > "(error %d). Quota may get out of sync!", ret);
> > +#endif
> > /* Clear dirty bit anyway to avoid infinite loop. */
> > clear_dquot_dirty(dquot);
> > }
> >
> > If it's a timing issue it's related to that error message, as the new helper is
> > run outside of the spinlock.
> >
> > What's is fishy there besides the error message being available only in one
> > case, is the pointer that is used for dp_op. I'm not at all familiar with the
> > code, but can it be that these superblocks are different for those two cases?
>
> dquot->dq_sb could be different from the sb we have only if the dirty list
> would get corrupted in some way. Not impossible but does not seem too
> likely. Let's first check whether there are any quotas in the first place.
>
> I've asked this already but I don't think I've got an answer: What
> filesystem type is the root filesystem? Does it have any quotas (either
> there would be files like aquota.user, quota.user, aquota.group,
> quota.group in / or there would be quota feature enabled - how to find that
> out depends on the filesystem so once I know the fs type I can give you
> instructions).

Oh, indeed. Sorry that I missed that. Kernel configuration you asked for is
available via bitbacket (see some messages above in the thread), I'm using
`make x86_64_defconfig` to get .config out of that.

The root filesystem is initramfs (cpio I suppose). I dunno if that ever
supported quota.

--
With Best Regards,
Andy Shevchenko


2023-10-20 20:36:47

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

Fri, Oct 20, 2023 at 12:43:56PM -0700, Linus Torvalds kirjoitti:
> On Fri, 20 Oct 2023 at 11:29, Andy Shevchenko
> <[email protected]> wrote:
> >
> > I'll reply to this with the attached object file, I assume it won't go to the
> > mailing list, but should be available in your mailbox.
>
> Honestly, both cases (that function gets inlined twice) look
> *identical* from a quick look, apart from obviously the extra call to
> __quota_error().
>
> I might be missing something, but this most definitely is not a "gcc
> ends up creating very different code when it doesn't need to
> synchronize around the call" thing.
>
> So a compiler issue looks very unlikely. No absolute guarantees - I
> didn't do *that* kind of walk-through instruction by instruction - but
> the results actually seem to line up perfectly.
>
> Even register allocation didn't change, making the compare between #if
> 0 and without rather easy.
>
> There's one extra spill/reload due to the call in the "non-#if0" case,
> and that actually made me look twice (because it spilled %eax, and
> then reloaded it as %rcx), but it turns that %eax/%ecx had the same
> value at the time of the spill, so even that was not a "real"
> difference.
>
> So I will claim that no, it's not the compiler. It's some unrelated
> subtle timing, or possibly just a random code layout issue (because
> the code addresses do obviously change).

Okay, but since now I can't use the certain configuration, the bug is
persistent to me after this merge with the GCC. Yet, you mentioned that
you would expect some reports but I don't think many people have a
configuration similar to what I have. In any case a bug is lurking somewhere
there.

Let me check next week on different CPU (but I'm quite sceptical that it
may anyhow trigger the same behaviour as if it's a timing, many parameters
are involved, including hardware clocks, etc).

That said, if you or anyone has ideas how to debug futher, I'm all ears!

--
With Best Regards,
Andy Shevchenko


2023-10-21 01:48:50

by Baokun Li

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On 2023/10/20 23:06, Andy Shevchenko wrote:
> +Cc: Baokun, the author of the patch.
>
> On Fri, Oct 20, 2023 at 05:51:59PM +0300, Andy Shevchenko wrote:
>> On Thu, Oct 19, 2023 at 11:43:47AM -0700, Linus Torvalds wrote:
>>> On Thu, 19 Oct 2023 at 11:16, Andy Shevchenko
>>> <[email protected]> wrote:
>>>> Meanwhile a wild idea, can it be some git (automatic) conflict resolution that
>>>> makes that merge affect another (not related to the main contents of the merge)
>>>> files? Like upstream has one base, the merge has another which is older/newer
>>>> in the history?
>>> I already looked at any obvious case of that.
>>>
>>> The only quota-related issue on the other side is an obvious
>>> one-liner: commit 86be6b8bd834 ("quota: Check presence of quota
>>> operation structures instead of ->quota_read and ->quota_write
>>> callbacks").
>>>
>>> It didn't affect the merge, because it was not related to any of the
>>> changes that came in from the quota branch (it was physically close to
>>> the change that used lockdep_assert_held_write() instead of a
>>> WARN_ON_ONCE(down_read_trylock()) sequence, but it is unrelated to
>>> it).
>>>
>>> I guess you could try reverting that one-liner after the merge, but I
>>> _really_ don't think it is at all relevant.
>>>
>>> What *would* probably be interesting is to start at the pre-merge
>>> state, and rebase the code that got merged in. And then bisect *that*
>>> series.
>>>
>>> IOW, with the merge that triggers your bisection being commit
>>> 1500e7e0726e, do perhaps something like this:
>>>
>>> # Name the states before the merge
>>> git branch pre-merge 1500e7e0726e^
>>> git branch jan-state 1500e7e0726e^2
>>>
>>> # Now double-check that this works for you, of course.
>>> # Just to be safe...
>>> git checkout pre-merge
>>> .. test-build and test-boot this with the bad config ..
>> That's I have checked already [4], but okay, let me double check.
>> [5] is the same as [4] according to `git diff`.
>>
>> It boots.
>>
>>> # Then, let's create a new branch that is
>>> # the rebased version of Jan's state:
>>> git checkout -b jan-rebased jan-state
>>> git rebase pre-merge
>> [6] is created.
>>
>>> # Verify that the tree is the same as the merge
>>> git diff 1500e7e0726e
>> Yes, nothing in output.
>>
>> And it does not boot.
>>
>>> # Ok, that was empty, so do a bisect on this
>>> # rebased history
>>> git bisect start
>>> git bisect bad
>>> git bisect good pre-merge
>>>
>>> .. and see what commit it *now* claims is the bad commit.
>> git bisect start
>> # status: waiting for both good and bad commits
>> # good: [63580f669d7ff5aa5a1fa2e3994114770a491722] Merge tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
>> git bisect good 63580f669d7ff5aa5a1fa2e3994114770a491722
>> # status: waiting for bad commit, 1 good commit known
>> # bad: [2447ff4196091e41d385635f9b6d003119f24199] ext2: Fix kernel-doc warnings
>> git bisect bad 2447ff4196091e41d385635f9b6d003119f24199
>> # bad: [a7c4109a1fa7f9f8cfa9aa93e7aae52d0df820f6] MAINTAINERS: change reiserfs status to obsolete
>> git bisect bad a7c4109a1fa7f9f8cfa9aa93e7aae52d0df820f6
>> # bad: [74fdc82e4a4302bf8a519101a40691b78d9beb6c] quota: add new helper dquot_active()
>> git bisect bad 74fdc82e4a4302bf8a519101a40691b78d9beb6c
>> # bad: [e64db1c50eb5d3be2187b56d32ec39e56b739845] quota: factor out dquot_write_dquot()
>> git bisect bad e64db1c50eb5d3be2187b56d32ec39e56b739845
>> # good: [eea7e964642984753768ddbb710e2eefd32c7a89] ext2: remove redundant assignment to variable desc and variable best_desc
>> git bisect good eea7e964642984753768ddbb710e2eefd32c7a89
>> # first bad commit: [e64db1c50eb5d3be2187b56d32ec39e56b739845] quota: factor out dquot_write_dquot()
>>
>>> Would you be willing to do this? It should be only a few bisects,
>>> since Jan's branch only brought in 17 commits that the above rebases
>>> into this test branch. So four or five bisections should pinpoint the
>>> exact point where it goes bad.
>> See above.
>>
>> I even rebuilt again with just rebased on top of e64db1c50eb5 and it doesn't
>> boot, so we found the culprit that triggers this issue.
>>
Hello, Addy!

This patch does not seem to cause this problem. Just like linus said,
this patch
has only two slight differences from the previous:
1) Change "if (err)" to "if (err < 0)"
    In all the implementations of dq_op->write_dquot(), the returned
value of err
    is not greater than 0. Therefore, this does not cause behavior
inconsistency.
2) Adding quota_error()
    quota_error() does not seem to cause a boot failure.

Also, you mentioned that the root file system is initramfs. If no other
file system
that supports quota is automatically mounted during startup, it seems
that quota
does not cause this problem logically.

In my opinion, as Josh mentioned, replace the CONFIG_DEBUG_LIST related
BUG()/BUG_ON() with WARN_ON(), and then check whether the system can be
started normally. If yes, it indicates that the panic is caused by the
list corruption,
then, check for the items that may damage the list. If WARN_ON() is
recorded in
the dmesg log of the machine after the startup, it is easier to locate
the problem.

--
With Best Regards,
Baokun Li
.

2023-10-21 23:36:30

by Kees Cook

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1



On October 20, 2023 1:36:36 PM PDT, [email protected] wrote:
>That said, if you or anyone has ideas how to debug futher, I'm all ears!

I don't think this has been tried yet:

When I've had these kind of hard-to-find glitches I've used manual built-binary bisection. Assuming you have a source tree that works when built with Clang and not with GCC:
- build the tree with Clang with, say, O=build-clang
- build the tree with GCC, O=build-gcc
- make a new tree for testing: cp -a build-clang build-test
- pick a suspect .o file (or files) to copy from build-gcc into build-test
- perform a relink: "make O=build-test" should DTRT since the copied-in .o files should be newer than the .a and other targets
- test for failure, repeat

Once you've isolated it to (hopefully) a single .o file, then comes the byte-by-byte analysis or something similar...

I hope that helps! These kinds of bugs are super frustrating.

-Kees

--
Kees Cook

Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 17.10.23 12:27, Andy Shevchenko wrote:
> On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
>> Hello Linus,
>>
>> could you please pull from
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git for_v6.6-rc1
>>
>> to get:
>> * fixes for possible use-after-free issues with quota when racing with
>> chown
>> * fixes for ext2 crashing when xattr allocation races with another
>> block allocation to the same file from page writeback code
>> * fix for block number overflow in ext2
>> * marking of reiserfs as obsolete in MAINTAINERS
>> * assorted minor cleanups
>>
>> Top of the tree is df1ae36a4a0e. The full shortlog is:
>
> This merge commit (?) broke boot on Intel Merrifield.
> It has earlycon enabled and only what I got is watchdog
> trigger without a bit of information printed out.
>
> I tried to give a two bisects with the same result.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot (using info from
https://lore.kernel.org/all/[email protected]/ here):

#regzbot ^introduced 024128477809f8
#regzbot title quota: boot on Intel Merrifield after merge commit
1500e7e0726e
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

2023-10-23 11:45:54

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Sat, Oct 21, 2023 at 04:36:19PM -0700, Kees Cook wrote:
> On October 20, 2023 1:36:36 PM PDT, [email protected] wrote:
> >That said, if you or anyone has ideas how to debug futher, I'm all ears!
>
> I don't think this has been tried yet:
>
> When I've had these kind of hard-to-find glitches I've used manual
> built-binary bisection. Assuming you have a source tree that works when built
> with Clang and not with GCC:
> - build the tree with Clang with, say, O=build-clang
> - build the tree with GCC, O=build-gcc
> - make a new tree for testing: cp -a build-clang build-test
> - pick a suspect .o file (or files) to copy from build-gcc into build-test
> - perform a relink: "make O=build-test" should DTRT since the copied-in .o
> files should be newer than the .a and other targets
> - test for failure, repeat
>
> Once you've isolated it to (hopefully) a single .o file, then comes the
> byte-by-byte analysis or something similar...
>
> I hope that helps! These kinds of bugs are super frustrating.

I'm sorry, but I can't see how this is not an error prone approach.
If it's a timing issue then the arbitrary object change may help and it doesn't
prove anything. As earlier I tried to comment out the error message, and it
worked with GCC as well. The difference is so little (according to Linus) that
it may not be suspectible. Maybe I am missing the point...

--
With Best Regards,
Andy Shevchenko


2023-10-23 12:15:13

by Jan Kara

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Mon 23-10-23 14:45:05, Andy Shevchenko wrote:
> On Sat, Oct 21, 2023 at 04:36:19PM -0700, Kees Cook wrote:
> > On October 20, 2023 1:36:36 PM PDT, [email protected] wrote:
> > >That said, if you or anyone has ideas how to debug futher, I'm all ears!
> >
> > I don't think this has been tried yet:
> >
> > When I've had these kind of hard-to-find glitches I've used manual
> > built-binary bisection. Assuming you have a source tree that works when built
> > with Clang and not with GCC:
> > - build the tree with Clang with, say, O=build-clang
> > - build the tree with GCC, O=build-gcc
> > - make a new tree for testing: cp -a build-clang build-test
> > - pick a suspect .o file (or files) to copy from build-gcc into build-test
> > - perform a relink: "make O=build-test" should DTRT since the copied-in .o
> > files should be newer than the .a and other targets
> > - test for failure, repeat
> >
> > Once you've isolated it to (hopefully) a single .o file, then comes the
> > byte-by-byte analysis or something similar...
> >
> > I hope that helps! These kinds of bugs are super frustrating.
>
> I'm sorry, but I can't see how this is not an error prone approach.
> If it's a timing issue then the arbitrary object change may help and it doesn't
> prove anything. As earlier I tried to comment out the error message, and it
> worked with GCC as well. The difference is so little (according to Linus) that
> it may not be suspectible. Maybe I am missing the point...

Given how reliably you can hit the problem with some kernels while you
cannot hit them with others (only slightly different in a code that doesn't
even get executed on your system) I suspect this is really more a code
placement issue than a timing issue. Like if during the linking phase of
vmlinux some code ends up at some position, the kernel fails, otherwise it
boots fine. Not sure how to debug such thing though. Maybe some playing
with the linker and the order of object files linked could reveal something
but I'm just guessing.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-10-23 12:19:16

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Sat, Oct 21, 2023 at 09:48:38AM +0800, Baokun Li wrote:
> On 2023/10/20 23:06, Andy Shevchenko wrote:
> > On Fri, Oct 20, 2023 at 05:51:59PM +0300, Andy Shevchenko wrote:
> > > On Thu, Oct 19, 2023 at 11:43:47AM -0700, Linus Torvalds wrote:

...

> > > I even rebuilt again with just rebased on top of e64db1c50eb5 and it doesn't
> > > boot, so we found the culprit that triggers this issue.
>
> This patch does not seem to cause this problem. Just like linus said, this
> patch
> has only two slight differences from the previous:
> 1) Change "if (err)" to "if (err < 0)"
> ??? In all the implementations of dq_op->write_dquot(), the returned value
> of err
> ??? is not greater than 0. Therefore, this does not cause behavior
> inconsistency.
> 2) Adding quota_error()
> ??? quota_error() does not seem to cause a boot failure.
>
> Also, you mentioned that the root file system is initramfs. If no other file
> system
> that supports quota is automatically mounted during startup, it seems that
> quota
> does not cause this problem logically.
>
> In my opinion, as Josh mentioned, replace the CONFIG_DEBUG_LIST related
> BUG()/BUG_ON() with WARN_ON(), and then check whether the system can be
> started normally. If yes, it indicates that the panic is caused by the list
> corruption, then, check for the items that may damage the list. If WARN_ON()
> is recorded in the dmesg log of the machine after the startup, it is easier
> to locate the problem.

I mentioned that I have checked that, but okay, lemme double check it.
I took the test-mrfld-jr branch and applied that patch on top.
And as expected no luck.

fstab I have, btw is this

$ cat output/target/etc/fstab
# <file system> <mount pt> <type> <options> <dump> <pass>
/dev/root / ext2 rw,noauto 0 1
proc /proc proc defaults 0 0
devpts /dev/pts devpts defaults,gid=5,mode=620,ptmxmode=0666 0 0
tmpfs /dev/shm tmpfs mode=0777 0 0
tmpfs /tmp tmpfs mode=1777 0 0
tmpfs /run tmpfs mode=0755,nosuid,nodev 0 0
sysfs /sys sysfs defaults 0 0

Not sure if /dev/root affects this all, it's Buildroot + Busybox in initramfs
at the end.

On the booted machine
(clang build of my main branch, based on the latest -rcX):

Welcome to Buildroot
buildroot login: root

# uname -a
Linux buildroot 6.6.0-rc7-00142-g9266a02ba229 #28 SMP PREEMPT_DYNAMIC Mon Oct 23 15:00:17 EEST 2023 x86_64 GNU/Linux

# mount
rootfs on / type rootfs (rw,size=453412k,nr_inodes=113353)
devtmpfs on /dev type devtmpfs (rw,relatime,size=453412k,nr_inodes=113353,mode=755)
proc on /proc type proc (rw,relatime)
devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=666)
tmpfs on /dev/shm type tmpfs (rw,relatime,mode=777)
tmpfs on /tmp type tmpfs (rw,relatime)
tmpfs on /run type tmpfs (rw,nosuid,nodev,relatime,mode=755)
sysfs on /sys type sysfs (rw,relatime)

What is fishy here is the size of rootfs, it's only 30M compressed side,
I can't be ~450M decompressed. I just noticed this, dunno if it's related.

--
With Best Regards,
Andy Shevchenko


2023-10-23 13:40:59

by Baokun Li

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

Hello!

On 2023/10/23 20:19, Andy Shevchenko wrote:
> On Sat, Oct 21, 2023 at 09:48:38AM +0800, Baokun Li wrote:
>> On 2023/10/20 23:06, Andy Shevchenko wrote:
>>> On Fri, Oct 20, 2023 at 05:51:59PM +0300, Andy Shevchenko wrote:
>>>> On Thu, Oct 19, 2023 at 11:43:47AM -0700, Linus Torvalds wrote:
> ...
>
>>>> I even rebuilt again with just rebased on top of e64db1c50eb5 and it doesn't
>>>> boot, so we found the culprit that triggers this issue.
>> This patch does not seem to cause this problem. Just like linus said, this
>> patch
>> has only two slight differences from the previous:
>> 1) Change "if (err)" to "if (err < 0)"
>>     In all the implementations of dq_op->write_dquot(), the returned value
>> of err
>>     is not greater than 0. Therefore, this does not cause behavior
>> inconsistency.
>> 2) Adding quota_error()
>>     quota_error() does not seem to cause a boot failure.
>>
>> Also, you mentioned that the root file system is initramfs. If no other file
>> system
>> that supports quota is automatically mounted during startup, it seems that
>> quota
>> does not cause this problem logically.
>>
>> In my opinion, as Josh mentioned, replace the CONFIG_DEBUG_LIST related
>> BUG()/BUG_ON() with WARN_ON(), and then check whether the system can be
>> started normally. If yes, it indicates that the panic is caused by the list
>> corruption, then, check for the items that may damage the list. If WARN_ON()
>> is recorded in the dmesg log of the machine after the startup, it is easier
>> to locate the problem.
> I mentioned that I have checked that, but okay, lemme double check it.
> I took the test-mrfld-jr branch and applied that patch on top.
> And as expected no luck.
By "okay" do you mean that after replacing BUG()/BUG_ON() with WARN_ON()
you can boot up normally but don't see any prints, or does the
replacement have
no effect and still fails to boot up?
> fstab I have, btw is this
>
> $ cat output/target/etc/fstab
> # <file system> <mount pt> <type> <options> <dump> <pass>
> /dev/root / ext2 rw,noauto 0 1
> proc /proc proc defaults 0 0
> devpts /dev/pts devpts defaults,gid=5,mode=620,ptmxmode=0666 0 0
> tmpfs /dev/shm tmpfs mode=0777 0 0
> tmpfs /tmp tmpfs mode=1777 0 0
> tmpfs /run tmpfs mode=0755,nosuid,nodev 0 0
> sysfs /sys sysfs defaults 0 0
>
> Not sure if /dev/root affects this all, it's Buildroot + Busybox in initramfs
> at the end.
>
> On the booted machine
> (clang build of my main branch, based on the latest -rcX):
>
> Welcome to Buildroot
> buildroot login: root
>
> # uname -a
> Linux buildroot 6.6.0-rc7-00142-g9266a02ba229 #28 SMP PREEMPT_DYNAMIC Mon Oct 23 15:00:17 EEST 2023 x86_64 GNU/Linux
>
> # mount
> rootfs on / type rootfs (rw,size=453412k,nr_inodes=113353)
> devtmpfs on /dev type devtmpfs (rw,relatime,size=453412k,nr_inodes=113353,mode=755)
> proc on /proc type proc (rw,relatime)
> devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=666)
> tmpfs on /dev/shm type tmpfs (rw,relatime,mode=777)
> tmpfs on /tmp type tmpfs (rw,relatime)
> tmpfs on /run type tmpfs (rw,nosuid,nodev,relatime,mode=755)
> sysfs on /sys type sysfs (rw,relatime)
>
> What is fishy here is the size of rootfs, it's only 30M compressed side,
> I can't be ~450M decompressed. I just noticed this, dunno if it's related.
>
Of the filesystems mounted above, only ext2 (aka rootfs) supports quota,
but it doesn't appear to have quota enabled.
If the quota change is causing ext2 to fail to load the root directory,
you can now do the following checks:

1) Compare the binary generated by ext2  before and after the quota patch.
2) `dumpe2fs -h /dev/root` to see if there are any useful error messages
saved on disk superblock.

Thanks!
--
With Best Regards,
Baokun Li
.

2023-10-23 16:08:28

by Kees Cook

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Mon, Oct 23, 2023 at 02:15:01PM +0200, Jan Kara wrote:
> On Mon 23-10-23 14:45:05, Andy Shevchenko wrote:
> > On Sat, Oct 21, 2023 at 04:36:19PM -0700, Kees Cook wrote:
> > > On October 20, 2023 1:36:36 PM PDT, [email protected] wrote:
> > > >That said, if you or anyone has ideas how to debug futher, I'm all ears!
> > >
> > > I don't think this has been tried yet:
> > >
> > > When I've had these kind of hard-to-find glitches I've used manual
> > > built-binary bisection. Assuming you have a source tree that works when built
> > > with Clang and not with GCC:
> > > - build the tree with Clang with, say, O=build-clang
> > > - build the tree with GCC, O=build-gcc
> > > - make a new tree for testing: cp -a build-clang build-test
> > > - pick a suspect .o file (or files) to copy from build-gcc into build-test
> > > - perform a relink: "make O=build-test" should DTRT since the copied-in .o
> > > files should be newer than the .a and other targets
> > > - test for failure, repeat
> > >
> > > Once you've isolated it to (hopefully) a single .o file, then comes the
> > > byte-by-byte analysis or something similar...
> > >
> > > I hope that helps! These kinds of bugs are super frustrating.
> >
> > I'm sorry, but I can't see how this is not an error prone approach.
> > If it's a timing issue then the arbitrary object change may help and it doesn't
> > prove anything. As earlier I tried to comment out the error message, and it
> > worked with GCC as well. The difference is so little (according to Linus) that
> > it may not be suspectible. Maybe I am missing the point...
>
> Given how reliably you can hit the problem with some kernels while you
> cannot hit them with others (only slightly different in a code that doesn't
> even get executed on your system) I suspect this is really more a code
> placement issue than a timing issue. Like if during the linking phase of
> vmlinux some code ends up at some position, the kernel fails, otherwise it
> boots fine. Not sure how to debug such thing though. Maybe some playing
> with the linker and the order of object files linked could reveal something
> but I'm just guessing.

Right -- in theory there will be some minimum subset of "from GCC"
objects that when used together in the otherwise "known good" build will
trip the failure.

--
Kees Cook

Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On 22.10.23 15:46, Linux regression tracking #adding (Thorsten Leemhuis)
wrote:
> On 17.10.23 12:27, Andy Shevchenko wrote:
>> On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
>>> Hello Linus,
>>>
>>> could you please pull from
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git for_v6.6-rc1
>>
>> This merge commit (?) broke boot on Intel Merrifield.
>> It has earlycon enabled and only what I got is watchdog
>> trigger without a bit of information printed out.
>>
>> I tried to give a two bisects with the same result.
>
> #regzbot ^introduced 024128477809f8
> #regzbot title quota: boot on Intel Merrifield after merge commit
> 1500e7e0726e
> #regzbot ignore-activity

Removing this from the tracking. To quote Linus from
https://lore.kernel.org/all/CAHk-=wgEHNFHpcvnp2X6-fjBngrhPYO=oHAR905Q_qk-njV31A@mail.gmail.com/

"""
The quota thing remains unexplained, and honestly seems like a timing
issue that just happens to hit Andy. Very strange, but I suspect that
without more reports (that may or may not ever happen), we're stuck.
"""

No other reports showed up afaik.

#regzbot inconclusive: individual and strange timing issue that got stuck
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

2023-11-24 16:47:41

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1

On Wed, Nov 22, 2023 at 09:15:06AM +0100, Linux regression tracking #update (Thorsten Leemhuis) wrote:
> On 22.10.23 15:46, Linux regression tracking #adding (Thorsten Leemhuis)
> wrote:
> > On 17.10.23 12:27, Andy Shevchenko wrote:
> >> On Wed, Aug 30, 2023 at 12:24:34PM +0200, Jan Kara wrote:
> >>> Hello Linus,
> >>>
> >>> could you please pull from
> >>>
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git for_v6.6-rc1
> >>
> >> This merge commit (?) broke boot on Intel Merrifield.
> >> It has earlycon enabled and only what I got is watchdog
> >> trigger without a bit of information printed out.
> >>
> >> I tried to give a two bisects with the same result.
> >
> > #regzbot ^introduced 024128477809f8
> > #regzbot title quota: boot on Intel Merrifield after merge commit
> > 1500e7e0726e
> > #regzbot ignore-activity
>
> Removing this from the tracking. To quote Linus from
> https://lore.kernel.org/all/CAHk-=wgEHNFHpcvnp2X6-fjBngrhPYO=oHAR905Q_qk-njV31A@mail.gmail.com/
>
> """
> The quota thing remains unexplained, and honestly seems like a timing
> issue that just happens to hit Andy. Very strange, but I suspect that
> without more reports (that may or may not ever happen), we're stuck.
> """
>
> No other reports showed up afaik.

Yeah, have no time to dig into this more... Maybe later, who knows?

--
With Best Regards,
Andy Shevchenko