Hi Ryusuke,
today I got this bug in kernel, which seems to be related to nilfs2.
It was likely caused by improper shutdown and following nilfs2 partition
corruption. Now I can still read the data, but on the whole the
computer is not useable, because starting a process which uses the
corrupted file system simply crashes in kernel. I am actually not sure
if the filesystem is corrupted, as I don't know about any tool to check
that. The relevant parts of dmesg log are bellow.
Please let me know if you are the right contact or if you need more info
about the problem.
Thank you,
Tomas
[ 0.000000] Linux version 4.19.84 (nixbld@localhost) (gcc version 8.3.0 (GCC)) #1-NixOS SMP Tue Nov 12 18:21:46 UTC 2019
[ 0.000000] Command line: initrd=\efi\nixos\4s51zw36kd1qb0ymk0charxjg8x6k5k3-initrd-linux-4.19.84-initrd.efi systemConfig=/nix/store/gdbxhzysr929abrymjqala0b5bh2fqmv-nixos-system-ushi-19.09.1258.07e66484e67 init=/nix/store/gdbxhzysr929abrymjqala0b5bh2fqmv-nixos-system-ushi-19.09.1258.07e66484e67/init loglevel=4
[ 37.741106] systemd-journald[470]: Received client request to flush runtime journal.
[ 37.749084] systemd-journald[470]: File /var/log/journal/55a4ea9159c14c0bb8767a43819c6927/system.journal corrupted or uncleanly shut down, renaming and replacing.
[ 37.810819] audit: type=1130 audit(1573985039.617:3): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-udevd comm="systemd" exe="/nix/store/v8flm2h07zcfg5k5npz56m0ayj0qm1q8-systemd-243/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[ 38.321561] NILFS version 2 loaded
[ 38.323236] NILFS (dm-1): mounting unchecked fs
[ 38.349185] NILFS (dm-1): recovery complete
[ 38.353228] NILFS (dm-1): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
[ 63.543941] systemd-journald[470]: File
/var/log/journal/55a4ea9159c14c0bb8767a43819c6927/user-1000.journal
corrupted or uncleanly shut down, renaming and replacing.
[12637.085548] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
[12637.085558] PGD 0 P4D 0
[12637.085567] Oops: 0000 [#1] SMP PTI
[12637.085574] CPU: 0 PID: 657 Comm: segctord Not tainted 4.19.84 #1-NixOS
[12637.085577] Hardware name: ASUSTeK COMPUTER INC. VivoBook 15_ASUS Laptop X507MA_R507MA/X507MA, BIOS X507MA.301 09/14/2018
[12637.085589] RIP: 0010:percpu_counter_add_batch+0x4/0x60
[12637.085593] Code: 89 e6 89 c7 e8 dd 3b 28 00 3b 05 fb e0 b6 00 72 d8 4c 89 ee 48 89 ef e8 7a 63 2a 00 48 89 d8 5b 5d 41 5c 41 5d c3 41 54 55 53 <48> 8b 47 20 65 44 8b 20 49 63 ec 48 63 ca 48 01 f5 48 39 e9 7e 0a
[12637.085597] RSP: 0018:ffff9d1b00a0bd20 EFLAGS: 00010006
[12637.085601] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000018
[12637.085604] RDX: 0000000000000018 RSI: 0000000000000001 RDI: 0000000000000088
[12637.085608] RBP: ffff8df67a2988d0 R08: 0000000000000000 R09: ffff8df66fe0cfe0
[12637.085611] R10: 0000000000000230 R11: 0000000000000000 R12: 0000000000000000
[12637.085614] R13: ffff8df67a298758 R14: ffff8df67a2988c8 R15: ffffccd684229a80
[12637.085618] FS: 0000000000000000(0000) GS:ffff8df67ba00000(0000) knlGS:0000000000000000
[12637.085621] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12637.085624] CR2: 00000000000000a8 CR3: 000000011ac0a000 CR4: 0000000000340ef0
[12637.085628] Call Trace:
[12637.085640] __test_set_page_writeback+0x37c/0x3f0
[12637.085663] nilfs_segctor_do_construct+0x184e/0x2040 [nilfs2]
[12637.085680] nilfs_segctor_construct+0x1f5/0x2e0 [nilfs2]
[12637.085693] nilfs_segctor_thread+0x129/0x370 [nilfs2]
[12637.085706] ? nilfs_segctor_construct+0x2e0/0x2e0 [nilfs2]
[12637.085713] kthread+0x112/0x130
[12637.085719] ? kthread_bind+0x30/0x30
[12637.085728] ret_from_fork+0x1f/0x40
[12637.085734] Modules linked in: ctr ccm af_packet msr 8021q snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic hid_multitouch arc4 ath9k ath9k_common ath9k_hw ath mac80211 snd_soc_skl snd_soc_skl_ipc spi_pxa2xx_platform asus_nb_wmi snd_soc_sst_ipc snd_soc_sst_dsp asus_wmi 8250_dw i2c_designware_platform sparse_keymap i2c_designware_core wmi_bmof i915 snd_hda_ext_core nilfs2 snd_soc_acpi_intel_match snd_soc_acpi uvcvideo videobuf2_vmalloc nls_iso8859_1 videobuf2_memops videobuf2_v4l2 snd_soc_core nls_cp437 rtsx_usb_ms intel_telemetry_pltdrv vfat intel_punit_ipc intel_telemetry_core fat intel_pmc_ipc memstick videobuf2_common snd_compress kvmgt vfio_mdev mdev ath3k vfio_iommu_type1 vfio btusb ac97_bus snd_pcm_dmaengine btrtl x86_pkg_temp_thermal intel_powerclamp btbcm cec coretemp btintel
[12637.085819] crct10dif_pclmul crc32_pclmul videodev snd_hda_intel bluetooth drm_kms_helper ghash_clmulni_intel deflate media efi_pstore intel_cstate pstore intel_rapl_perf cfg80211 snd_hda_codec joydev mousedev evdev wdat_wdt serio_raw mac_hid efivars drm snd_hda_core snd_hwdep ecdh_generic snd_pcm snd_timer mei_me idma64 virt_dma snd intel_gtt agpgart i2c_i801 i2c_algo_bit mei fb_sys_fops syscopyarea soundcore rfkill processor_thermal_device sysfillrect sysimgblt intel_lpss_pci intel_soc_dts_iosf thermal wmi intel_lpss i2c_hid i2c_core battery tpm_crb button ac tpm_tis tpm_tis_core asus_wireless video pcc_cpufreq tpm rng_core pinctrl_geminilake int3400_thermal int3403_thermal pinctrl_intel int340x_thermal_zone acpi_thermal_rel iptable_nat nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
[12637.085912] nf_defrag_ipv4 libcrc32c ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables iptable_filter sch_fq_codel loop cpufreq_powersave tun tap macvlan bridge stp llc kvm_intel kvm irqbypass efivarfs ip_tables x_tables ipv6 crc_ccitt autofs4 ext4 crc32c_generic crc16 mbcache jbd2 fscrypto dm_crypt algif_skcipher af_alg rtsx_usb_sdmmc mmc_core rtsx_usb hid_generic usbhid hid sd_mod input_leds led_class atkbd libps2 ahci libahci xhci_pci libata xhci_hcd aesni_intel usbcore aes_x86_64 crypto_simd scsi_mod cryptd glue_helper crc32c_intel usb_common rtc_cmos i8042 serio dm_mod
[12637.086000] CR2: 00000000000000a8
[12637.086005] ---[ end trace ee0079180c990cd2 ]---
[12637.120805] RIP: 0010:percpu_counter_add_batch+0x4/0x60
[12637.120807] Code: 89 e6 89 c7 e8 dd 3b 28 00 3b 05 fb e0 b6 00 72 d8 4c 89 ee 48 89 ef e8 7a 63 2a 00 48 89 d8 5b 5d 41 5c 41 5d c3 41 54 55 53 <48> 8b 47 20 65 44 8b 20 49 63 ec 48 63 ca 48 01 f5 48 39 e9 7e 0a
[12637.120809] RSP: 0018:ffff9d1b00a0bd20 EFLAGS: 00010006
[12637.120811] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000018
[12637.120812] RDX: 0000000000000018 RSI: 0000000000000001 RDI: 0000000000000088
[12637.120814] RBP: ffff8df67a2988d0 R08: 0000000000000000 R09: ffff8df66fe0cfe0
[12637.120815] R10: 0000000000000230 R11: 0000000000000000 R12: 0000000000000000
[12637.120816] R13: ffff8df67a298758 R14: ffff8df67a2988c8 R15: ffffccd684229a80
[12637.120818] FS: 0000000000000000(0000) GS:ffff8df67ba00000(0000) knlGS:0000000000000000
[12637.120820] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12637.120821] CR2: 00000000000000a8 CR3: 0000000138e0a000 CR4: 0000000000340ef0
Hi,
> It was likely caused by improper shutdown and following nilfs2 partition
> corruption. Now I can still read the data, but on the whole the
> computer is not useable, because starting a process which uses the
> corrupted file system simply crashes in kernel.
Thank you for reporting the issue.
Let me ask you a few questions:
1) Is the crash reproducible in the environment ?
2) Can you mount the corrupted(?) partition from a recent version of kernel ?
3) Does read-only mount option (-r) work to avoid the crash ?
Thanks,
Ryusuke Konishi
2019年11月18日(月) 2:34 Tomas Hlavaty <[email protected]>:
>
> Hi Ryusuke,
>
> today I got this bug in kernel, which seems to be related to nilfs2.
>
> It was likely caused by improper shutdown and following nilfs2 partition
> corruption. Now I can still read the data, but on the whole the
> computer is not useable, because starting a process which uses the
> corrupted file system simply crashes in kernel. I am actually not sure
> if the filesystem is corrupted, as I don't know about any tool to check
> that. The relevant parts of dmesg log are bellow.
>
> Please let me know if you are the right contact or if you need more info
> about the problem.
>
> Thank you,
>
> Tomas
>
> [ 0.000000] Linux version 4.19.84 (nixbld@localhost) (gcc version 8.3.0 (GCC)) #1-NixOS SMP Tue Nov 12 18:21:46 UTC 2019
> [ 0.000000] Command line: initrd=\efi\nixos\4s51zw36kd1qb0ymk0charxjg8x6k5k3-initrd-linux-4.19.84-initrd.efi systemConfig=/nix/store/gdbxhzysr929abrymjqala0b5bh2fqmv-nixos-system-ushi-19.09.1258.07e66484e67 init=/nix/store/gdbxhzysr929abrymjqala0b5bh2fqmv-nixos-system-ushi-19.09.1258.07e66484e67/init loglevel=4
>
>
>
> [ 37.741106] systemd-journald[470]: Received client request to flush runtime journal.
> [ 37.749084] systemd-journald[470]: File /var/log/journal/55a4ea9159c14c0bb8767a43819c6927/system.journal corrupted or uncleanly shut down, renaming and replacing.
> [ 37.810819] audit: type=1130 audit(1573985039.617:3): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-udevd comm="systemd" exe="/nix/store/v8flm2h07zcfg5k5npz56m0ayj0qm1q8-systemd-243/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
>
> [ 38.321561] NILFS version 2 loaded
> [ 38.323236] NILFS (dm-1): mounting unchecked fs
>
>
> [ 38.349185] NILFS (dm-1): recovery complete
> [ 38.353228] NILFS (dm-1): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
>
> [ 63.543941] systemd-journald[470]: File
> /var/log/journal/55a4ea9159c14c0bb8767a43819c6927/user-1000.journal
> corrupted or uncleanly shut down, renaming and replacing.
>
> [12637.085548] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
> [12637.085558] PGD 0 P4D 0
> [12637.085567] Oops: 0000 [#1] SMP PTI
> [12637.085574] CPU: 0 PID: 657 Comm: segctord Not tainted 4.19.84 #1-NixOS
> [12637.085577] Hardware name: ASUSTeK COMPUTER INC. VivoBook 15_ASUS Laptop X507MA_R507MA/X507MA, BIOS X507MA.301 09/14/2018
> [12637.085589] RIP: 0010:percpu_counter_add_batch+0x4/0x60
> [12637.085593] Code: 89 e6 89 c7 e8 dd 3b 28 00 3b 05 fb e0 b6 00 72 d8 4c 89 ee 48 89 ef e8 7a 63 2a 00 48 89 d8 5b 5d 41 5c 41 5d c3 41 54 55 53 <48> 8b 47 20 65 44 8b 20 49 63 ec 48 63 ca 48 01 f5 48 39 e9 7e 0a
> [12637.085597] RSP: 0018:ffff9d1b00a0bd20 EFLAGS: 00010006
> [12637.085601] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000018
> [12637.085604] RDX: 0000000000000018 RSI: 0000000000000001 RDI: 0000000000000088
> [12637.085608] RBP: ffff8df67a2988d0 R08: 0000000000000000 R09: ffff8df66fe0cfe0
> [12637.085611] R10: 0000000000000230 R11: 0000000000000000 R12: 0000000000000000
> [12637.085614] R13: ffff8df67a298758 R14: ffff8df67a2988c8 R15: ffffccd684229a80
> [12637.085618] FS: 0000000000000000(0000) GS:ffff8df67ba00000(0000) knlGS:0000000000000000
> [12637.085621] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [12637.085624] CR2: 00000000000000a8 CR3: 000000011ac0a000 CR4: 0000000000340ef0
> [12637.085628] Call Trace:
> [12637.085640] __test_set_page_writeback+0x37c/0x3f0
> [12637.085663] nilfs_segctor_do_construct+0x184e/0x2040 [nilfs2]
> [12637.085680] nilfs_segctor_construct+0x1f5/0x2e0 [nilfs2]
> [12637.085693] nilfs_segctor_thread+0x129/0x370 [nilfs2]
> [12637.085706] ? nilfs_segctor_construct+0x2e0/0x2e0 [nilfs2]
> [12637.085713] kthread+0x112/0x130
> [12637.085719] ? kthread_bind+0x30/0x30
> [12637.085728] ret_from_fork+0x1f/0x40
> [12637.085734] Modules linked in: ctr ccm af_packet msr 8021q snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic hid_multitouch arc4 ath9k ath9k_common ath9k_hw ath mac80211 snd_soc_skl snd_soc_skl_ipc spi_pxa2xx_platform asus_nb_wmi snd_soc_sst_ipc snd_soc_sst_dsp asus_wmi 8250_dw i2c_designware_platform sparse_keymap i2c_designware_core wmi_bmof i915 snd_hda_ext_core nilfs2 snd_soc_acpi_intel_match snd_soc_acpi uvcvideo videobuf2_vmalloc nls_iso8859_1 videobuf2_memops videobuf2_v4l2 snd_soc_core nls_cp437 rtsx_usb_ms intel_telemetry_pltdrv vfat intel_punit_ipc intel_telemetry_core fat intel_pmc_ipc memstick videobuf2_common snd_compress kvmgt vfio_mdev mdev ath3k vfio_iommu_type1 vfio btusb ac97_bus snd_pcm_dmaengine btrtl x86_pkg_temp_thermal intel_powerclamp btbcm cec coretemp btintel
> [12637.085819] crct10dif_pclmul crc32_pclmul videodev snd_hda_intel bluetooth drm_kms_helper ghash_clmulni_intel deflate media efi_pstore intel_cstate pstore intel_rapl_perf cfg80211 snd_hda_codec joydev mousedev evdev wdat_wdt serio_raw mac_hid efivars drm snd_hda_core snd_hwdep ecdh_generic snd_pcm snd_timer mei_me idma64 virt_dma snd intel_gtt agpgart i2c_i801 i2c_algo_bit mei fb_sys_fops syscopyarea soundcore rfkill processor_thermal_device sysfillrect sysimgblt intel_lpss_pci intel_soc_dts_iosf thermal wmi intel_lpss i2c_hid i2c_core battery tpm_crb button ac tpm_tis tpm_tis_core asus_wireless video pcc_cpufreq tpm rng_core pinctrl_geminilake int3400_thermal int3403_thermal pinctrl_intel int340x_thermal_zone acpi_thermal_rel iptable_nat nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
> [12637.085912] nf_defrag_ipv4 libcrc32c ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables iptable_filter sch_fq_codel loop cpufreq_powersave tun tap macvlan bridge stp llc kvm_intel kvm irqbypass efivarfs ip_tables x_tables ipv6 crc_ccitt autofs4 ext4 crc32c_generic crc16 mbcache jbd2 fscrypto dm_crypt algif_skcipher af_alg rtsx_usb_sdmmc mmc_core rtsx_usb hid_generic usbhid hid sd_mod input_leds led_class atkbd libps2 ahci libahci xhci_pci libata xhci_hcd aesni_intel usbcore aes_x86_64 crypto_simd scsi_mod cryptd glue_helper crc32c_intel usb_common rtc_cmos i8042 serio dm_mod
> [12637.086000] CR2: 00000000000000a8
> [12637.086005] ---[ end trace ee0079180c990cd2 ]---
> [12637.120805] RIP: 0010:percpu_counter_add_batch+0x4/0x60
> [12637.120807] Code: 89 e6 89 c7 e8 dd 3b 28 00 3b 05 fb e0 b6 00 72 d8 4c 89 ee 48 89 ef e8 7a 63 2a 00 48 89 d8 5b 5d 41 5c 41 5d c3 41 54 55 53 <48> 8b 47 20 65 44 8b 20 49 63 ec 48 63 ca 48 01 f5 48 39 e9 7e 0a
> [12637.120809] RSP: 0018:ffff9d1b00a0bd20 EFLAGS: 00010006
> [12637.120811] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000018
> [12637.120812] RDX: 0000000000000018 RSI: 0000000000000001 RDI: 0000000000000088
> [12637.120814] RBP: ffff8df67a2988d0 R08: 0000000000000000 R09: ffff8df66fe0cfe0
> [12637.120815] R10: 0000000000000230 R11: 0000000000000000 R12: 0000000000000000
> [12637.120816] R13: ffff8df67a298758 R14: ffff8df67a2988c8 R15: ffffccd684229a80
> [12637.120818] FS: 0000000000000000(0000) GS:ffff8df67ba00000(0000) knlGS:0000000000000000
> [12637.120820] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [12637.120821] CR2: 00000000000000a8 CR3: 0000000138e0a000 CR4: 0000000000340ef0
> On Nov 18, 2019, at 7:51 PM, Ryusuke Konishi <[email protected]> wrote:
>
> Hi,
>
>> It was likely caused by improper shutdown and following nilfs2 partition
>> corruption. Now I can still read the data, but on the whole the
>> computer is not useable, because starting a process which uses the
>> corrupted file system simply crashes in kernel.
>
> Thank you for reporting the issue.
> Let me ask you a few questions:
>
> 1) Is the crash reproducible in the environment ?
> 2) Can you mount the corrupted(?) partition from a recent version of kernel ?
> 3) Does read-only mount option (-r) work to avoid the crash ?
I believe it could be important to know more details about the partition too:
(1) the partition size?
(2) the logical block size?
(3) the segment size?
(4) how the partition was created?
(5) the version of tools that created the partition?
(6) the amount of free space on the partition?
Thanks,
Viacheslav Dubeyko.
>
> Thanks,
> Ryusuke Konishi
>
> 2019年11月18日(月) 2:34 Tomas Hlavaty <[email protected]>:
>>
>> Hi Ryusuke,
>>
>> today I got this bug in kernel, which seems to be related to nilfs2.
>>
>> It was likely caused by improper shutdown and following nilfs2 partition
>> corruption. Now I can still read the data, but on the whole the
>> computer is not useable, because starting a process which uses the
>> corrupted file system simply crashes in kernel. I am actually not sure
>> if the filesystem is corrupted, as I don't know about any tool to check
>> that. The relevant parts of dmesg log are bellow.
>>
>> Please let me know if you are the right contact or if you need more info
>> about the problem.
>>
>> Thank you,
>>
>> Tomas
>>
>> [ 0.000000] Linux version 4.19.84 (nixbld@localhost) (gcc version 8.3.0 (GCC)) #1-NixOS SMP Tue Nov 12 18:21:46 UTC 2019
>> [ 0.000000] Command line: initrd=\efi\nixos\4s51zw36kd1qb0ymk0charxjg8x6k5k3-initrd-linux-4.19.84-initrd.efi systemConfig=/nix/store/gdbxhzysr929abrymjqala0b5bh2fqmv-nixos-system-ushi-19.09.1258.07e66484e67 init=/nix/store/gdbxhzysr929abrymjqala0b5bh2fqmv-nixos-system-ushi-19.09.1258.07e66484e67/init loglevel=4
>>
>>
>>
>> [ 37.741106] systemd-journald[470]: Received client request to flush runtime journal.
>> [ 37.749084] systemd-journald[470]: File /var/log/journal/55a4ea9159c14c0bb8767a43819c6927/system.journal corrupted or uncleanly shut down, renaming and replacing.
>> [ 37.810819] audit: type=1130 audit(1573985039.617:3): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-udevd comm="systemd" exe="/nix/store/v8flm2h07zcfg5k5npz56m0ayj0qm1q8-systemd-243/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
>>
>> [ 38.321561] NILFS version 2 loaded
>> [ 38.323236] NILFS (dm-1): mounting unchecked fs
>>
>>
>> [ 38.349185] NILFS (dm-1): recovery complete
>> [ 38.353228] NILFS (dm-1): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
>>
>> [ 63.543941] systemd-journald[470]: File
>> /var/log/journal/55a4ea9159c14c0bb8767a43819c6927/user-1000.journal
>> corrupted or uncleanly shut down, renaming and replacing.
>>
>> [12637.085548] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
>> [12637.085558] PGD 0 P4D 0
>> [12637.085567] Oops: 0000 [#1] SMP PTI
>> [12637.085574] CPU: 0 PID: 657 Comm: segctord Not tainted 4.19.84 #1-NixOS
>> [12637.085577] Hardware name: ASUSTeK COMPUTER INC. VivoBook 15_ASUS Laptop X507MA_R507MA/X507MA, BIOS X507MA.301 09/14/2018
>> [12637.085589] RIP: 0010:percpu_counter_add_batch+0x4/0x60
>> [12637.085593] Code: 89 e6 89 c7 e8 dd 3b 28 00 3b 05 fb e0 b6 00 72 d8 4c 89 ee 48 89 ef e8 7a 63 2a 00 48 89 d8 5b 5d 41 5c 41 5d c3 41 54 55 53 <48> 8b 47 20 65 44 8b 20 49 63 ec 48 63 ca 48 01 f5 48 39 e9 7e 0a
>> [12637.085597] RSP: 0018:ffff9d1b00a0bd20 EFLAGS: 00010006
>> [12637.085601] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000018
>> [12637.085604] RDX: 0000000000000018 RSI: 0000000000000001 RDI: 0000000000000088
>> [12637.085608] RBP: ffff8df67a2988d0 R08: 0000000000000000 R09: ffff8df66fe0cfe0
>> [12637.085611] R10: 0000000000000230 R11: 0000000000000000 R12: 0000000000000000
>> [12637.085614] R13: ffff8df67a298758 R14: ffff8df67a2988c8 R15: ffffccd684229a80
>> [12637.085618] FS: 0000000000000000(0000) GS:ffff8df67ba00000(0000) knlGS:0000000000000000
>> [12637.085621] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [12637.085624] CR2: 00000000000000a8 CR3: 000000011ac0a000 CR4: 0000000000340ef0
>> [12637.085628] Call Trace:
>> [12637.085640] __test_set_page_writeback+0x37c/0x3f0
>> [12637.085663] nilfs_segctor_do_construct+0x184e/0x2040 [nilfs2]
>> [12637.085680] nilfs_segctor_construct+0x1f5/0x2e0 [nilfs2]
>> [12637.085693] nilfs_segctor_thread+0x129/0x370 [nilfs2]
>> [12637.085706] ? nilfs_segctor_construct+0x2e0/0x2e0 [nilfs2]
>> [12637.085713] kthread+0x112/0x130
>> [12637.085719] ? kthread_bind+0x30/0x30
>> [12637.085728] ret_from_fork+0x1f/0x40
>> [12637.085734] Modules linked in: ctr ccm af_packet msr 8021q snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic hid_multitouch arc4 ath9k ath9k_common ath9k_hw ath mac80211 snd_soc_skl snd_soc_skl_ipc spi_pxa2xx_platform asus_nb_wmi snd_soc_sst_ipc snd_soc_sst_dsp asus_wmi 8250_dw i2c_designware_platform sparse_keymap i2c_designware_core wmi_bmof i915 snd_hda_ext_core nilfs2 snd_soc_acpi_intel_match snd_soc_acpi uvcvideo videobuf2_vmalloc nls_iso8859_1 videobuf2_memops videobuf2_v4l2 snd_soc_core nls_cp437 rtsx_usb_ms intel_telemetry_pltdrv vfat intel_punit_ipc intel_telemetry_core fat intel_pmc_ipc memstick videobuf2_common snd_compress kvmgt vfio_mdev mdev ath3k vfio_iommu_type1 vfio btusb ac97_bus snd_pcm_dmaengine btrtl x86_pkg_temp_thermal intel_powerclamp btbcm cec coretemp btintel
>> [12637.085819] crct10dif_pclmul crc32_pclmul videodev snd_hda_intel bluetooth drm_kms_helper ghash_clmulni_intel deflate media efi_pstore intel_cstate pstore intel_rapl_perf cfg80211 snd_hda_codec joydev mousedev evdev wdat_wdt serio_raw mac_hid efivars drm snd_hda_core snd_hwdep ecdh_generic snd_pcm snd_timer mei_me idma64 virt_dma snd intel_gtt agpgart i2c_i801 i2c_algo_bit mei fb_sys_fops syscopyarea soundcore rfkill processor_thermal_device sysfillrect sysimgblt intel_lpss_pci intel_soc_dts_iosf thermal wmi intel_lpss i2c_hid i2c_core battery tpm_crb button ac tpm_tis tpm_tis_core asus_wireless video pcc_cpufreq tpm rng_core pinctrl_geminilake int3400_thermal int3403_thermal pinctrl_intel int340x_thermal_zone acpi_thermal_rel iptable_nat nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
>> [12637.085912] nf_defrag_ipv4 libcrc32c ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables iptable_filter sch_fq_codel loop cpufreq_powersave tun tap macvlan bridge stp llc kvm_intel kvm irqbypass efivarfs ip_tables x_tables ipv6 crc_ccitt autofs4 ext4 crc32c_generic crc16 mbcache jbd2 fscrypto dm_crypt algif_skcipher af_alg rtsx_usb_sdmmc mmc_core rtsx_usb hid_generic usbhid hid sd_mod input_leds led_class atkbd libps2 ahci libahci xhci_pci libata xhci_hcd aesni_intel usbcore aes_x86_64 crypto_simd scsi_mod cryptd glue_helper crc32c_intel usb_common rtc_cmos i8042 serio dm_mod
>> [12637.086000] CR2: 00000000000000a8
>> [12637.086005] ---[ end trace ee0079180c990cd2 ]---
>> [12637.120805] RIP: 0010:percpu_counter_add_batch+0x4/0x60
>> [12637.120807] Code: 89 e6 89 c7 e8 dd 3b 28 00 3b 05 fb e0 b6 00 72 d8 4c 89 ee 48 89 ef e8 7a 63 2a 00 48 89 d8 5b 5d 41 5c 41 5d c3 41 54 55 53 <48> 8b 47 20 65 44 8b 20 49 63 ec 48 63 ca 48 01 f5 48 39 e9 7e 0a
>> [12637.120809] RSP: 0018:ffff9d1b00a0bd20 EFLAGS: 00010006
>> [12637.120811] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000018
>> [12637.120812] RDX: 0000000000000018 RSI: 0000000000000001 RDI: 0000000000000088
>> [12637.120814] RBP: ffff8df67a2988d0 R08: 0000000000000000 R09: ffff8df66fe0cfe0
>> [12637.120815] R10: 0000000000000230 R11: 0000000000000000 R12: 0000000000000000
>> [12637.120816] R13: ffff8df67a298758 R14: ffff8df67a2988c8 R15: ffffccd684229a80
>> [12637.120818] FS: 0000000000000000(0000) GS:ffff8df67ba00000(0000) knlGS:0000000000000000
>> [12637.120820] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [12637.120821] CR2: 00000000000000a8 CR3: 0000000138e0a000 CR4: 0000000000340ef0
Hi Ryusuke,
thanks for your answer.
Ryusuke Konishi <[email protected]> writes:
> 1) Is the crash reproducible in the environment ?
yes
> 2) Can you mount the corrupted(?) partition from a recent version of
> kernel ?
> 3) Does read-only mount option (-r) work to avoid the crash ?
I'll have access to the computer sometime next week so I'll try this out
and let you know.
Thank you,
Tomas
Hi Ryusuke,
>> 2) Can you mount the corrupted(?) partition from a recent version of
>> kernel ?
this will take me some time to figure out
>> 3) Does read-only mount option (-r) work to avoid the crash ?
ro mount doesn't seem to crash
at least after mounting the partition read-only
- running lscp
- running sudo find . -type f inside the mounted partition
- cat <some random file on the nilfs partition>
does not crash
the crash i was seeing was during rsync (writing i guess)
Other info that might be relevant:
- the nilfs partition was on top of luks
- the corruption happened probably during shutdown
the shutdown hanged for a long time waiting for nilfs
disk (iirc it waits for 1m30s) and even after that it did
not finish so i turned the computer off without waiting
further. after new start, i got the crash
- i got the same problem on another disk recently
Regards,
Tomas
Hi Viacheslav,
Viacheslav Dubeyko <[email protected]> writes:
> (1) the partition size?
the first disk with crash was 1TB
the second disk with crash, which i have by me, is 2TB:
$ lsblk /dev/sdb
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 1.8T 0 disk
└─extdisk 254:2 0 1.8T 0 crypt /mnt/b
> (2) the logical block size?
> (3) the segment size?
how can i find (2) and (3) out?
here is the output of nilfs-tune:
$ sudo nilfs-tune -l /dev/mapper/extdisk
nilfs-tune 2.2.7
Filesystem volume name: backup1_nilfs2
Filesystem UUID: 7d9708f9-464f-41b7-a0c6-eda18741012f
Filesystem magic number: 0x3434
Filesystem revision #: 2.0
Filesystem features: (none)
Filesystem state: valid
Filesystem OS type: Linux
Block size: 4096
Filesystem created: Thu Dec 27 14:14:14 2018
Last mount time: Fri Dec 20 13:06:15 2019
Last write time: Thu Jan 23 13:04:30 2020
Mount count: 15
Maximum mount count: 50
Reserve blocks uid: 0 (user root)
Reserve blocks gid: 0 (group root)
First inode: 11
Inode size: 128
DAT entry size: 32
Checkpoint size: 192
Segment usage size: 16
Number of segments: 238465
Device size: 2000396834816
First data block: 1
# of blocks per segment: 2048
Reserved segments %: 5
Last checkpoint #: 9884
Last block address: 280841435
Last sequence #: 137120
Free blocks count: 207591424
Commit interval: 0
# of blks to create seg: 0
CRC seed: 0x5172270a
CRC check sum: 0x2ef767d2
CRC check data size: 0x00000118
it seems strange that the last write time is today, even though i
mounted the partition read-only
/dev/mapper/extdisk on /mnt/b type nilfs2 (ro,relatime)
> (4) how the partition was created?
using parted
then cryptsetup luksFormat
then cryptsetup luksOpen
then mkfs.nilfs2
> (5) the version of tools that created the partition?
how can i find this out? is it saved somewhere?
> (6) the amount of free space on the partition?
/dev/mapper/extdisk 1.9T 1.1T 699G 61% /mnt/b
Regards,
Tomas
Hi,
It is reproducible in my environment.
Kernel version is 4.19.86 (Gentoo).
NILFS2 with kernel 4.19.82 works well.
I did following for the test.
i) mount corrupt partition with read-only option
(this partition causes "mounting fs with errors" at every rw mount)
i-1) wait a few minutes ... not crash
i-2) fs access (ls, du, ...) ... not crash
ii) create small NILFS2 fs and read-write mount
dd if=/dev/zero of=/tmp/n bs=1M count=500
mount -o loop /tmp/n /mnt/tmp
ii-1) wait a few minutes ... not crash
ii-2) touch file in the fs ... crash (in few seconds)
In <CAKFNMo=k1wVHOwXhTLEOJ+A-nwmvJ+sN_PPa8kY8fMxrQ4R+Jw@mail.gmail.com>;
Ryusuke Konishi <[email protected]> wrote
as Subject "Re: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 in nilfs_segctor_do_construct":
> Hi,
>
>> It was likely caused by improper shutdown and following nilfs2 partition
>> corruption. Now I can still read the data, but on the whole the
>> computer is not useable, because starting a process which uses the
>> corrupted file system simply crashes in kernel.
>
> Thank you for reporting the issue.
> Let me ask you a few questions:
>
> 1) Is the crash reproducible in the environment ?
> 2) Can you mount the corrupted(?) partition from a recent version of kernel ?
> 3) Does read-only mount option (-r) work to avoid the crash ?
>
> Thanks,
> Ryusuke Konishi
>
> 2019$BG/(B11$B7n(B18$BF|(B($B7n(B) 2:34 Tomas Hlavaty <[email protected]>:
>>
>> Hi Ryusuke,
>>
>> today I got this bug in kernel, which seems to be related to nilfs2.
>>
>> It was likely caused by improper shutdown and following nilfs2 partition
>> corruption. Now I can still read the data, but on the whole the
>> computer is not useable, because starting a process which uses the
>> corrupted file system simply crashes in kernel. I am actually not sure
>> if the filesystem is corrupted, as I don't know about any tool to check
>> that. The relevant parts of dmesg log are bellow.
>>
>> Please let me know if you are the right contact or if you need more info
>> about the problem.
>>
>> Thank you,
>>
>> Tomas
>>
>> [ 0.000000] Linux version 4.19.84 (nixbld@localhost) (gcc version 8.3.0 (GCC)) #1-NixOS SMP Tue Nov 12 18:21:46 UTC 2019
>> [ 0.000000] Command line: initrd=\efi\nixos\4s51zw36kd1qb0ymk0charxjg8x6k5k3-initrd-linux-4.19.84-initrd.efi systemConfig=/nix/store/gdbxhzysr929abrymjqala0b5bh2fqmv-nixos-system-ushi-19.09.1258.07e66484e67 init=/nix/store/gdbxhzysr929abrymjqala0b5bh2fqmv-nixos-system-ushi-19.09.1258.07e66484e67/init loglevel=4
>>
>>
>>
>> [ 37.741106] systemd-journald[470]: Received client request to flush runtime journal.
>> [ 37.749084] systemd-journald[470]: File /var/log/journal/55a4ea9159c14c0bb8767a43819c6927/system.journal corrupted or uncleanly shut down, renaming and replacing.
>> [ 37.810819] audit: type=1130 audit(1573985039.617:3): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-udevd comm="systemd" exe="/nix/store/v8flm2h07zcfg5k5npz56m0ayj0qm1q8-systemd-243/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
>>
>> [ 38.321561] NILFS version 2 loaded
>> [ 38.323236] NILFS (dm-1): mounting unchecked fs
>>
>>
>> [ 38.349185] NILFS (dm-1): recovery complete
>> [ 38.353228] NILFS (dm-1): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
>>
>> [ 63.543941] systemd-journald[470]: File
>> /var/log/journal/55a4ea9159c14c0bb8767a43819c6927/user-1000.journal
>> corrupted or uncleanly shut down, renaming and replacing.
>>
>> [12637.085548] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
>> [12637.085558] PGD 0 P4D 0
>> [12637.085567] Oops: 0000 [#1] SMP PTI
>> [12637.085574] CPU: 0 PID: 657 Comm: segctord Not tainted 4.19.84 #1-NixOS
>> [12637.085577] Hardware name: ASUSTeK COMPUTER INC. VivoBook 15_ASUS Laptop X507MA_R507MA/X507MA, BIOS X507MA.301 09/14/2018
>> [12637.085589] RIP: 0010:percpu_counter_add_batch+0x4/0x60
>> [12637.085593] Code: 89 e6 89 c7 e8 dd 3b 28 00 3b 05 fb e0 b6 00 72 d8 4c 89 ee 48 89 ef e8 7a 63 2a 00 48 89 d8 5b 5d 41 5c 41 5d c3 41 54 55 53 <48> 8b 47 20 65 44 8b 20 49 63 ec 48 63 ca 48 01 f5 48 39 e9 7e 0a
>> [12637.085597] RSP: 0018:ffff9d1b00a0bd20 EFLAGS: 00010006
>> [12637.085601] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000018
>> [12637.085604] RDX: 0000000000000018 RSI: 0000000000000001 RDI: 0000000000000088
>> [12637.085608] RBP: ffff8df67a2988d0 R08: 0000000000000000 R09: ffff8df66fe0cfe0
>> [12637.085611] R10: 0000000000000230 R11: 0000000000000000 R12: 0000000000000000
>> [12637.085614] R13: ffff8df67a298758 R14: ffff8df67a2988c8 R15: ffffccd684229a80
>> [12637.085618] FS: 0000000000000000(0000) GS:ffff8df67ba00000(0000) knlGS:0000000000000000
>> [12637.085621] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [12637.085624] CR2: 00000000000000a8 CR3: 000000011ac0a000 CR4: 0000000000340ef0
>> [12637.085628] Call Trace:
>> [12637.085640] __test_set_page_writeback+0x37c/0x3f0
>> [12637.085663] nilfs_segctor_do_construct+0x184e/0x2040 [nilfs2]
>> [12637.085680] nilfs_segctor_construct+0x1f5/0x2e0 [nilfs2]
>> [12637.085693] nilfs_segctor_thread+0x129/0x370 [nilfs2]
>> [12637.085706] ? nilfs_segctor_construct+0x2e0/0x2e0 [nilfs2]
>> [12637.085713] kthread+0x112/0x130
>> [12637.085719] ? kthread_bind+0x30/0x30
>> [12637.085728] ret_from_fork+0x1f/0x40
>> [12637.085734] Modules linked in: ctr ccm af_packet msr 8021q snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic hid_multitouch arc4 ath9k ath9k_common ath9k_hw ath mac80211 snd_soc_skl snd_soc_skl_ipc spi_pxa2xx_platform asus_nb_wmi snd_soc_sst_ipc snd_soc_sst_dsp asus_wmi 8250_dw i2c_designware_platform sparse_keymap i2c_designware_core wmi_bmof i915 snd_hda_ext_core nilfs2 snd_soc_acpi_intel_match snd_soc_acpi uvcvideo videobuf2_vmalloc nls_iso8859_1 videobuf2_memops videobuf2_v4l2 snd_soc_core nls_cp437 rtsx_usb_ms intel_telemetry_pltdrv vfat intel_punit_ipc intel_telemetry_core fat intel_pmc_ipc memstick videobuf2_common snd_compress kvmgt vfio_mdev mdev ath3k vfio_iommu_type1 vfio btusb ac97_bus snd_pcm_dmaengine btrtl x86_pkg_temp_thermal intel_powerclamp btbcm cec coretemp btintel
>> [12637.085819] crct10dif_pclmul crc32_pclmul videodev snd_hda_intel bluetooth drm_kms_helper ghash_clmulni_intel deflate media efi_pstore intel_cstate pstore intel_rapl_perf cfg80211 snd_hda_codec joydev mousedev evdev wdat_wdt serio_raw mac_hid efivars drm snd_hda_core snd_hwdep ecdh_generic snd_pcm snd_timer mei_me idma64 virt_dma snd intel_gtt agpgart i2c_i801 i2c_algo_bit mei fb_sys_fops syscopyarea soundcore rfkill processor_thermal_device sysfillrect sysimgblt intel_lpss_pci intel_soc_dts_iosf thermal wmi intel_lpss i2c_hid i2c_core battery tpm_crb button ac tpm_tis tpm_tis_core asus_wireless video pcc_cpufreq tpm rng_core pinctrl_geminilake int3400_thermal int3403_thermal pinctrl_intel int340x_thermal_zone acpi_thermal_rel iptable_nat nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
>> [12637.085912] nf_defrag_ipv4 libcrc32c ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables iptable_filter sch_fq_codel loop cpufreq_powersave tun tap macvlan bridge stp llc kvm_intel kvm irqbypass efivarfs ip_tables x_tables ipv6 crc_ccitt autofs4 ext4 crc32c_generic crc16 mbcache jbd2 fscrypto dm_crypt algif_skcipher af_alg rtsx_usb_sdmmc mmc_core rtsx_usb hid_generic usbhid hid sd_mod input_leds led_class atkbd libps2 ahci libahci xhci_pci libata xhci_hcd aesni_intel usbcore aes_x86_64 crypto_simd scsi_mod cryptd glue_helper crc32c_intel usb_common rtc_cmos i8042 serio dm_mod
>> [12637.086000] CR2: 00000000000000a8
>> [12637.086005] ---[ end trace ee0079180c990cd2 ]---
>> [12637.120805] RIP: 0010:percpu_counter_add_batch+0x4/0x60
>> [12637.120807] Code: 89 e6 89 c7 e8 dd 3b 28 00 3b 05 fb e0 b6 00 72 d8 4c 89 ee 48 89 ef e8 7a 63 2a 00 48 89 d8 5b 5d 41 5c 41 5d c3 41 54 55 53 <48> 8b 47 20 65 44 8b 20 49 63 ec 48 63 ca 48 01 f5 48 39 e9 7e 0a
>> [12637.120809] RSP: 0018:ffff9d1b00a0bd20 EFLAGS: 00010006
>> [12637.120811] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000018
>> [12637.120812] RDX: 0000000000000018 RSI: 0000000000000001 RDI: 0000000000000088
>> [12637.120814] RBP: ffff8df67a2988d0 R08: 0000000000000000 R09: ffff8df66fe0cfe0
>> [12637.120815] R10: 0000000000000230 R11: 0000000000000000 R12: 0000000000000000
>> [12637.120816] R13: ffff8df67a298758 R14: ffff8df67a2988c8 R15: ffffccd684229a80
>> [12637.120818] FS: 0000000000000000(0000) GS:ffff8df67ba00000(0000) knlGS:0000000000000000
>> [12637.120820] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [12637.120821] CR2: 00000000000000a8 CR3: 0000000138e0a000 CR4: 0000000000340ef0
Hi,
Now I found that my /tmp is not tmpfs, it is in root partition (!!!???)
And, I use LUKS for it.
LUKS - VG - LV (root, usr, ...)
I want to try "ii)" without LUKS/LVM, but cannot reboot now.
In <[email protected]>;
ARAI Shun-ichi <[email protected]> wrote
as Subject "Re: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 in nilfs_segctor_do_construct":
> Hi,
>
> It is reproducible in my environment.
> Kernel version is 4.19.86 (Gentoo).
> NILFS2 with kernel 4.19.82 works well.
>
> I did following for the test.
>
> i) mount corrupt partition with read-only option
> (this partition causes "mounting fs with errors" at every rw mount)
> i-1) wait a few minutes ... not crash
> i-2) fs access (ls, du, ...) ... not crash
>
> ii) create small NILFS2 fs and read-write mount
> dd if=/dev/zero of=/tmp/n bs=1M count=500
> mount -o loop /tmp/n /mnt/tmp
> ii-1) wait a few minutes ... not crash
> ii-2) touch file in the fs ... crash (in few seconds)
>
>
> In <CAKFNMo=k1wVHOwXhTLEOJ+A-nwmvJ+sN_PPa8kY8fMxrQ4R+Jw@mail.gmail.com>;
> Ryusuke Konishi <[email protected]> wrote
> as Subject "Re: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 in nilfs_segctor_do_construct":
>
>> Hi,
>>
>>> It was likely caused by improper shutdown and following nilfs2 partition
>>> corruption. Now I can still read the data, but on the whole the
>>> computer is not useable, because starting a process which uses the
>>> corrupted file system simply crashes in kernel.
>>
>> Thank you for reporting the issue.
>> Let me ask you a few questions:
>>
>> 1) Is the crash reproducible in the environment ?
>> 2) Can you mount the corrupted(?) partition from a recent version of kernel ?
>> 3) Does read-only mount option (-r) work to avoid the crash ?
>>
>> Thanks,
>> Ryusuke Konishi
>>
>> 2019$BG/(B11$B7n(B18$BF|(B($B7n(B) 2:34 Tomas Hlavaty <[email protected]>:
>>>
>>> Hi Ryusuke,
>>>
>>> today I got this bug in kernel, which seems to be related to nilfs2.
>>>
>>> It was likely caused by improper shutdown and following nilfs2 partition
>>> corruption. Now I can still read the data, but on the whole the
>>> computer is not useable, because starting a process which uses the
>>> corrupted file system simply crashes in kernel. I am actually not sure
>>> if the filesystem is corrupted, as I don't know about any tool to check
>>> that. The relevant parts of dmesg log are bellow.
>>>
>>> Please let me know if you are the right contact or if you need more info
>>> about the problem.
>>>
>>> Thank you,
>>>
>>> Tomas
>>>
>>> [ 0.000000] Linux version 4.19.84 (nixbld@localhost) (gcc version 8.3.0 (GCC)) #1-NixOS SMP Tue Nov 12 18:21:46 UTC 2019
>>> [ 0.000000] Command line: initrd=\efi\nixos\4s51zw36kd1qb0ymk0charxjg8x6k5k3-initrd-linux-4.19.84-initrd.efi systemConfig=/nix/store/gdbxhzysr929abrymjqala0b5bh2fqmv-nixos-system-ushi-19.09.1258.07e66484e67 init=/nix/store/gdbxhzysr929abrymjqala0b5bh2fqmv-nixos-system-ushi-19.09.1258.07e66484e67/init loglevel=4
>>>
>>>
>>>
>>> [ 37.741106] systemd-journald[470]: Received client request to flush runtime journal.
>>> [ 37.749084] systemd-journald[470]: File /var/log/journal/55a4ea9159c14c0bb8767a43819c6927/system.journal corrupted or uncleanly shut down, renaming and replacing.
>>> [ 37.810819] audit: type=1130 audit(1573985039.617:3): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-udevd comm="systemd" exe="/nix/store/v8flm2h07zcfg5k5npz56m0ayj0qm1q8-systemd-243/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
>>>
>>> [ 38.321561] NILFS version 2 loaded
>>> [ 38.323236] NILFS (dm-1): mounting unchecked fs
>>>
>>>
>>> [ 38.349185] NILFS (dm-1): recovery complete
>>> [ 38.353228] NILFS (dm-1): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
>>>
>>> [ 63.543941] systemd-journald[470]: File
>>> /var/log/journal/55a4ea9159c14c0bb8767a43819c6927/user-1000.journal
>>> corrupted or uncleanly shut down, renaming and replacing.
>>>
>>> [12637.085548] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
>>> [12637.085558] PGD 0 P4D 0
>>> [12637.085567] Oops: 0000 [#1] SMP PTI
>>> [12637.085574] CPU: 0 PID: 657 Comm: segctord Not tainted 4.19.84 #1-NixOS
>>> [12637.085577] Hardware name: ASUSTeK COMPUTER INC. VivoBook 15_ASUS Laptop X507MA_R507MA/X507MA, BIOS X507MA.301 09/14/2018
>>> [12637.085589] RIP: 0010:percpu_counter_add_batch+0x4/0x60
>>> [12637.085593] Code: 89 e6 89 c7 e8 dd 3b 28 00 3b 05 fb e0 b6 00 72 d8 4c 89 ee 48 89 ef e8 7a 63 2a 00 48 89 d8 5b 5d 41 5c 41 5d c3 41 54 55 53 <48> 8b 47 20 65 44 8b 20 49 63 ec 48 63 ca 48 01 f5 48 39 e9 7e 0a
>>> [12637.085597] RSP: 0018:ffff9d1b00a0bd20 EFLAGS: 00010006
>>> [12637.085601] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000018
>>> [12637.085604] RDX: 0000000000000018 RSI: 0000000000000001 RDI: 0000000000000088
>>> [12637.085608] RBP: ffff8df67a2988d0 R08: 0000000000000000 R09: ffff8df66fe0cfe0
>>> [12637.085611] R10: 0000000000000230 R11: 0000000000000000 R12: 0000000000000000
>>> [12637.085614] R13: ffff8df67a298758 R14: ffff8df67a2988c8 R15: ffffccd684229a80
>>> [12637.085618] FS: 0000000000000000(0000) GS:ffff8df67ba00000(0000) knlGS:0000000000000000
>>> [12637.085621] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [12637.085624] CR2: 00000000000000a8 CR3: 000000011ac0a000 CR4: 0000000000340ef0
>>> [12637.085628] Call Trace:
>>> [12637.085640] __test_set_page_writeback+0x37c/0x3f0
>>> [12637.085663] nilfs_segctor_do_construct+0x184e/0x2040 [nilfs2]
>>> [12637.085680] nilfs_segctor_construct+0x1f5/0x2e0 [nilfs2]
>>> [12637.085693] nilfs_segctor_thread+0x129/0x370 [nilfs2]
>>> [12637.085706] ? nilfs_segctor_construct+0x2e0/0x2e0 [nilfs2]
>>> [12637.085713] kthread+0x112/0x130
>>> [12637.085719] ? kthread_bind+0x30/0x30
>>> [12637.085728] ret_from_fork+0x1f/0x40
>>> [12637.085734] Modules linked in: ctr ccm af_packet msr 8021q snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic hid_multitouch arc4 ath9k ath9k_common ath9k_hw ath mac80211 snd_soc_skl snd_soc_skl_ipc spi_pxa2xx_platform asus_nb_wmi snd_soc_sst_ipc snd_soc_sst_dsp asus_wmi 8250_dw i2c_designware_platform sparse_keymap i2c_designware_core wmi_bmof i915 snd_hda_ext_core nilfs2 snd_soc_acpi_intel_match snd_soc_acpi uvcvideo videobuf2_vmalloc nls_iso8859_1 videobuf2_memops videobuf2_v4l2 snd_soc_core nls_cp437 rtsx_usb_ms intel_telemetry_pltdrv vfat intel_punit_ipc intel_telemetry_core fat intel_pmc_ipc memstick videobuf2_common snd_compress kvmgt vfio_mdev mdev ath3k vfio_iommu_type1 vfio btusb ac97_bus snd_pcm_dmaengine btrtl x86_pkg_temp_thermal intel_powerclamp btbcm cec coretemp btintel
>>> [12637.085819] crct10dif_pclmul crc32_pclmul videodev snd_hda_intel bluetooth drm_kms_helper ghash_clmulni_intel deflate media efi_pstore intel_cstate pstore intel_rapl_perf cfg80211 snd_hda_codec joydev mousedev evdev wdat_wdt serio_raw mac_hid efivars drm snd_hda_core snd_hwdep ecdh_generic snd_pcm snd_timer mei_me idma64 virt_dma snd intel_gtt agpgart i2c_i801 i2c_algo_bit mei fb_sys_fops syscopyarea soundcore rfkill processor_thermal_device sysfillrect sysimgblt intel_lpss_pci intel_soc_dts_iosf thermal wmi intel_lpss i2c_hid i2c_core battery tpm_crb button ac tpm_tis tpm_tis_core asus_wireless video pcc_cpufreq tpm rng_core pinctrl_geminilake int3400_thermal int3403_thermal pinctrl_intel int340x_thermal_zone acpi_thermal_rel iptable_nat nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
>>> [12637.085912] nf_defrag_ipv4 libcrc32c ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables iptable_filter sch_fq_codel loop cpufreq_powersave tun tap macvlan bridge stp llc kvm_intel kvm irqbypass efivarfs ip_tables x_tables ipv6 crc_ccitt autofs4 ext4 crc32c_generic crc16 mbcache jbd2 fscrypto dm_crypt algif_skcipher af_alg rtsx_usb_sdmmc mmc_core rtsx_usb hid_generic usbhid hid sd_mod input_leds led_class atkbd libps2 ahci libahci xhci_pci libata xhci_hcd aesni_intel usbcore aes_x86_64 crypto_simd scsi_mod cryptd glue_helper crc32c_intel usb_common rtc_cmos i8042 serio dm_mod
>>> [12637.086000] CR2: 00000000000000a8
>>> [12637.086005] ---[ end trace ee0079180c990cd2 ]---
>>> [12637.120805] RIP: 0010:percpu_counter_add_batch+0x4/0x60
>>> [12637.120807] Code: 89 e6 89 c7 e8 dd 3b 28 00 3b 05 fb e0 b6 00 72 d8 4c 89 ee 48 89 ef e8 7a 63 2a 00 48 89 d8 5b 5d 41 5c 41 5d c3 41 54 55 53 <48> 8b 47 20 65 44 8b 20 49 63 ec 48 63 ca 48 01 f5 48 39 e9 7e 0a
>>> [12637.120809] RSP: 0018:ffff9d1b00a0bd20 EFLAGS: 00010006
>>> [12637.120811] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000018
>>> [12637.120812] RDX: 0000000000000018 RSI: 0000000000000001 RDI: 0000000000000088
>>> [12637.120814] RBP: ffff8df67a2988d0 R08: 0000000000000000 R09: ffff8df66fe0cfe0
>>> [12637.120815] R10: 0000000000000230 R11: 0000000000000000 R12: 0000000000000000
>>> [12637.120816] R13: ffff8df67a298758 R14: ffff8df67a2988c8 R15: ffffccd684229a80
>>> [12637.120818] FS: 0000000000000000(0000) GS:ffff8df67ba00000(0000) knlGS:0000000000000000
>>> [12637.120820] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [12637.120821] CR2: 00000000000000a8 CR3: 0000000138e0a000 CR4: 0000000000340ef0
Hi,
FYI, reporting additional test results.
I reproduced this problem with clean NILFS2 fs in previous mail.
"clean" means that "make filesystem before every tests."
In this mail, I tried to reproduct with/without VG/LV, LUKS, loopback.
* Not reproduced
USB stick - primary partition - NILFS2
USB stick - primary partition - VG/LV - NILFS2
USB stick - primary partition - VG/LV - LUKS - NILFS2
USB stick - primary partition - LUKS - VG/LV - NILFS2
USB stick - primary partition - LUKS - VG/LV - LUKS - NILFS2
/tmp (tmpfs) - regular file - NILFS2 (loopback mount, kernel 4.19.82)
USB stick - primary partition(512MiB) - NILFS2
* Reproduced (always, immediately)
/tmp (tmpfs) - regular file - NILFS2 (loopback mount)
USB stick - primary partition - ext4 - regular file - NILFS2 (loopback mount)
Test conditions:
kernel 4.19.86 (same as previous test)
NILFS2/ext4 filesystem, VG/LV, LUKS were made with default parameters
size of "primary partition" in USB stick is approx. 14GiB
size of "regular file" is approx. 512MiB
"reproduce": mount NILFS2, touch file, sync
In <[email protected]>;
ARAI Shun-ichi <[email protected]> wrote
as Subject "Re: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 in nilfs_segctor_do_construct":
> Hi,
>
> It is reproducible in my environment.
> Kernel version is 4.19.86 (Gentoo).
> NILFS2 with kernel 4.19.82 works well.
>
> I did following for the test.
>
> i) mount corrupt partition with read-only option
> (this partition causes "mounting fs with errors" at every rw mount)
> i-1) wait a few minutes ... not crash
> i-2) fs access (ls, du, ...) ... not crash
>
> ii) create small NILFS2 fs and read-write mount
> dd if=/dev/zero of=/tmp/n bs=1M count=500
> mount -o loop /tmp/n /mnt/tmp
> ii-1) wait a few minutes ... not crash
> ii-2) touch file in the fs ... crash (in few seconds)
>
>
> In <CAKFNMo=k1wVHOwXhTLEOJ+A-nwmvJ+sN_PPa8kY8fMxrQ4R+Jw@mail.gmail.com>;
> Ryusuke Konishi <[email protected]> wrote
> as Subject "Re: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 in nilfs_segctor_do_construct":
>
>> Hi,
>>
>>> It was likely caused by improper shutdown and following nilfs2 partition
>>> corruption. Now I can still read the data, but on the whole the
>>> computer is not useable, because starting a process which uses the
>>> corrupted file system simply crashes in kernel.
>>
>> Thank you for reporting the issue.
>> Let me ask you a few questions:
>>
>> 1) Is the crash reproducible in the environment ?
>> 2) Can you mount the corrupted(?) partition from a recent version of kernel ?
>> 3) Does read-only mount option (-r) work to avoid the crash ?
>>
>> Thanks,
>> Ryusuke Konishi
>>
>> 2019$BG/(B11$B7n(B18$BF|(B($B7n(B) 2:34 Tomas Hlavaty <[email protected]>:
>>>
>>> Hi Ryusuke,
>>>
>>> today I got this bug in kernel, which seems to be related to nilfs2.
>>>
>>> It was likely caused by improper shutdown and following nilfs2 partition
>>> corruption. Now I can still read the data, but on the whole the
>>> computer is not useable, because starting a process which uses the
>>> corrupted file system simply crashes in kernel. I am actually not sure
>>> if the filesystem is corrupted, as I don't know about any tool to check
>>> that. The relevant parts of dmesg log are bellow.
>>>
>>> Please let me know if you are the right contact or if you need more info
>>> about the problem.
>>>
>>> Thank you,
>>>
>>> Tomas
>>>
>>> [ 0.000000] Linux version 4.19.84 (nixbld@localhost) (gcc version 8.3.0 (GCC)) #1-NixOS SMP Tue Nov 12 18:21:46 UTC 2019
>>> [ 0.000000] Command line: initrd=\efi\nixos\4s51zw36kd1qb0ymk0charxjg8x6k5k3-initrd-linux-4.19.84-initrd.efi systemConfig=/nix/store/gdbxhzysr929abrymjqala0b5bh2fqmv-nixos-system-ushi-19.09.1258.07e66484e67 init=/nix/store/gdbxhzysr929abrymjqala0b5bh2fqmv-nixos-system-ushi-19.09.1258.07e66484e67/init loglevel=4
>>>
>>>
>>>
>>> [ 37.741106] systemd-journald[470]: Received client request to flush runtime journal.
>>> [ 37.749084] systemd-journald[470]: File /var/log/journal/55a4ea9159c14c0bb8767a43819c6927/system.journal corrupted or uncleanly shut down, renaming and replacing.
>>> [ 37.810819] audit: type=1130 audit(1573985039.617:3): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-udevd comm="systemd" exe="/nix/store/v8flm2h07zcfg5k5npz56m0ayj0qm1q8-systemd-243/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
>>>
>>> [ 38.321561] NILFS version 2 loaded
>>> [ 38.323236] NILFS (dm-1): mounting unchecked fs
>>>
>>>
>>> [ 38.349185] NILFS (dm-1): recovery complete
>>> [ 38.353228] NILFS (dm-1): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
>>>
>>> [ 63.543941] systemd-journald[470]: File
>>> /var/log/journal/55a4ea9159c14c0bb8767a43819c6927/user-1000.journal
>>> corrupted or uncleanly shut down, renaming and replacing.
>>>
>>> [12637.085548] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
>>> [12637.085558] PGD 0 P4D 0
>>> [12637.085567] Oops: 0000 [#1] SMP PTI
>>> [12637.085574] CPU: 0 PID: 657 Comm: segctord Not tainted 4.19.84 #1-NixOS
>>> [12637.085577] Hardware name: ASUSTeK COMPUTER INC. VivoBook 15_ASUS Laptop X507MA_R507MA/X507MA, BIOS X507MA.301 09/14/2018
>>> [12637.085589] RIP: 0010:percpu_counter_add_batch+0x4/0x60
>>> [12637.085593] Code: 89 e6 89 c7 e8 dd 3b 28 00 3b 05 fb e0 b6 00 72 d8 4c 89 ee 48 89 ef e8 7a 63 2a 00 48 89 d8 5b 5d 41 5c 41 5d c3 41 54 55 53 <48> 8b 47 20 65 44 8b 20 49 63 ec 48 63 ca 48 01 f5 48 39 e9 7e 0a
>>> [12637.085597] RSP: 0018:ffff9d1b00a0bd20 EFLAGS: 00010006
>>> [12637.085601] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000018
>>> [12637.085604] RDX: 0000000000000018 RSI: 0000000000000001 RDI: 0000000000000088
>>> [12637.085608] RBP: ffff8df67a2988d0 R08: 0000000000000000 R09: ffff8df66fe0cfe0
>>> [12637.085611] R10: 0000000000000230 R11: 0000000000000000 R12: 0000000000000000
>>> [12637.085614] R13: ffff8df67a298758 R14: ffff8df67a2988c8 R15: ffffccd684229a80
>>> [12637.085618] FS: 0000000000000000(0000) GS:ffff8df67ba00000(0000) knlGS:0000000000000000
>>> [12637.085621] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [12637.085624] CR2: 00000000000000a8 CR3: 000000011ac0a000 CR4: 0000000000340ef0
>>> [12637.085628] Call Trace:
>>> [12637.085640] __test_set_page_writeback+0x37c/0x3f0
>>> [12637.085663] nilfs_segctor_do_construct+0x184e/0x2040 [nilfs2]
>>> [12637.085680] nilfs_segctor_construct+0x1f5/0x2e0 [nilfs2]
>>> [12637.085693] nilfs_segctor_thread+0x129/0x370 [nilfs2]
>>> [12637.085706] ? nilfs_segctor_construct+0x2e0/0x2e0 [nilfs2]
>>> [12637.085713] kthread+0x112/0x130
>>> [12637.085719] ? kthread_bind+0x30/0x30
>>> [12637.085728] ret_from_fork+0x1f/0x40
>>> [12637.085734] Modules linked in: ctr ccm af_packet msr 8021q snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic hid_multitouch arc4 ath9k ath9k_common ath9k_hw ath mac80211 snd_soc_skl snd_soc_skl_ipc spi_pxa2xx_platform asus_nb_wmi snd_soc_sst_ipc snd_soc_sst_dsp asus_wmi 8250_dw i2c_designware_platform sparse_keymap i2c_designware_core wmi_bmof i915 snd_hda_ext_core nilfs2 snd_soc_acpi_intel_match snd_soc_acpi uvcvideo videobuf2_vmalloc nls_iso8859_1 videobuf2_memops videobuf2_v4l2 snd_soc_core nls_cp437 rtsx_usb_ms intel_telemetry_pltdrv vfat intel_punit_ipc intel_telemetry_core fat intel_pmc_ipc memstick videobuf2_common snd_compress kvmgt vfio_mdev mdev ath3k vfio_iommu_type1 vfio btusb ac97_bus snd_pcm_dmaengine btrtl x86_pkg_temp_thermal intel_powerclamp btbcm cec coretemp btintel
>>> [12637.085819] crct10dif_pclmul crc32_pclmul videodev snd_hda_intel bluetooth drm_kms_helper ghash_clmulni_intel deflate media efi_pstore intel_cstate pstore intel_rapl_perf cfg80211 snd_hda_codec joydev mousedev evdev wdat_wdt serio_raw mac_hid efivars drm snd_hda_core snd_hwdep ecdh_generic snd_pcm snd_timer mei_me idma64 virt_dma snd intel_gtt agpgart i2c_i801 i2c_algo_bit mei fb_sys_fops syscopyarea soundcore rfkill processor_thermal_device sysfillrect sysimgblt intel_lpss_pci intel_soc_dts_iosf thermal wmi intel_lpss i2c_hid i2c_core battery tpm_crb button ac tpm_tis tpm_tis_core asus_wireless video pcc_cpufreq tpm rng_core pinctrl_geminilake int3400_thermal int3403_thermal pinctrl_intel int340x_thermal_zone acpi_thermal_rel iptable_nat nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
>>> [12637.085912] nf_defrag_ipv4 libcrc32c ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables iptable_filter sch_fq_codel loop cpufreq_powersave tun tap macvlan bridge stp llc kvm_intel kvm irqbypass efivarfs ip_tables x_tables ipv6 crc_ccitt autofs4 ext4 crc32c_generic crc16 mbcache jbd2 fscrypto dm_crypt algif_skcipher af_alg rtsx_usb_sdmmc mmc_core rtsx_usb hid_generic usbhid hid sd_mod input_leds led_class atkbd libps2 ahci libahci xhci_pci libata xhci_hcd aesni_intel usbcore aes_x86_64 crypto_simd scsi_mod cryptd glue_helper crc32c_intel usb_common rtc_cmos i8042 serio dm_mod
>>> [12637.086000] CR2: 00000000000000a8
>>> [12637.086005] ---[ end trace ee0079180c990cd2 ]---
>>> [12637.120805] RIP: 0010:percpu_counter_add_batch+0x4/0x60
>>> [12637.120807] Code: 89 e6 89 c7 e8 dd 3b 28 00 3b 05 fb e0 b6 00 72 d8 4c 89 ee 48 89 ef e8 7a 63 2a 00 48 89 d8 5b 5d 41 5c 41 5d c3 41 54 55 53 <48> 8b 47 20 65 44 8b 20 49 63 ec 48 63 ca 48 01 f5 48 39 e9 7e 0a
>>> [12637.120809] RSP: 0018:ffff9d1b00a0bd20 EFLAGS: 00010006
>>> [12637.120811] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000018
>>> [12637.120812] RDX: 0000000000000018 RSI: 0000000000000001 RDI: 0000000000000088
>>> [12637.120814] RBP: ffff8df67a2988d0 R08: 0000000000000000 R09: ffff8df66fe0cfe0
>>> [12637.120815] R10: 0000000000000230 R11: 0000000000000000 R12: 0000000000000000
>>> [12637.120816] R13: ffff8df67a298758 R14: ffff8df67a2988c8 R15: ffffccd684229a80
>>> [12637.120818] FS: 0000000000000000(0000) GS:ffff8df67ba00000(0000) knlGS:0000000000000000
>>> [12637.120820] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [12637.120821] CR2: 00000000000000a8 CR3: 0000000138e0a000 CR4: 0000000000340ef0
And,
In <[email protected]>;
ARAI Shun-ichi <[email protected]> wrote
as Subject "Re: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 in nilfs_segctor_do_construct":
> Hi,
>
> FYI, reporting additional test results.
>
> I reproduced this problem with clean NILFS2 fs in previous mail.
> "clean" means that "make filesystem before every tests."
> In this mail, I tried to reproduct with/without VG/LV, LUKS, loopback.
>
> * Not reproduced
> USB stick - primary partition - NILFS2
> USB stick - primary partition - VG/LV - NILFS2
> USB stick - primary partition - VG/LV - LUKS - NILFS2
> USB stick - primary partition - LUKS - VG/LV - NILFS2
> USB stick - primary partition - LUKS - VG/LV - LUKS - NILFS2
> /tmp (tmpfs) - regular file - NILFS2 (loopback mount, kernel 4.19.82)
> USB stick - primary partition(512MiB) - NILFS2
>
> * Reproduced (always, immediately)
> /tmp (tmpfs) - regular file - NILFS2 (loopback mount)
> USB stick - primary partition - ext4 - regular file - NILFS2 (loopback mount)
this loopback problem is seen in Kernel 5.5.4.
> Test conditions:
> kernel 4.19.86 (same as previous test)
> NILFS2/ext4 filesystem, VG/LV, LUKS were made with default parameters
> size of "primary partition" in USB stick is approx. 14GiB
> size of "regular file" is approx. 512MiB
> "reproduce": mount NILFS2, touch file, sync
This is my first post to the LKML, so please be kind :) I also have
been affected by this bug. The bug is triggered whenever a write
happens to the filesystem, which means mounting read-only is an
available option to recover data. I took the time to do a full bisect
on the kernel sources and have identified the commit where the
breakage happens.
Regarding versions, I can confirm that 4.19.83 is stable with regards
to NILFS, and 4.19.84 and later are broken. I can also confirm that
5.3.10 works fine and have heard that 5.3.12 breaks NILFS as well. I
can also confirm that the 5.4.18 kernel still has this issue. I did
not trace how far back the issue goes on the 5.4.x series, or even in
more detail on the 5.3.x series.
To simplify my bisection task, I used the 4.19.x series, and
determined that commit d3b3c0a14615c495118acc4bdca23d53eea46ed2 is the
commit that breaks NILFS. Furthermore, when reverting this commit on
otherwise clean 4.19.84 kernel sources, the NILFS issue does not occur
anymore.
I'm not familiar enough with NILFS's internals to determine why the
small caching change to the kernel from that commit breaks NILFS, nor
can I offer a patch to fix it (besides reverting the offending change)
but I can confirm that this is the initial cause. I also know there
has been alot of new changes to kernel caching in more recent (5.4 /
5.5 / 5.6) kernels, so perhaps there is still more diagnostics to do.
I have the test VM that I used for bisection available if someone
wants to coordinate with me to put together a patch for this, but
ideally someone can take my diagnostics effort here and make use of it
directly. I saved dmesg logs from both good and bad cases and I can
send them if someone is interested. I can also provide some level of
detailed system setup instructions to reproduce the issue. I did my
testing against an existing external hard drive, but I have been able
to reproduce the issue consistently against a freshly created loopback
mount as well, so it is not just caused by disk corruption or an
unclean unmount.
- Brian
On Sat, Feb 15, 2020 at 8:11 PM ARAI Shun-ichi <[email protected]> wrote:
>
> And,
>
> In <[email protected]>;
> ARAI Shun-ichi <[email protected]> wrote
> as Subject "Re: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 in nilfs_segctor_do_construct":
>
> > Hi,
> >
> > FYI, reporting additional test results.
> >
> > I reproduced this problem with clean NILFS2 fs in previous mail.
> > "clean" means that "make filesystem before every tests."
> > In this mail, I tried to reproduct with/without VG/LV, LUKS, loopback.
> >
> > * Not reproduced
> > USB stick - primary partition - NILFS2
> > USB stick - primary partition - VG/LV - NILFS2
> > USB stick - primary partition - VG/LV - LUKS - NILFS2
> > USB stick - primary partition - LUKS - VG/LV - NILFS2
> > USB stick - primary partition - LUKS - VG/LV - LUKS - NILFS2
> > /tmp (tmpfs) - regular file - NILFS2 (loopback mount, kernel 4.19.82)
> > USB stick - primary partition(512MiB) - NILFS2
> >
> > * Reproduced (always, immediately)
> > /tmp (tmpfs) - regular file - NILFS2 (loopback mount)
> > USB stick - primary partition - ext4 - regular file - NILFS2 (loopback mount)
>
> this loopback problem is seen in Kernel 5.5.4.
>
> > Test conditions:
> > kernel 4.19.86 (same as previous test)
> > NILFS2/ext4 filesystem, VG/LV, LUKS were made with default parameters
> > size of "primary partition" in USB stick is approx. 14GiB
> > size of "regular file" is approx. 512MiB
> > "reproduce": mount NILFS2, touch file, sync
Thank you Arai-san,
Your method with loopback device worked to reproduce the issue
even where the bug doesn't easily hit for physical devices.
Regards,
Ryusuke Konishi
On Sun, Feb 16, 2020 at 11:11 AM ARAI Shun-ichi <[email protected]> wrote:
>
> And,
>
> In <[email protected]>;
> ARAI Shun-ichi <[email protected]> wrote
> as Subject "Re: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 in nilfs_segctor_do_construct":
>
> > Hi,
> >
> > FYI, reporting additional test results.
> >
> > I reproduced this problem with clean NILFS2 fs in previous mail.
> > "clean" means that "make filesystem before every tests."
> > In this mail, I tried to reproduct with/without VG/LV, LUKS, loopback.
> >
> > * Not reproduced
> > USB stick - primary partition - NILFS2
> > USB stick - primary partition - VG/LV - NILFS2
> > USB stick - primary partition - VG/LV - LUKS - NILFS2
> > USB stick - primary partition - LUKS - VG/LV - NILFS2
> > USB stick - primary partition - LUKS - VG/LV - LUKS - NILFS2
> > /tmp (tmpfs) - regular file - NILFS2 (loopback mount, kernel 4.19.82)
> > USB stick - primary partition(512MiB) - NILFS2
> >
> > * Reproduced (always, immediately)
> > /tmp (tmpfs) - regular file - NILFS2 (loopback mount)
> > USB stick - primary partition - ext4 - regular file - NILFS2 (loopback mount)
>
> this loopback problem is seen in Kernel 5.5.4.
>
> > Test conditions:
> > kernel 4.19.86 (same as previous test)
> > NILFS2/ext4 filesystem, VG/LV, LUKS were made with default parameters
> > size of "primary partition" in USB stick is approx. 14GiB
> > size of "regular file" is approx. 512MiB
> > "reproduce": mount NILFS2, touch file, sync
Tomas Hlavaty <[email protected]> writes:
>>> 2) Can you mount the corrupted(?) partition from a recent version of
>>> kernel ?
I tried the following Linux kernel versions:
- v4.19
- v5.4
- v5.5.11
and still get the crash
In Msg <[email protected]>;
Subject "Re: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 in nilfs_segctor_do_construct":
> Tomas Hlavaty <[email protected]> writes:
>>>> 2) Can you mount the corrupted(?) partition from a recent version of
>>>> kernel ?
>
> I tried the following Linux kernel versions:
>
> - v4.19
> - v5.4
> - v5.5.11
>
> and still get the crash
Ryusuke Konishi pointed out:
In Msg <CAKFNMomjWkNvHvHkEp=Jv_BiGPNj=oLEChyoXX1yCj5xctAkMA@mail.gmail.com>;
Subject "Re: BUG: kernel NULL pointer dereference, address: 00000000000000a8":
> As the result of bisection, it turned out that commit
> f4bdb2697ccc9cecf1a9de86905c309ad901da4c on 5.3.y
> ("mm/filemap.c: don't initiate writeback if mapping has no dirty pages")
> triggers the crash.
This commit modifies __filemap_fdatawrite_range() as follows.
[before]
if (!mapping_cap_writeback_dirty(mapping))
return 0;
[after]
if (!mapping_cap_writeback_dirty(mapping) ||
!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
return 0;
I did simple test with this code (Kernel 5.5.13).
[test]
if (!mapping_cap_writeback_dirty(mapping) ||
mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
return 0;
It does not cause crash by the test (without long-term operation). So,
I think that it may be related to PAGECACHE_TAG_TOWRITE.
One possible(?) scenario is:
0. some write operation
1. sync (WB_SYNC_ALL)
2. tagged "PAGECACHE_TAG_TOWRITE"
3. __filemap_fdatawrite_range() is called and returns successfully
(but no-op)
4. some data is/are free-ed
(because of 3.)
5. crash at test/setting writeback for free-ed data
nilfs_segctor_do_construct()
nilfs_segctor_prepare_write()
set_page_writeback()
How about this?
> In Msg <[email protected]>;
> Subject "Re: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 in nilfs_segctor_do_construct":
>
>> Tomas Hlavaty <[email protected]> writes:
>>>>> 2) Can you mount the corrupted(?) partition from a recent version of
>>>>> kernel ?
>>
>> I tried the following Linux kernel versions:
>>
>> - v4.19
>> - v5.4
>> - v5.5.11
>>
>> and still get the crash
I found conditions to reproduce this issue with Linux 5.7-rc3:
- CONFIG_MEMCG=y *and* CONFIG_BLK_CGROUP=y
- When the NILFS2 file system writes to a device, the device file has
never written by other programs since boot
The following is an example with CONFIG_MEMCG=y and
CONFIG_BLK_CGROUP=y kernel. If you do mkfs and mount it, it works
because the mkfs command has written data to the device file before
mounting:
# mkfs -t nilfs2 /dev/sda1
mkfs.nilfs2 (nilfs-utils 2.2.7)
Start writing file system initial data to the device
Blocksize:4096 Device:/dev/sda1 Device Size:267386880
File system initialization succeeded !!
# mount /dev/sda1 /mnt
# touch /mnt
# sync
#
Loopback mount seems to be the same - if you do losetup, mkfs and
mount on a loopback device, it works:
# losetup /dev/loop0 foo
# mkfs -t nilfs2 /dev/loop0
mkfs.nilfs2 (nilfs-utils 2.2.7)
Start writing file system initial data to the device
Blocksize:4096 Device:/dev/loop0 Device Size:267386880
File system initialization succeeded !!
# mount /dev/sda1 /mnt
# touch /mnt
# sync
#
But if you do mkfs on a file and use mount -o loop, it may fail,
depending on whether the loopback device assigned by the mount command
was used or not before mounting:
# /sbin/mkfs.nilfs2 ./foo
mkfs.nilfs2 (nilfs-utils 2.2.7)
Start writing file system initial data to the device
Blocksize:4096 Device:./foo Device Size:268435456
File system initialization succeeded !!
# mount -o loop ./foo /mnt
[ 36.371331] NILFS (loop0): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
# touch /mnt
# sync
[ 40.252869] BUG: kernel NULL pointer dereference, address: 00000000000000a8
(snip)
After reboot, it fails:
# mount /dev/sda1 /mnt
[ 14.021188] NILFS (sda1): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
# touch /mnt
# sync
[ 20.576309] BUG: kernel NULL pointer dereference, address: 00000000000000a8
(snip)
But if you do dummy write to the device file before mounting, it
works:
# dd if=/dev/sda1 of=/dev/sda1 count=1
1+0 records in
1+0 records out
512 bytes copied, 0.0135982 s, 37.7 kB/s
# mount /dev/sda1 /mnt
[ 52.604560] NILFS (sda1): mounting unchecked fs
[ 52.613335] NILFS (sda1): recovery complete
[ 52.613877] NILFS (sda1): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
# touch /mnt
# sync
#
# losetup /dev/loop0 foo
# dd if=/dev/loop0 of=/dev/loop0 count=1
1+0 records in
1+0 records out
512 bytes copied, 0.0243797 s, 21.0 kB/s
# mount /dev/loop0 /mnt
[ 271.915595] NILFS (loop0): mounting unchecked fs
[ 272.049603] NILFS (loop0): recovery complete
[ 272.049724] NILFS (loop0): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
# touch /mnt
# sync
#
I think the dummy write is a simple workaround for now, unless
mounting NILFS2 at boot time. But I have been using NILFS2 /home for
years, I would like to know better workarounds.
Thank you! This is very helpful information, and does seem to be a
workaround.
Like you, I have my home directory on a separate NILFS2 filesystem. As a
temporary solution, I removed the line from /etc/fstab for that
filesystem and added your dd suggestion along with a manual mount of the
home filesystem to /etc/rc.local. /home is now mounted properly at boot
with any of the newer kernels I tried.
Thanks,
Tom
On 4/30/20 5:38 AM, Hideki EIRAKU wrote:
>> In Msg <[email protected]>;
>> Subject "Re: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 in nilfs_segctor_do_construct":
>>
>>> Tomas Hlavaty <[email protected]> writes:
>>>>>> 2) Can you mount the corrupted(?) partition from a recent version of
>>>>>> kernel ?
>>>
>>> I tried the following Linux kernel versions:
>>>
>>> - v4.19
>>> - v5.4
>>> - v5.5.11
>>>
>>> and still get the crash
>
> I found conditions to reproduce this issue with Linux 5.7-rc3:
>
> - CONFIG_MEMCG=y *and* CONFIG_BLK_CGROUP=y
>
> - When the NILFS2 file system writes to a device, the device file has
> never written by other programs since boot
>
> The following is an example with CONFIG_MEMCG=y and
> CONFIG_BLK_CGROUP=y kernel. If you do mkfs and mount it, it works
> because the mkfs command has written data to the device file before
> mounting:
>
> # mkfs -t nilfs2 /dev/sda1
> mkfs.nilfs2 (nilfs-utils 2.2.7)
> Start writing file system initial data to the device
> Blocksize:4096 Device:/dev/sda1 Device Size:267386880
> File system initialization succeeded !!
> # mount /dev/sda1 /mnt
> # touch /mnt
> # sync
> #
>
> Loopback mount seems to be the same - if you do losetup, mkfs and
> mount on a loopback device, it works:
>
> # losetup /dev/loop0 foo
> # mkfs -t nilfs2 /dev/loop0
> mkfs.nilfs2 (nilfs-utils 2.2.7)
> Start writing file system initial data to the device
> Blocksize:4096 Device:/dev/loop0 Device Size:267386880
> File system initialization succeeded !!
> # mount /dev/sda1 /mnt
> # touch /mnt
> # sync
> #
>
> But if you do mkfs on a file and use mount -o loop, it may fail,
> depending on whether the loopback device assigned by the mount command
> was used or not before mounting:
>
> # /sbin/mkfs.nilfs2 ./foo
> mkfs.nilfs2 (nilfs-utils 2.2.7)
> Start writing file system initial data to the device
> Blocksize:4096 Device:./foo Device Size:268435456
> File system initialization succeeded !!
> # mount -o loop ./foo /mnt
> [ 36.371331] NILFS (loop0): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
> # touch /mnt
> # sync
> [ 40.252869] BUG: kernel NULL pointer dereference, address: 00000000000000a8
> (snip)
>
> After reboot, it fails:
>
> # mount /dev/sda1 /mnt
> [ 14.021188] NILFS (sda1): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
> # touch /mnt
> # sync
> [ 20.576309] BUG: kernel NULL pointer dereference, address: 00000000000000a8
> (snip)
>
> But if you do dummy write to the device file before mounting, it
> works:
>
> # dd if=/dev/sda1 of=/dev/sda1 count=1
> 1+0 records in
> 1+0 records out
> 512 bytes copied, 0.0135982 s, 37.7 kB/s
> # mount /dev/sda1 /mnt
> [ 52.604560] NILFS (sda1): mounting unchecked fs
> [ 52.613335] NILFS (sda1): recovery complete
> [ 52.613877] NILFS (sda1): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
> # touch /mnt
> # sync
> #
>
> # losetup /dev/loop0 foo
> # dd if=/dev/loop0 of=/dev/loop0 count=1
> 1+0 records in
> 1+0 records out
> 512 bytes copied, 0.0243797 s, 21.0 kB/s
> # mount /dev/loop0 /mnt
> [ 271.915595] NILFS (loop0): mounting unchecked fs
> [ 272.049603] NILFS (loop0): recovery complete
> [ 272.049724] NILFS (loop0): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
> # touch /mnt
> # sync
> #
>
> I think the dummy write is a simple workaround for now, unless
> mounting NILFS2 at boot time. But I have been using NILFS2 /home for
> years, I would like to know better workarounds.
>
Hi,
This bug turned out to be caused by set_page_writeback() call for
segment summary buffers and super root buffers at
nilfs_segctor_prepare_write().
set_page_writeback() can call inc_wb_stat(inode_to_wb(inode),
WB_WRIEBACK) where inode_to_wb(inode) is NULL if inode_attach_wb() is
not called in advance. To ensure inode_attach_wb() is called,
mark_buffer_dirty() should be called for those buffers.
The following patch fixes this issue, but I got another oops at
nilfs_segctor_complete_write() during a stress test. So, I'm still
investigating.
Regards,
Ryusuke Konishi
===
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 445eef4..f6b5ca8 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1650,6 +1650,8 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci)
list_for_each_entry(bh, &segbuf->sb_segsum_buffers,
b_assoc_buffers) {
+ set_buffer_uptodate(bh);
+ mark_buffer_dirty(bh);
if (bh->b_page != bd_page) {
if (bd_page) {
lock_page(bd_page);
@@ -1665,6 +1667,8 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci)
b_assoc_buffers) {
set_buffer_async_write(bh);
if (bh == segbuf->sb_super_root) {
+ set_buffer_uptodate(bh);
+ mark_buffer_dirty(bh);
if (bh->b_page != bd_page) {
lock_page(bd_page);
clear_page_dirty_for_io(bd_page);
===
On Thu, 30 Apr 2020 08:27:47 -0700, Tom <[email protected]> wrote:
> Thank you! This is very helpful information, and does seem to be a
> workaround.
>
> Like you, I have my home directory on a separate NILFS2 filesystem. As
> a temporary solution, I removed the line from /etc/fstab for that
> filesystem and added your dd suggestion along with a manual mount of
> the home filesystem to /etc/rc.local. /home is now mounted properly
> at boot with any of the newer kernels I tried.
>
> Thanks,
> Tom
>
> On 4/30/20 5:38 AM, Hideki EIRAKU wrote:
>>> In Msg <[email protected]>;
>>> Subject "Re: BUG: unable to handle kernel NULL pointer dereference at
>>> 00000000000000a8 in nilfs_segctor_do_construct":
>>>
>>>> Tomas Hlavaty <[email protected]> writes:
>>>>>>> 2) Can you mount the corrupted(?) partition from a recent version of
>>>>>>> kernel ?
>>>>
>>>> I tried the following Linux kernel versions:
>>>>
>>>> - v4.19
>>>> - v5.4
>>>> - v5.5.11
>>>>
>>>> and still get the crash
>> I found conditions to reproduce this issue with Linux 5.7-rc3:
>> - CONFIG_MEMCG=y *and* CONFIG_BLK_CGROUP=y
>> - When the NILFS2 file system writes to a device, the device file has
>> never written by other programs since boot
>> The following is an example with CONFIG_MEMCG=y and
>> CONFIG_BLK_CGROUP=y kernel. If you do mkfs and mount it, it works
>> because the mkfs command has written data to the device file before
>> mounting:
>> # mkfs -t nilfs2 /dev/sda1
>> mkfs.nilfs2 (nilfs-utils 2.2.7)
>> Start writing file system initial data to the device
>> Blocksize:4096 Device:/dev/sda1 Device Size:267386880
>> File system initialization succeeded !!
>> # mount /dev/sda1 /mnt
>> # touch /mnt
>> # sync
>> #
>> Loopback mount seems to be the same - if you do losetup, mkfs and
>> mount on a loopback device, it works:
>> # losetup /dev/loop0 foo
>> # mkfs -t nilfs2 /dev/loop0
>> mkfs.nilfs2 (nilfs-utils 2.2.7)
>> Start writing file system initial data to the device
>> Blocksize:4096 Device:/dev/loop0 Device Size:267386880
>> File system initialization succeeded !!
>> # mount /dev/sda1 /mnt
>> # touch /mnt
>> # sync
>> #
>> But if you do mkfs on a file and use mount -o loop, it may fail,
>> depending on whether the loopback device assigned by the mount command
>> was used or not before mounting:
>> # /sbin/mkfs.nilfs2 ./foo
>> mkfs.nilfs2 (nilfs-utils 2.2.7)
>> Start writing file system initial data to the device
>> Blocksize:4096 Device:./foo Device Size:268435456
>> File system initialization succeeded !!
>> # mount -o loop ./foo /mnt
>> [ 36.371331] NILFS (loop0): segctord starting. Construction interval =
>> 5 seconds, CP frequency < 30 seconds
>> # touch /mnt
>> # sync
>> [ 40.252869] BUG: kernel NULL pointer dereference, address:
>> 00000000000000a8
>> (snip)
>> After reboot, it fails:
>> # mount /dev/sda1 /mnt
>> [ 14.021188] NILFS (sda1): segctord starting. Construction interval =
>> 5 seconds, CP frequency < 30 seconds
>> # touch /mnt
>> # sync
>> [ 20.576309] BUG: kernel NULL pointer dereference, address:
>> 00000000000000a8
>> (snip)
>> But if you do dummy write to the device file before mounting, it
>> works:
>> # dd if=/dev/sda1 of=/dev/sda1 count=1
>> 1+0 records in
>> 1+0 records out
>> 512 bytes copied, 0.0135982 s, 37.7 kB/s
>> # mount /dev/sda1 /mnt
>> [ 52.604560] NILFS (sda1): mounting unchecked fs
>> [ 52.613335] NILFS (sda1): recovery complete
>> [ 52.613877] NILFS (sda1): segctord starting. Construction interval =
>> 5 seconds, CP frequency < 30 seconds
>> # touch /mnt
>> # sync
>> #
>> # losetup /dev/loop0 foo
>> # dd if=/dev/loop0 of=/dev/loop0 count=1
>> 1+0 records in
>> 1+0 records out
>> 512 bytes copied, 0.0243797 s, 21.0 kB/s
>> # mount /dev/loop0 /mnt
>> [ 271.915595] NILFS (loop0): mounting unchecked fs
>> [ 272.049603] NILFS (loop0): recovery complete
>> [ 272.049724] NILFS (loop0): segctord starting. Construction interval
>> = 5 seconds, CP frequency < 30 seconds
>> # touch /mnt
>> # sync
>> #
>> I think the dummy write is a simple workaround for now, unless
>> mounting NILFS2 at boot time. But I have been using NILFS2 /home for
>> years, I would like to know better workarounds.
>>