2013-07-29 10:25:06

by Nix

[permalink] [raw]
Subject: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

My server's ARC-1210 has been working fine for years, but when I
upgraded from 3.10.1, it started failing:

Instead of

[ 0.784044] Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
[ 0.804028] scsi0 : Areca SATA Host Adapter RAID Controller
Driver Version 1.20.00.15 2010/08/05
[...]

[ 4.111770] sd 7:0:0:1: [sdd] Assuming drive cache: write through
[ 4.115399] sd 7:0:0:1: [sdd] No Caching mode page present
[ 4.115401] sd 7:0:0:1: [sdd] Assuming drive cache: write through
[ 4.118081] sdd: sdd1
[ 4.124363] sd 7:0:0:1: [sdd] No Caching mode page present
[ 4.124601] sd 7:0:0:1: [sdd] Assuming drive cache: write through
[ 4.124867] sd 7:0:0:1: [sdd] Attached SCSI removable disk

I now see (timestamps and some of the right edge chopped off because not
captured on my camera, no netconsole as this machine has all my storage
and is my loghost, and with this bug it can't get at any of that
storage).

sd 7:0:0:1: [sdd] Assuming drive cache: write through
sd 7:0:0:1: [sdd] No Caching mode page present
sd 7:0:0:1: [sdd] Assuming drive cache: write through
sdd: sdd1
sd 7:0:0:1: [sdd] No Caching mode page present
sd 7:0:0:1: [sdd] Assuming drive cache: write through
sd 7:0:0:1: [sdd] Attached SCSI removable disk
arcmsr0: abort device command of scsi id = 0 lun = 1
arcmsr0: abort device command of scsi id = 0 lun = 0
arcmsr: executing bus reset eh.....num_resets=0, num_[...]

arcmsr0: wait 'abort all outstanding command' timeout
arcmsr0: executing hw bus reset ....
arcmsr0: waiting for hw bus reset return, retry=0
arcmsr0: waiting for hw bus reset return, retry=1
Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
arcmsr: scsi bus reset eh returns with success
[and back to the top of the error messages again, apparently forever,
not that the machine would be much use without its RAID array even
if this loop terminated at some point, so I only gave it a couple
of minutes]

The failure happens precisely at the moment we transition to early
userspace, so presumably userspace I/O is failing (or something related
to raw device access, perhaps, since the first thing it does is a
vgscan).

I haven't bisected yet (sorry, I have work to do which means this
machine must be running right now), but nothing has changed in the
arcmsr controller, nor in SCSI-land excepting

commit 98dcc2946adbe4349ef1ef9b99873b912831edd4
Author: Martin K. Petersen <[email protected]>
Date: Thu Jun 6 22:15:55 2013 -0400

SCSI: sd: Update WRITE SAME heuristics

so my, admittedly largely baseless, suspicions currently fall there.


Obviously, at this point, this machine has no modules loaded (it has
almost none loaded even when fully operational)

.config, unchanged from 3.10.1 to 3.10.3:

CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CPU_AUTOPROBE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_KERNEL_LZMA=y
CONFIG_DEFAULT_HOSTNAME="spindle"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_FHANDLE=y
CONFIG_AUDIT=y
CONFIG_HAVE_GENERIC_HARDIRQS=y

CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
CONFIG_NO_HZ_IDLE=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

CONFIG_IRQ_TIME_ACCOUNTING=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

CONFIG_TREE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_FANOUT=8
CONFIG_RCU_FANOUT_LEAF=8
CONFIG_LOG_BUF_SHIFT=18
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE=y
CONFIG_CGROUPS=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_NAMESPACES=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_UIDGID_CONVERTED=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE="usr/initramfs.spindle"
CONFIG_INITRAMFS_ROOT_UID=99
CONFIG_INITRAMFS_ROOT_GID=101
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
CONFIG_INITRAMFS_COMPRESSION_LZMA=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HOTPLUG=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_UID16=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_PCI_QUIRKS=y
CONFIG_HAVE_PERF_EVENTS=y

CONFIG_PERF_EVENTS=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y
CONFIG_TRACEPOINTS=y
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_JUMP_LABEL=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_HAVE_CONTEXT_TRACKING=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_COMPAT_OLD_SIGACTION=y

CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_BSG=y

CONFIG_PARTITION_ADVANCED=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BLOCK_COMPAT=y

CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=m
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_CFQ=y
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
CONFIG_INLINE_READ_UNLOCK=y
CONFIG_INLINE_READ_UNLOCK_IRQ=y
CONFIG_INLINE_WRITE_UNLOCK=y
CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
CONFIG_MUTEX_SPIN_ON_OWNER=y

CONFIG_ZONE_DMA=y
CONFIG_SMP=y
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_NO_BOOTMEM=y
CONFIG_MCORE2=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_P6_NOP=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
CONFIG_NR_CPUS=8
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_THRESHOLD=y
CONFIG_X86_THERMAL_VECTOR=y
CONFIG_MICROCODE=m
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_MICROCODE_INTEL_LIB=y
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK=y
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
CONFIG_ARCH_DISCARD_MEMBLOCK=y
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_BALLOON_COMPACTION=y
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_X86_RESERVE_LOW=64
CONFIG_MTRR=y
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_ARCH_RANDOM=y
CONFIG_X86_SMAP=y
CONFIG_SECCOMP=y
CONFIG_HZ_100=y
CONFIG_HZ=100
CONFIG_SCHED_HRTICK=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

CONFIG_PM_RUNTIME=y
CONFIG_PM=y
CONFIG_ACPI=y
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_I2C=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_IPMI=m
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_CUSTOM_DSDT_FILE=""
CONFIG_ACPI_BLACKLIST_YEAR=0
CONFIG_ACPI_PCI_SLOT=y
CONFIG_X86_PM_TIMER=y

CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_GOV_COMMON=y
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_STAT_DETAILS=y
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y

CONFIG_X86_ACPI_CPUFREQ=y

CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
CONFIG_INTEL_IDLE=y

CONFIG_I7300_IDLE_IOAT_CHANNEL=y
CONFIG_I7300_IDLE=y

CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_PCIEAER=y
CONFIG_PCIEASPM=y
CONFIG_PCIEASPM_DEFAULT=y
CONFIG_PCIE_PME=y
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
CONFIG_PCI_ATS=y
CONFIG_PCI_IOV=y
CONFIG_PCI_IOAPIC=y
CONFIG_PCI_LABEL=y
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y

CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
CONFIG_BINFMT_SCRIPT=y
CONFIG_BINFMT_MISC=y
CONFIG_COREDUMP=y
CONFIG_IA32_EMULATION=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_KEYS_COMPAT=y
CONFIG_HAVE_TEXT_POKE_SMP=y
CONFIG_X86_DEV_DMA_OPS=y
CONFIG_NET=y

CONFIG_PACKET=y
CONFIG_PACKET_DIAG=y
CONFIG_UNIX=y
CONFIG_UNIX_DIAG=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_PNP=y
CONFIG_INET_LRO=y
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
CONFIG_INET_UDP_DIAG=y
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_IPV6=y
CONFIG_IPV6_PRIVACY=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_NETFILTER=y
CONFIG_NETFILTER_ADVANCED=y
CONFIG_BRIDGE_NETFILTER=y

CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_NETLINK_ACCT=m
CONFIG_NETFILTER_XTABLES=y

CONFIG_NETFILTER_XT_MARK=y

CONFIG_NETFILTER_XT_TARGET_LOG=y

CONFIG_NETFILTER_XT_MATCH_ECN=m
CONFIG_NETFILTER_XT_MATCH_IPRANGE=y
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=y
CONFIG_NETFILTER_XT_MATCH_NFACCT=m
CONFIG_NETFILTER_XT_MATCH_OWNER=y

CONFIG_IP_NF_IPTABLES=y
CONFIG_IP_NF_MANGLE=y

CONFIG_STP=y
CONFIG_BRIDGE=y
CONFIG_BRIDGE_IGMP_SNOOPING=y
CONFIG_HAVE_NET_DSA=y
CONFIG_LLC=y
CONFIG_DNS_RESOLVER=y
CONFIG_NETLINK_DIAG=y
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_XPS=y
CONFIG_BQL=y
CONFIG_BPF_JIT=y

CONFIG_BT=y
CONFIG_BT_RFCOMM=y

CONFIG_BT_HCIBTUSB=y
CONFIG_FIB_RULES=y
CONFIG_HAVE_BPF_JIT=y


CONFIG_UEVENT_HELPER_PATH=""
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
CONFIG_FW_LOADER_USER_HELPER=y

CONFIG_PNP=y

CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
CONFIG_BLK_DEV_CRYPTOLOOP=m
CONFIG_BLK_DEV_NBD=m
CONFIG_CDROM_PKTCDVD=y
CONFIG_CDROM_PKTCDVD_BUFFERS=16




CONFIG_HAVE_IDE=y

CONFIG_SCSI_MOD=y
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_PROC_FS=y

CONFIG_BLK_DEV_SD=y
CONFIG_BLK_DEV_SR=y
CONFIG_CHR_DEV_SG=y
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_SCAN_ASYNC=y

CONFIG_SCSI_LOWLEVEL=y
CONFIG_SCSI_ARCMSR=y
CONFIG_ATA=y
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_ACPI=y

CONFIG_SATA_AHCI=y
CONFIG_MD=y
CONFIG_BLK_DEV_DM=y
CONFIG_DM_CRYPT=y
CONFIG_DM_SNAPSHOT=y
CONFIG_DM_MIRROR=y
CONFIG_DM_ZERO=y

CONFIG_FIREWIRE=m
CONFIG_FIREWIRE_OHCI=m
CONFIG_FIREWIRE_SBP2=m
CONFIG_NETDEVICES=y
CONFIG_NET_CORE=y
CONFIG_DUMMY=m
CONFIG_MII=y
CONFIG_MACVLAN=y
CONFIG_MACVTAP=y
CONFIG_TUN=y

CONFIG_VHOST_NET=y
CONFIG_VHOST_RING=y

CONFIG_ETHERNET=y
CONFIG_NET_VENDOR_INTEL=y
CONFIG_E1000E=y



CONFIG_INPUT=y

CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1680
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=1050
CONFIG_INPUT_EVDEV=y

CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_CYPRESS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
CONFIG_INPUT_JOYSTICK=y
CONFIG_JOYSTICK_ANALOG=y

CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_LIBPS2=y
CONFIG_GAMEPORT=y

CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_UNIX98_PTYS=y

CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_NR_UARTS=4
CONFIG_SERIAL_8250_RUNTIME_UARTS=4

CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_IPMI_HANDLER=m
CONFIG_IPMI_PANIC_EVENT=y
CONFIG_IPMI_DEVICE_INTERFACE=m
CONFIG_IPMI_SI=m
CONFIG_IPMI_POWEROFF=m
CONFIG_NVRAM=m
CONFIG_HPET=y
CONFIG_HPET_MMAP=y
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
CONFIG_I2C_CHARDEV=y
CONFIG_I2C_HELPER_AUTO=y


CONFIG_I2C_I801=y






CONFIG_PPS=y



CONFIG_PTP_1588_CLOCK=y

CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
CONFIG_GPIO_DEVRES=y
CONFIG_HWMON=y
CONFIG_HWMON_VID=y

CONFIG_SENSORS_W83793=y

CONFIG_THERMAL=y
CONFIG_THERMAL_HWMON=y
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
CONFIG_THERMAL_GOV_STEP_WISE=y
CONFIG_SSB_POSSIBLE=y

CONFIG_BCMA_POSSIBLE=y


CONFIG_MFD_CORE=y
CONFIG_LPC_ICH=y
CONFIG_MEDIA_SUPPORT=y

CONFIG_MEDIA_CAMERA_SUPPORT=y
CONFIG_VIDEO_DEV=y
CONFIG_VIDEO_V4L2=y



CONFIG_MEDIA_SUBDRV_AUTOSELECT=y












CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=1

CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=256
CONFIG_DUMMY_CONSOLE=y
CONFIG_SOUND=y
CONFIG_SOUND_OSS_CORE=y
CONFIG_SND=y
CONFIG_SND_TIMER=y
CONFIG_SND_PCM=y
CONFIG_SND_JACK=y
CONFIG_SND_SEQUENCER=y
CONFIG_SND_SEQ_DUMMY=m
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=y
CONFIG_SND_PCM_OSS=y
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_SEQUENCER_OSS=y
CONFIG_SND_HRTIMER=y
CONFIG_SND_SEQ_HRTIMER_DEFAULT=y
CONFIG_SND_DYNAMIC_MINORS=y
CONFIG_SND_VERBOSE_PROCFS=y
CONFIG_SND_VMASTER=y
CONFIG_SND_KCTL_JACK=y
CONFIG_SND_DMA_SGBUF=y
CONFIG_SND_PCI=y
CONFIG_SND_HDA_INTEL=y
CONFIG_SND_HDA_PREALLOC_SIZE=64
CONFIG_SND_HDA_INPUT_JACK=y
CONFIG_SND_HDA_GENERIC=y
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=0

CONFIG_HID=y
CONFIG_HID_GENERIC=y

CONFIG_HID_A4TECH=y
CONFIG_HID_APPLE=y
CONFIG_HID_BELKIN=y
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
CONFIG_HID_CYPRESS=y
CONFIG_HID_EZKEY=y
CONFIG_HID_KYE=y
CONFIG_HID_KENSINGTON=y
CONFIG_HID_LOGITECH=y
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y

CONFIG_USB_HID=y

CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB_ARCH_HAS_XHCI=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y

CONFIG_USB_DEFAULT_PERSIST=y
CONFIG_USB_DYNAMIC_MINORS=y

CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_PCI=y
CONFIG_USB_UHCI_HCD=y

CONFIG_USB_ACM=y


CONFIG_USB_STORAGE=y


CONFIG_USB_SERIAL=y
CONFIG_USB_SERIAL_PL2303=m

CONFIG_EDAC=y
CONFIG_EDAC_MM_EDAC=y
CONFIG_EDAC_I7CORE=y
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_SYSTOHC=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"

CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y



CONFIG_RTC_DRV_CMOS=y


CONFIG_VIRT_DRIVERS=y
CONFIG_VIRTIO=m

CONFIG_VIRTIO_PCI=m
CONFIG_VIRTIO_BALLOON=m


CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
CONFIG_IOMMU_SUPPORT=y


CONFIG_MEMORY=y

CONFIG_FIRMWARE_MEMMAP=y
CONFIG_DMIID=y

CONFIG_DCACHE_WORD_ACCESS=y
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT23=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_JBD2=y
CONFIG_FS_MBCACHE=y
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_FANOTIFY=y
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_PRINT_QUOTA_WARNING=y
CONFIG_QUOTA_TREE=y
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_QUOTACTL_COMPAT=y
CONFIG_FUSE_FS=y
CONFIG_CUSE=y
CONFIG_GENERIC_ACL=y


CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"

CONFIG_PROC_FS=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=y
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
CONFIG_NFS_USE_KERNEL_DNS=y
CONFIG_NFSD=y
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso-8859-1"
CONFIG_NLS_CODEPAGE_437=y
CONFIG_NLS_ASCII=m
CONFIG_NLS_ISO8859_1=y
CONFIG_NLS_ISO8859_15=m
CONFIG_NLS_UTF8=m

CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
CONFIG_DEFAULT_MESSAGE_LOGLEVEL=4
CONFIG_ENABLE_WARN_DEPRECATED=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=1024
CONFIG_MAGIC_SYSRQ=y
CONFIG_STRIP_ASM_SYMS=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_KERNEL=y
CONFIG_LOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR=y
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_DETECT_HUNG_TASK=y
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_TIMER_STATS=y
CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_STACKTRACE=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y

CONFIG_RCU_CPU_STALL_TIMEOUT=60
CONFIG_LATENCYTOP=y
CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS=y
CONFIG_DEBUG_STRICT_USER_COPY_CHECKS=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_BRANCH_PROFILE_NONE=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_HAVE_ARCH_KMEMCHECK=y
CONFIG_STRICT_DEVMEM=y
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
CONFIG_DEBUG_RODATA=y
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
CONFIG_DEFAULT_IO_DELAY_TYPE=0

CONFIG_KEYS=y
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_PATH=y
CONFIG_SECURITY_APPARMOR=y
CONFIG_SECURITY_APPARMOR_BOOTPARAM_VALUE=0
CONFIG_DEFAULT_SECURITY_DAC=y
CONFIG_DEFAULT_SECURITY=""
CONFIG_CRYPTO=y

CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
CONFIG_CRYPTO_WORKQUEUE=y


CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_ECB=y


CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_SHA256=y

CONFIG_CRYPTO_AES=y


CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_IRQ_ROUTING=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_APIC_ARCHITECTURE=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=y
CONFIG_KVM_INTEL=y
CONFIG_BINARY_PRINTF=y

CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_IO=y
CONFIG_CRC16=y
CONFIG_CRC_ITU_T=y
CONFIG_CRC32=y
CONFIG_CRC32_SLICEBY8=y
CONFIG_ZLIB_INFLATE=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
CONFIG_XZ_DEC_IA64=y
CONFIG_XZ_DEC_ARM=y
CONFIG_XZ_DEC_ARMTHUMB=y
CONFIG_XZ_DEC_SPARC=y
CONFIG_XZ_DEC_BCJ=y
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_CHECK_SIGNATURE=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_NLATTR=y
CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE=y
CONFIG_OID_REGISTRY=y

--
NULL && (void)


2013-07-29 13:02:01

by Bernd Schubert

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

Hi Nick,

On 07/29/2013 12:10 PM, Nick Alcock wrote:
> My server's ARC-1210 has been working fine for years, but when I
> upgraded from 3.10.1, it started failing:
>
> Instead of
>
> [ 0.784044] Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
> [ 0.804028] scsi0 : Areca SATA Host Adapter RAID Controller
> Driver Version 1.20.00.15 2010/08/05
> [...]
>
> [ 4.111770] sd 7:0:0:1: [sdd] Assuming drive cache: write through
> [ 4.115399] sd 7:0:0:1: [sdd] No Caching mode page present
> [ 4.115401] sd 7:0:0:1: [sdd] Assuming drive cache: write through
> [ 4.118081] sdd: sdd1
> [ 4.124363] sd 7:0:0:1: [sdd] No Caching mode page present
> [ 4.124601] sd 7:0:0:1: [sdd] Assuming drive cache: write through
> [ 4.124867] sd 7:0:0:1: [sdd] Attached SCSI removable disk
>
> I now see (timestamps and some of the right edge chopped off because not
> captured on my camera, no netconsole as this machine has all my storage
> and is my loghost, and with this bug it can't get at any of that
> storage).
>
> sd 7:0:0:1: [sdd] Assuming drive cache: write through
> sd 7:0:0:1: [sdd] No Caching mode page present
> sd 7:0:0:1: [sdd] Assuming drive cache: write through
> sdd: sdd1
> sd 7:0:0:1: [sdd] No Caching mode page present
> sd 7:0:0:1: [sdd] Assuming drive cache: write through
> sd 7:0:0:1: [sdd] Attached SCSI removable disk
> arcmsr0: abort device command of scsi id = 0 lun = 1
> arcmsr0: abort device command of scsi id = 0 lun = 0
> arcmsr: executing bus reset eh.....num_resets=0, num_[...]
>
> arcmsr0: wait 'abort all outstanding command' timeout
> arcmsr0: executing hw bus reset ....
> arcmsr0: waiting for hw bus reset return, retry=0
> arcmsr0: waiting for hw bus reset return, retry=1
> Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
> arcmsr: scsi bus reset eh returns with success
> [and back to the top of the error messages again, apparently forever,
> not that the machine would be much use without its RAID array even
> if this loop terminated at some point, so I only gave it a couple
> of minutes]
>
> The failure happens precisely at the moment we transition to early
> userspace, so presumably userspace I/O is failing (or something related
> to raw device access, perhaps, since the first thing it does is a
> vgscan).
>
> I haven't bisected yet (sorry, I have work to do which means this
> machine must be running right now), but nothing has changed in the
> arcmsr controller, nor in SCSI-land excepting
>
> commit 98dcc2946adbe4349ef1ef9b99873b912831edd4
> Author: Martin K. Petersen <[email protected]>
> Date: Thu Jun 6 22:15:55 2013 -0400
>
> SCSI: sd: Update WRITE SAME heuristics
>
> so my, admittedly largely baseless, suspicions currently fall there.
>
>
> Obviously, at this point, this machine has no modules loaded (it has
> almost none loaded even when fully operational)

I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this
patch is only in 3.10.3, but not yet in 3.10.1. And I don't think this
commit can cause your issue at all, a failing heuristics would enable
WRITE SAME and would cause issues with linux-md, but there shouldn't
happen anything directly in the scsi-layer.
Which was your last working kernel version?


Thanks,
Bernd

2013-07-29 13:05:51

by Nix

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 29 Jul 2013, Bernd Schubert said:

> Hi Nick,
>
> On 07/29/2013 12:10 PM, Nick Alcock wrote:
>> arcmsr0: abort device command of scsi id = 0 lun = 1
>> arcmsr0: abort device command of scsi id = 0 lun = 0
>> arcmsr: executing bus reset eh.....num_resets=0, num_[...]
>>
>> arcmsr0: wait 'abort all outstanding command' timeout
>> arcmsr0: executing hw bus reset ....
>> arcmsr0: waiting for hw bus reset return, retry=0
>> arcmsr0: waiting for hw bus reset return, retry=1
>> Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
>> arcmsr: scsi bus reset eh returns with success
>> [and back to the top of the error messages again, apparently forever,
>> not that the machine would be much use without its RAID array even
>> if this loop terminated at some point, so I only gave it a couple
>> of minutes]
>>
>> The failure happens precisely at the moment we transition to early
>> userspace, so presumably userspace I/O is failing (or something related
>> to raw device access, perhaps, since the first thing it does is a
>> vgscan).
>>
>> I haven't bisected yet (sorry, I have work to do which means this
>> machine must be running right now), but nothing has changed in the
>> arcmsr controller, nor in SCSI-land excepting
>>
>> commit 98dcc2946adbe4349ef1ef9b99873b912831edd4
>> Author: Martin K. Petersen <[email protected]>
>> Date: Thu Jun 6 22:15:55 2013 -0400
[...]
>> Obviously, at this point, this machine has no modules loaded (it has
>> almost none loaded even when fully operational)
>
> I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this
> patch is only in 3.10.3, but not yet in 3.10.1.

... and I see this problem with 3.10.3 but not 3.10.1. (Haven't tried
3.10.2.)

> And I don't think this
> commit can cause your issue at all, a failing heuristics would enable
> WRITE SAME and would cause issues with linux-md, but there shouldn't
> happen anything directly in the scsi-layer. Which was your last
> working kernel version?

3.10.1. :)

No changes to arcmsr between those versions... I suspect I'll have to
bisect, which will be a complete pig because every failure means a hard
powerdown of this box. Always-on servers rarely appreciate hard
powerdowns :(

--
NULL && (void)

2013-07-29 14:17:04

by Bernd Schubert

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 07/29/2013 03:05 PM, Nix wrote:
> On 29 Jul 2013, Bernd Schubert said:
>
>> Hi Nick,
>>
>> On 07/29/2013 12:10 PM, Nick Alcock wrote:
>>> arcmsr0: abort device command of scsi id = 0 lun = 1
>>> arcmsr0: abort device command of scsi id = 0 lun = 0
>>> arcmsr: executing bus reset eh.....num_resets=0, num_[...]
>>>
>>> arcmsr0: wait 'abort all outstanding command' timeout
>>> arcmsr0: executing hw bus reset ....
>>> arcmsr0: waiting for hw bus reset return, retry=0
>>> arcmsr0: waiting for hw bus reset return, retry=1
>>> Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
>>> arcmsr: scsi bus reset eh returns with success
>>> [and back to the top of the error messages again, apparently forever,
>>> not that the machine would be much use without its RAID array even
>>> if this loop terminated at some point, so I only gave it a couple
>>> of minutes]
>>>
>>> The failure happens precisely at the moment we transition to early
>>> userspace, so presumably userspace I/O is failing (or something related
>>> to raw device access, perhaps, since the first thing it does is a
>>> vgscan).
>>>
>>> I haven't bisected yet (sorry, I have work to do which means this
>>> machine must be running right now), but nothing has changed in the
>>> arcmsr controller, nor in SCSI-land excepting
>>>
>>> commit 98dcc2946adbe4349ef1ef9b99873b912831edd4
>>> Author: Martin K. Petersen <[email protected]>
>>> Date: Thu Jun 6 22:15:55 2013 -0400
> [...]
>>> Obviously, at this point, this machine has no modules loaded (it has
>>> almost none loaded even when fully operational)
>>
>> I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this
>> patch is only in 3.10.3, but not yet in 3.10.1.
>
> ... and I see this problem with 3.10.3 but not 3.10.1. (Haven't tried
> 3.10.2.)

Hmm, indeed that points to this commit. I just don't see what could fail
there.

Could you try to run these commands with 3.10.1?

# # check if reporting opcodes works
# sg_opcodes -v -n /dev/sdX

# check ata information page
# sg_vpd --page=0x89 /dev/sdX

>
>> And I don't think this
>> commit can cause your issue at all, a failing heuristics would enable
>> WRITE SAME and would cause issues with linux-md, but there shouldn't
>> happen anything directly in the scsi-layer. Which was your last
>> working kernel version?
>
> 3.10.1. :)

Whoops, sorry, I missed that in your first sentence.

>
> No changes to arcmsr between those versions... I suspect I'll have to
> bisect, which will be a complete pig because every failure means a hard
> powerdown of this box. Always-on servers rarely appreciate hard
> powerdowns :(
>

Maybe just revert this commit? Helpful would be some scsi logging to see
which command actually fails. I guess you don't have a serial console?


Thanks,
Bernd

2013-07-29 14:26:50

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

>>>>> "Nick" == Nick Alcock <[email protected]> writes:

Nick> My server's ARC-1210 has been working fine for years, but when I
Nick> upgraded from 3.10.1, it started failing:

Nick> [ 0.784044] Areca RAID Controller0: F/W V1.46 2009-01-06 & Model
Nick> ARC-1210 [ 0.804028] scsi0 : Areca SATA Host Adapter RAID
Nick> Controller
Nick> Driver Version 1.20.00.15 2010/08/05
Nick> [...]

Interesting. Please provide the output of:

# sg_inq /dev/sdd
# sg_vpd /dev/sdd
# sg_vpd -p ai /dev/sdd

--
Martin K. Petersen Oracle Linux Engineering

2013-07-29 14:27:40

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

>>>>> "Bernd" == Bernd Schubert <[email protected]> writes:

Bernd> I tested this patch with ARC-1260 and F/W V1.49, no issues.

It could be due to the firmware version discrepancy.

--
Martin K. Petersen Oracle Linux Engineering

2013-07-29 15:01:59

by Nix

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 29 Jul 2013, Bernd Schubert spake thusly:

> On 07/29/2013 03:05 PM, Nix wrote:
>> On 29 Jul 2013, Bernd Schubert said:
>>> I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this
>>> patch is only in 3.10.3, but not yet in 3.10.1.
>>
>> ... and I see this problem with 3.10.3 but not 3.10.1. (Haven't tried
>> 3.10.2.)
>
> Hmm, indeed that points to this commit. I just don't see what could fail there.
>
> Could you try to run these commands with 3.10.1?
>
> # # check if reporting opcodes works
> # sg_opcodes -v -n /dev/sdX
>
> # check ata information page
> # sg_vpd --page=0x89 /dev/sdX

If this might cause the same problem I think I'd better wait until work
is done for the day and the machine is no longer loaded, and can be
rebooted without harm...

>> No changes to arcmsr between those versions... I suspect I'll have to
>> bisect, which will be a complete pig because every failure means a hard
>> powerdown of this box. Always-on servers rarely appreciate hard
>> powerdowns :(
>>
>
> Maybe just revert this commit? Helpful would be some scsi logging to
> see which command actually fails. I guess you don't have a serial
> console?

Not at that stage, no! And, yes, a test revert of this one commit will
be the first thing I try this evening / tomorrow morning (depending on
system load).

--
NULL && (void)

2013-07-29 20:04:54

by Nix

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 29 Jul 2013, Bernd Schubert spake thusly:
> Could you try to run these commands with 3.10.1?
>
> # # check if reporting opcodes works
> # sg_opcodes -v -n /dev/sdX

spindle:/boot# sg_opcodes -v -n /dev/sda
inquiry cdb: 12 00 00 00 24 00
Report Supported Operation Codes cmd: a3 0c 00 00 00 00 00 00 20 00 00 00
Report Supported Operation Codes: Fixed format, current; Sense key: Illegal Request
Additional sense: Invalid command operation code
Info fld=0x0 [0]
Sense Key Specific: Error in Command byte 3840
Report supported operation codes: operation not supported

(sdb is the same, obviously, since they are both separate RAID volumes
controlled by the same controller.)

> # check ata information page
> # sg_vpd --page=0x89 /dev/sdX

spindle:/boot# sg_vpd --page=0x89 /dev/sda
ATA information VPD page:
fetching VPD page failed

Not very helpful, I know :(

I'll try rebooting into a kernel with that commit reverted next.

Areca controllers appear to be a bit weird: e.g. they needed special
support in smartctl...

>> No changes to arcmsr between those versions... I suspect I'll have to
>> bisect, which will be a complete pig because every failure means a hard
>> powerdown of this box. Always-on servers rarely appreciate hard
>> powerdowns :(
>
> Maybe just revert this commit? Helpful would be some scsi logging to
> see which command actually fails. I guess you don't have a serial
> console?

I could set one up, in theory, but the problem is that all my machines
are rather dependent on my NFS-mounted $HOME. Guess where it's mounted
from... in any case, the machine has no serial port, so it would have to
be a usb-serial console, and we know exactly how reliable those are :/

--
NULL && (void)

2013-07-29 20:15:46

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

>>>>> "Nix" == Nix <[email protected]> writes:

Nix> spindle:/boot# sg_vpd --page=0x89 /dev/sda ATA information VPD
Nix> page: fetching VPD page failed

Please add -v

I'll also need the output of:

# sg_vpd -vl


Nix> I'll try rebooting into a kernel with that commit reverted next.

Doesn't matter as far as the sg commands are concerned...

--
Martin K. Petersen Oracle Linux Engineering

2013-07-29 21:09:42

by Nix

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 29 Jul 2013, Bernd Schubert uttered the following:

> On 07/29/2013 03:05 PM, Nix wrote:
>> On 29 Jul 2013, Bernd Schubert said:
>>
>>> Hi Nick,
>>>
>>> On 07/29/2013 12:10 PM, Nick Alcock wrote:
>>>> arcmsr0: abort device command of scsi id = 0 lun = 1
>>>> arcmsr0: abort device command of scsi id = 0 lun = 0
>>>> arcmsr: executing bus reset eh.....num_resets=0, num_[...]
>>>>
>>>> arcmsr0: wait 'abort all outstanding command' timeout
>>>> arcmsr0: executing hw bus reset ....
>>>> arcmsr0: waiting for hw bus reset return, retry=0
>>>> arcmsr0: waiting for hw bus reset return, retry=1
>>>> Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
>>>> arcmsr: scsi bus reset eh returns with success
>>>> [and back to the top of the error messages again, apparently forever,
>>>> not that the machine would be much use without its RAID array even
>>>> if this loop terminated at some point, so I only gave it a couple
>>>> of minutes]
>>>>
>>>> The failure happens precisely at the moment we transition to early
>>>> userspace, so presumably userspace I/O is failing (or something related
>>>> to raw device access, perhaps, since the first thing it does is a
>>>> vgscan).
>>>>
>>>> I haven't bisected yet (sorry, I have work to do which means this
>>>> machine must be running right now), but nothing has changed in the
>>>> arcmsr controller, nor in SCSI-land excepting
>>>>
>>>> commit 98dcc2946adbe4349ef1ef9b99873b912831edd4
>>>> Author: Martin K. Petersen <[email protected]>
>>>> Date: Thu Jun 6 22:15:55 2013 -0400

I can now confirm that reverting this commit causes this problem to go
away, and my machine boots fine again.

Please revert (and figure out what is wrong so that 3.11 doesn't
implode in the same way? I'm happy to assist...)

(My apologies if a 'please revert' from someone bitten by a stable
regression isn't adequate reason to revert the thing: I've never been
quite sure who should report regressions in stable patches to Greg. It
should at least be *evidence*. So here's my "it crashed and now it
doesn't" evidence. :} )

--
NULL && (void)

2013-07-29 23:35:01

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

>>>>> "Nix" == Nix <[email protected]> writes:

Bernd,

Nix> I can now confirm that reverting this commit causes this problem to
Nix> go away, and my machine boots fine again.

Can you please send me the output of sq_inq with your 1.49 firmware?

I made a tweak that allowed Nix to boot but we're trying to find a good
blacklist trigger. And that's tricky given that Areca allows you
manually specify the SCSI model string for each volume...

--
Martin K. Petersen Oracle Linux Engineering

2013-07-30 00:29:03

by Douglas Gilbert

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 13-07-29 05:09 PM, Nix wrote:
> On 29 Jul 2013, Bernd Schubert uttered the following:
>
>> On 07/29/2013 03:05 PM, Nix wrote:
>>> On 29 Jul 2013, Bernd Schubert said:
>>>
>>>> Hi Nick,
>>>>
>>>> On 07/29/2013 12:10 PM, Nick Alcock wrote:
>>>>> arcmsr0: abort device command of scsi id = 0 lun = 1
>>>>> arcmsr0: abort device command of scsi id = 0 lun = 0
>>>>> arcmsr: executing bus reset eh.....num_resets=0, num_[...]
>>>>>
>>>>> arcmsr0: wait 'abort all outstanding command' timeout
>>>>> arcmsr0: executing hw bus reset ....
>>>>> arcmsr0: waiting for hw bus reset return, retry=0
>>>>> arcmsr0: waiting for hw bus reset return, retry=1
>>>>> Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
>>>>> arcmsr: scsi bus reset eh returns with success
>>>>> [and back to the top of the error messages again, apparently forever,
>>>>> not that the machine would be much use without its RAID array even
>>>>> if this loop terminated at some point, so I only gave it a couple
>>>>> of minutes]
>>>>>
>>>>> The failure happens precisely at the moment we transition to early
>>>>> userspace, so presumably userspace I/O is failing (or something related
>>>>> to raw device access, perhaps, since the first thing it does is a
>>>>> vgscan).
>>>>>
>>>>> I haven't bisected yet (sorry, I have work to do which means this
>>>>> machine must be running right now), but nothing has changed in the
>>>>> arcmsr controller, nor in SCSI-land excepting
>>>>>
>>>>> commit 98dcc2946adbe4349ef1ef9b99873b912831edd4
>>>>> Author: Martin K. Petersen <[email protected]>
>>>>> Date: Thu Jun 6 22:15:55 2013 -0400
>
> I can now confirm that reverting this commit causes this problem to go
> away, and my machine boots fine again.
>
> Please revert (and figure out what is wrong so that 3.11 doesn't
> implode in the same way? I'm happy to assist...)

Hi,
Please supply the information that Martin Petersen asked
for.

I just examined a more recent Areca SAS RAID controller
and would describe it as the SCSI device from hell. One solution
to this problem is to modify the arcmsr driver so it returns
a more consistent set of lies to the management SCSI commands that
Martin is asking about.

Doug Gilbert

2013-07-30 00:56:33

by Nix

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 30 Jul 2013, Douglas Gilbert outgrape:

> Please supply the information that Martin Petersen asked
> for.

Did it in private IRC (the advantage of working for the same division of
the same company!)

I didn't realise the original fix was actually implemented to allow
Bernd, with a different Areca controller, to boot... obviously, in that
situation, reversion is wrong, since that would just replace one won't-
boot situation with another.

It looks like a solution is possible that will let us boot *both* my
controller (with its old 2009-era firmware) *and* his. We just have
to let Martin implement it. Give him time, I only got a successful
boot out of it an hour ago :)

> I just examined a more recent Areca SAS RAID controller
> and would describe it as the SCSI device from hell. One solution
> to this problem is to modify the arcmsr driver so it returns
> a more consistent set of lies to the management SCSI commands that
> Martin is asking about.

I can't help notice that something is skewy in its error handling, too.
When the controller errors, even resetting the bus doesn't seem to be
enough to bring it back :/ I've seen errors from it before which did
*not* lead to it imploding forever, but this is apparently not one such.

Certainly Areca-the-company has... issues with communication with the
community (i.e., they don't). A shame I didn't know that before I bought
the controller and made all my data completely dependent on it, really.
Shame, the controller otherwise works very well (fast, and has coped
with a disk failure with aplomb).

--
NULL && (void)

2013-07-30 18:09:52

by Bernd Schubert

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 07/30/2013 01:34 AM, Martin K. Petersen wrote:
>>>>>> "Nix" == Nix <[email protected]> writes:
>
> Bernd,
>
> Nix> I can now confirm that reverting this commit causes this problem to
> Nix> go away, and my machine boots fine again.
>
> Can you please send me the output of sq_inq with your 1.49 firmware?
>
> I made a tweak that allowed Nix to boot but we're trying to find a good
> blacklist trigger. And that's tricky given that Areca allows you
> manually specify the SCSI model string for each volume...
>

Sorry it got a bit late today.

Here it is.

> (wheezy)fslab1:~# sg_inq -v /dev/sdc
> inquiry cdb: 12 00 00 00 24 00
> standard INQUIRY:
> inquiry cdb: 12 00 00 00 60 00
> PQual=0 Device_type=0 RMB=0 version=0x05 [SPC-3]
> [AERC=0] [TrmTsk=0] NormACA=0 HiSUP=0 Resp_data_format=2
> SCCS=0 ACC=0 TPGS=0 3PC=0 Protect=0 BQue=0
> EncServ=0 MultiP=0 [MChngr=0] [ACKREQQ=0] Addr16=1
> [RelAdr=0] WBus16=1 Sync=0 Linked=0 [TranDis=0] CmdQue=1
> [SPI: Clocking=0x3 QAS=0 IUS=0]
> length=96 (0x60) Peripheral device type: disk
> Vendor identification: Hitachi
> Product identification: HDS724040KLSA80
> Product revision level: R001
> inquiry cdb: 12 01 00 00 fc 00
> inquiry cdb: 12 01 80 00 fc 00
> Unit serial number: KRFS2CRAHXJZVD

Besides the firmware, the difference might be that I'm exporting single
disks without any areca-raidset in between.
I can try to confirm that tomorrow, I just need the system as it is till
tomorrow noon.


Cheers,
Bernd

2013-07-30 18:14:49

by Bernd Schubert

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 07/30/2013 02:56 AM, Nix wrote:
> On 30 Jul 2013, Douglas Gilbert outgrape:
>
>> Please supply the information that Martin Petersen asked
>> for.
>
> Did it in private IRC (the advantage of working for the same division of
> the same company!)
>
> I didn't realise the original fix was actually implemented to allow
> Bernd, with a different Areca controller, to boot... obviously, in that
> situation, reversion is wrong, since that would just replace one won't-
> boot situation with another.

Unless there is very simple fix the commit should reverted, imho. It
would better then to remove write-same support from the md-layer.


Cheers,
Bernd

2013-07-30 21:21:07

by Nix

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 30 Jul 2013, Bernd Schubert told this:

> On 07/30/2013 02:56 AM, Nix wrote:
>> On 30 Jul 2013, Douglas Gilbert outgrape:
>>
>>> Please supply the information that Martin Petersen asked
>>> for.
>>
>> Did it in private IRC (the advantage of working for the same division of
>> the same company!)
>>
>> I didn't realise the original fix was actually implemented to allow
>> Bernd, with a different Areca controller, to boot... obviously, in that
>> situation, reversion is wrong, since that would just replace one won't-
>> boot situation with another.
>
> Unless there is very simple fix the commit should reverted, imho. It
> would better then to remove write-same support from the md-layer.

I'm not using md on that machine, just LVM. Our suspicion is that ext4
is doing a WRITE SAME for some reason.

--
NULL && (void)

2013-07-31 00:25:05

by Nick Alcock

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 30 Jul 2013, Bernd Schubert told this:

> On 07/30/2013 01:34 AM, Martin K. Petersen wrote:
>> (wheezy)fslab1:~# sg_inq -v /dev/sdc
>> inquiry cdb: 12 00 00 00 24 00
>> standard INQUIRY:
>> inquiry cdb: 12 00 00 00 60 00
>> PQual=0 Device_type=0 RMB=0 version=0x05 [SPC-3]
>> [AERC=0] [TrmTsk=0] NormACA=0 HiSUP=0 Resp_data_format=2
>> SCCS=0 ACC=0 TPGS=0 3PC=0 Protect=0 BQue=0
>> EncServ=0 MultiP=0 [MChngr=0] [ACKREQQ=0] Addr16=1
>> [RelAdr=0] WBus16=1 Sync=0 Linked=0 [TranDis=0] CmdQue=1
>> [SPI: Clocking=0x3 QAS=0 IUS=0]
>> length=96 (0x60) Peripheral device type: disk
>> Vendor identification: Hitachi
>> Product identification: HDS724040KLSA80
>> Product revision level: R001
>> inquiry cdb: 12 01 00 00 fc 00
>> inquiry cdb: 12 01 80 00 fc 00
>> Unit serial number: KRFS2CRAHXJZVD
>
> Besides the firmware, the difference might be that I'm exporting single disks without any areca-raidset in between.
> I can try to confirm that tomorrow, I just need the system as it is till tomorrow noon.

Aaah. Yeah, it looks like in JBOD mode it's just passing things straight
on to the disk: that vendor ID is a dead giveaway. For all I know my
earlier firmware does the same, but for obvious reasons I can't really
test that! Quite possibly it's passing *everything* on to the disk,
including all SCSI commands, in which case we don't actually know that
your Areca controller supports the VPD page we thought it did: quite
possibly only this underlying disk does.

You can get a degree of info on the underlying disks in the array even
if it's in RAID mode -- smartctl does it, for instance -- but it takes
Areca-specific code and chattering to the sg devices directly. I bet
that in JBOD mode, the sg device is the only exposure the controller has
to the world, and *all* the /dev/sd* devices are just passthroughs.

--
NULL && (void)

2013-07-31 03:11:03

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

>>>>> "Doug" == Douglas Gilbert <[email protected]> writes:

Doug> I just examined a more recent Areca SAS RAID controller and would
Doug> describe it as the SCSI device from hell. One solution to this
Doug> problem is to modify the arcmsr driver so it returns a more
Doug> consistent set of lies to the management SCSI commands that Martin
Doug> is asking about.

Yeah. This is quite the challenge given that the product id is
user-specified and product revision hardcoded. I can match on "Areca" in
the vendor id and that's about it :(

My current approach is to tweak the driver so that I can set
skip_vpd_pages for the ATA models. Under the assumption that the SAS
controllers actually feature the ATA Information VPD...

--
Martin K. Petersen Oracle Linux Engineering

2013-07-31 03:16:11

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

>>>>> "Bernd" == Bernd Schubert <[email protected]> writes:

Bernd,

>> Product revision level: R001

It's clearly not verbatim passthrough...

Bernd> Besides the firmware, the difference might be that I'm exporting
Bernd> single disks without any areca-raidset in between. I can try to
Bernd> confirm that tomorrow, I just need the system as it is till
Bernd> tomorrow noon.

That would be a great data point. I don't have any Areca boards.

--
Martin K. Petersen Oracle Linux Engineering

2013-07-31 03:20:17

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

>>>>> "Nick" == Nick Alcock <[email protected]> writes:

Nick> in which case we don't actually know that your Areca controller
Nick> supports the VPD page we thought it did: quite possibly only this
Nick> underlying disk does.

The ATA Information VPD page is created by the SCSI-ATA Translation
layer. The controller firmware in this case.

--
Martin K. Petersen Oracle Linux Engineering

2013-07-31 17:51:14

by Bernd Schubert

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 07/31/2013 05:15 AM, Martin K. Petersen wrote:
>>>>>> "Bernd" == Bernd Schubert <[email protected]> writes:
>
> Bernd,
>
>>> Product revision level: R001
>
> It's clearly not verbatim passthrough...
>
> Bernd> Besides the firmware, the difference might be that I'm exporting
> Bernd> single disks without any areca-raidset in between. I can try to
> Bernd> confirm that tomorrow, I just need the system as it is till
> Bernd> tomorrow noon.
>
> That would be a great data point. I don't have any Areca boards.
>

Just tested it, areca-raidset does not make a difference, but the
firmware version does. After downgrading to 1.46 I have the same issue.

It is getting a bit late for me, but as this a pure development system,
which is also booted over nfs, I can investigate it tomorrow.


Cheers,
Bernd

2013-07-31 18:40:57

by Bernd Schubert

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 07/31/2013 07:51 PM, Bernd Schubert wrote:
> On 07/31/2013 05:15 AM, Martin K. Petersen wrote:
>>>>>>> "Bernd" == Bernd Schubert <[email protected]> writes:
>>
>> Bernd,
>>
>>>> Product revision level: R001
>>
>> It's clearly not verbatim passthrough...
>>
>> Bernd> Besides the firmware, the difference might be that I'm exporting
>> Bernd> single disks without any areca-raidset in between. I can try to
>> Bernd> confirm that tomorrow, I just need the system as it is till
>> Bernd> tomorrow noon.
>>
>> That would be a great data point. I don't have any Areca boards.
>>
>
> Just tested it, areca-raidset does not make a difference, but the
> firmware version does. After downgrading to 1.46 I have the same issue.
>
> It is getting a bit late for me, but as this a pure development system,
> which is also booted over nfs, I can investigate it tomorrow.
>


I couldn't resist and captured a few logs, see attachment. Is
0xffff88007feff180 the VPD inquiry? Can't we just cancel that command if
submitting it fails and then turn off VPD inquiry for this controller?


Attachments:
kern.log (120.47 kB)

2013-08-01 14:34:20

by Bernd Schubert

[permalink] [raw]
Subject: [PATCH] scsi disk: Use its own buffer for the vpd request

Once I noticed that scsi_get_vpd_page() works fine from other function
calls and that it is not 0x89, but already 0x0 that fails fixing it became
easy.

Nix, any chance you could verify it also works for you?


From: Bernd Schubert <[email protected]>

Somehow older areca firmware versions have issues with
scsi_get_vpd_page() and a large buffer.
Even scsi_get_vpd_page(, page=0,) failed in sd_read_write_same(),
while a similar request from sd_read_block_limits() worked fine.
Limiting the buf-size to 64-bytes fixes the issue with F/W V1.46.

Fixes a regression with areca controllers and older firmware versions
introduced by commit: 66c28f97120e8a621afd5aa7a31c4b85c547d33d

Reported-by: Nix <[email protected]>
Signed-off-by: Bernd Schubert <[email protected]>
CC: [email protected]
---
drivers/scsi/sd.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 80f39b8..02e50ae 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2651,13 +2651,16 @@ static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer)
struct scsi_device *sdev = sdkp->device;

if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, INQUIRY) < 0) {
+ /* too large values might cause issues with arcmsr */
+ int vpd_buf_len = 64;
+
sdev->no_report_opcodes = 1;

/* Disable WRITE SAME if REPORT SUPPORTED OPERATION
* CODES is unsupported and the device has an ATA
* Information VPD page (SAT).
*/
- if (!scsi_get_vpd_page(sdev, 0x89, buffer, SD_BUF_SIZE))
+ if (!scsi_get_vpd_page(sdev, 0x89, buffer, vpd_buf_len))
sdev->no_write_same = 1;
}

2013-08-01 14:37:59

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH] scsi disk: Use its own buffer for the vpd request

Whoops, the title is wrong, it should have been:

[PATCH] scsi disk: Limit get_vpd_page buf size

On 08/01/2013 04:34 PM, Bernd Schubert wrote:
> Once I noticed that scsi_get_vpd_page() works fine from other function
> calls and that it is not 0x89, but already 0x0 that fails fixing it became
> easy.
>
> Nix, any chance you could verify it also works for you?
>
>
> From: Bernd Schubert <[email protected]>
>
> Somehow older areca firmware versions have issues with
> scsi_get_vpd_page() and a large buffer.
> Even scsi_get_vpd_page(, page=0,) failed in sd_read_write_same(),
> while a similar request from sd_read_block_limits() worked fine.
> Limiting the buf-size to 64-bytes fixes the issue with F/W V1.46.
>
> Fixes a regression with areca controllers and older firmware versions
> introduced by commit: 66c28f97120e8a621afd5aa7a31c4b85c547d33d
>
> Reported-by: Nix <[email protected]>
> Signed-off-by: Bernd Schubert <[email protected]>
> CC: [email protected]
> ---
> drivers/scsi/sd.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 80f39b8..02e50ae 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -2651,13 +2651,16 @@ static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer)
> struct scsi_device *sdev = sdkp->device;
>
> if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, INQUIRY) < 0) {
> + /* too large values might cause issues with arcmsr */
> + int vpd_buf_len = 64;
> +
> sdev->no_report_opcodes = 1;
>
> /* Disable WRITE SAME if REPORT SUPPORTED OPERATION
> * CODES is unsupported and the device has an ATA
> * Information VPD page (SAT).
> */
> - if (!scsi_get_vpd_page(sdev, 0x89, buffer, SD_BUF_SIZE))
> + if (!scsi_get_vpd_page(sdev, 0x89, buffer, vpd_buf_len))
> sdev->no_write_same = 1;
> }
>
>

2013-08-01 14:55:23

by Bernd Schubert

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 07/30/2013 11:20 PM, Nix wrote:
> On 30 Jul 2013, Bernd Schubert told this:
>
>> On 07/30/2013 02:56 AM, Nix wrote:
>>> On 30 Jul 2013, Douglas Gilbert outgrape:
>>>
>>>> Please supply the information that Martin Petersen asked
>>>> for.
>>>
>>> Did it in private IRC (the advantage of working for the same division of
>>> the same company!)
>>>
>>> I didn't realise the original fix was actually implemented to allow
>>> Bernd, with a different Areca controller, to boot... obviously, in that
>>> situation, reversion is wrong, since that would just replace one won't-
>>> boot situation with another.
>>
>> Unless there is very simple fix the commit should reverted, imho. It
>> would better then to remove write-same support from the md-layer.
>
> I'm not using md on that machine, just LVM. Our suspicion is that ext4
> is doing a WRITE SAME for some reason.
>

I didn't check yet for other cases, mkfs.ext4 does WRITE SAME and with
lazy init it also will happen after mounting the file system, while lazy
init is running (inode zeroing).


Cheers,
Bernd

2013-08-01 16:04:40

by Nix

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 1 Aug 2013, Bernd Schubert verbalised:

> On 07/30/2013 11:20 PM, Nix wrote:
>> On 30 Jul 2013, Bernd Schubert told this:
>>
>>> On 07/30/2013 02:56 AM, Nix wrote:
>>>> On 30 Jul 2013, Douglas Gilbert outgrape:
>>>>
>>>>> Please supply the information that Martin Petersen asked
>>>>> for.
>>>>
>>>> Did it in private IRC (the advantage of working for the same division of
>>>> the same company!)
>>>>
>>>> I didn't realise the original fix was actually implemented to allow
>>>> Bernd, with a different Areca controller, to boot... obviously, in that
>>>> situation, reversion is wrong, since that would just replace one won't-
>>>> boot situation with another.
>>>
>>> Unless there is very simple fix the commit should reverted, imho. It
>>> would better then to remove write-same support from the md-layer.
>>
>> I'm not using md on that machine, just LVM. Our suspicion is that ext4
>> is doing a WRITE SAME for some reason.
>
> I didn't check yet for other cases, mkfs.ext4 does WRITE SAME and with
> lazy init it also will happen after mounting the file system, while
> lazy init is running (inode zeroing).

Well, it'll happen the first few times you mount the fs. If your fs is
years old (as mine are) the inode tables will probably have been
initialized by now!

--
NULL && (void)

2013-08-01 16:21:49

by Bernd Schubert

[permalink] [raw]
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

On 08/01/2013 06:04 PM, Nix wrote:
> On 1 Aug 2013, Bernd Schubert verbalised:
>
>> On 07/30/2013 11:20 PM, Nix wrote:
>>> On 30 Jul 2013, Bernd Schubert told this:
>>>
>>>> On 07/30/2013 02:56 AM, Nix wrote:
>>>>> On 30 Jul 2013, Douglas Gilbert outgrape:
>>>>>
>>>>>> Please supply the information that Martin Petersen asked
>>>>>> for.
>>>>>
>>>>> Did it in private IRC (the advantage of working for the same division of
>>>>> the same company!)
>>>>>
>>>>> I didn't realise the original fix was actually implemented to allow
>>>>> Bernd, with a different Areca controller, to boot... obviously, in that
>>>>> situation, reversion is wrong, since that would just replace one won't-
>>>>> boot situation with another.
>>>>
>>>> Unless there is very simple fix the commit should reverted, imho. It
>>>> would better then to remove write-same support from the md-layer.
>>>
>>> I'm not using md on that machine, just LVM. Our suspicion is that ext4
>>> is doing a WRITE SAME for some reason.
>>
>> I didn't check yet for other cases, mkfs.ext4 does WRITE SAME and with
>> lazy init it also will happen after mounting the file system, while
>> lazy init is running (inode zeroing).
>
> Well, it'll happen the first few times you mount the fs. If your fs is
> years old (as mine are) the inode tables will probably have been
> initialized by now!
>

I'm frequently doing tests with millions of files and reformating is
ways faster than deleting the all these files.

2013-08-02 03:00:37

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH] scsi disk: Use its own buffer for the vpd request

>>>>> "Bernd" == Bernd Schubert <[email protected]> writes:

Bernd,

Bernd> Once I noticed that scsi_get_vpd_page() works fine from other
Bernd> function calls and that it is not 0x89, but already 0x0 that
Bernd> fails fixing it became easy.

Bernd> Nix, any chance you could verify it also works for you?

Do we get an appropriate error back when we try to issue WRITE SAME
10/16? If so, I'm OK with this fix.

And thanks for looking into this!

--
Martin K. Petersen Oracle Linux Engineering

2013-08-02 23:46:19

by Nick Alcock

[permalink] [raw]
Subject: Re: [PATCH] scsi disk: Use its own buffer for the vpd request

On 1 Aug 2013, Bernd Schubert told this:

> Once I noticed that scsi_get_vpd_page() works fine from other function
> calls and that it is not 0x89, but already 0x0 that fails fixing it became
> easy.
>
> Nix, any chance you could verify it also works for you?

Sorry for the delay: it's hard for me to verify this during the working
week.

I'll check it tomorrow -- after I've run a backup! :} (why yes, bugs of
this nature do frighten me a bit. I know it's superstition, but I'm
always wondering whether the SCSI controller will come back again
whenever that post-error bus reset happens.)

--
NULL && (void)

2013-08-03 11:17:49

by Nick Alcock

[permalink] [raw]
Subject: Re: [PATCH] scsi disk: Use its own buffer for the vpd request

On 1 Aug 2013, Bernd Schubert stated:

> Once I noticed that scsi_get_vpd_page() works fine from other function
> calls and that it is not 0x89, but already 0x0 that fails fixing it became
> easy.
>
> Nix, any chance you could verify it also works for you?

Confirmed, thank you!

> Somehow older areca firmware versions have issues with
> scsi_get_vpd_page() and a large buffer.

I wonder if they're using math modulo SD_BUF_SIZE-1 by mistake, so they
misinterpret this as zero? (Still, doing math modulo 511 seems very
odd, even if this firmware *does* only support 512-byte sectors.)

--
NULL && (void)

2013-08-26 20:23:24

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH] scsi disk: Use its own buffer for the vpd request

Martin,

sorry for my late reply, I entirely lost track of this (customer issues,
vacation, lots of main work, ...).

On 08/02/2013 05:00 AM, Martin K. Petersen wrote:
>>>>>> "Bernd" == Bernd Schubert <[email protected]> writes:
>
> Bernd,
>
> Bernd> Once I noticed that scsi_get_vpd_page() works fine from other
> Bernd> function calls and that it is not 0x89, but already 0x0 that
> Bernd> fails fixing it became easy.
>
> Bernd> Nix, any chance you could verify it also works for you?
>
> Do we get an appropriate error back when we try to issue WRITE SAME
> 10/16? If so, I'm OK with this fix.
>
> And thanks for looking into this!
>


Is testing with sg_write_same sufficient?

With F/W V1.49:

> (squeeze)fslab2:~# lsscsi | grep sda
> [2:0:0:0] disk ATA HDS724040KLSA80 KFAO /dev/sda

> (squeeze)fslab2:~# strace -f sg_write_same --10 -v --num=0 --lba=0 /dev/sda

> ioctl(3, SG_IO, {'S', SG_DXFER_TO_DEV, cmd[10]=[41, 00, 00, 00, 00, 00, 00, 00, 00, 00], mx_sb_len=32, iovec_count=0, dxfer_len=512, timeout=60000, flags=0, data[512]=["\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...], status=02, masked_status=01, sb[18]=[70, 00, 05, 00, 00, 00, 00, 0a, 00, 00, 00, 00, 20, 00, 00, 00, 00, 00], host_status=0, driver_status=0x8, resid=0, duration=0, info=0x1}) = 0
> write(2, "Write same: Fixed format, curre"..., 114Write same: Fixed format, current; Sense key: Illegal Request
> Additional sense: Invalid command operation code
> ) = 114
> write(2, "Write same(10) command not suppo"..., 37Write same(10) command not supported
> ) = 37


> (squeeze)fslab2:~# strace -f sg_write_same --16 -v --num=0 --lba=0 /dev/sda

> ioctl(3, SG_IO, {'S', SG_DXFER_TO_DEV, cmd[16]=[93, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00], mx_sb_len=32, iovec_count=0, dxfer_len=512, timeout=60000, flags=0, data[512]=["\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...], status=02, masked_status=01, sb[18]=[70, 00, 05, 00, 00, 00, 00, 0a, 00, 00, 00, 00, 24, 00, 00, 00, 00, 00], host_status=0, driver_status=0x8, resid=0, duration=0, info=0x1}) = 0
> write(2, "Write same: Fixed format, curre"..., 104Write same: Fixed format, current; Sense key: Illegal Request
> Additional sense: Invalid field in cdb
> ) = 104
> write(2, "bad field in Write same(16) cdb,"..., 63bad field in Write same(16) cdb, option probably not supported
> ) = 63



Now with F/W V1.46

> (squeeze)fslab2:~# lsscsi | grep sdk
> [10:0:1:2] disk Hitachi HDS724040KLSA80 R001 /dev/sdk

> (squeeze)fslab2:~# cat /sys/class/scsi_host/host10/host_fw_model
> ARC-1260


> (squeeze)fslab2:~# strace -f sg_write_same --10 -v --num=0 --lba=0 /dev/sdk
> execve("/usr/bin/sg_write_same", ["sg_write_same", "--10", "-v", "--num=0", "--lba=0", "/dev/sdk"], [/* 26 vars */]) = 0

> ioctl(3, SG_IO, {'S', SG_DXFER_TO_DEV, cmd[10]=[41, 00, 00, 00, 00, 00, 00, 00, 00, 00], mx_sb_len=32, iovec_count=0, dxfer_len=512, timeout=60000, flags=0, data[512]=["\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...], status=00, masked_status=00, sb[19]=[f0, 00, 05, 00, 00, 00, 00, 0b, 00, 00, 00, 00, 20, 00, 00, 00, 02, 00, 00], host_status=0, driver_status=0x8, resid=0, duration=0, info=0x1}) = 0
> write(2, "Write same: Fixed format, curre"..., 134Write same: Fixed format, current; Sense key: Illegal Request
> Additional sense: Invalid command operation code
> Info fld=0x0 [0]
> ) = 134
> write(2, "Write same(10) command not suppo"..., 37Write same(10) command not supported
> ) = 37

> (squeeze)fslab2:~# strace -f sg_write_same --16 -v --num=0 --lba=0 /dev/sdk
> execve("/usr/bin/sg_write_same", ["sg_write_same", "--16", "-v", "--num=0", "--lba=0", "/dev/sdk"], [/* 26 vars */]) = 0

> ioctl(3, SG_IO, {'S', SG_DXFER_TO_DEV, cmd[16]=[93, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00], mx_sb_len=32, iovec_count=0, dxfer_len=512, timeout=60000, flags=0, data[512]=["\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...], status=00, masked_status=00, sb[19]=[f0, 00, 05, 00, 00, 00, 00, 0b, 00, 00, 00, 00, 20, 00, 00, 00, 02, 00, 00], host_status=0, driver_status=0x8, resid=0, duration=0, info=0x1}) = 0
> write(2, "Write same: Fixed format, curre"..., 134Write same: Fixed format, current; Sense key: Illegal Request
> Additional sense: Invalid command operation code
> Info fld=0x0 [0]
> ) = 134
> write(2, "Write same(16) command not suppo"..., 37Write same(16) command not supported
> ) = 37


Is this sufficient, or do you need something else?


Thanks,
Bernd


2013-08-30 10:02:18

by Nix

[permalink] [raw]
Subject: Re: [PATCH] scsi disk: Use its own buffer for the vpd request

On 1 Aug 2013, Bernd Schubert said:

> Once I noticed that scsi_get_vpd_page() works fine from other function
> calls and that it is not 0x89, but already 0x0 that fails fixing it became
> easy.
>
> Nix, any chance you could verify it also works for you?

As an aside, this commit does indeed fix the bug I reported, but it
doesn't seem to have gone anywhere, not even into -stable.

Is it held up somehow?

(stable has

commit 0ac10bd036f0f3b8ce7ac2390446eab9531c72eb
Author: Martin K. Petersen <[email protected]>
Date: Tue Jul 30 22:58:34 2013 -0400

SCSI: Don't attempt to send extended INQUIRY command if skip_vpd_pages is set

which IIRC was eventually found not to be necessary, because this fix
works fine instead?)

Possibly I'm misremembering the order of month-old events and Martin's
fix was eventually considered better... in which case, sorry for the noise.

--
NULL && (void)

2013-08-31 01:51:20

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] scsi disk: Use its own buffer for the vpd request

On Fri, Aug 30, 2013 at 11:01:56AM +0100, Nix wrote:
> On 1 Aug 2013, Bernd Schubert said:
>
> > Once I noticed that scsi_get_vpd_page() works fine from other function
> > calls and that it is not 0x89, but already 0x0 that fails fixing it became
> > easy.
> >
> > Nix, any chance you could verify it also works for you?
>
> As an aside, this commit does indeed fix the bug I reported, but it
> doesn't seem to have gone anywhere, not even into -stable.
>
> Is it held up somehow?
>
> (stable has
>
> commit 0ac10bd036f0f3b8ce7ac2390446eab9531c72eb
> Author: Martin K. Petersen <[email protected]>
> Date: Tue Jul 30 22:58:34 2013 -0400
>
> SCSI: Don't attempt to send extended INQUIRY command if skip_vpd_pages is set
>
> which IIRC was eventually found not to be necessary, because this fix
> works fine instead?)
>
> Possibly I'm misremembering the order of month-old events and Martin's
> fix was eventually considered better... in which case, sorry for the noise.

Is that other patch even needed anymore, now that Martin's patch is in
the tree?

thanks,

greg k-h

2013-08-31 19:48:19

by Nix

[permalink] [raw]
Subject: Re: [PATCH] scsi disk: Use its own buffer for the vpd request

On 31 Aug 2013, Greg KH said:
> On Fri, Aug 30, 2013 at 11:01:56AM +0100, Nix wrote:
>> On 1 Aug 2013, Bernd Schubert said:
>>
>> > Once I noticed that scsi_get_vpd_page() works fine from other function
>> > calls and that it is not 0x89, but already 0x0 that fails fixing it became
>> > easy.
>> >
>> > Nix, any chance you could verify it also works for you?
>>
>> As an aside, this commit does indeed fix the bug I reported, but it
>> doesn't seem to have gone anywhere, not even into -stable.
>>
>> Is it held up somehow?
>>
>> (stable has
>>
>> commit 0ac10bd036f0f3b8ce7ac2390446eab9531c72eb
>> Author: Martin K. Petersen <[email protected]>
>> Date: Tue Jul 30 22:58:34 2013 -0400
>>
>> SCSI: Don't attempt to send extended INQUIRY command if skip_vpd_pages is set
>>
>> which IIRC was eventually found not to be necessary, because this fix
>> works fine instead?)
>>
>> Possibly I'm misremembering the order of month-old events and Martin's
>> fix was eventually considered better... in which case, sorry for the noise.
>
> Is that other patch even needed anymore, now that Martin's patch is in
> the tree?

My understanding is that this patch is rather better, since Martin's
patch prevents sending of the extended INQUIRY command at all: this one
just uses a reduced buffer size, but can still issue the command. (But I
may be misunderstanding everything.)

--
NULL && (void)

2013-09-01 18:40:57

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH] scsi disk: Use its own buffer for the vpd request

On 08/31/2013 09:48 PM, Nix wrote:
> On 31 Aug 2013, Greg KH said:
>> On Fri, Aug 30, 2013 at 11:01:56AM +0100, Nix wrote:
>>> On 1 Aug 2013, Bernd Schubert said:
>>>
>>>> Once I noticed that scsi_get_vpd_page() works fine from other function
>>>> calls and that it is not 0x89, but already 0x0 that fails fixing it became
>>>> easy.
>>>>
>>>> Nix, any chance you could verify it also works for you?
>>>
>>> As an aside, this commit does indeed fix the bug I reported, but it
>>> doesn't seem to have gone anywhere, not even into -stable.
>>>
>>> Is it held up somehow?
>>>
>>> (stable has
>>>
>>> commit 0ac10bd036f0f3b8ce7ac2390446eab9531c72eb
>>> Author: Martin K. Petersen <[email protected]>
>>> Date: Tue Jul 30 22:58:34 2013 -0400
>>>
>>> SCSI: Don't attempt to send extended INQUIRY command if skip_vpd_pages is set
>>>
>>> which IIRC was eventually found not to be necessary, because this fix
>>> works fine instead?)
>>>
>>> Possibly I'm misremembering the order of month-old events and Martin's
>>> fix was eventually considered better... in which case, sorry for the noise.
>>
>> Is that other patch even needed anymore, now that Martin's patch is in
>> the tree?
>
> My understanding is that this patch is rather better, since Martin's
> patch prevents sending of the extended INQUIRY command at all: this one
> just uses a reduced buffer size, but can still issue the command. (But I
> may be misunderstanding everything.)

Hmm, I wonder if 7562523e84ddc742fe1f9db8bd76b01acca89f6b (linus tree) /
0ac10bd036f0f3b8ce7ac2390446eab9531c72eb (stable-tree) always works . It
tests if sdev->skip_vpd_pages is set, but
as far as I can see this only gets set for Seagate drives via
BLIST_SKIP_VPD_PAGES.
So if anything else than a Seagate drive is connected to an Areca
controller with older firmware it will still fail.


Cheers,
Bernd

2013-09-20 22:51:49

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH] scsi disk: Use its own buffer for the vpd request

>>>>> "Bernd" == Bernd Schubert <[email protected]> writes:

[Sorry about the delay. Catching up on a couple of weeks worth of email]

Bernd> So if anything else than a Seagate drive is connected to an Areca
Bernd> controller with older firmware it will still fail.

It's just blacklisting a specific Seagate drive. However, skip_vpd_pages
is also set by USB.

I'm still completely in favor of your patch to reduce the VPD buffer
size. Please resubmit and feel free to add my Acked-by:.

--
Martin K. Petersen Oracle Linux Engineering