2005-01-14 08:27:04

by Andrew Morton

[permalink] [raw]
Subject: 2.6.11-rc1-mm1


ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/


- Added bk-xfs to the -mm "external trees" lineup.

- Added the Linux Trace Toolkit (and hence relayfs). Mainly because I
haven't yet taken as close a look at LTT as I should have. Probably neither
have you.

It needs a bit of work on the kernel<->user periphery, which is not a big
deal.

As does relayfs, IMO. It seems to need some regularised way in which a
userspace relayfs client can tell relayfs what file(s) to use. LTT is
currently using some ghastly stick-a-pathname-in-/proc thing. Relayfs
should provide this service.

relayfs needs a closer look too. A lot of advanced instrumentation
projects seem to require it, but none of them have been merged. Lots of
people say "use netlink instead" and lots of other people say "err, we think
relayfs is better". This is a discussion which needs to be had.

- The 2.6.10-mm3 announcement was munched by the vger filters, sorry. One of
the uml patches had an inopportune substring in its name (oh pee tee hyphen
oh you tee). Nice trick if you meant it ;)

- Big update to the ext3 extended attribute support. agruen, tridge and sct
have been cooking this up for a while. samba4 proved to be a good
stress test.

- davej's "2.6 post-Halloween features" document has been added to -mm as
Documentation/feature-list-2.6.txt in the hope that someone will review it
and help keep it up-to-date.

- Added FUSE (filesystem in userspace) for people to play with. Am agnostic
as to whether it should be merged (haven't read it at all closely yet,
either), but I am impressed by the amount of care which has obviously gone
into it. Opinions sought.




Changes since 2.6.10-mm3:


linus.patch
bk-alsa.patch
bk-arm.patch
bk-cifs.patch
bk-cpufreq.patch
bk-drm-via.patch
bk-i2c.patch
bk-ide-dev.patch
bk-input.patch
bk-dtor-input.patch
bk-kbuild.patch
bk-kconfig.patch
bk-netdev.patch
bk-ntfs.patch
bk-pci.patch
bk-usb.patch
bk-xfs.patch

Latest versions of everyone's bk trees.

-m32r-include-nodemaskh-for-build-fix.patch
-acpi_smp_processor_id-warning-fix.patch
-sn2-trivial-nodemaskh-include-fix.patch
-split-bprm_apply_creds-into-two-functions.patch
-merge-_vm_enough_memorys-into-a-common-helper.patch
-ppc64-fix-iommu-cleanup-regression.patch
-ppc64-rename-perf-counter-register-defines.patch
-dmi_iterate-fix.patch
-arch-i386-kernel-cpu-mtrr-too-many-bits-are-masked-off-from-cr4.patch
-pm-introduce-pm_message_t.patch
-mark-older-power-managment-as-deprecated.patch
-swsusp-device-power-management-fix.patch
-swsusp-properly-suspend-and-resume-all-devices.patch
-m32r-employ-new-kernel-api-abi.patch
-m68k-update-defconfigs-for-2610.patch
-mmc_wbsd-depends-on-isa.patch
-m68k-remove-nowhere-referenced-files.patch
-direct-write-vs-truncate-deadlock.patch
-random-whitespace-cleanups.patch
-random-remove-pool-resizing-sysctl.patch
-cciss-update-to-version-264.patch
-reiserfs-vs-8115-test-adjustment.patch
-export-get_sb_pseudo.patch
-proc_kcore-correct-double-accounting-of-elf_buflen.patch
-remove-intermezzo-maintainers-entry.patch
-3c59x-reload-eeprom-values-at-rmmod-for-needy-cards.patch
-3c59x-remove-eeprom_reset-for-3c905b.patch
-3c59x-add-eeprom_reset-for-3c900-boomerang.patch
-3c59x-pm-fix.patch
-3c59x-missing-pci_disable_device.patch
-3c59x-use-netdev_priv.patch
-3c59x-make-use-of-generic_mii_ioctl.patch
-3c59x-vortex-select-mii.patch
-3c59x-support-more-ethtool_ops.patch
-inux-269-fs-proc-basec-array-size.patch
-linux-269-fs-proc-proc_ttyc-avoid-array.patch
-optimize-prefetch-usage-in-list_for_each_xxx.patch
-signalc-convert-assertion-to-bug_on.patch
-right-severity-level-for-fatal-message.patch
-remove-unused-drivers-char-rio-cdprotoh.patch
-remove-unused-drivers-char-rsf16fmih.patch
-mtd-added-nec-upd29f064115-support.patch
-ide-cd-is-very-noisy.patch
-signedness-fix-in-deadline-ioschedc.patch
-cleanup-virtual-console-selectionc-interface.patch
-warn-about-cli-sti-co-uses-even-on-up.patch
-remove-umsdos-from-tree.patch
-kill-quota_v2c-printk-of-size_t-warning.patch
-silence-numerous-size_t-warnings-in-drivers-acpi-processor_idlec.patch
-make-irda-string-tables-conditional-on-config_irda_debug.patch
-fix-unresolved-mtd-symbols-in-scx200_docflashc.patch
-fix-module_param-type-mismatch-in-drivers-char-n_hdlcc.patch
-drivers-char-misc-cleanups.patch
-pktcdvd-make-two-functions-static.patch
-pktcdvd-grep-friendly-function-prototypes.patch
-pktcdvd-small-documentation-update.patch
-isofs-remove-useless-include.patch
-synaptics-remove-unused-struct-member-variable.patch
-kill-one-if-x-vfreex-usage.patch
-smbfs-make-some-functions-static.patch
-mips-fixed-build-error-about-nec-vr4100-series.patch
-efs-make-a-struct-static-fwd.patch
-fs-ext3-possible-cleanups.patch
-fs-ext2-xattrc-make-ext2_xattr_list-static.patch
-fs-hugetlbfs-inodec-make-4-functions-static.patch
-remove-nr_super-define.patch
-i2o-fix-init-exit-section-usage.patch
-use-modern-format-for-pci-apic-irq-transform-printks.patch
-coda-bounds-checking.patch
-coda-use-list_for_each_entry_safe.patch
-coda-make-global-code-static.patch
-coda-remove-unused-coda_mknod.patch
-coda-rename-coda_psdev-to-coda.patch
-remove-outdated-smbfs-changelog.patch
-update-geerts-address-in-credits.patch
-cputime-introduce-cputime.patch
-cputime-microsecond-based-cputime-for-s390.patch
-4level-swapoff-hang-fix.patch
-snd-intel8x0-ac97-quirk-entries-for-hp-xw6200-xw8000.patch
-igxb-build-fix.patch
-eepro-build-fix.patch
-3c515-warning-fix.patch
-ixgb-whitespace-fix.patch
-fix-expand_stack-smp-race.patch
-ppc-fix-idle-with-interrupts-disabled.patch
-ppc-remove-duplicate-define.patch
-ppc-include-missing-header.patch
-ppc64-move-hotplug-cpu-functions-to-smp_ops.patch
-ppc64-kprobes-breaks-bug-handling.patch
-ppc64-fix-numa-build.patch
-ppc64-enhance-oops-printing.patch
-ppc64-fix-xmon-longjmp-handling.patch
-ppc64-make-xmon-print-bug-warnings.patch
-ppc64-xtime-gettimeofday-can-get-out-of-sync.patch
-ppc64-pci-cleanup.patch
-ppc64-remove-flush_instruction_cache.patch
-ppc64-interrupt-code-cleanup.patch
-ppc64-fix-rtas_set_indicator9005.patch
-ppc64-make-numa-code-handle-unexpected-layouts.patch
-ppc64-semicolon-in-rtasdc.patch
-improved-wait_8254_wraparound.patch
-kprobes-dont-steal-interrupts-from-vm86.patch
-apic-lapic-hanging-problems-on-nforce2-system.patch
-x86_64-work-around-another-aperture-bios-bug-on-opteron.patch
-x86_64-hack-to-disable-clustered-mode-on-amd-systems.patch
-x86_64-updates-for-x86-64-boot-optionstxt.patch
-x86_64-update-defconfig.patch
-x86_64-remove-old-checksumc.patch
-x86_64-fix-sparse-warnings.patch
-x86_64-fix-some-gcc-4-warnings-in-arch-x86_64.patch
-i386-port-missing-cpuid-bits-from-x86-64-to-i386.patch
-i386-amd-dual-core-support-for-i386.patch
-i386-count-both-multi-cores-and-smp-siblings-in.patch
-i386-count-both-multi-cores-and-smp-siblings-in-fix.patch
-i386-export-phys_proc_id.patch
-x86_64-move-memset_io-out-of-line-to-avoid-warnings.patch
-x86_64-fix-ioremap-attribute-restoration-on-i386-and.patch
-x86_64-fix-tlb-reporting-on-k8.patch
-x86_64-change_page_attr-logic-fixes-from-andrea.patch
-x86_64-fix-mptables-printk.patch
-x86_64-add-new-key-syscalls.patch
-x86_64-remove-direct-mem_map-references.patch
-x86_64-remove-check-that-limited-max-number-of-io-apic.patch
-x86_64-prevent-gcc-from-generating-mmx-code-by-mistake.patch
-x86_64-dont-sync-apic-arbs-on-p4s.patch
-x86_64-cleanups-preparing-for-memory-hotplug.patch
-x86_64-remove-unused-prototypes.patch
-x86_64-fix-a-lot-of-broken-white-space-in.patch
-x86_64-fix-signal-fpu-leak-on-i386-and-x86-64.patch
-x86_64-disable-conforming-bit-on-user32_cs-segment.patch
-x86_64-notify-user-of-mce-events.patch
-uml-add-some-pudding.patch
-uml-use-va_end-wherever-va_args-are-used.patch
-uml-split-out-arch-specific-syscalls-from-generic-ones.patch
-uml-three-level-page-table-support.patch
-uml-x86-64-core-support.patch
-uml-x86-64-config-support.patch
-uml-factor-out-register-saving-and-restoring.patch
-uml-x86_64-ptrace-support.patch
-uml-separate-out-signal-reception.patch
-uml-make-a-common-misconfiguration-impossible.patch
-uml-separate-out-the-time-code.patch
-uml-x86-64-headers.patch
-uml-split-out-arch-link-address-definitions.patch
-uml-dont-use-__nr_waitpid-on-arches-which-dont-have-it.patch
-uml-use-va_copy.patch
-uml-code-tidying.patch
-uml-use-for_each_cpu.patch
-uml-2610-ptrace-updates.patch
-uml-add-the-new-syscalls.patch
-uml-64-bit-cleanups.patch
-uml-silence-some-message-from-the-console-driver.patch
-uml-add-a-missing-include.patch
-uml-sparse-annotations.patch
-uml-fix-sys_call_table-syntax.patch
-uml-fix-make-clean.patch
-uml-define-config_input-better.patch
-uml-fix-a-compile-warning.patch
-seclvl-add-missing-dependency.patch
-binfmt_elf-fix-return-error-codes-and-early-corrupt-binary-detection.patch
-fix-setattr-attr_size-locking-for-nfsd.patch
-pcmcia-new-ds-cs-interface.patch
-pcmcia-call-device-drivers-from-ds-not-from-cs.patch
-pcmcia-unify-bind_mtd-and-pcmcia_bind_mtd.patch
-pcmcia-unfiy-bind_device-and-pcmcia_bind_device.patch
-pcmcia-device-model-integration-can-only-be-submitted-under-gpl.patch
-pcmcia-add-pcmcia_devices.patch
-pcmcia-remove-socket_bind_t-use-pcmcia_devices-instead.patch
-pcmcia-remove-internal-module-use-count-use-module_refcount-instead.patch
-pcmcia-set-drivers-owner-field.patch
-pcmcia-move-pcmcia_unregister_client-to-ds.patch
-pcmcia-device-model-integration-can-only-be-submitted-under-gpl-part-2.patch
-pcmcia-use-kref-instead-of-native-atomic-counter.patch
-pcmcia-add-pcmcia_putget_socket.patch
-pcmcia-grab-a-reference-to-the-cs-socket-in-ds.patch
-pcmcia-get-a-reference-to-ds-socket-for-each-pcmcia_device.patch
-pcmcia-add-a-pointer-to-client-in-struct-pcmcia_device.patch
-pcmcia-use-pcmcia_device-in-send_event.patch
-pcmcia-use-pcmcia_device-to-mark-clients-as-stale.patch
-pcmcia-code-moving-in-ds.patch
-pcmcia-use-pcmcia_device-in-register_client.patch
-pcmcia-direct-ordered-unbind-of-devices.patch
-pcmcia-bug-on-dev_list-=-null.patch
-pcmcia-bug-if-clients-are-kept-too-long.patch
-pcmcia-move-struct-client_t-inside-struct-pcmcia_device.patch
-pcmcia-use-driver_find-in-ds.patch
-pcmcia-set_netdev-for-network-devices.patch
-pcmcia-set_netdev-for-wireless-network-devices.patch
-pcmcia-reduce-stack-usage-in-ds_ioctl-randy-dunlap.patch
-pcmcia-add-disable_clkrun-option.patch
-pcmcia-rename-pcmcia-devices.patch
-pcmcia-pd6729-e-mail-update.patch
-pcmcia-pd6729-cleanups.patch
-pcmcia-pd6729-isa_irq-handling.patch
-pcmcia-remove-obsolete-code.patch
-pcmcia-remove-pending_events.patch
-pcmcia-remove-client_attributes.patch
-pcmcia-remove-unneeded-parameter-from-rsrc_mgr.patch
-pcmcia-remove-dev_info-from-client.patch
-pcmcia-remove-mtd-and-bulkmem-replaced-by-pcmciamtd.patch
-pcmcia-per-socket-resource-database.patch
-pcmcia-validate_mem-only-for-non-statically-mapped-sockets.patch
-pcmcia-adjust_io_region-only-for-non-statically-mapped-sockets.patch
-pcmcia-find_io_region-only-for-non-statically-mapped-sockets.patch
-pcmcia-find_mem_region-only-for-non-statically-mapped-sockets.patch
-pcmcia-adjust_-and-release_resources-only-for-non-statically-mapped-sockets.patch
-pcmcia-move-resource-handling-code-only-for-non-statically-mapped-sockets-to-other-file.patch
-pcmcia-make-rsrc_nonstatic-an-independend-module.patch
-pcmcia-allocate-resource-database-per-socket.patch
-pcmcia-remove-typedef.patch
-pcmcia-grab-lock-in-resource_release.patch
-sched-make-preempt_bkl-depend-on-preempt-alone.patch
-use-mmiowb-in-qla1280c.patch
-bug-on-error-handlings-in-ext3-under-i-o-failure.patch
-bug-on-error-handlings-in-ext3-under-i-o-failure-fix.patch
-scsi-aic7xxx-kill-kernel-22-ifdefs.patch

Merged

+sparc64-nodemask-build-fix.patch

sparc64 compile fix

+selinux-fix-error-handling-code-for-policy-load.patch

SELinux fix

+generic-irq-code-missing-export-of-probe_irq_mask.patch

parisc fix

+infiniband-ipoib-use-correct-static-rate-in-ipoib.patch
+infiniband-mthca-trivial-formatting-fix.patch
+infiniband-mthca-support-rdma-atomic-attributes-in-qp-modify.patch
+infiniband-mthca-clean-up-allocation-mapping-of-hca-context-memory.patch
+infiniband-mthca-add-needed-rmb-in-event-queue-poll.patch
+infiniband-core-remove-debug-printk.patch
+infiniband-make-more-code-static.patch
+infiniband-core-set-byte_cnt-correctly-in-mad-completion.patch
+infiniband-core-add-qp-number-to-work-completion-struct.patch
+infiniband-core-add-node_type-and-phys_state-sysfs-attrs.patch
+infiniband-mthca-clean-up-computation-of-hca-memory-map.patch
+infiniband-core-fix-handling-of-0-hop-directed-route-mads.patch
+infiniband-core-add-more-parameters-to-process_mad.patch
+infiniband-core-add-qp_type-to-struct-ib_qp.patch
+infiniband-core-add-ib_find_cached_gid-function.patch
+infiniband-update-copyrights-for-new-year.patch
+infiniband-ipoib-move-structs-from-stack-to-device-private-struct.patch
+infiniband-core-rename-handle_outgoing_smp.patch

infiniband updates

+seagate-st3200822as-sata-disk-needs-to-be-in-sil_blacklist-as-well.patch

SATA blacklist entry

-agpgart-allow-multiple-backends-to-be-initialized-fix.patch
-agpgart-add-bridge-assignment-missed-in-agp_allocate_memory.patch

Folded into agpgart-allow-multiple-backends-to-be-initialized.patch

+agpgart-add-agp_find_bridge-function.patch
+agpgart-allow-drivers-to-allocate-memory-local-to.patch
-agp-x86_64-build-fix.patch

More work on the support-multiple-agp-busses patches.

+orphaned-pagecache-memleak-fix.patch

Fix a weird memory leak on the page LRU. This isn't right yet.

+mark-page-accessed-in-filemapc-not-quite-right.patch

Page aging fix

+netpoll-fix-napi-polling-race-on-smp.patch

netpoll oops fix

+tun-tan-arp-monitor-support.patch

Make the tun/tap driver play right with ARP monitoring.

+atmel_cs-add-support-lg-lw2100n-wlan-pcmcia-card.patch

Add firmware support for another wlan card.

+ppc32-fix-mpc8272ads.patch
+ppc32-add-freescale-pq2fads-support.patch

ppc32 updates

+ppc64-make-hvlpevent_unregisterhandler-work.patch
+ppc64-make-iseries_veth-call-flush_scheduled_work.patch
+ppc64-iommu-avoid-isa-io-space-on-power3.patch

ppc64 updates

+frv-remove-mandatory-single-step-debugging-diversion.patch
+frv-excess-whitespace-cleanup.patch

arch/frv updates

+x86_64-i386-increase-command-line-size.patch
+x86_64-add-brackets-to-bitops.patch
+x86_64-move-early-cpu-detection-earlier.patch
+x86_64-disable-uselib-when-possible.patch
+x86_64-optimize-nodemask-operations-slightly.patch
+x86_64-fix-a-bug-in-timer_suspend.patch
+x86-consolidate-code-segment-base-calculation.patch

x86_64 update

+swsusp-more-small-fixes.patch
+swsusp-dm-use-right-levels-for-device_suspend.patch
+swsusp-update-docs.patch
+acpi-comment-whitespace-updates.patch
+make-suspend-work-with-ioapic.patch
+swsusp-refrigerator-cleanups.patch

swsusp update

+uml-avoid-null-dereference-in-linec.patch
+uml-readd-config_magic_sysrq-for-uml.patch
+uml-commentary-addition-to-recent-sysemu-fix.patch
+uml-drop-unused-buffer_headh-header-from-hostfs.patch
+uml-delete-unused-header-umnh.patch
+uml-commentary-about-sigwinch-handling-for-consoles.patch
+uml-fail-xterm_open-when-we-have-no-display.patch
+uml-depend-on-usermode-in-drivers-block-kconfig-and-drop-arch-um-kconfig_block.patch
+uml-makefile-simplification-and-correction.patch
+uml-fix-compilation-for-missing-headers.patch
+uml-fix-some-uml-own-initcall-macros.patch
+uml-refuse-to-run-without-skas-if-no-tt-mode-in.patch
+uml-for-ubd-cmdline-param-use-colon-as-delimiter.patch
+uml-allow-free-ubd-flag-ordering.patch
+uml-move-code-from-ubd_user-to-ubd_kern.patch
+uml-fix-and-cleanup-code-in-ubd_kernc-coming-from-ubd_userc.patch
+uml-add-stack-content-to-dumps.patch
+uml-add-stack-addresses-to-dumps.patch
+uml-update-ld-scripts-to-newer-binutils.patch

UML update

+reintroduce-export_symboltask_nice-for-binfmt_elf32.patch

s/390 build fix

+csum_and_copy_from_user-gcc4-warning-fixes-m32r-fix.patch

m32r build fix

+fixups-for-block2mtd.patch

block2mtd update

+poll-mini-optimisations.patch

teeny poll() speedup

+file_tableexpand_files-code-cleanup.patch
+file_tableexpand_files-code-cleanup-remove-debug.patch

code consolidation

+mtrr-size-and-base-debug.patch

Debug an mtrr bug.

+minor-ext3-speedup.patch

Reduce ext3 CPU consumption a little.

+move-read-only-and-immutable-checks-into-permission.patch
+factor-out-common-code-around-follow_link-invocation.patch

Code cleanups/consolidation

+relayfs-doc.patch
+relayfs-common-files.patch
+relayfs-locking-lockless-implementation.patch
+relayfs-headers.patch

relayfs

+ltt-core-implementation.patch
+ltt-core-headers.patch
+ltt-kconfig-fix.patch
+ltt-kernel-events.patch
+ltt-kernel-events-tidy.patch
+ltt-kernel-events-build-fix.patch
+ltt-fs-events.patch
+ltt-fs-events-tidy.patch
+ltt-ipc-events.patch
+ltt-mm-events.patch
+ltt-net-events.patch
+ltt-architecture-events.patch

LTT.

+lock-initializer-cleanup-ppc.patch
+lock-initializer-cleanup-m32r.patch
+lock-initializer-cleanup-video.patch
+lock-initializer-cleanup-ide.patch
+lock-initializer-cleanup-sound.patch
+lock-initializer-cleanup-sh.patch
+lock-initializer-cleanup-ppc64.patch
+lock-initializer-cleanup-security.patch
+lock-initializer-cleanup-core.patch
+lock-initializer-cleanup-media-drivers.patch
+lock-initializer-cleanup-networking.patch
+lock-initializer-cleanup-block-devices.patch
+lock-initializer-cleanup-s390.patch
+lock-initializer-cleanup-usermode.patch
+lock-initializer-cleanup-scsi.patch
+lock-initializer-cleanup-sparc.patch
+lock-initializer-cleanup-v850.patch
+lock-initializer-cleanup-i386.patch
+lock-initializer-cleanup-drm.patch
+lock-initializer-cleanup-firewire.patch
+lock-initializer-cleanup-arm26.patch
+lock-initializer-cleanup-m68k.patch
+lock-initializer-cleanup-network-drivers.patch
+lock-initializer-cleanup-mtd.patch
+lock-initializer-cleanup-x86_64.patch
+lock-initializer-cleanup-filesystems.patch
+lock-initializer-cleanup-ia64.patch
+lock-initializer-cleanup-raid.patch
+lock-initializer-cleanup-isdn.patch
+lock-initializer-cleanup-parisc.patch
+lock-initializer-cleanup-sparc64.patch
+lock-initializer-cleanup-arm.patch
+lock-initializer-cleanup-misc-drivers.patch
+lock-initializer-cleanup-alpha.patch
+lock-initializer-cleanup-character-devices.patch
+lock-initializer-cleanup-drivers-serial.patch
+lock-initializer-cleanup-frv.patch

spinlock and rwlock initialiser clanups

+ext3-ea-revert-cleanup.patch
+ext3-ea-revert-old-ea-in-inode.patch
+ext3-ea-mbcache-cleanup.patch
+ext2-ea-race-in-ext-xattr-sharing-code.patch
+ext3-ea-ext3-do-not-use-journal_release_buffer.patch
+ext3-ea-ext3-factor-our-common-xattr-code-unnecessary-lock.patch
+ext3-ea-ext-no-spare-xattr-handler-slots-needed.patch
+ext3-ea-cleanup-and-prepare-ext3-for-in-inode-xattrs.patch
+ext3-ea-hide-ext3_get_inode_loc-in_mem-option.patch
+ext3-ea-in-inode-extended-attributes-for-ext3.patch

Big ext3+EA update with various fixes

+fix-race-between-core-dumping-and-exec.patch
+fix-exec-deadlock-when-ptrace-used-inside-the-thread-group.patch
+ptrace-unlocked-access-to-last_siginfo-resending.patch
+clear-false-pending-signal-indication-in-core-dump.patch

Various ptrace/signal/coredump fixes

+pcmcia-remove-irq_type_time.patch
+pcmcia-ignore-driver-irq-mask.patch
+pcmcia-remove-irq_mask-and-irq_list-parameters-from-pcmcia-drivers.patch
+pcmcia-use-irq_mask-to-mark-irqs-as-unusable.patch
+pcmcia-remove-racy-try_irq.patch
+pcmcia-modify-irq_mask-via-sysfs.patch
+pcmcia-remove-includes-in-rsrc_mgr-which-arent-necessary-any-longer.patch

pcmcia udpates.

+sched-fix-preemption-race-core-i386.patch
+sched-make-use-of-preempt_schedule_irq-ppc.patch
+sched-make-use-of-preempt_schedule_irq-arm.patch

CPU scheduler preemption fix

+fbdev-cleanup-broken-edid-fixup-code.patch
+fbcon-catch-blank-events-on-both-device-and-console-level.patch
+fbcon-fix-compile-error.patch
+fbdev-fbmon-cleanup.patch
+i810fb-module-param-fix.patch
+atyfb-fix-module-parameter-descriptions.patch
+radeonfb-fix-init-exit-section-usage.patch
+pxafb-reorder-add_wait_queue-and-set_current_state.patch
+sa1100fb-reorder-add_wait_queue-and-set_current_state.patch
+backlight-add-backlight-lcd-device-basic-support.patch
+fbdev-add-w100-framebuffer-driver.patch

fbdev/fbcon update

+post-halloween-doc.patch

davej's 2.6 feature list

+fuse-maintainers-kconfig-and-makefile-changes.patch
+fuse-core.patch
+fuse-device-functions.patch
+fuse-read-only-operations.patch
+fuse-read-write-operations.patch
+fuse-file-operations.patch
+fuse-mount-options.patch
+fuse-extended-attribute-operations.patch
+fuse-readpages-operation.patch
+fuse-nfs-export.patch
+fuse-direct-i-o.patch

Filesystem in userspace.

+ieee1394-adds-a-disable_irm-option-to-ieee1394ko.patch

New command line option for firewire.

+fix-typo-in-arch-i386-kconfig.patch

Fix a tpyo.

+random-whitespace-doh.patch
+random-entropy-debugging-improvements.patch
+random-run-time-configurable-debugging.patch
+random-periodicity-detection-fix.patch
+random-add_input_randomness.patch

random driver updates

+various-kconfig-fixes.patch

Fix a huge number of Kconfig typos and brainos.





number of patches in -mm: 434
number of changesets in external trees: 314
number of patches in -mm only: 417
total patches: 731




All 434 patches:


linus.patch

sparc64-nodemask-build-fix.patch
sparc64: nodemask build fix

selinux-fix-error-handling-code-for-policy-load.patch
SELinux: fix error handling code for policy load

generic-irq-code-missing-export-of-probe_irq_mask.patch
generic irq code missing export of probe_irq_mask()

infiniband-ipoib-use-correct-static-rate-in-ipoib.patch
InfiniBand/IPoIB: use correct static rate in IpoIB

infiniband-mthca-trivial-formatting-fix.patch
InfiniBand/mthca: trivial formatting fix

infiniband-mthca-support-rdma-atomic-attributes-in-qp-modify.patch
InfiniBand/mthca: support RDMA/atomic attributes in QP modify

infiniband-mthca-clean-up-allocation-mapping-of-hca-context-memory.patch
InfiniBand/mthca: clean up allocation mapping of HCA context memory

infiniband-mthca-add-needed-rmb-in-event-queue-poll.patch
InfiniBand/mthca: add needed rmb() in event queue poll

infiniband-core-remove-debug-printk.patch
InfiniBand/core: remove debug printk

infiniband-make-more-code-static.patch
InfiniBand: make more code static

infiniband-core-set-byte_cnt-correctly-in-mad-completion.patch
InfiniBand/core: set byte_cnt correctly in MAD completion

infiniband-core-add-qp-number-to-work-completion-struct.patch
InfiniBand/core: add QP number to work completion struct

infiniband-core-add-node_type-and-phys_state-sysfs-attrs.patch
InfiniBand/core: add node_type and phys_state sysfs attrs

infiniband-mthca-clean-up-computation-of-hca-memory-map.patch
InfiniBand/mthca: clean up computation of HCA memory map

infiniband-core-fix-handling-of-0-hop-directed-route-mads.patch
InfiniBand/core: fix handling of 0-hop directed route MADs

infiniband-core-add-more-parameters-to-process_mad.patch
InfiniBand/core: add more parameters to process_mad

infiniband-core-add-qp_type-to-struct-ib_qp.patch
InfiniBand/core: add qp_type to struct ib_qp

infiniband-core-add-ib_find_cached_gid-function.patch
InfiniBand/core: add ib_find_cached_gid function

infiniband-update-copyrights-for-new-year.patch
InfiniBand: update copyrights for new year

infiniband-ipoib-move-structs-from-stack-to-device-private-struct.patch
InfiniBand/ipoib: move structs from stack to device private struct

infiniband-core-rename-handle_outgoing_smp.patch
InfiniBand/core: rename handle_outgoing_smp

ia64-acpi-build-fix.patch
ia64 acpi build fix

ia64-config_apci_numa-fix.patch
ia64 CONFIG_APCI_NUMA fix

bk-acpi-revert-20041210.patch
bk-acpi-revert-20041210

acpi-report-errors-in-fanc.patch
ACPI: report errors in fan.c

acpi-flush-tlb-when-pagetable-changed.patch
acpi: flush TLB when pagetable changed

acpi-kfree-fix.patch
a

bk-alsa.patch

bk-arm.patch

bk-cifs.patch

bk-cpufreq.patch

bk-drm-via.patch

bk-i2c.patch

bk-ide-dev.patch

ide-dev-build-fix.patch
ide-dev-build-fix

bk-input.patch

bk-dtor-input.patch

alps-touchpad-detection-fix.patch
ALPS touchpad detection fix

bk-kbuild.patch

bk-kconfig.patch

seagate-st3200822as-sata-disk-needs-to-be-in-sil_blacklist-as-well.patch
Seagate ST3200822AS SATA disk needs to be in sil_blacklist as well

bk-netdev.patch

bk-ntfs.patch

bk-pci.patch

bk-usb.patch

bk-xfs.patch

mm.patch
add -mmN to EXTRAVERSION

fix-smm-failures-on-e750x-systems.patch
fix SMM failures on E750x systems

agpgart-allow-multiple-backends-to-be-initialized.patch
agpgart: allow multiple backends to be initialized
agpgart-allow-multiple-backends-to-be-initialized fix
agpgart: add bridge assignment missed in agp_allocate_memory
x86_64 agp failure fix

agpgart-add-agp_find_bridge-function.patch
agpgart: add agp_find_bridge function

agpgart-allow-drivers-to-allocate-memory-local-to.patch
agpgart: allow drivers to allocate memory local to the bridge

drm-add-support-for-new-multiple-agp-bridge-agpgart-api.patch
drm: add support for new multiple agp bridge agpgart api

fb-add-support-for-new-multiple-agp-bridge-agpgart-api.patch
fb: add support for new multiple agp bridge agpgart api

agpgart-add-bridge-parameter-to-driver-functions.patch
agpgart: add bridge parameter to driver functions

vm-pageout-throttling.patch
vm: pageout throttling

make-tree_lock-an-rwlock.patch
make mapping->tree_lock an rwlock

orphaned-pagecache-memleak-fix.patch
orphaned pagecache memleak fix

mark-page-accessed-in-filemapc-not-quite-right.patch
mark-page-accessed in filemap.c not quite right

must-fix.patch
must fix lists update
must fix list update
mustfix update
must-fix update
mustfix lists

pcnet32-79c976-with-fiber-optic.patch
pcnet32: 79c976 with fiber optic fix

add-omap-support-to-smc91x-ethernet-driver.patch
Add OMAP support to smc91x Ethernet driver

restore-net-sched-iptc-after-iptables-kmod-cleanup.patch
Restore net/sched/ipt.c After iptables Kmod Cleanup

b44-bounce-buffer-fix.patch
b44 bounce buffering fix

netpoll-fix-napi-polling-race-on-smp.patch
netpoll: fix NAPI polling race on SMP

tun-tan-arp-monitor-support.patch
tun/tap ARP monitor support

atmel_cs-add-support-lg-lw2100n-wlan-pcmcia-card.patch
atmel_cs: Add support LG LW2100N WLAN PCMCIA card

ppc32-fix-mpc8272ads.patch
ppc32: Fix mpc8272ads

ppc32-add-freescale-pq2fads-support.patch
ppc32: Add Freescale PQ2FADS support

ppc64-make-hvlpevent_unregisterhandler-work.patch
ppc64: make HvLpEvent_unregisterHandler() work

ppc64-make-iseries_veth-call-flush_scheduled_work.patch
ppc64: make iseries_veth call flush_scheduled_work()

ppc64-iommu-avoid-isa-io-space-on-power3.patch
ppc64: iommu: avoid ISA io space on POWER3

ppc64-reloc_hide.patch

frv-remove-mandatory-single-step-debugging-diversion.patch
FRV: Remove mandatory single-step debugging diversion

frv-excess-whitespace-cleanup.patch
FRV: Excess whitespace cleanup

superhyway-bus-support.patch
SuperHyway bus support

x86_64-i386-increase-command-line-size.patch
x86_64/i386: increase command line size

x86_64-add-brackets-to-bitops.patch
x86_64: Add brackets to bitops

x86_64-move-early-cpu-detection-earlier.patch
x86_64: Move early CPU detection earlier

x86_64-disable-uselib-when-possible.patch
x86_64: Disable uselib when possible

x86_64-optimize-nodemask-operations-slightly.patch
x86_64: Optimize nodemask operations slightly

x86_64-fix-a-bug-in-timer_suspend.patch
Fix a bug in timer_suspend() on x86_64

x86-consolidate-code-segment-base-calculation.patch
x68: consolidate code segment base calculation

xen-vmm-4-add-ptep_establish_new-to-make-va-available.patch
Xen VMM #4: add ptep_establish_new to make va available

xen-vmm-4-return-code-for-arch_free_page.patch
Xen VMM #4: return code for arch_free_page

xen-vmm-4-return-code-for-arch_free_page-fix.patch
Get rid of arch_free_page() warning

xen-vmm-4-runtime-disable-of-vt-console.patch
Xen VMM #4: runtime disable of VT console

xen-vmm-4-has_arch_dev_mem.patch
Xen VMM #4: HAS_ARCH_DEV_MEM

xen-vmm-4-split-free_irq-into-teardown_irq.patch
Xen VMM #4: split free_irq into teardown_irq

swsusp-more-small-fixes.patch
swsusp: more small fixes

swsusp-dm-use-right-levels-for-device_suspend.patch
swsusp/dm: Use right levels for device_suspend()

swsusp-update-docs.patch
swsusp: update docs

acpi-comment-whitespace-updates.patch
acpi: comment/whitespace updates

make-suspend-work-with-ioapic.patch
make suspend work with ioapic

swsusp-refrigerator-cleanups.patch
swsusp: refrigerator cleanups

uml-avoid-null-dereference-in-linec.patch
uml: avoid NULL dereference in line.c

uml-readd-config_magic_sysrq-for-uml.patch
uml: readd CONFIG_MAGIC_SYSRQ for UML

uml-commentary-addition-to-recent-sysemu-fix.patch
uml: Commentary addition to recent SYSEMU fix.

uml-drop-unused-buffer_headh-header-from-hostfs.patch
uml: drop unused buffer_head.h header from hostfs

uml-delete-unused-header-umnh.patch
uml: delete unused header umn.h

uml-commentary-about-sigwinch-handling-for-consoles.patch
uml: commentary about SIGWINCH handling for consoles

uml-fail-xterm_open-when-we-have-no-display.patch
uml: fail xterm_open when we have no $DISPLAY

uml-depend-on-usermode-in-drivers-block-kconfig-and-drop-arch-um-kconfig_block.patch
uml: depend on !USERMODE in drivers/block/Kconfig and drop arch/um/Kconfig_block

uml-makefile-simplification-and-correction.patch
uml: Makefile simplification and correction.

uml-fix-compilation-for-missing-headers.patch
uml: fix compilation for missing headers

uml-fix-some-uml-own-initcall-macros.patch
uml: fix some UML own initcall macros

uml-refuse-to-run-without-skas-if-no-tt-mode-in.patch
uml: refuse to run without skas if no tt mode in

uml-for-ubd-cmdline-param-use-colon-as-delimiter.patch
uml: for ubd cmdline param use colon as delimiter

uml-allow-free-ubd-flag-ordering.patch
uml: allow free ubd flag ordering

uml-move-code-from-ubd_user-to-ubd_kern.patch
uml: move code from ubd_user to ubd_kern

uml-fix-and-cleanup-code-in-ubd_kernc-coming-from-ubd_userc.patch
uml: fix and cleanup code in ubd_kern.c coming from ubd_user.c

uml-add-stack-content-to-dumps.patch
uml: add stack content to dumps

uml-add-stack-addresses-to-dumps.patch
uml: add stack addresses to dumps

uml-update-ld-scripts-to-newer-binutils.patch
uml: update ld scripts to newer binutils

reintroduce-export_symboltask_nice-for-binfmt_elf32.patch
reintroduce task_nice export for binfmt_elf32

wacom-tablet-driver.patch
wacom tablet driver

force-feedback-support-for-uinput.patch
Force feedback support for uinput

kmap_atomic-takes-char.patch
kmap_atomic takes char*

kmap_atomic-takes-char-fix.patch
kmap_atomic-takes-char-fix

kmap_atomic-fallout.patch
kmap_atomic fallout

kunmap-fallout-more-fixes.patch
kunmap-fallout-more-fixes

make-sysrq-f-call-oom_kill.patch
make sysrq-F call oom_kill()

allow-admin-to-enable-only-some-of-the-magic-sysrq-functions.patch
Allow admin to enable only some of the Magic-Sysrq functions

sort-out-pci_rom_address_enable-vs-ioresource_rom_enable.patch
Sort out PCI_ROM_ADDRESS_ENABLE vs IORESOURCE_ROM_ENABLE

csum_and_copy_from_user-gcc4-warning-fixes.patch
csum_and_copy_from_user gcc4 warning fixes

csum_and_copy_from_user-gcc4-warning-fixes-m32r-fix.patch
csum_and_copy_from_user-gcc4-warning-fixes m32r fix

smbfs-fixes.patch
smbfs fixes

irqpoll.patch
irqpoll

fixups-for-block2mtd.patch
fixups for block2mtd

poll-mini-optimisations.patch
poll: mini optimisations

file_tableexpand_files-code-cleanup.patch
file_table:expand_files() code cleanup

file_tableexpand_files-code-cleanup-remove-debug.patch
file_tableexpand_files-code-cleanup-remove-debug

mtrr-size-and-base-debug.patch
mtrr size-and-base debugging

minor-ext3-speedup.patch
Minor ext3 speedup

move-read-only-and-immutable-checks-into-permission.patch
move read-only and immutable checks into permission()

factor-out-common-code-around-follow_link-invocation.patch
factor out common code around ->follow_link invocation

relayfs-doc.patch
relayfs: doc

relayfs-common-files.patch
relayfs: common files

relayfs-locking-lockless-implementation.patch
relayfs: locking/lockless implementation

relayfs-headers.patch
relayfs: headers

ltt-core-implementation.patch
ltt: core implementation

ltt-core-headers.patch
ltt: core headers

ltt-kconfig-fix.patch
ltt kconfig fix

ltt-kernel-events.patch
ltt: kernel/ events

ltt-kernel-events-tidy.patch
ltt-kernel-events tidy

ltt-kernel-events-build-fix.patch
ltt-kernel-events-build-fix

ltt-fs-events.patch
ltt: fs/ events

ltt-fs-events-tidy.patch
ltt-fs-events tidy

ltt-ipc-events.patch
ltt: ipc/ events

ltt-mm-events.patch
ltt: mm/ events

ltt-net-events.patch
ltt: net/ events

ltt-architecture-events.patch
ltt: architecture events

lock-initializer-cleanup-ppc.patch
Lock initializer cleanup: PPC

lock-initializer-cleanup-m32r.patch
Lock initializer cleanup: M32R

lock-initializer-cleanup-video.patch
Lock initializer cleanup: Video

lock-initializer-cleanup-ide.patch
Lock initializer cleanup: IDE

lock-initializer-cleanup-sound.patch
Lock initializer cleanup: sound

lock-initializer-cleanup-sh.patch
Lock initializer cleanup: SH

lock-initializer-cleanup-ppc64.patch
Lock initializer cleanup: PPC64

lock-initializer-cleanup-security.patch
Lock initializer cleanup: Security

lock-initializer-cleanup-core.patch
Lock initializer cleanup: Core

lock-initializer-cleanup-media-drivers.patch
Lock initializer cleanup: media drivers

lock-initializer-cleanup-networking.patch
Lock initializer cleanup: Networking

lock-initializer-cleanup-block-devices.patch
Lock initializer cleanup: Block devices

lock-initializer-cleanup-s390.patch
Lock initializer cleanup: S390

lock-initializer-cleanup-usermode.patch
Lock initializer cleanup: UserMode

lock-initializer-cleanup-scsi.patch
Lock initializer cleanup: SCSI

lock-initializer-cleanup-sparc.patch
Lock initializer cleanup: SPARC

lock-initializer-cleanup-v850.patch
Lock initializer cleanup: V850

lock-initializer-cleanup-i386.patch
Lock initializer cleanup: I386

lock-initializer-cleanup-drm.patch
Lock initializer cleanup: DRM

lock-initializer-cleanup-firewire.patch
Lock initializer cleanup: Firewire

lock-initializer-cleanup-arm26.patch
Lock initializer cleanup - (ARM26)

lock-initializer-cleanup-m68k.patch
Lock initializer cleanup: M68K

lock-initializer-cleanup-network-drivers.patch
Lock initializer cleanup: Network drivers

lock-initializer-cleanup-mtd.patch
Lock initializer cleanup: MTD

lock-initializer-cleanup-x86_64.patch
Lock initializer cleanup: X86_64

lock-initializer-cleanup-filesystems.patch
Lock initializer cleanup: Filesystems

lock-initializer-cleanup-ia64.patch
Lock initializer cleanup: IA64

lock-initializer-cleanup-raid.patch
Lock initializer cleanup: Raid

lock-initializer-cleanup-isdn.patch
Lock initializer cleanup: ISDN

lock-initializer-cleanup-parisc.patch
Lock initializer cleanup: PARISC

lock-initializer-cleanup-sparc64.patch
Lock initializer cleanup: SPARC64

lock-initializer-cleanup-arm.patch
Lock initializer cleanup: ARM

lock-initializer-cleanup-misc-drivers.patch
Lock initializer cleanup: Misc drivers

lock-initializer-cleanup-alpha.patch
Lock initializer cleanup - (ALPHA)

lock-initializer-cleanup-character-devices.patch
Lock initializer cleanup: character devices

lock-initializer-cleanup-drivers-serial.patch
Lock initializer cleanup: drivers/serial

lock-initializer-cleanup-frv.patch
Lock initializer cleanup: FRV

ext3-ea-revert-cleanup.patch
ext3-ea-revert-cleanup

ext3-ea-revert-old-ea-in-inode.patch
revert old ea-in-inode patch

ext3-ea-mbcache-cleanup.patch
ext3/EA: mbcache cleanup

ext2-ea-race-in-ext-xattr-sharing-code.patch
ext3/EA: Race in ext[23] xattr sharing code

ext3-ea-ext3-do-not-use-journal_release_buffer.patch
ext3/EA: Ext3: do not use journal_release_buffer

ext3-ea-ext3-factor-our-common-xattr-code-unnecessary-lock.patch
ext3/EA: Ext3: factor our common xattr code; unnecessary lock

ext3-ea-ext-no-spare-xattr-handler-slots-needed.patch
ext3/EA: Ext[23]: no spare xattr handler slots needed

ext3-ea-cleanup-and-prepare-ext3-for-in-inode-xattrs.patch
ext3/EA: Cleanup and prepare ext3 for in-inode xattrs

ext3-ea-hide-ext3_get_inode_loc-in_mem-option.patch
ext3/EA: Hide ext3_get_inode_loc in_mem option

ext3-ea-in-inode-extended-attributes-for-ext3.patch
ext3/EA: In-inode extended attributes for ext3

speedup-proc-pid-maps.patch
Speed up /proc/pid/maps

speedup-proc-pid-maps-fix.patch
Speed up /proc/pid/maps fix

speedup-proc-pid-maps-fix-fix.patch
speedup-proc-pid-maps fix fix

speedup-proc-pid-maps-fix-fix-fix.patch
speedup /proc/<pid>/maps(4th version)

inotify.patch
inotify

ioctl-rework-2.patch
ioctl rework #2

ioctl-rework-2-fix.patch
ioctl-rework-2 fix

make-standard-conversions-work-with-compat_ioctl.patch
make standard conversions work with compat_ioctl.

fget_light-fput_light-for-ioctls.patch
fget_light/fput_light for ioctls

macros-to-detect-existance-of-unlocked_ioctl-and-ioctl_compat.patch
macros to detect existance of unlocked_ioctl and ioctl_compat

fix-coredump_wait-deadlock-with-ptracer-tracee-on-shared-mm.patch
fix coredump_wait deadlock with ptracer & tracee on shared mm

fix-race-between-core-dumping-and-exec.patch
fix race between core dumping and exec with shared mm

fix-exec-deadlock-when-ptrace-used-inside-the-thread-group.patch
fix exec deadlock when ptrace used inside the thread group

ptrace-unlocked-access-to-last_siginfo-resending.patch
ptrace: unlocked access to last_siginfo (resending)

clear-false-pending-signal-indication-in-core-dump.patch
clear false pending signal indication in core dump

pcmcia-remove-irq_type_time.patch
pcmcia: remove IRQ_TYPE_TIME

pcmcia-ignore-driver-irq-mask.patch
pcmcia: ignore driver IRQ mask

pcmcia-remove-irq_mask-and-irq_list-parameters-from-pcmcia-drivers.patch
pcmcia: remove irq_mask and irq_list parameters from PCMCIA drivers

pcmcia-use-irq_mask-to-mark-irqs-as-unusable.patch
pcmcia: use irq_mask to mark IRQs as (un)usable

pcmcia-remove-racy-try_irq.patch
pcmcia: remove racy try_irq()

pcmcia-modify-irq_mask-via-sysfs.patch
pcmcia: modify irq_mask via sysfs

pcmcia-remove-includes-in-rsrc_mgr-which-arent-necessary-any-longer.patch
pcmcia: remove #includes in rsrc_mgr which aren't necessary any longer

kgdb-ga.patch
kgdb stub for ia32 (George Anzinger's one)
kgdbL warning fix
kgdb buffer overflow fix
kgdbL warning fix
kgdb: CONFIG_DEBUG_INFO fix
x86_64 fixes
correct kgdb.txt Documentation link (against 2.6.1-rc1-mm2)
kgdb: fix for recent gcc
kgdb warning fixes
THREAD_SIZE fixes for kgdb
Fix stack overflow test for non-8k stacks
kgdb-ga.patch fix for i386 single-step into sysenter
fix TRAP_BAD_SYSCALL_EXITS on i386
add TRAP_BAD_SYSCALL_EXITS config for i386
kgdb-is-incompatible-with-kprobes
kgdb-ga-build-fix
kgdb-ga-fixes

kgdb-kill-off-highmem_start_page.patch
kgdb: kill off highmem_start_page

kgdboe-netpoll.patch
kgdb-over-ethernet via netpoll
kgdboe: fix configuration of MAC address

kgdb-x86_64-support.patch
kgdb-x86_64-support.patch for 2.6.2-rc1-mm3
kgdb-x86_64-warning-fixes
kgdb-x86_64-fix
kgdb-x86_64-serial-fix
kprobes exception notifier fix

dev-mem-restriction-patch.patch
/dev/mem restriction patch

dev-mem-restriction-patch-allow-reads.patch
dev-mem-restriction-patch: allow reads

jbd-remove-livelock-avoidance.patch
JBD: remove livelock avoidance code in journal_dirty_data()

journal_add_journal_head-debug.patch
journal_add_journal_head-debug

list_del-debug.patch
list_del debug check

unplug-can-sleep.patch
unplug functions can sleep

firestream-warnings.patch
firestream warnings

perfctr-core.patch
perfctr: core
perfctr: remove bogus perfctr_sample_thread() calls

perfctr-i386.patch
perfctr: i386

perfctr-x86-core-updates.patch
perfctr x86 core updates

perfctr-x86-driver-updates.patch
perfctr x86 driver updates

perfctr-x86-driver-cleanup.patch
perfctr: x86 driver cleanup

perfctr-prescott-fix.patch
Prescott fix for perfctr

perfctr-x86-update-2.patch
perfctr x86 update 2

perfctr-x86_64.patch
perfctr: x86_64

perfctr-x86_64-core-updates.patch
perfctr x86_64 core updates

perfctr-ppc.patch
perfctr: PowerPC

perfctr-ppc32-driver-update.patch
perfctr: ppc32 driver update

perfctr-ppc32-mmcr0-handling-fixes.patch
perfctr ppc32 MMCR0 handling fixes

perfctr-ppc32-update.patch
perfctr ppc32 update

perfctr-ppc32-update-2.patch
perfctr ppc32 update

perfctr-virtualised-counters.patch
perfctr: virtualised counters

perfctr-remap_page_range-fix.patch

virtual-perfctr-illegal-sleep.patch
virtual perfctr illegal sleep

make-perfctr_virtual-default-in-kconfig-match-recommendation.patch
Make PERFCTR_VIRTUAL default in Kconfig match recommendation in help text

perfctr-ifdef-cleanup.patch
perfctr ifdef cleanup

perfctr-update-2-6-kconfig-related-updates.patch
perfctr: Kconfig-related updates

perfctr-virtual-updates.patch
perfctr virtual updates

perfctr-virtual-cleanup.patch
perfctr: virtual cleanup

perfctr-ppc32-preliminary-interrupt-support.patch
perfctr ppc32 preliminary interrupt support

perfctr-update-5-6-reduce-stack-usage.patch
perfctr: reduce stack usage

perfctr-interrupt-support-kconfig-fix.patch
perfctr interrupt_support Kconfig fix

perfctr-low-level-documentation.patch
perfctr low-level documentation

perfctr-inheritance-1-3-driver-updates.patch
perfctr inheritance: driver updates

perfctr-inheritance-2-3-kernel-updates.patch
perfctr inheritance: kernel updates

perfctr-inheritance-3-3-documentation-updates.patch
perfctr inheritance: documentation updates

perfctr-inheritance-locking-fix.patch
perfctr inheritance locking fix

perfctr-api-changes-first-step.patch
perfctr API changes: first step

perfctr-virtual-update.patch
perfctr virtual update

perfctr-x86-64-ia32-emulation-fix.patch
perfctr x86-64 ia32 emulation fix

perfctr-sysfs-update-1-4-core.patch
perfctr sysfs update: core

perfctr-sysfs-update.patch
Perfctr sysfs update

perfctr-sysfs-update-2-4-x86.patch
perfctr sysfs update: x86

perfctr-sysfs-update-3-4-x86-64.patch
perfctr sysfs update: x86-64
perfctr: syscall numbers in x86-64 ia32-emulation
perfctr x86_64 native syscall numbers fix

perfctr-sysfs-update-4-4-ppc32.patch
perfctr sysfs update: ppc32

sched-fix-preemption-race-core-i386.patch
sched: fix preemption race (Core/i386)

sched-make-use-of-preempt_schedule_irq-ppc.patch
sched: make use of preempt_schedule_irq() (PPC)

sched-make-use-of-preempt_schedule_irq-arm.patch
sched: make use of preempt_schedule_irq (ARM)

add-do_proc_doulonglongvec_minmax-to-sysctl-functions.patch
Add do_proc_doulonglongvec_minmax to sysctl functions
add-do_proc_doulonglongvec_minmax-to-sysctl-functions-fix
add-do_proc_doulonglongvec_minmax-to-sysctl-functions fix 2

add-sysctl-interface-to-sched_domain-parameters.patch
Add sysctl interface to sched_domain parameters

allow-modular-ide-pnp.patch
allow modular ide-pnp

allow-x86_64-to-reenable-interrupts-on-contention.patch
Allow x86_64 to reenable interrupts on contention

i386-cpu-hotplug-updated-for-mm.patch
i386 CPU hotplug updated for -mm

ppc64-fix-cpu-hotplug.patch
ppc64: fix hotplug cpu

serialize-access-to-ide-devices.patch
serialize access to ide devices

disable-atykb-warning.patch
disable atykb "too many keys pressed" warning

export-file_ra_state_init-again.patch
Export file_ra_state_init() again

cachefs-filesystem.patch
CacheFS filesystem

numa-policies-for-file-mappings-mpol_mf_move-cachefs.patch
numa-policies-for-file-mappings-mpol_mf_move for cachefs

cachefs-release-search-records-lest-they-return-to-haunt-us.patch
CacheFS: release search records lest they return to haunt us

fix-64-bit-problems-in-cachefs.patch
Fix 64-bit problems in cachefs

cachefs-fixed-typos-that-cause-wrong-pointer-to-be-kunmapped.patch
cachefs: fixed typos that cause wrong pointer to be kunmapped

cachefs-return-the-right-error-upon-invalid-mount.patch
CacheFS: return the right error upon invalid mount

fix-cachefs-barrier-handling-and-other-kernel-discrepancies.patch
Fix CacheFS barrier handling and other kernel discrepancies

remove-error-from-linux-cachefsh.patch
Remove #error from linux/cachefs.h

cachefs-warning-fix-2.patch
cachefs warning fix 2

cachefs-linkage-fix-2.patch
cachefs linkage fix

cachefs-build-fix.patch
cachefs build fix

cachefs-documentation.patch
CacheFS documentation

add-page-becoming-writable-notification.patch
Add page becoming writable notification

add-page-becoming-writable-notification-fix.patch
do_wp_page_mk_pte_writable() fix

add-page-becoming-writable-notification-build-fix.patch
add-page-becoming-writable-notification build fix

provide-a-filesystem-specific-syncable-page-bit.patch
Provide a filesystem-specific sync'able page bit

provide-a-filesystem-specific-syncable-page-bit-fix.patch
provide-a-filesystem-specific-syncable-page-bit-fix

provide-a-filesystem-specific-syncable-page-bit-fix-2.patch
provide-a-filesystem-specific-syncable-page-bit-fix-2

make-afs-use-cachefs.patch
Make AFS use CacheFS

afs-cachefs-dependency-fix.patch
afs-cachefs-dependency-fix

split-general-cache-manager-from-cachefs.patch
Split general cache manager from CacheFS

turn-cachefs-into-a-cache-backend.patch
Turn CacheFS into a cache backend

rework-the-cachefs-documentation-to-reflect-fs-cache-split.patch
Rework the CacheFS documentation to reflect FS-Cache split

update-afs-client-to-reflect-cachefs-split.patch
Update AFS client to reflect CacheFS split

assign_irq_vector-section-fix.patch
assign_irq_vector __init section fix

kexec-i8259-shutdowni386.patch
kexec: i8259-shutdown.i386

kexec-i8259-shutdown-x86_64.patch
kexec: x86_64 i8259 shutdown

kexec-apic-virtwire-on-shutdowni386patch.patch
kexec: apic-virtwire-on-shutdown.i386.patch

kexec-apic-virtwire-on-shutdownx86_64.patch
kexec: apic-virtwire-on-shutdown.x86_64

kexec-ioapic-virtwire-on-shutdowni386.patch
kexec: ioapic-virtwire-on-shutdown.i386

kexec-apic-virt-wire-fix.patch
kexec: apic-virt-wire fix

kexec-ioapic-virtwire-on-shutdownx86_64.patch
kexec: ioapic-virtwire-on-shutdown.x86_64

kexec-e820-64bit.patch
kexec: e820-64bit

kexec-kexec-generic.patch
kexec: kexec-generic

kexec-ide-spindown-fix.patch
kexec-ide-spindown-fix

kexec-ifdef-cleanup.patch
kexec ifdef cleanup

kexec-machine_shutdownx86_64.patch
kexec: machine_shutdown.x86_64

kexec-kexecx86_64.patch
kexec: kexec.x86_64

kexec-kexecx86_64-4level-fix.patch
kexec-kexecx86_64-4level-fix

kexec-kexecx86_64-4level-fix-unfix.patch
kexec-kexecx86_64-4level-fix unfix

kexec-machine_shutdowni386.patch
kexec: machine_shutdown.i386

kexec-kexeci386.patch
kexec: kexec.i386

kexec-use_mm.patch
kexec: use_mm

kexec-loading-kernel-from-non-default-offset.patch
kexec: loading kernel from non-default offset

kexec-loading-kernel-from-non-default-offset-fix.patch
kdump: fix bss compile error

kexec-enabling-co-existence-of-normal-kexec-kernel-and-panic-kernel.patch
kexec: nabling co-existence of normal kexec kernel and panic kernel

kexec-ppc-support.patch
kexec: ppc support

crashdump-documentation.patch
crashdump: documentation

crashdump-memory-preserving-reboot-using-kexec.patch
crashdump: memory preserving reboot using kexec

crashdump-memory-preserving-reboot-using-kexec-fix.patch
kdump: Fix for boot problems on SMP

kdump-config_discontigmem-fix.patch
kdump: CONFIG_DISCONTIGMEM fix

crashdump-routines-for-copying-dump-pages.patch
crashdump: routines for copying dump pages

crashdump-routines-for-copying-dump-pages-kmap-fiddle.patch
crashdump-routines-for-copying-dump-pages-kmap-fiddle

crashdump-kmap-build-fix.patch
crashdump kmap build fix

crashdump-register-snapshotting-before-kexec-boot.patch
crashdump: register snapshotting before kexec boot

crashdump-elf-format-dump-file-access.patch
crashdump: ELF format dump file access

crashdump-linear-raw-format-dump-file-access.patch
crashdump: linear/raw format dump file access

crashdump-minor-bug-fixes-to-kexec-crashdump-code.patch
crashdump: minor bug fixes to kexec crashdump code

crashdump-cleanups-to-the-kexec-based-crashdump-code.patch
crashdump: cleanups to the kexec based crashdump code

x86-rename-apic_mode_exint.patch
x86: rename APIC_MODE_EXINT

x86-local-apic-fix.patch
x86: local apic fix

new-bitmap-list-format-for-cpusets.patch
new bitmap list format (for cpusets)

cpusets-big-numa-cpu-and-memory-placement.patch
cpusets - big numa cpu and memory placement

cpusets-config_cpusets-depends-on-smp.patch
Cpusets: CONFIG_CPUSETS depends on SMP

cpusets-move-cpusets-above-embedded.patch
move CPUSETS above EMBEDDED

cpusets-fix-cpuset_get_dentry.patch
cpusets : fix cpuset_get_dentry()

cpusets-fix-race-in-cpuset_add_file.patch
cpusets: fix race in cpuset_add_file()

cpusets-remove-more-casts.patch
cpusets: remove more casts

cpusets-make-config_cpusets-the-default-in-sn2_defconfig.patch
cpusets: make CONFIG_CPUSETS the default in sn2_defconfig

cpusets-document-proc-status-allowed-fields.patch
cpusets: document proc status allowed fields

cpusets-dont-export-proc_cpuset_operations.patch
Cpusets - Dont export proc_cpuset_operations

cpusets-display-allowed-masks-in-proc-status.patch
cpusets: display allowed masks in proc status

cpusets-simplify-cpus_allowed-setting-in-attach.patch
cpusets: simplify cpus_allowed setting in attach

cpusets-remove-useless-validation-check.patch
cpusets: remove useless validation check

cpusets-tasks-file-simplify-format-fixes.patch
Cpusets tasks file: simplify format, fixes

cpusets-simplify-memory-generation.patch
Cpusets: simplify memory generation

cpusets-interoperate-with-hotplug-online-maps.patch
cpusets: interoperate with hotplug online maps

cpusets-alternative-fix-for-possible-race-in.patch
cpusets: alternative fix for possible race in cpuset_tasks_read()

cpusets-remove-casts.patch
cpusets: remove void* typecasts

reiser4-sb_sync_inodes.patch
reiser4: vfs: add super_operations.sync_inodes()

reiser4-allow-drop_inode-implementation.patch
reiser4: export vfs inode.c symbols

reiser4-truncate_inode_pages_range.patch
reiser4: vfs: add truncate_inode_pages_range()

reiser4-export-remove_from_page_cache.patch
reiser4: export pagecache add/remove functions to modules

reiser4-export-page_cache_readahead.patch
reiser4: export page_cache_readahead to modules

reiser4-reget-page-mapping.patch
reiser4: vfs: re-check page->mapping after calling try_to_release_page()

reiser4-rcu-barrier.patch
reiser4: add rcu_barrier() synchronization point

reiser4-export-inode_lock.patch
reiser4: export inode_lock to modules

reiser4-export-pagevec-funcs.patch
reiser4: export pagevec functions to modules

reiser4-export-radix_tree_preload.patch
reiser4: export radix_tree_preload() to modules

reiser4-export-find_get_pages.patch

reiser4-radix-tree-tag.patch
reiser4: add new radix tree tag

reiser4-radix_tree_lookup_slot.patch
reiser4: add radix_tree_lookup_slot()

reiser4-perthread-pages.patch
reiser4: per-thread page pools

reiser4-include-reiser4.patch
reiser4: add to build system

reiser4-doc.patch
reiser4: documentation

reiser4-only.patch
reiser4: main fs

reiser4-recover-read-performance.patch
reiser4: recover read performance

reiser4-export-find_get_pages_tag.patch
reiser4-export-find_get_pages_tag

reiser4-add-missing-context.patch

add-acpi-based-floppy-controller-enumeration.patch
Add ACPI-based floppy controller enumeration.

possible-dcache-bug-debugging-patch.patch
Possible dcache BUG: debugging patch

serial-add-support-for-non-standard-xtals-to-16c950-driver.patch
serial: add support for non-standard XTALs to 16c950 driver

add-support-for-possio-gcc-aka-pcmcia-siemens-mc45.patch
Add support for Possio GCC AKA PCMCIA Siemens MC45

mpsc-driver-patch.patch
serial: MPSC driver

generic-serial-cli-conversion.patch
generic-serial cli() conversion

specialix-io8-cli-conversion.patch
Specialix/IO8 cli() conversion

sx-cli-conversion.patch
SX cli() conversion

revert-allow-oem-written-modules-to-make-calls-to-ia64-oem-sal-functions.patch
revert "allow OEM written modules to make calls to ia64 OEM SAL functions"

md-add-interface-for-userspace-monitoring-of-events.patch
md: add interface for userspace monitoring of events.

make-acpi_bus_register_driver-consistent-with-pci_register_driver-again.patch
make acpi_bus_register_driver() consistent with pci_register_driver()

remove-lock_section-from-x86_64-spin_lock-asm.patch
remove LOCK_SECTION from x86_64 spin_lock asm

kfree_skb-dump_stack.patch
kfree_skb-dump_stack

cancel_rearming_delayed_work.patch
cancel_rearming_delayed_work()
make cancel_rearming_delayed_workqueue static

ipvs-deadlock-fix.patch
ipvs deadlock fix

minimal-ide-disk-updates.patch
Minimal ide-disk updates

use-find_trylock_page-in-free_swap_and_cache-instead-of-hand-coding.patch
use find_trylock_page in free_swap_and_cache instead of hand coding

fbdev-cleanup-broken-edid-fixup-code.patch
fbdev: Cleanup broken edid fixup code

fbcon-catch-blank-events-on-both-device-and-console-level.patch
fbcon: Catch blank events on both device and console level

fbcon-fix-compile-error.patch
fbcon: Fix compile error

fbdev-fbmon-cleanup.patch
fbdev: Fbmon cleanup

i810fb-module-param-fix.patch
i810fb: Module param fix

atyfb-fix-module-parameter-descriptions.patch
atyfb: Fix module parameter descriptions

radeonfb-fix-init-exit-section-usage.patch
radeonfb: Fix init/exit section usage

pxafb-reorder-add_wait_queue-and-set_current_state.patch
pxafb: Reorder add_wait_queue() and set_current_state()

sa1100fb-reorder-add_wait_queue-and-set_current_state.patch
sa1100fb: Reorder add_wait_queue() and set_current_state()

backlight-add-backlight-lcd-device-basic-support.patch
backlight: Add Backlight/LCD device basic support

fbdev-add-w100-framebuffer-driver.patch
fbdev: Add w100 framebuffer driver

raid5-overlapping-read-hack.patch
raid5 overlapping read hack

figure-out-who-is-inserting-bogus-modules.patch
Figure out who is inserting bogus modules

detect-atomic-counter-underflows.patch
detect atomic counter underflows

waiting-10s-before-mounting-root-filesystem.patch
retry mounting the root filesystem at boot time

post-halloween-doc.patch
post halloween doc

periodically-scan-redzone-entries-and-slab-control-structures.patch
periodically scan redzone entries and slab control structures

fuse-maintainers-kconfig-and-makefile-changes.patch
Subject: [PATCH 1/11] FUSE - MAINTAINERS, Kconfig and Makefile changes

fuse-core.patch
Subject: [PATCH 2/11] FUSE - core

fuse-device-functions.patch
Subject: [PATCH 3/11] FUSE - device functions

fuse-read-only-operations.patch
Subject: [PATCH 4/11] FUSE - read-only operations

fuse-read-write-operations.patch
Subject: [PATCH 5/11] FUSE - read-write operations

fuse-file-operations.patch
Subject: [PATCH 6/11] FUSE - file operations

fuse-mount-options.patch
Subject: [PATCH 7/11] FUSE - mount options

fuse-extended-attribute-operations.patch
Subject: [PATCH 8/11] FUSE - extended attribute operations

fuse-readpages-operation.patch
Subject: [PATCH 9/11] FUSE - readpages operation

fuse-nfs-export.patch
Subject: [PATCH 10/11] FUSE - NFS export

fuse-direct-i-o.patch
Subject: [PATCH 11/11] FUSE - direct I/O

ieee1394-adds-a-disable_irm-option-to-ieee1394ko.patch
ieee1394: add a disable_irm option to ieee1394.ko

fix-typo-in-arch-i386-kconfig.patch
Fix typo in arch/i386/Kconfig

random-whitespace-doh.patch
random: whitespace doh

random-entropy-debugging-improvements.patch
random: entropy debugging improvements

random-run-time-configurable-debugging.patch
random: run-time configurable debugging

random-periodicity-detection-fix.patch
random: periodicity detection fix

random-add_input_randomness.patch
random: add_input_randomness

various-kconfig-fixes.patch
various Kconfig fixes




2005-01-14 08:47:25

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Andrew Morton <[email protected]> writes:
>
> - Added the Linux Trace Toolkit (and hence relayfs). Mainly because I
> haven't yet taken as close a look at LTT as I should have. Probably neither
> have you.

I think it would be better to have a standard set of kprobes instead
of all the ugly LTT hooks. kprobes could then log to relayfs or another
fast logging mechanism.

Advantage of this would be that it had no impact on fast paths
unless enabled (LTT slows down a kernel quite considerable just
by compiling it in)

> As does relayfs, IMO. It seems to need some regularised way in which a
> userspace relayfs client can tell relayfs what file(s) to use. LTT is
> currently using some ghastly stick-a-pathname-in-/proc thing. Relayfs
> should provide this service.
>
> relayfs needs a closer look too. A lot of advanced instrumentation
> projects seem to require it, but none of them have been merged. Lots of
> people say "use netlink instead" and lots of other people say "err, we think
> relayfs is better". This is a discussion which needs to be had.

imho relayfs and netlink are for completely problem spaces.
relayfs is for relaying a lot of data quickly (e.g. for kernel
instrumentation). There it fills a niche that printk doesn't fill
(since it's too slow). netlink is quite slow (allocates data for each
event, does lots of other gunk), but an useful extensible format
for low frequency events.

For the problems that relayfs solves netlink is totally unusable
due to low efficiency (you could as well use printk, but that is
also to slow). I think a low overhead logging mechanism is very
much needed, because I find myself reinventing it quite often
when I need to debug some timing sensitive problem. Trying to
tackle these with printk is hopeless because it changes timing too much.

The problem relayfs has IMHO is that it is too complicated. It
seems to either suffer from a overfull specification or second system
effect. There are lots of different options to do everything,
instead of a nice simple fast path that does one thing efficiently.
IMHO before merging it should go through a diet and only keep
the paths that are actually needed and dropping a lot of the current
baggage.

Preferably that would be only the fastest options (extremly simple
per CPU buffer with inlined fast path that drop data on buffer overflow),
with leaving out anything more complicated. My ideal is something
like the old SGI ktrace which was an extremly simple mechanism
to do lockless per CPU logging of binary data efficiently and
reading that from a user daemon.

-Andi

2005-01-14 09:20:46

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Andi Kleen wrote:
> I think it would be better to have a standard set of kprobes instead
> of all the ugly LTT hooks. kprobes could then log to relayfs or another
> fast logging mechanism.
>
> Advantage of this would be that it had no impact on fast paths
> unless enabled (LTT slows down a kernel quite considerable just
> by compiling it in)

There are different ways to look at this. For one thing, the current
ltt hooks aren't as fast as they should be (i.e. we check whether
the tracing is enabled for a certain event way too far in the code-path.)
This should be rather simple to fix. Whether it be by checking for the
event's logging as early as possible or by using one of the hooking
frameworks that generate noops which cost nothing until tracing is
enabled. None of this is really difficult. What is difficult is trying
to maintain the LTT patches outside the kernel while trying to add all
the bells-and-whistles that make such a thing lightweight and effective.

As far as kprobes go, then you still need to have some form or another
of marking the code for key events, unless you keep maintaining a set
of kprobes-able points separately, which really makes it unusable for
the rest of us, as the users of LTT have discovered over time (having
to create a new patch for every new kernel that comes out.) Yet I do
see the point of being able to add the stuff dynamically.

So lately I've been thinking that there may be a middle-ground here
where everyone could be happy. Define three states for the hooks:
disabled, static, marker. The third one just adds some info into
System.map for allowing the automation of the insertion of kprobes
hooks (though you would still need the debugging info to find the
values of the variables that you want to log.) Hence, you get to
choose which type of poison you prefer. For my part, I think the
noop/early-check should be sufficient to get better performance from
the existing hook-set.

> imho relayfs and netlink are for completely problem spaces.
> relayfs is for relaying a lot of data quickly (e.g. for kernel
> instrumentation). There it fills a niche that printk doesn't fill
> (since it's too slow). netlink is quite slow (allocates data for each
> event, does lots of other gunk), but an useful extensible format
> for low frequency events.
>
> For the problems that relayfs solves netlink is totally unusable
> due to low efficiency (you could as well use printk, but that is
> also to slow). I think a low overhead logging mechanism is very
> much needed, because I find myself reinventing it quite often
> when I need to debug some timing sensitive problem. Trying to
> tackle these with printk is hopeless because it changes timing too much.

This is a very positive review, thanks.

> The problem relayfs has IMHO is that it is too complicated. It
> seems to either suffer from a overfull specification or second system
> effect. There are lots of different options to do everything,
> instead of a nice simple fast path that does one thing efficiently.
> IMHO before merging it should go through a diet and only keep
> the paths that are actually needed and dropping a lot of the current
> baggage.
>
> Preferably that would be only the fastest options (extremly simple
> per CPU buffer with inlined fast path that drop data on buffer overflow),
> with leaving out anything more complicated. My ideal is something
> like the old SGI ktrace which was an extremly simple mechanism
> to do lockless per CPU logging of binary data efficiently and
> reading that from a user daemon.

Certainly we are more than willing to accomodate any reasonable
changes. Some of the "overfeatures" you've noticed actually stem
from our trying to implement a number of things over relayfs. For
example, we've ported printk over to relayfs and have been able
to obtain lossless printk by implementing dynamically resizable
buffers. That doesn't mean there isn't room for improvement. If
there are any specific changes you think are required, we'd be
glad to take a look at them.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-14 10:27:49

by Nikita Danilov

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Andi Kleen <[email protected]> writes:

[...]

>
> Preferably that would be only the fastest options (extremly simple
> per CPU buffer with inlined fast path that drop data on buffer overflow),

Logging mechanism that loses data is worse than useless. It's only too
often that one spends a lot of time trying to reproduce some condition
with logging on, only to find out that nothing was logged.

[...]

>
> -Andi

Nikita.

2005-01-14 10:38:39

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Fri, Jan 14, 2005 at 01:27:27PM +0300, Nikita Danilov wrote:
> Andi Kleen <[email protected]> writes:
>
> [...]
>
> >
> > Preferably that would be only the fastest options (extremly simple
> > per CPU buffer with inlined fast path that drop data on buffer overflow),
>
> Logging mechanism that loses data is worse than useless. It's only too
> often that one spends a lot of time trying to reproduce some condition
> with logging on, only to find out that nothing was logged.

When you have a timing bug and your logger starts to block randomly
you also won't debug anything. Fix is to make your buffers bigger.

-Andi

2005-01-14 10:59:05

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Andi Kleen wrote:
> When you have a timing bug and your logger starts to block randomly
> you also won't debug anything. Fix is to make your buffers bigger.

relayfs allows you to choose which is best for you.

>From Documentation/filesystems/relayfs.txt:
...
int relay_open(channel_path, bufsize, nbufs, channel_flags,
channel_callbacks, start_reserve, end_reserve,
rchan_start_reserve, resize_min, resize_max, mode,
init_buf, init_buf_size)
...
- resize_min - if set, this signifies that the channel is
auto-resizeable. The value specifies the size that the channel will
try to maintain as a normal working size, and that it won't go
below. The client makes use of the resizing callbacks and
relay_realloc_buffer() and relay_replace_buffer() to actually effect
the resize.

- resize_max - if set, this signifies that the channel is
auto-resizeable. The value specifies the maximum size the channel
can have as a result of resizing.
...

LTT uses fixed-sized channels, but the implementation of printk-
over-relayfs used resize_min and resize_max to allow automatic
sizing (grep for relay_open):
http://www.opersys.com/ftp/pub/relayfs/patch-printk-on-relayfs-2.6.0-test1

... now I'm going to get some sleep ... I'll catch up later with
further discussion ...

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-14 12:36:43

by Miklos Szeredi

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

> - Added FUSE (filesystem in userspace) for people to play with. Am agnostic
> as to whether it should be merged (haven't read it at all closely yet,
> either), but I am impressed by the amount of care which has obviously gone
> into it. Opinions sought.

Great, thanks Andrew!

Miklos

2005-01-14 13:04:29

by Kasper Sandberg

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Fri, 2005-01-14 at 00:23 -0800, Andrew Morton wrote:
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/
>
>
> - Added bk-xfs to the -mm "external trees" lineup.
>
> - Added the Linux Trace Toolkit (and hence relayfs). Mainly because I
> haven't yet taken as close a look at LTT as I should have. Probably neither
> have you.
>
> It needs a bit of work on the kernel<->user periphery, which is not a big
> deal.
>
> As does relayfs, IMO. It seems to need some regularised way in which a
> userspace relayfs client can tell relayfs what file(s) to use. LTT is
> currently using some ghastly stick-a-pathname-in-/proc thing. Relayfs
> should provide this service.
>
> relayfs needs a closer look too. A lot of advanced instrumentation
> projects seem to require it, but none of them have been merged. Lots of
> people say "use netlink instead" and lots of other people say "err, we think
> relayfs is better". This is a discussion which needs to be had.
>
> - The 2.6.10-mm3 announcement was munched by the vger filters, sorry. One of
> the uml patches had an inopportune substring in its name (oh pee tee hyphen
> oh you tee). Nice trick if you meant it ;)
>
> - Big update to the ext3 extended attribute support. agruen, tridge and sct
> have been cooking this up for a while. samba4 proved to be a good
> stress test.
>
> - davej's "2.6 post-Halloween features" document has been added to -mm as
> Documentation/feature-list-2.6.txt in the hope that someone will review it
> and help keep it up-to-date.
>
> - Added FUSE (filesystem in userspace) for people to play with. Am agnostic
> as to whether it should be merged (haven't read it at all closely yet,
> either), but I am impressed by the amount of care which has obviously gone
> into it. Opinions sought.

i really believe fuse is a good thing to have merged, i use it, and it
works really really good. my vote is to get it in

<snip>

2005-01-14 15:07:22

by Barry K. Nathan

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

This isn't new to 2.6.11-rc1-mm1, but it has the infamous (to Fedora
users) "ACPI shutdown bug" -- poweroff hangs instead of actually turning
the computer off, on some computers. Here's the RH Bugzilla report where
most of the discussion took place:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=132761

In the Fedora kernels it turned out to be due to kexec. I'll see if I
can narrow it down further.

-Barry K. Nathan <[email protected]>

2005-01-14 15:25:06

by Roman Zippel

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi,

On Fri, 14 Jan 2005, Andi Kleen wrote:

> > - Added the Linux Trace Toolkit (and hence relayfs). Mainly because I
> > haven't yet taken as close a look at LTT as I should have. Probably neither
> > have you.
>
> I think it would be better to have a standard set of kprobes instead
> of all the ugly LTT hooks. kprobes could then log to relayfs or another
> fast logging mechanism.

kprobes is not portable.

> The problem relayfs has IMHO is that it is too complicated. It
> seems to either suffer from a overfull specification or second system
> effect. There are lots of different options to do everything,
> instead of a nice simple fast path that does one thing efficiently.

I have to agree with this. relayfs should resemble a very simple pipe,
maybe making it possible to writing them directly to disk.
ltt has the same problem. It still does way too much at event time, it
should just pump the data to disk and postprocess it later. I think it's
better to implement multiple traces in user space via a daemon, which
synchronizes multiple users.

> IMHO before merging it should go through a diet and only keep
> the paths that are actually needed and dropping a lot of the current
> baggage.

While I agree this is needed, I don't think it's a reason against merging,
it should just be made clear, that the API is not stable and will change.

bye, Roman

2005-01-14 15:31:31

by Roman Zippel

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi,

On Fri, 14 Jan 2005, Karim Yaghmour wrote:

> Andi Kleen wrote:
> > When you have a timing bug and your logger starts to block randomly
> > you also won't debug anything. Fix is to make your buffers bigger.
>
> relayfs allows you to choose which is best for you.
>
> >From Documentation/filesystems/relayfs.txt:
> ...
> int relay_open(channel_path, bufsize, nbufs, channel_flags,
> channel_callbacks, start_reserve, end_reserve,
> rchan_start_reserve, resize_min, resize_max, mode,
> init_buf, init_buf_size)

You don't think that's a little overkill?
BTW it should return a pointer not an id, at every further access it needs
to be looked up, killing the effects of any lockless mechanism.

bye, Roman

2005-01-14 15:35:32

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Fri, 14 Jan 2005, Andrew Morton wrote:

> - Added the Linux Trace Toolkit (and hence relayfs). Mainly because I
> haven't yet taken as close a look at LTT as I should have. Probably neither
> have you.

Just a few things from a quick look;

- What's with all the ltt_*_bit? Please use the ones provided by the
kernel.

- i see cpu_has_tsc, can't you use sched_clock?

- ltt_log_event isn't preempt safe

- num_cpus isn't hotplug cpu safe, and you should be using the
for_each_online_cpu iterators

- code style, you have large hunks of code with blocks of the following
form, you can save processor cycles by placing an if (incoming_process)
branch earlier. This code is in _ltt_log_event, which i presume executes
frequently.

if (event_id == LTT_EV_SCHEDCHANGE)
incoming_process = (struct task_struct *) ((ltt_schedchange *) event_struct)->in);

if ((trace->tracing_gid == 1) && (current->egid != trace->traced_gid)) {
if (incoming_process == NULL)
return 0;
else if (incoming_process->egid != trace->traced_gid)
return 0;
}
... [ more of the same ]
if ((trace->tracing_uid == 1) && (current->euid != trace->traced_uid)) {
if (incoming_process == NULL)
return 0;
else if (incoming_process->euid != trace->traced_uid)
return 0;
}

2005-01-14 16:56:33

by Dave Jones

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Fri, Jan 14, 2005 at 07:07:14AM -0800, Barry K. Nathan wrote:
> This isn't new to 2.6.11-rc1-mm1, but it has the infamous (to Fedora
> users) "ACPI shutdown bug" -- poweroff hangs instead of actually turning
> the computer off, on some computers. Here's the RH Bugzilla report where
> most of the discussion took place:
>
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=132761
>
> In the Fedora kernels it turned out to be due to kexec. I'll see if I
> can narrow it down further.

For *some* users. It still affects others.
My Compaq Evo showed the bug with 2.6.9 vanilla, went away with 2.6.10
vanilla.

Dave

2005-01-14 17:35:47

by Adrian Bunk

[permalink] [raw]
Subject: [patch] 2.6.11-rc1-mm1: ip_tables.c: ipt_find_target must be EXPORT_SYMBOL'ed

On Fri, Jan 14, 2005 at 12:23:52AM -0800, Andrew Morton wrote:
>...
> All 434 patches:
>...
> restore-net-sched-iptc-after-iptables-kmod-cleanup.patch
> Restore net/sched/ipt.c After iptables Kmod Cleanup
>...

This causes the following error with CONFIG_NET_ACT_IPT=m:

<-- snip -->

if [ -r System.map ]; then /sbin/depmod -ae -F System.map 2.6.11-rc1-mm1; fi
WARNING: /lib/modules/2.6.11-rc1-mm1/kernel/net/sched/ipt.ko needs unknown symbol ipt_find_target

<-- snip -->


The fix is simple:


Signed-off-by: Adrian Bunk <[email protected]>

--- linux-2.6.11-rc1-mm1-modular/net/ipv4/netfilter/ip_tables.c.old 2005-01-14 18:03:18.000000000 +0100
+++ linux-2.6.11-rc1-mm1-modular/net/ipv4/netfilter/ip_tables.c 2005-01-14 18:04:17.000000000 +0100
@@ -488,6 +488,7 @@
return NULL;
return target;
}
+EXPORT_SYMBOL(ipt_find_target);

static int match_revfn(const char *name, u8 revision, int *bestp)
{

2005-01-14 17:43:28

by Patrick McHardy

[permalink] [raw]
Subject: Re: [patch] 2.6.11-rc1-mm1: ip_tables.c: ipt_find_target must be EXPORT_SYMBOL'ed

Adrian Bunk wrote:

>On Fri, Jan 14, 2005 at 12:23:52AM -0800, Andrew Morton wrote:
>
>>...
>>All 434 patches:
>>...
>>restore-net-sched-iptc-after-iptables-kmod-cleanup.patch
>> Restore net/sched/ipt.c After iptables Kmod Cleanup
>>...
>>
>
>This causes the following error with CONFIG_NET_ACT_IPT=m:
>
><-- snip -->
>
>if [ -r System.map ]; then /sbin/depmod -ae -F System.map 2.6.11-rc1-mm1; fi
>WARNING: /lib/modules/2.6.11-rc1-mm1/kernel/net/sched/ipt.ko needs unknown symbol ipt_find_target
>
><-- snip -->
>
The fix is already in Dave's tree.

Regards
Patrick

2005-01-14 17:55:36

by Barry K. Nathan

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Fri, Jan 14, 2005 at 11:56:12AM -0500, Dave Jones wrote:
> For *some* users. It still affects others.
> My Compaq Evo showed the bug with 2.6.9 vanilla, went away with 2.6.10
> vanilla.

Ok, I didn't know that.

Anyway, I've dug a bit deeper into my particular case, and there's now
some more information here:
http://bugme.osdl.org/show_bug.cgi?id=4041

-Barry K. Nathan <[email protected]>

2005-01-14 18:35:58

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Kasper Sandberg <[email protected]> wrote:
>
> i really believe fuse is a good thing to have merged, i use it, and it
> works really really good.

What filesystem(s) do you use, and why?

2005-01-14 19:08:09

by Bill Davidsen

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Kasper Sandberg wrote:
> On Fri, 2005-01-14 at 00:23 -0800, Andrew Morton wrote:
>
>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/
>>
>>
>>- Added bk-xfs to the -mm "external trees" lineup.
>>
>>- Added the Linux Trace Toolkit (and hence relayfs). Mainly because I
>> haven't yet taken as close a look at LTT as I should have. Probably neither
>> have you.
>>
>> It needs a bit of work on the kernel<->user periphery, which is not a big
>> deal.
>>
>> As does relayfs, IMO. It seems to need some regularised way in which a
>> userspace relayfs client can tell relayfs what file(s) to use. LTT is
>> currently using some ghastly stick-a-pathname-in-/proc thing. Relayfs
>> should provide this service.
>>
>> relayfs needs a closer look too. A lot of advanced instrumentation
>> projects seem to require it, but none of them have been merged. Lots of
>> people say "use netlink instead" and lots of other people say "err, we think
>> relayfs is better". This is a discussion which needs to be had.
>>
>>- The 2.6.10-mm3 announcement was munched by the vger filters, sorry. One of
>> the uml patches had an inopportune substring in its name (oh pee tee hyphen
>> oh you tee). Nice trick if you meant it ;)
>>
>>- Big update to the ext3 extended attribute support. agruen, tridge and sct
>> have been cooking this up for a while. samba4 proved to be a good
>> stress test.
>>
>>- davej's "2.6 post-Halloween features" document has been added to -mm as
>> Documentation/feature-list-2.6.txt in the hope that someone will review it
>> and help keep it up-to-date.
>>
>>- Added FUSE (filesystem in userspace) for people to play with. Am agnostic
>> as to whether it should be merged (haven't read it at all closely yet,
>> either), but I am impressed by the amount of care which has obviously gone
>> into it. Opinions sought.
>
>
> i really believe fuse is a good thing to have merged, i use it, and it
> works really really good. my vote is to get it in

I like the idea, but I also like the practice of letting a feature like
this sit in -mm for a few weeks or even a month until people have a
chance to break^H^H^H^H^Htest it a bit.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2005-01-14 19:13:32

by Rogério Brito

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Jan 14 2005, Andrew Morton wrote:
> Kasper Sandberg <[email protected]> wrote:
> > i really believe fuse is a good thing to have merged, i use it, and it
> > works really really good.
>
> What filesystem(s) do you use, and why?

I'm not the person to whom you asked the question, but I will answer
anyway.

I have never used a -mm kernel tree before, but seeing that fuse got
included made me download the patch to try it.

I'll be using gmailfs (which needs fuse) just to see how things work with
Debian's testing (sarge) userland.


Hope this is another data point of interest, Rog?rio.

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rog?rio Brito - [email protected] - http://www.ime.usp.br/~rbrito
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

2005-01-14 19:41:23

by Peter Buckingham

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Andrew Morton wrote:
> Kasper Sandberg <[email protected]> wrote:
>
>>i really believe fuse is a good thing to have merged, i use it, and it
>> works really really good.
>
>
> What filesystem(s) do you use, and why?

we're currently prototyping a lightweight network filesystem proxy using
fuse.

peter

2005-01-14 21:08:23

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Hello Roman,

Roman Zippel wrote:
> You don't think that's a little overkill?

I can see why you'd say this as a first impression, but really it isn't.

Here's a simple primer to this call's parameters:
channel_path, mode:
Where does this appear in relayfs and what rights do
user-space apps have over it (rwx).
bufsize, nbufs:
Usually things have to be subdivided in sub-buffers to make
both writing and reading simple. LTT uses this to allow,
among other things, random trace access.
channel_flags, channel_callbacks:
General channel management (should we write over unread data,
is data delivered in bulk or in units, what granularity of
timestamping is required, who should we call to initialize/
finalize the content of a sub-buffer.) All of these are used
by LTT, for example, in a number of ways.
start_reserve, end_reserve, rchan_start_reserve:
Some subsystems, like LTT, need to be able to write some key
data at sub-buffer boundaries. This is to specify how much
space is required for said data.
resize_min, resize_max:
Allow for dynamic resizing of buffer.
init_buf, init_buf_size:
Is there an initial buffer containing some data that should
be used to initialize the channel's content. If you're doing
init-time tracing, for example, you need to have a pre-allocated
static buffer that is copied to relayfs once relayfs is mounted.

As you can see, most of this is already used in one way or another by
LTT. The only thing LTT doesn't use is the dynamic resizing, but as was
said earlier in this thread, some people actually want to have this.
If it really came to it, we could drop this and resubmit when somebody
actually requests this, but my understanding is that the previous poster
did indeed indicate his need for this.

> BTW it should return a pointer not an id, at every further access it needs
> to be looked up, killing the effects of any lockless mechanism.

Sounds reasonable. We will review this.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-14 21:57:10

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Zwane Mwaikambo wrote:
> Just a few things from a quick look;

Thanks for the feedback. I've added your suggestions to my to-do list.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-14 22:41:33

by Tim Bird

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Andrew Morton wrote:
> - Added the Linux Trace Toolkit (and hence relayfs). Mainly because I
> haven't yet taken as close a look at LTT as I should have. Probably neither
> have you.
>
> It needs a bit of work on the kernel<->user periphery, which is not a big
> deal.
>
> As does relayfs, IMO. It seems to need some regularised way in which a
> userspace relayfs client can tell relayfs what file(s) to use. LTT is
> currently using some ghastly stick-a-pathname-in-/proc thing. Relayfs
> should provide this service.
>
> relayfs needs a closer look too. A lot of advanced instrumentation
> projects seem to require it, but none of them have been merged. Lots of
> people say "use netlink instead" and lots of other people say "err, we think
> relayfs is better". This is a discussion which needs to be had.

Thanks very much. I know lots of embedded folks who will be happy to
see this discussion take place. (As an aside, I'll try to encourage
some of our more shy members to speak up and participate in the
discussion as well. I know Hitachi has been doing some work on
tracing, and I'd hate to see duplicate effort.)

BTW - I agree with most of the relayfs comments. It seems like overkill
for the kernel developer doing a "casual", ad-hoc trace. I'll try to
work with Karim on the suggested improvements.

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Electronics
=============================

2005-01-14 22:48:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Fri, 2005-01-14 at 00:23 -0800, Andrew Morton wrote:
> - Added the Linux Trace Toolkit (and hence relayfs). Mainly because I
> haven't yet taken as close a look at LTT as I should have. Probably neither
> have you.

I have. Maybe you should have. I really don't see a good argument to
include this code.

The "non-locking" claim is nice, but a do { } while loop in the slot
reservation for every event including a do { } while loop in the slow
path is just a replacement of locking without actually using a lock. I
don't care whether this is likely or unlikely to happen, it's just bogus
to add a non constant time path for debugging/tracing purposes.

Default timestamp measuring with do_gettimeofday is also contrary to the
non locking argument. There is
a) a lock in there
b) it might loop because it's a sequential lock.

If you have no TSC you can do at least a jiffies + event-number based,
not so finegrained tracing which gives you at least the timeline of the
events.

There is also no need to do time diff calculations / conversions, this
can be done in userspace postprocessing.

Adding 150k relayfs source in order to do tracing is scary. I don't see
any real advantage over a nice implemented per cpu ringbuffer, which is
lock free and does not add variable timed delays in the log path. Don't
tell me that a ringbuffer is not suitable, it's a question of size and
it is the same problem for relayfs. If you don't have enough buffers it
does not work. This applies for every implementation of tracebuffering
you do. In space constraint systems relayfs is even worse as it needs
more memory than the plain ringbuffer.
The ringbuffer has a nice advantage. In case the system crashes you can
retrieve the last and therefor most interesting information from the
ringbuffer without any hassle via BDI or in the worstcase via a serial
dump. You can even copy the tail of the buffer into a permanent storage
like buffered SRAM so it can be retrieved after reboot.

Splitting the trace into different paths is nice to have but I don't see
a single point which cannot be done by a userspace (hostside)
postprocessing tool. It adds another non time constant component to the
trace path. Even the per CPU ringbuffers can be nicely synchronized by a
userspace postprocessing tool without adding complex synchronization
functions.

Replacing printk by a varags print into an event buffer is a nice idea
to replace serial logging of long lasting debug features. Must we really
include 150k source for this or can we just increase the log buffer size
or improve the printk itself?
In case of time related tracing it's just overkill. The printk
information is mostly a string, which can be replaced by the address on
which the printk is happening. The maybe available arguments can be
dumped in binary form. All this information can be converted into human
readable form by postprocessing.

I wonder whether the various formatting options of the trace are really
of any value. I need neither strings, HEX strings nor XML formatted
information from the kernel. Max. 8192 Byte of user information makes me
frown. Tracing is not a copy to userspace function or am I missing
something ?

All tracepoints are unconditionally compiled into the kernel, whether
they are enabled or not. Why is it neccecary to check the enabled bit
for information I'm not interested in ? Why can't I compile this away by
not enabling the tracepoint at all.

I don't need to point out the various coding style issues again, but I
question if
atomic_set(&var), atomic_read(&var) | bit);
which can be found on several places is really doing what it's suggests
to do.

I did a short test on a 300MHz PIII box and the maximum time spent in
the log path (interrupts disabled during measurement) is about 30us.
Extrapolated to a 74MHz ARM SoC it will sum up to ~ 90-120us, what makes
it purely useless.

Summary:

1. The code is not doing what it claims to do.
2. The code adds unnecessary overhead
3. It's not useful for low speed systems.

Question:
Why is the code included ?

tglx



2005-01-14 22:52:43

by Andre Eisenbach

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Fri, 14 Jan 2005 00:23:52 -0800, Andrew Morton <[email protected]> wrote:
> - Added FUSE (filesystem in userspace) for people to play with. Am agnostic
> as to whether it should be merged (haven't read it at all closely yet,
> either), but I am impressed by the amount of care which has obviously gone
> into it. Opinions sought.

This is great news!

As a long time user of KDE's kio-slaves, I was always missing the
kio-slave functionality on the command line and in non-kde programs.
FUSE provides a kio-slave interface, but hopefully the inclusion of
FUSE in the mm-kernel will cause more "fuse native" filesystems to
come out which provide the functionality of the various kio-slaves.

Some things I'd like to see (as I am currently using the KIO
equivalent) implemented as FUSE fs:
- "fish", virtual file access over ssh
- "audiocd", virtual audio cd filesystem which copies and encodes
audio tracks on the fly
- "ftp", virtual file system ftp server access
etc..

Imagination is the limit, and since it can be implemented in userspace
pretty easily with FUSE, I am looking forward to see what people can
come up with and hope that FUSE is here to sray.

Cheers,
Andre

2005-01-14 23:03:19

by Tim Bird

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Karim Yaghmour wrote:
> Roman Zippel wrote:
>>You don't think that's a little overkill?
>
>Based on the descriptions below, I think Roman is right. There's
too much going on here for the average user. I haven't looked closely,
but some of the stuff seems to be for esoteric use cases. There are
two ways to approach it:
- add a simplified API for the most common usage
- strip out the stuff that's not really needed, and figure out
workarounds for things (like tracing initialization) that need
special assistance.

Some of these options (e.g. bufsize) are available to the user
via tracedaemon. I can honestly say I haven't got a clue what
to use for some of them, and so always leave them at defaults.

> I can see why you'd say this as a first impression, but really it isn't.
>
> Here's a simple primer to this call's parameters:
> channel_path, mode:
> Where does this appear in relayfs and what rights do
> user-space apps have over it (rwx).
> bufsize, nbufs:
> Usually things have to be subdivided in sub-buffers to make
> both writing and reading simple. LTT uses this to allow,
> among other things, random trace access.
Could these be simplified to a few enumerated modes?

> channel_flags, channel_callbacks:
> start_reserve, end_reserve, rchan_start_reserve:
> resize_min, resize_max:
> init_buf, init_buf_size:

It seems like you could remove these from relay_open() and move them to
get()/set() operations if you wanted to simplify the open API.
Or, you could create other (separate) APIs to pre-fill the buffer or
reserve space. Do you want me to take a look at this and propose
some specific changes? (I won't get to this until Monday, though).

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Electronics
=============================

2005-01-14 23:15:36

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


[repost. first reply had wrong lkml CC.]

Hello Thomas,

First, thanks for the feedback, it's greatly appreciated.

Lots of stuff in here. I don't mean to drop any of your arguments, but
I'm going to reply to this in a way that makes this reply and further
responses as useful as possible to outsiders. Let me know if you
think I've dropped something important.

Thomas Gleixner wrote:

>> The "non-locking" claim is nice, but a do { } while loop in the slot
>> reservation for every event including a do { } while loop in the slow
>> path is just a replacement of locking without actually using a lock. I
>> don't care whether this is likely or unlikely to happen, it's just bogus
>> to add a non constant time path for debugging/tracing purposes.


relayfs implements two schemes: lockless and locking. The later uses
standard linear locking mechanisms. If you need stringent constant
time, you know what to do.


>> Default timestamp measuring with do_gettimeofday is also contrary to the
>> non locking argument. There is
>> a) a lock in there
>> b) it might loop because it's a sequential lock.


That's true, but that's not a limitation of relayfs per se. We'd gladly
use any timing facility available to us. We already use the TSC when
available.


>> If you have no TSC you can do at least a jiffies + event-number based,
>> not so finegrained tracing which gives you at least the timeline of the
>> events.


Interesting. I've added this to the to-do list.


>> There is also no need to do time diff calculations / conversions, this
>> can be done in userspace postprocessing.


Ah yes, that's the kind of thing that you learn by getting bitten by it.
The problem is the size of the data stream. Diffs are an easy and a
rather inexpensive way of reducing trace sizes. Logging 2 or 4 more bytes
per event when you've got tens of thousands of events occuring per second
does have a noticeable impact. If this is really a sticking point, we
could provide a way for writing full time-stamps.


>> you do. In space constraint systems relayfs is even worse as it needs
>> more memory than the plain ringbuffer.


Don't get us wrong, we can strip this down to make this a stupid ring-
buffer. But the fact of the matter is that in trying to use such a thing,
you will find yourself reimplementing the exact things we did for the
same purposes.


>> The ringbuffer has a nice advantage. In case the system crashes you can
>> retrieve the last and therefor most interesting information from the
>> ringbuffer without any hassle via BDI or in the worstcase via a serial
>> dump. You can even copy the tail of the buffer into a permanent storage
>> like buffered SRAM so it can be retrieved after reboot.


And there's a reason why you can't do that with relayfs? We've looked at
this and interfacing between relayfs and crashdump is trivial.


>> Splitting the trace into different paths is nice to have but I don't see
>> a single point which cannot be done by a userspace (hostside)
>> postprocessing tool. It adds another non time constant component to the
>> trace path. Even the per CPU ringbuffers can be nicely synchronized by a
>> userspace postprocessing tool without adding complex synchronization
>> functions.


Again life is a merciless teacher. LTT did initially start with a single
eat-your-breakfeast-dinner-and-supper-in-one-place buffer. But that just
doesn't scale. If you're doing flight-recording, for example, you need
to have a separate channel which contains process creation/exit,
otherwise you have a hard time interepreting the data.


>> In case of time related tracing it's just overkill. The printk
>> information is mostly a string, which can be replaced by the address on
>> which the printk is happening. The maybe available arguments can be
>> dumped in binary form. All this information can be converted into human
>> readable form by postprocessing.


I'm sorry, I don't understand your argument here.


>> I wonder whether the various formatting options of the trace are really
>> of any value. I need neither strings, HEX strings nor XML formatted
>> information from the kernel. Max. 8192 Byte of user information makes me
>> frown. Tracing is not a copy to userspace function or am I missing
>> something ?


Dynamically created custom events and events directed by the likes of
DProbes need something to write to, and user-space utilities must have
a way of determining what format this data was written in. That's all
there is to see here.


>> All tracepoints are unconditionally compiled into the kernel, whether
>> they are enabled or not. Why is it neccecary to check the enabled bit
>> for information I'm not interested in ? Why can't I compile this away by
>> not enabling the tracepoint at all.


But you can. Have a look at include/linux/ltt-events.h:
#else /* defined(CONFIG_LTT) */
#define ltt_ev(ID, DATA)
#define ltt_ev_trap_entry(ID, EIP)
#define ltt_ev_trap_exit()
#define ltt_ev_irq_entry(ID, KERNEL)
#define ltt_ev_irq_exit()
#define ltt_ev_schedchange(OUT, IN)
#define ltt_ev_soft_irq(ID, DATA)
#define ltt_ev_process(ID, DATA1, DATA2)
#define ltt_ev_process_exit(DATA1, DATA2)
#define ltt_ev_file_system(ID, DATA1, DATA2, FILE_NAME)
#define ltt_ev_timer(ID, SDATA, DATA1, DATA2)
#define ltt_ev_memory(ID, DATA)
#define ltt_ev_socket(ID, DATA1, DATA2)
#define ltt_ev_ipc(ID, DATA1, DATA2)
#define ltt_ev_network(ID, DATA)
#define ltt_ev_heartbeat()
#endif /* defined(CONFIG_LTT) */


>> I don't need to point out the various coding style issues again, but I
>> question if
>> atomic_set(&var), atomic_read(&var) | bit);
>> which can be found on several places is really doing what it's suggests
>> to do.


If there are actual code snippets you think are broken, we'll gladly
fix them.


>> I did a short test on a 300MHz PIII box and the maximum time spent in
>> the log path (interrupts disabled during measurement) is about 30us.
>> Extrapolated to a 74MHz ARM SoC it will sum up to ~ 90-120us, what makes
>> it purely useless.


Granted tracing is not free, but please avoid spreading FUD without
actually carrying out proper testing. We've done quite a large number
of tests and we've demonstrated over and over that LTT, and ltt-over-
relayfs, is actually very efficient. If you're interested in actual
test data, then you may want to check out the following:
http://www.opersys.com/ftp/pub/LTT/Documentation/ltt-usenix.ps.gz
http://lwn.net/Articles/13870/

We are aware of the cost of the various tracing components, as you
can see by my earlier posting about early-checking to minimize the
cost of the tracing hooks for kernel compiled with them, and are
open for any optimization. If you have any concrete suggestions, save
the scrap-everything-I-know-better (which is really unproductive as
you would anyway have to go down the same path we have), we are more
than willing to entertain them.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-14 23:25:25

by Tim Bird

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Thomas Gleixner wrote:
> On Fri, 2005-01-14 at 00:23 -0800, Andrew Morton wrote:
>
>>- Added the Linux Trace Toolkit (and hence relayfs). Mainly because I
>> haven't yet taken as close a look at LTT as I should have. Probably neither
>> have you.
>
> I have. Maybe you should have. I really don't see a good argument to
> include this code.

[ Lots of excellent criticisms omitted.]

I don't want to be argumentative, but possibly (to answer your last
question first), there are twofold reasons to put this in -mm:
- there's no tracing infrastructure in the kernel now (except for
kprobes - which provides hooks for creating tracepoints dynamically,
but not 1) supporting infrastructure for timestamping, managing event
data, etc., and 2) a static list of generally useful tracepoints.
- to generate this discussion.

>
> I did a short test on a 300MHz PIII box and the maximum time spent in
> the log path (interrupts disabled during measurement) is about 30us.
> Extrapolated to a 74MHz ARM SoC it will sum up to ~ 90-120us, what makes
> it purely useless.

I've used it for various tasks, and I know others who have. I wouldn't
recommend it in its present form for deep scheduling tweaks or debugging
kernel race conditions (which it is more likely to mask than
it is to find), but inapplicability there hardly makes it worthless for
other things.

>
> Summary:
>
> 1. The code is not doing what it claims to do.
I'm guessing the sense of this is in the micro-claims which are implied
(e.g. runs lockless and therefore avoids cache thrashing), rather than
the high-level claim of providing useful information in some situations.
It clearly does the latter. At least is has for me.

> 2. The code adds unnecessary overhead
I agree it could be improved. The threshold for "unnecessary" varies
by task.

> 3. It's not useful for low speed systems.
I've used it on low speed systems.

> Question:
> Why is the code included ?
See above.

By the way, don't think that your comments are not appreciated.
I'm not particularly glued to any specific part of the implementation.
I'm excited to see tracing discussed here, if only to avoid
duplicate efforts and point out danger areas, for multiple tracing
projects that I am aware of.

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Electronics
=============================

2005-01-15 00:01:43

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi Karim,

On Fri, 2005-01-14 at 18:09 -0500, Karim Yaghmour wrote:
> >> The "non-locking" claim is nice, but a do { } while loop in the slot
> >> reservation for every event including a do { } while loop in the slow
> >> path is just a replacement of locking without actually using a lock. I
> >> don't care whether this is likely or unlikely to happen, it's just bogus
> >> to add a non constant time path for debugging/tracing purposes.
>
> relayfs implements two schemes: lockless and locking. The later uses
> standard linear locking mechanisms. If you need stringent constant
> time, you know what to do.

It's not only me, who needs constant time. Everybody interested in
tracing will need that. In my opinion its a principle of tracing.

The "lockless" mechanism is _FAKE_ as I already pointed out. It replaces
locks by do { } while loops. So what ?

> >> Default timestamp measuring with do_gettimeofday is also contrary to the
> >> non locking argument. There is
> >> a) a lock in there
> >> b) it might loop because it's a sequential lock.
>
> >> If you have no TSC you can do at least a jiffies + event-number based,
> >> not so finegrained tracing which gives you at least the timeline of the
> >> events.
>
> Interesting. I've added this to the to-do list.

Interesting. I read this phrase more than once in the discussion of your
patch. When will the to-do list be done ?

> >> There is also no need to do time diff calculations / conversions, this
> >> can be done in userspace postprocessing.
>
> Ah yes, that's the kind of thing that you learn by getting bitten by it.
> The problem is the size of the data stream. Diffs are an easy and a
> rather inexpensive way of reducing trace sizes. Logging 2 or 4 more bytes
> per event when you've got tens of thousands of events occuring per second
> does have a noticeable impact. If this is really a sticking point, we
> could provide a way for writing full time-stamps.

I'm impressed of your sudden time constraints awareness. Allowing 8192
bytes of user event size, string printing with varags and XML tracing
is not biting you ?

If you only store the low 32 bit of TSC you have a valid timeline when
you are able to do the math in your postprocessor. Depending on the
speed 16 bit are enough.

> >> you do. In space constraint systems relayfs is even worse as it needs
> >> more memory than the plain ringbuffer.
>
> Don't get us wrong, we can strip this down to make this a stupid ring-
> buffer. But the fact of the matter is that in trying to use such a thing,
> you will find yourself reimplementing the exact things we did for the
> same purposes.

A ring buffer is not stupid at all. I have implemented tracing with ring
buffers already, so I know the limitations and the PITA.

OTOH ringbuffers _ARE_ lockless, constant time comsuming and allow you
to implement the splitting and related functionality in userspace
postprocessing, which has to be done anyway.

Do not tell me that streaming out data in a constant stream is worse
than putting them into nodes of a filesystem and retrieving them from
there.

Setting up a simple /dev/proc/sys interface and do a
cat /xxx/trace/cpuX >file/interface/whatever
is not less efficient than the conversion of your data into a file.

> >> The ringbuffer has a nice advantage. In case the system crashes you can
> >> retrieve the last and therefor most interesting information from the
> >> ringbuffer without any hassle via BDI or in the worstcase via a serial
> >> dump. You can even copy the tail of the buffer into a permanent storage
> >> like buffered SRAM so it can be retrieved after reboot.
>
>
> And there's a reason why you can't do that with relayfs? We've looked at
> this and interfacing between relayfs and crashdump is trivial.

Sure, I have to grab stuff out of a filesystem instead of simply doing
for (....)
sendserial(buffer[i]);

I know you can provide a nice function for doing so, but it will take
another xxx kB of code instead of a 10 line simple solution.

> >> Splitting the trace into different paths is nice to have but I don't see
> >> a single point which cannot be done by a userspace (hostside)
> >> postprocessing tool. It adds another non time constant component to the
> >> trace path. Even the per CPU ringbuffers can be nicely synchronized by a
> >> userspace postprocessing tool without adding complex synchronization
> >> functions.
>
>
> Again life is a merciless teacher. LTT did initially start with a single
> eat-your-breakfeast-dinner-and-supper-in-one-place buffer. But that just
> doesn't scale. If you're doing flight-recording, for example, you need
> to have a separate channel which contains process creation/exit,
> otherwise you have a hard time interepreting the data.

Haha. If you have eventstamps and timestamps (even the jiffie + event
based ones) nothing is hard to interpret. I guess the ethereal guys are
rolling on the floor and laughing.

The kernel is not the place to fix your postprocessing problems. Sure
you have to do more complicated stuff, but you move the burden from
kernel to a place where it does not hurt.

What's hard on interpreting and filtering a stream of data ?

> >> In case of time related tracing it's just overkill. The printk
> >> information is mostly a string, which can be replaced by the address on
> >> which the printk is happening. The maybe available arguments can be
> >> dumped in binary form. All this information can be converted into human
> >> readable form by postprocessing.
>
> I'm sorry, I don't understand your argument here.

What's complicated ? In case I want to have timing related tracing which
includes printks, then storing the address where the printk is coming
from is enough instead of a various length string. Storing some args in
binary form with this address should not be too hard to achieve.

Again its a postprocessing problems.

> >> I wonder whether the various formatting options of the trace are really
> >> of any value. I need neither strings, HEX strings nor XML formatted
> >> information from the kernel. Max. 8192 Byte of user information makes me
> >> frown. Tracing is not a copy to userspace function or am I missing
> >> something ?
> Dynamically created custom events and events directed by the likes of
> DProbes need something to write to, and user-space utilities must have
> a way of determining what format this data was written in. That's all
> there is to see here.

And therefor I need strings, HEX strings, XML ? A simple number and the
data behind gives you all you need.

Again its a postprocessing problems.

> >> All tracepoints are unconditionally compiled into the kernel, whether
> >> they are enabled or not. Why is it neccecary to check the enabled bit
> >> for information I'm not interested in ? Why can't I compile this away by
> >> not enabling the tracepoint at all.

> But you can. Have a look at include/linux/ltt-events.h:
> #else /* defined(CONFIG_LTT) */
> #define ltt_ev(ID, DATA)
> #define ltt_ev_trap_entry(ID, EIP)
> #define ltt_ev_trap_exit()

Sure I'm aware that I can switch off all, but I can not deselect
specific tracepoints during compile time to reduce the overhead.

If I want to have custom tracepoints for my specific problem, then why I
need the overhead of the other stuff ?

> >> I don't need to point out the various coding style issues again, but I
> >> question if
> >> atomic_set(&var), atomic_read(&var) | bit);
> >> which can be found on several places is really doing what it's suggests
> >> to do.
>
> If there are actual code snippets you think are broken, we'll gladly
> fix them.

If you consider the above example, which is taken of your code, as sane
then we can stop talkin about this.

> >> I did a short test on a 300MHz PIII box and the maximum time spent in
> >> the log path (interrupts disabled during measurement) is about 30us.
> >> Extrapolated to a 74MHz ARM SoC it will sum up to ~ 90-120us, what makes
> >> it purely useless
>
> Granted tracing is not free, but please avoid spreading FUD without
> actually carrying out proper testing. We've done quite a large number
> of tests and we've demonstrated over and over that LTT, and ltt-over-
> relayfs, is actually very efficient. If you're interested in actual
> test data, then you may want to check out the following:
> http://www.opersys.com/ftp/pub/LTT/Documentation/ltt-usenix.ps.gz
> http://lwn.net/Articles/13870/

Karim, please do not use the FUD argument.

I do not doubt that it is efficient from your point of view.

But if short tests show this and I'm able to prove that numbers, you can
barely deny that the scaling of 300MHZ PIII to ARM 74MHz SoC is wrong.
It's simple math.

> We are aware of the cost of the various tracing components, as you
> can see by my earlier posting about early-checking to minimize the
> cost of the tracing hooks for kernel compiled with them, and are
> open for any optimization. If you have any concrete suggestions, save
> the scrap-everything-I-know-better (which is really unproductive as
> you would anyway have to go down the same path we have), we are more
> than willing to entertain them.

Yes, the "you would anyway have to go down the same path we have"
argument really scares me away from doing so.

I don't buy this kind of arguments.

tglx


2005-01-15 00:20:15

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Fri, Jan 14, 2005 at 02:58:38PM -0800, Tim Bird wrote:
> > Roman Zippel wrote:
> >>You don't think that's a little overkill?
> >
> >Based on the descriptions below, I think Roman is right. There's
> too much going on here for the average user. I haven't looked closely,
> but some of the stuff seems to be for esoteric use cases. There are
> two ways to approach it:
> - add a simplified API for the most common usage
> - strip out the stuff that's not really needed, and figure out
> workarounds for things (like tracing initialization) that need
> special assistance.
>
> Some of these options (e.g. bufsize) are available to the user
> via tracedaemon. I can honestly say I haven't got a clue what
> to use for some of them, and so always leave them at defaults.

This is a strong cue that they are unneeded.

> > I can see why you'd say this as a first impression, but really it isn't.
> >
> > Here's a simple primer to this call's parameters:
> > channel_path, mode:
> > Where does this appear in relayfs and what rights do
> > user-space apps have over it (rwx).
> > bufsize, nbufs:
> > Usually things have to be subdivided in sub-buffers to make
> > both writing and reading simple. LTT uses this to allow,
> > among other things, random trace access.
> Could these be simplified to a few enumerated modes?

Just make it a global single define in the source.

>
> > channel_flags, channel_callbacks:
> > start_reserve, end_reserve, rchan_start_reserve:
> > resize_min, resize_max:
> > init_buf, init_buf_size:
>
> It seems like you could remove these from relay_open() and move them to
> get()/set() operations if you wanted to simplify the open API.

I think all for which not an clear need is demonstrated should
be removed. If there is a real need it can be still readded later.
But in the current form it is far too complicated and too fat.

> Or, you could create other (separate) APIs to pre-fill the buffer or
> reserve space. Do you want me to take a look at this and propose
> some specific changes? (I won't get to this until Monday, though).

No, no, it far less APIs not more.

-Andi

2005-01-15 00:22:34

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Thomas Gleixner <[email protected]> wrote:
>
> ...
> I'm impressed of your sudden time constraints awareness. Allowing 8192
> bytes of user event size, string printing with varags and XML tracing
> is not biting you ?

? I see no XML in there.

akpm:/usr/src/25> grep -i xml patches/ltt* patches/relayfs*
patches/ltt-core-headers.patch:+#define LTT_CUSTOM_EV_FORMAT_TYPE_XML 3
akpm:/usr/src/25>

>
> Haha. If you have eventstamps and timestamps (even the jiffie + event
> based ones) nothing is hard to interpret. I guess the ethereal guys are
> rolling on the floor and laughing.
>
> The kernel is not the place to fix your postprocessing problems. Sure
> you have to do more complicated stuff, but you move the burden from
> kernel to a place where it does not hurt.

I thought Karim said that this was a form of data compression.

>
> Yes, the "you would anyway have to go down the same path we have"
> argument really scares me away from doing so.
>
> I don't buy this kind of arguments.

I do. When someone has been working on a real-world project for several
years we *need* to understand all the problems which that person
encountered before we can competently review the implementation. Surely
you've been there before: you throw out all the old stuff, write a new one
and once you've addressed all the warts and corner cases and
weird-but-valid requirements it ends up with the same complexity as the
original.

2005-01-15 00:24:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi Tim,

On Fri, 2005-01-14 at 15:22 -0800, Tim Bird wrote:
> [ Lots of excellent criticisms omitted.]

Thanks for the compliment :)

> I don't want to be argumentative, but possibly (to answer your last
> question first), there are twofold reasons to put this in -mm:
> - there's no tracing infrastructure in the kernel now (except for
> kprobes - which provides hooks for creating tracepoints dynamically,
> but not 1) supporting infrastructure for timestamping, managing event
> data, etc., and 2) a static list of generally useful tracepoints.
> - to generate this discussion.

I have no objection at all to put instrumentation into the kernel. Quite
the contrary, I would appreciate it.

Putting tracepoints into the kernel is great.
Providing a trace/log/instrumentation framework is great.
Adding the given overhead is not.

> I've used it for various tasks, and I know others who have. I wouldn't
> recommend it in its present form for deep scheduling tweaks or debugging
> kernel race conditions (which it is more likely to mask than
> it is to find), but inapplicability there hardly makes it worthless for
> other things.

Putting a 200k patch into the kernel for limited usage and maybe
restricting a generic simple non intrusive and more generic
implementation by its mere presence is making it inapplicable enough.

Merge the instrumentation points from ltt and other projects like DSKI
and the places where in kernel instrumentation for specific purposes is
already available and use a simple and effective framework which moves
the burden into postprocessing and provides a simple postmortem dump
interface, is the goal IMHO.

When this is available, trace tool developers can concentrate on
postprocessing improvement rather than moving postprocessing
incapabilities into the kernel.

> By the way, don't think that your comments are not appreciated.
> I'm not particularly glued to any specific part of the implementation.
> I'm excited to see tracing discussed here, if only to avoid
> duplicate efforts and point out danger areas, for multiple tracing
> projects that I am aware of.

So I'm I.

tglx


2005-01-15 01:07:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Fri, 2005-01-14 at 16:26 -0800, Andrew Morton wrote:
> ? I see no XML in there.
>
> akpm:/usr/src/25> grep -i xml patches/ltt* patches/relayfs*
> patches/ltt-core-headers.patch:+#define LTT_CUSTOM_EV_FORMAT_TYPE_XML 3
> akpm:/usr/src/25>

And what is this define for ?

> > The kernel is not the place to fix your postprocessing problems. Sure
> > you have to do more complicated stuff, but you move the burden from
> > kernel to a place where it does not hurt.
>
> I thought Karim said that this was a form of data compression.

Adding data compression in form of an additional computation is really
inventive. Provide the information in a way that postprocessing tools
can do the job without adding computations to the kernel is the goal. I
pointed out a couple of those possibilities in my previous mail.

> >
> > Yes, the "you would anyway have to go down the same path we have"
> > argument really scares me away from doing so.
> >
> > I don't buy this kind of arguments.
>
> I do. When someone has been working on a real-world project for several
> years we *need* to understand all the problems which that person
> encountered before we can competently review the implementation.

I'm working on real world problems for quite a long time and your
argument should apply the other way too. I have implemented
instrumentation in different flavours before, so I know exactly what I'm
talking about.

I'm well aware of the worthiness of someones experience and I'm not
going to throw it away, but I don't see the reverse, that accepting this
is forcing me to blindly agree with arguments from those persons.

> Surelyyou've been there before: you throw out all the old stuff,
> write a new one and once you've addressed all the warts and corner
> cases and weird-but-valid requirements it ends up with the same
> complexity as the original.

I disagree at this point.

Accepting the maturness of an implementation just from the argument that
somebody has done this for a couple of time and therefor gained
experience is a quite weak argument, if one can point out the opposite
by just reading the code and making a short real life test.

If the goal is to provide some "cool to have" instrumentation in the
kernel, then I stop arguing immidiately.

But this can not be the goal. If we introduce instrumentation facilities
into the kernel, then they must be for general use, optimized for non
intrusiveness and replace all the other "[] provide measurement X"
config options instead of introducing parallel mechanisms.

I do not accept unnecessary complexity in the kernel, when you are able
to achieve the same goal by putting more thoughts into the
postprocessing. The kernel code is responsible to provide a simple and
fast interface for those tasks and nothing more. I don't see the point
why we need 150k additional code with limitations/problems, which are
even obvious without running it, instead of a simple interface to
userland where different postprocessors can compete to do the job more
or less perfect.

As I pointed out in my reply to Tim, I would be happy to have
instrumentation in the kernel, but I'm not willing to pay the price
which is requested by the currently discussed implementation.

tglx



2005-01-15 01:12:25

by Roman Zippel

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi,

On Fri, 14 Jan 2005, Karim Yaghmour wrote:

> As you can see, most of this is already used in one way or another by
> LTT. The only thing LTT doesn't use is the dynamic resizing, but as was
> said earlier in this thread, some people actually want to have this.

This doesn't mean everything has to be put into a single call. Several
parameters can still be set after creation.

> start_reserve, end_reserve, rchan_start_reserve:
> Some subsystems, like LTT, need to be able to write some key
> data at sub-buffer boundaries. This is to specify how much
> space is required for said data.

Why should a subsystem care about the details of the buffer management?
You could move all this into the relay layer by making a relay channel
an event channel. I know you want to save space, but having a magic
event_struct_size array is not a good idea. If you have that much events,
that a little more overhead causes problems, the tracing results won't be
reliable anymore anyway.
Simplicity and maintainability are far more important than saving a few
bytes, the general case should be fast and simple, leave the complexity to
the special cases.

bye, Roman

2005-01-15 01:20:50

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Thomas Gleixner wrote:
> Putting a 200k patch into the kernel for limited usage and maybe
> restricting a generic simple non intrusive and more generic
> implementation by its mere presence is making it inapplicable enough.

I think you've missed the other thread where people are claiming that
it's so generic as to be arcane ...

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-15 01:24:28

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Hello Thomas,

Gee Thomas, I guess you really want to take this one until the last
man is standing. Feel free to use the ad-hominem tone if it suits
you. Don't hold it against me though if I don't bite :)

Thomas Gleixner wrote:
> It's not only me, who needs constant time. Everybody interested in
> tracing will need that. In my opinion its a principle of tracing.

relayfs is a generalized buffering mechanism. Tracing is one application
it serves. Check out the web site: "high-speed data-relay filesystem."
Fancy name huh ...

> The "lockless" mechanism is _FAKE_ as I already pointed out. It replaces
> locks by do { } while loops. So what ?

Well for one thing, a portion of code running in user-context won't
disable interrupts while it's attempting to get buffer space, and
therefore won't impact on interrupt delivery.

> Interesting. I read this phrase more than once in the discussion of your
> patch. When will the to-do list be done ?

Well of course you hear it more than once, we are getting _a lot_ of
interesting feedback. Forgive me if I actually take the time to wait
a day or two for most everyone's feedback to come in and carry out
recommendations properly. Don't worry, I won't hold the changes too
long :)

> I'm impressed of your sudden time constraints awareness. Allowing 8192
> bytes of user event size, string printing with varags and XML tracing
> is not biting you ?

Use of these is by definition lacking performance. It's there because
some people actually need it. Again, if you have some concrete advice
as to what needs to be changed, we'll gladly hear it.

> If you only store the low 32 bit of TSC you have a valid timeline when
> you are able to do the math in your postprocessor. Depending on the
> speed 16 bit are enough.

We're already storing the low 32 bit of the TSC where available.

> A ring buffer is not stupid at all. I have implemented tracing with ring
> buffers already, so I know the limitations and the PITA.
>
> OTOH ringbuffers _ARE_ lockless, constant time comsuming and allow you
> to implement the splitting and related functionality in userspace
> postprocessing, which has to be done anyway.

We've had this debate before if you're interested to dig in the archives.
Here's a suggested implementation by Ingo:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103273730326318&w=2
And here are some reasons why this is incomplete:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103273967727564&w=2

> Do not tell me that streaming out data in a constant stream is worse
> than putting them into nodes of a filesystem and retrieving them from
> there.
>
> Setting up a simple /dev/proc/sys interface and do a
> cat /xxx/trace/cpuX >file/interface/whatever
> is not less efficient than the conversion of your data into a file.

Clearly you haven't read the implementation and/or aren't familiar with
its use. Usually, what you want to do is open(), mmap(), write(), there
is no "conversion" to a file. The filesystem abstraction is just a
namespace holder for us.

> Sure, I have to grab stuff out of a filesystem instead of simply doing
> for (....)
> sendserial(buffer[i]);
>
> I know you can provide a nice function for doing so, but it will take
> another xxx kB of code instead of a 10 line simple solution.

Again, you haven't read the implementation and aren't familiar with its
mechanics. Basically, you should just need to provide the pointer to
the begining of the relayfs buffer and do what you suggest above.

> Haha. If you have eventstamps and timestamps (even the jiffie + event
> based ones) nothing is hard to interpret. I guess the ethereal guys are
> rolling on the floor and laughing.
>
> The kernel is not the place to fix your postprocessing problems. Sure
> you have to do more complicated stuff, but you move the burden from
> kernel to a place where it does not hurt.
>
> What's hard on interpreting and filtering a stream of data ?

Umm, not having enough information in order for interpreting the data?

There is no postprocessing done in the kernel, please stop making
false claims. What is done is provide enough information to allow
simpler post-processing later. Spliting the stream on a per-cpu basis
is certainly not without merit. Plus, there is no cost in doing this,
each channel has a different ID and logging to it does not require
any form of string lookup (currently we're just checking a table to
make sure the ID is valid, but Roman suggested we dump this for pure
pointers instead and we've added this to our list.)

> What's complicated ? In case I want to have timing related tracing which
> includes printks, then storing the address where the printk is coming
> from is enough instead of a various length string. Storing some args in
> binary form with this address should not be too hard to achieve.
>
> Again its a postprocessing problems.

Sorry, I don't see how this is relevant to either relayfs or LTT.

> And therefor I need strings, HEX strings, XML ? A simple number and the
> data behind gives you all you need.
>
> Again its a postprocessing problems.

But that's exactly what we got already. Here's from include/linux/ltt-events.h:
/* Custom declared events */
/* ***WARNING*** These structures should never be used as is, use the
provided custom event creation and logging functions. */
typedef struct _ltt_new_event {
/* Basics */
u32 id; /* Custom event ID */
char type[LTT_CUSTOM_EV_TYPE_STR_LEN]; /* Event type description */
char desc[LTT_CUSTOM_EV_DESC_STR_LEN]; /* Detailed event description */

/* Custom formatting */
u32 format_type; /* Type of formatting */
char form[LTT_CUSTOM_EV_FORM_STR_LEN]; /* Data specific to format */
} LTT_PACKED_STRUCT ltt_new_event;
typedef struct _ltt_custom {
u32 id; /* Event ID */
u32 data_size; /* Size of data recorded by event */
void *data; /* Data recorded by event */
} LTT_PACKED_STRUCT ltt_custom;

The ltt_new_event struct is only used once when the event is created.
Everything afterwards goes through an ltt_custom struct.

> Sure I'm aware that I can switch off all, but I can not deselect
> specific tracepoints during compile time to reduce the overhead.
>
> If I want to have custom tracepoints for my specific problem, then why I
> need the overhead of the other stuff ?

Ah, ok, you weren't as clear earlier. I don't see anything that precludes
us from adding the appropriate kconfig/#ifdef machinery to allow this. I'll
gladly take a patch from you.

> If you consider the above example, which is taken of your code, as sane
> then we can stop talkin about this.

That's not the point. You're bending backwards as far as you can reach
trying to raise as much mud as you can, but when pressed for actual
constructive input you hide behind a strawman argument. If you don't
have anything to say, then stop whining.

> Karim, please do not use the FUD argument.
>
> I do not doubt that it is efficient from your point of view.
>
> But if short tests show this and I'm able to prove that numbers, you can
> barely deny that the scaling of 300MHZ PIII to ARM 74MHz SoC is wrong.
> It's simple math.

I like calling things by their name. You can say what you will but I
will bet on the casual observer's sense of reality to differentiate
between your "short tests" and the rounds of benchmarks we ran and
the results that we documented.

> Yes, the "you would anyway have to go down the same path we have"
> argument really scares me away from doing so.
>
> I don't buy this kind of arguments.

You have every right to contest what I'm saying. But if you do wish
to enforce that right, it seems to me that I have the right to not
have my time wasted by having to parse through your unnecessary
ad-hominem attacks. There are justifications for our choices, and I
will do my best to present them to you.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-15 01:24:06

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Thomas Gleixner wrote:
> I do not accept unnecessary complexity in the kernel, when you are able
> to achieve the same goal by putting more thoughts into the
> postprocessing. The kernel code is responsible to provide a simple and
> fast interface for those tasks and nothing more. I don't see the point
> why we need 150k additional code with limitations/problems, which are
> even obvious without running it, instead of a simple interface to
> userland where different postprocessors can compete to do the job more
> or less perfect.

You have previously demonstrated that you do not understand the
implementation you are criticizing. You keep repeating the size
of the patch like a mantra, yet when pressed for actual bits of
code that need fixing, you use a circular argument to slip away.

If you feel that there is some unncessary processing being done
in the kernel, please show me the piece of code affected so that
it can be fixed if it is broken.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-15 03:03:16

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Fri, Jan 14, 2005 at 12:23:52AM -0800, Andrew Morton wrote:
> - Added bk-xfs to the -mm "external trees" lineup.
> - Added the Linux Trace Toolkit (and hence relayfs). Mainly because I
> haven't yet taken as close a look at LTT as I should have. Probably neither
> have you.
> It needs a bit of work on the kernel<->user periphery, which is not a big
> deal.
[...]

No idea what hit me just yet. x86-64 doesn't boot. Still going through
the various architectures. The same system (including the initrd FPOS
bullcrap, though, of course, I'm using an initrd built just for this
kernel) boots various 2.6.x up to 2.6.10-mm1. There are vague indications
something in/around SCSI and/or initrd's has violently exploded in my face.


-- wli

Booting '2.6.11-rc1-mm1'

kernel (hd0,0)/vmlinuz-2.6.11-rc1-mm1 early_printk=serial root=/dev/sda2 consol
e=ttyS0,9600 profile=1 debug initcall_debug nmi_watchdog=2 elevator=cfq splash=
silent showopts resume=/dev/sda3 desktop
[Linux-bzImage, setup=0x1600, size=0x1c4711]
initrd (hd0,0)/initrd-2.6.11-rc1-mm1
[Linux-initrd @ 0x37ceb000, 0x304d0d bytes]

Bootdata ok (command line is early_printk=serial root=/dev/sda2 console=ttyS0,9600 profile=1 debug initcall_debug nmi_watchdog=2 elevator=cfq splash=silent showopts resume=/dev/sda3 desktop)
Linux version 2.6.11-rc1-mm1 (wli@residue) (gcc version 3.3.3 (SuSE Linux)) #2 SMP Fri Jan 14 18:00:33 PST 2005
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000ebbd0 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000007ffd0000 (usable)
BIOS-e820: 000000007ffd0000 - 000000007ffdf000 (ACPI data)
BIOS-e820: 000000007ffdf000 - 0000000080000000 (ACPI NVS)
BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000180000000 (usable)
ACPI: RSDP (v000 ACPIAM ) @ 0x00000000000f6710
ACPI: RSDT (v001 A M I OEMRSDT 0x05000427 MSFT 0x00000097) @ 0x000000007ffd0000
ACPI: FADT (v002 A M I OEMFACP 0x05000427 MSFT 0x00000097) @ 0x000000007ffd0200
ACPI: MADT (v001 A M I OEMAPIC 0x05000427 MSFT 0x00000097) @ 0x000000007ffd0390
ACPI: MCFG (v001 Intel Cayuse 0x00000001 MSFT 0x00000001) @ 0x000000007ffd0420
ACPI: OEMB (v001 A M I AMI_OEM 0x05000427 MSFT 0x00000097) @ 0x000000007ffdf040
ACPI: HPET (v001 A M I OEMHPET 0x05000427 MSFT 0x00000097) @ 0x000000007ffd7460
ACPI: DSDT (v001 CYCRB CYCRB039 0x00000039 INTL 0x02002026) @ 0x0000000000000000
No NUMA configuration found
Faking a node at 0000000000000000-0000000180000000
Bootmem setup node 0 0000000000000000-0000000180000000
On node 0 totalpages: 1572864
DMA zone: 4096 pages, LIFO batch:1
Normal zone: 1568768 pages, LIFO batch:16
HighMem zone: 0 pages, LIFO batch:1
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:3 APIC version 16
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x06] enabled)
Processor #6 15:3 APIC version 16
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled)
Processor #1 15:3 APIC version 16
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x07] enabled)
Processor #7 15:3 APIC version 16
ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
ACPI: IOAPIC (id[0x09] address[0xfec81000] gsi_base[24])
IOAPIC[1]: apic_id 9, version 32, address 0xfec81000, GSI 24-47
ACPI: IOAPIC (id[0x0a] address[0xfec81400] gsi_base[48])
IOAPIC[2]: apic_id 10, version 32, address 0xfec81400, GSI 48-71
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Setting APIC routing to flat
ACPI: HPET id: 0x8086a202 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
Checking aperture...
Built 1 zonelists
Initializing CPU#0
Kernel command line: early_printk=serial root=/dev/sda2 console=ttyS0,9600 profile=1 debug initcall_debug nmi_watchdog=2 elevator=cfq splash=silent showopts resume=/dev/sda3 desktop
kernel profiling enabled (shift: 1)
PID hash table entries: 4096 (order: 12, 131072 bytes)
time.c: Using 14.318180 MHz HPET timer.
time.c: Detected 3400.235 MHz processor.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes)
Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes)
Placing software IO TLB between 0x7528000 - 0x9528000
Memory: 4048952k/6291456k available (2395k kernel code, 0k reserved, 1484k data, 224k init)
Calibrating delay loop... 6750.20 BogoMIPS (lpj=3375104)
Security Framework v1.0.0 initialized
SELinux: Initializing.
SELinux: Starting in permissive mode
selinux_register_security: Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 256 (order: 0, 4096 bytes)
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
using mwait in idle threads.
CPU: Physical Processor ID: 0
CPU0: Thermal monitoring enabled (TM1)
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: Physical Processor ID: 0
CPU0: Intel(R) Xeon(TM) CPU 3.40GHz stepping 04
per-CPU timeslice cutoff: 1023.90 usecs.
task migration cache decay timeout: 2 msecs.
Booting processor 1/6 rip 6000 rsp ffff81007ff95f58
Initializing CPU#1
Calibrating delay loop... 6782.97 BogoMIPS (lpj=3391488)
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: Physical Processor ID: 3
CPU1: Thermal monitoring enabled (TM1)
Intel(R) Xeon(TM) CPU 3.40GHz stepping 04
Booting processor 2/1 rip 6000 rsp ffff810037c8df58
Initializing CPU#2
Calibrating delay loop... 6782.97 BogoMIPS (lpj=3391488)
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: Physical Processor ID: 0
CPU2: Thermal monitoring enabled (TM1)
Intel(R) Xeon(TM) CPU 3.40GHz stepping 04
Booting processor 3/7 rip 6000 rsp ffff81007ff03f58
Initializing CPU#3
Calibrating delay loop... 6782.97 BogoMIPS (lpj=3391488)
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: Physical Processor ID: 3
CPU3: Thermal monitoring enabled (TM1)
Intel(R) Xeon(TM) CPU 3.40GHz stepping 04
Total of 4 processors activated (27099.13 BogoMIPS).
Using local APIC timer interrupts.
Detected 12.500 MHz APIC timer.
checking TSC synchronization across 4 CPUs: passed.
time.c: Using HPET based timekeeping.
Brought up 4 CPUs
CPU0 attaching sched-domain:
domain 0: span 05
groups: 01 04
domain 1: span 0f
groups: 05 0a
domain 2: span 0f
groups: 0f
CPU1 attaching sched-domain:
domain 0: span 0a
groups: 02 08
domain 1: span 0f
groups: 0a 05
domain 2: span 0f
groups: 0f
CPU2 attaching sched-domain:
domain 0: span 05
groups: 04 01
domain 1: span 0f
groups: 05 0a
domain 2: span 0f
groups: 0f
CPU3 attaching sched-domain:
domain 0: span 0a
groups: 08 02
domain 1: span 0f
groups: 0a 05
domain 2: span 0f
groups: 0f
checking if image is initramfs...it isn't (no cpio magic); looks like an initrd
Calling initcall 0xffffffff805633a0: cpufreq_tsc+0x0/0x90()
Calling initcall 0xffffffff8056e390: init_elf32_binfmt+0x0/0x10()
Calling initcall 0xffffffff80570180: helper_init+0x0/0x40()
Calling initcall 0xffffffff80570280: pm_init+0x0/0x30()
Calling initcall 0xffffffff80570400: ksysfs_init+0x0/0x30()
Losing some ticks... checking if CPU frequency changed.
Calling initcall 0xffffffff80572510: filelock_init+0x0/0x40()
Calling initcall 0xffffffff80572ce0: init_script_binfmt+0x0/0x10()
Calling initcall 0xffffffff80572cf0: init_elf_binfmt+0x0/0x10()
Calling initcall 0xffffffff805809f0: netlink_proto_init+0x0/0x200()
NET: Registered protocol family 16
Calling initcall 0xffffffff805744c0: kobject_uevent_init+0x0/0x40()
Calling initcall 0xffffffff805745a0: pcibus_class_init+0x0/0x10()
Calling initcall 0xffffffff80574c20: pci_driver_init+0x0/0x10()
Calling initcall 0xffffffff80578520: tty_class_init+0x0/0x30()
Calling initcall 0xffffffff8057ac90: register_node_type+0x0/0x10()
Calling initcall 0xffffffff80566490: mtrr_if_init+0x0/0x80()
Calling initcall 0xffffffff8057f100: pci_direct_init+0x0/0x1b0()
PCI: Using configuration type 1
Calling initcall 0xffffffff8057fe30: pci_mmcfg_init+0x0/0x90()
PCI: Using MMCONFIG at e0000000
Calling initcall 0xffffffff805662f0: mtrr_init+0x0/0x1a0()
mtrr: v2.0 (20020519)
Calling initcall 0xffffffff8056d290: topology_init+0x0/0x70()
Calling initcall 0xffffffff801541a0: pm_sysrq_init+0x0/0x20()
Calling initcall 0xffffffff80572240: init_bio+0x0/0x190()
Calling initcall 0xffffffff805754d0: fbmem_init+0x0/0xb0()
Calling initcall 0xffffffff80577622: acpi_init+0x0/0x1f1()
ACPI: Subsystem revision 20041203
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
Calling initcall 0xffffffff8057792c: acpi_ec_init+0x0/0x5e()
Calling initcall 0xffffffff80577d09: acpi_pci_root_init+0x0/0x20()
Calling initcall 0xffffffff80577e85: acpi_pci_link_init+0x0/0x42()
Calling initcall 0xffffffff80577ec7: acpi_power_init+0x0/0x74()
Calling initcall 0xffffffff80577f3b: acpi_system_init+0x0/0xc7()
Calling initcall 0xffffffff80578002: acpi_event_init+0x0/0x3e()
Calling initcall 0xffffffff80578040: acpi_scan_init+0x0/0xc4()
ACPI: PCI Root Bridge [PCI0] (00:00)
PCI: Probing PCI hardware (bus 00)
PCI: Ignoring BAR0-3 of IDE controller 0000:00:1f.2
PCI: Transparent bridge - 0000:00:1e.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P1._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P2._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P2.P2P3._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P2.P2P4._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P6._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 *5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 *7 10 11 12 14 15)
Calling initcall 0xffffffff80578e60: misc_init+0x0/0x90()
Calling initcall 0xffffffff8057ae10: device_init+0x0/0x40()
Calling initcall 0xffffffff8057e850: input_init+0x0/0x170()
Calling initcall 0xffffffff8057f320: pci_acpi_init+0x0/0x130()
PCI: Using ACPI for IRQ routing
** PCI interrupts are no longer routed automatically. If this
** causes a device to stop working, it is probably because the
** driver failed to call pci_enable_device(). As a temporary
** workaround, the "pci=routeirq" argument restores the old
** behavior. If this argument makes the device work again,
** please email the output of "lspci" to [email protected]
** so I can fix the driver.
Calling initcall 0xffffffff8057f450: pci_legacy_init+0x0/0x100()
Calling initcall 0xffffffff8057f970: pcibios_irq_init+0x0/0x450()
Calling initcall 0xffffffff8057fdc0: pcibios_init+0x0/0x70()
Calling initcall 0xffffffff80580360: net_dev_init+0x0/0x200()
Calling initcall 0xffffffff805808f0: pktsched_init+0x0/0xc0()
Calling initcall 0xffffffff805809b0: tc_filter_init+0x0/0x40()
Calling initcall 0xffffffff80563430: late_hpet_init+0x0/0xc0()
hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
hpet0: 69ns tick, 3 64-bit timers
Calling initcall 0xffffffff8056c1f0: pci_iommu_init+0x0/0x610()
PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
Calling initcall 0xffffffff80572490: init_pipe_fs+0x0/0x50()
Calling initcall 0xffffffff80578104: acpi_motherboard_init+0x0/0x1bc()
Calling initcall 0xffffffff805782c0: chr_dev_init+0x0/0x90()
Calling initcall 0xffffffff8057ed20: cpufreq_gov_performance_init+0x0/0x10()
Calling initcall 0xffffffff8057ed30: pcibios_assign_resources+0x0/0xf0()
Calling initcall 0xffffffff8057fec0: fill_mp_bus_to_cpumask+0x0/0x100()
Calling initcall 0xffffffff80112800: time_init_device+0x0/0x30()
Calling initcall 0xffffffff80564a80: init_timer_sysfs+0x0/0x30()
Calling initcall 0xffffffff80564a50: i8259A_init_sysfs+0x0/0x30()
Calling initcall 0xffffffff80564f50: vsyscall_init+0x0/0x90()
Calling initcall 0xffffffff805654b0: sbf_init+0x0/0xd0()
Calling initcall 0xffffffff80566040: mce_init_device+0x0/0xf0()
Calling initcall 0xffffffff80565fd0: periodic_mcheck_init+0x0/0x30()
Calling initcall 0xffffffff80568300: init_lapic_sysfs+0x0/0x40()
Calling initcall 0xffffffff805697d0: ioapic_init_sysfs+0x0/0xd0()
Calling initcall 0xffffffff8056d330: x8664_sysctl_init+0x0/0x20()
Calling initcall 0xffffffff8056e370: ia32_init+0x0/0x20()
IA32 emulation $Id: sys_ia32.c,v 1.32 2002/03/24 13:02:28 ak Exp $
Calling initcall 0xffffffff8056e3a0: ia32_binfmt_init+0x0/0x20()
Calling initcall 0xffffffff8056e3c0: init_syscall32+0x0/0x120()
Calling initcall 0xffffffff8056e4e0: init_aout_binfmt+0x0/0x10()
Calling initcall 0xffffffff8056f450: create_proc_profile+0x0/0x410()
Calling initcall 0xffffffff8056f930: ioresources_init+0x0/0x50()
Calling initcall 0xffffffff8056fae0: uid_cache_init+0x0/0xb0()
Calling initcall 0xffffffff8056fed0: param_sysfs_init+0x0/0x200()
Calling initcall 0xffffffff805700d0: init_posix_timers+0x0/0xb0()
Calling initcall 0xffffffff805701c0: init+0x0/0x60()
Calling initcall 0xffffffff80570220: proc_dma_init+0x0/0x30()
Calling initcall 0xffffffff8014f870: percpu_modinit+0x0/0x90()
Calling initcall 0xffffffff80570250: kallsyms_init+0x0/0x30()
Calling initcall 0xffffffff805702b0: ikconfig_init+0x0/0x40()
Calling initcall 0xffffffff80570370: audit_init+0x0/0x90()
audit: initializing netlink socket (disabled)
audit(1105757136.391:0): initialized
Calling initcall 0xffffffff80570fb0: init_per_zone_pages_min+0x0/0x50()
Calling initcall 0xffffffff80571ae0: pdflush_init+0x0/0x20()
Calling initcall 0xffffffff80571b00: cpucache_init+0x0/0x30()
Calling initcall 0xffffffff80571e80: kswapd_init+0x0/0x60()
Calling initcall 0xffffffff80571f20: procswaps_init+0x0/0x30()
Calling initcall 0xffffffff80571f50: hugetlb_init+0x0/0xb0()
Total HugeTLB memory allocated, 0
Calling initcall 0xffffffff805720d0: init_tmpfs+0x0/0xe0()
Calling initcall 0xffffffff805724e0: fasync_init+0x0/0x30()
Calling initcall 0xffffffff80572b20: aio_setup+0x0/0x70()
Calling initcall 0xffffffff80572b90: eventpoll_init+0x0/0xf0()
Calling initcall 0xffffffff80572c80: init_sys32_ioctl+0x0/0x60()
Calling initcall 0xffffffff80572d00: init_mbcache+0x0/0x30()
Calling initcall 0xffffffff80572d30: dquot_init+0x0/0x100()
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
Calling initcall 0xffffffff80572e30: dnotify_init+0x0/0x30()
Calling initcall 0xffffffff805732b0: init_devpts_fs+0x0/0x40()
Calling initcall 0xffffffff805732f0: init_ext2_fs+0x0/0x70()
Calling initcall 0xffffffff805733a0: init_ramfs_fs+0x0/0x10()
Calling initcall 0xffffffff805733c0: init_hugetlbfs_fs+0x0/0x80()
Calling initcall 0xffffffff80573440: init_minix_fs+0x0/0x60()
Calling initcall 0xffffffff805734a0: init_iso9660_fs+0x0/0x70()
Calling initcall 0xffffffff805735a0: init_nfs_fs+0x0/0xa0()
Calling initcall 0xffffffff80573d00: init_nlm+0x0/0x30()
Calling initcall 0xffffffff80573d30: ipc_init+0x0/0x20()
Calling initcall 0xffffffff80573ed0: init_mqueue_fs+0x0/0xe0()
Calling initcall 0xffffffff80574100: selinux_nf_ip_init+0x0/0x60()
SELinux: Registering netfilter hooks
Calling initcall 0xffffffff80574240: init_sel_fs+0x0/0x70()
Calling initcall 0xffffffff805742b0: selnl_init+0x0/0x50()
Calling initcall 0xffffffff80574300: sel_netif_init+0x0/0x80()
Calling initcall 0xffffffff80574420: init_crypto+0x0/0x20()
Initializing Cryptographic API
Calling initcall 0xffffffff80574470: init+0x0/0x10()
Calling initcall 0xffffffff80574480: init+0x0/0x40()
Calling initcall 0xffffffff80228860: pci_init+0x0/0x30()
Intel E7520/7320/7525 detected.<7>Calling initcall 0xffffffff80574c30: pci_sysfs_init+0x0/0x40()
Calling initcall 0xffffffff80574c70: pci_proc_init+0x0/0x70()
Calling initcall 0xffffffff805751a0: fb_console_init+0x0/0x70()
Calling initcall 0xffffffff80576b50: vesafb_init+0x0/0x68()
Calling initcall 0xffffffff80577e50: irqrouter_init_sysfs+0x0/0x35()
Calling initcall 0xffffffff80578350: rand_initialize+0x0/0x1b0()
Calling initcall 0xffffffff80578550: tty_init+0x0/0x1e0()
Calling initcall 0xffffffff80578770: inotify_init+0x0/0x100()
inotify device minor=63
Calling initcall 0xffffffff80578870: pty_init+0x0/0x5f0()
Calling initcall 0xffffffff80579430: rtc_init+0x0/0x200()
Real Time Clock Driver v1.12
Calling initcall 0xffffffff80579630: hpet_init+0x0/0x70()
hpet_acpi_add: no address or irqs in _CRS
Calling initcall 0xffffffff805796a0: nvram_init+0x0/0x90()
Non-volatile memory driver v1.2
Calling initcall 0xffffffff80579790: agp_init+0x0/0x30()
Linux agpgart interface v0.101 (c) Dave Jones
Calling initcall 0xffffffff805798a0: serio_init+0x0/0x60()
Calling initcall 0xffffffff80579990: i8042_init+0x0/0x650()
Calling initcall 0xffffffff8057a3d0: serial8250_init+0x0/0x110()
Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing disabled
ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
Calling initcall 0xffffffff8057a5b0: serial8250_pci_init+0x0/0x10()
Calling initcall 0xffffffff80286e90: elevator_global_init+0x0/0x10()
Calling initcall 0xffffffff8057ae50: noop_init+0x0/0x10()
io scheduler noop registered
Calling initcall 0xffffffff8057ae60: as_init+0x0/0x60()
io scheduler anticipatory registered
Calling initcall 0xffffffff8057aec0: deadline_init+0x0/0x60()
io scheduler deadline registered
Calling initcall 0xffffffff80294810: cfq_init+0x0/0xb0()
io scheduler cfq registered (default)
Calling initcall 0xffffffff8057af20: rd_init+0x0/0x1c0()
RAMDISK driver initialized: 16 RAM disks of 128000K size 1024 blocksize
Calling initcall 0xffffffff8057b150: loop_init+0x0/0x340()
loop: loaded (max 8 devices)
Calling initcall 0xffffffff8057b500: net_olddevs_init+0x0/0xe0()
Calling initcall 0xffffffff80296e30: aec62xx_ide_init+0x0/0x10()
Calling initcall 0xffffffff80297540: ali15x3_ide_init+0x0/0x10()
Calling initcall 0xffffffff80298630: amd74xx_ide_init+0x0/0x10()
Calling initcall 0xffffffff80299820: atiixp_ide_init+0x0/0x10()
Calling initcall 0xffffffff80299dd0: cmd64x_ide_init+0x0/0x10()
Calling initcall 0xffffffff8029b200: sc1200_ide_init+0x0/0x10()
Calling initcall 0xffffffff8029bd20: cy82c693_ide_init+0x0/0x10()
Calling initcall 0xffffffff8029c050: hpt34x_ide_init+0x0/0x10()
Calling initcall 0xffffffff8029c730: hpt366_ide_init+0x0/0x10()
Calling initcall 0xffffffff8029e640: ns87415_ide_init+0x0/0x10()
Calling initcall 0xffffffff8029ea00: pdc202xx_ide_init+0x0/0x10()
Calling initcall 0xffffffff8029fac0: pdc202new_ide_init+0x0/0x10()
Calling initcall 0xffffffff8057c3c0: piix_ide_init+0x0/0xd0()
Calling initcall 0xffffffff802a1050: rz1000_ide_init+0x0/0x10()
Calling initcall 0xffffffff802a1130: svwks_ide_init+0x0/0x10()
Calling initcall 0xffffffff802a1ac0: siimage_ide_init+0x0/0x10()
Calling initcall 0xffffffff802a31f0: sis5513_ide_init+0x0/0x10()
Calling initcall 0xffffffff802a4280: slc90e66_ide_init+0x0/0x10()
Calling initcall 0xffffffff802a47e0: triflex_ide_init+0x0/0x10()
Calling initcall 0xffffffff802a4cb0: via_ide_init+0x0/0x10()
Calling initcall 0xffffffff802a5f40: generic_ide_init+0x0/0x10()
Calling initcall 0xffffffff8057df40: ide_init+0x0/0x80()
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
Calling initcall 0xffffffff8057e820: ide_generic_init+0x0/0x20()
Probing IDE interface ide0...
hda: TEAC DW-548D, ATAPI CD/DVD-ROM drive
ide1: I/O resource 0x170-0x177 not free.
ide1: ports already in use, skipping probe
Probing IDE interface ide2...
ide2: Wait for ready failed before probe !
Probing IDE interface ide3...
ide3: Wait for ready failed before probe !
Probing IDE interface ide4...
ide4: Wait for ready failed before probe !
Probing IDE interface ide5...
ide5: Wait for ready failed before probe !
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Calling initcall 0xffffffff802b3bf0: idedisk_init+0x0/0x10()
Calling initcall 0xffffffff802b5cb0: ide_cdrom_init+0x0/0x20()
hda: ATAPI 48X DVD-ROM CD-R/RW drive, 2048kB Cache
Uniform CD-ROM driver Revision: 3.20
Calling initcall 0xffffffff802ba550: idefloppy_init+0x0/0x30()
ide-floppy driver 0.99.newide
Calling initcall 0xffffffff8057e840: cdrom_init+0x0/0x10()
Calling initcall 0xffffffff8057e9c0: mousedev_init+0x0/0xe0()
mice: PS/2 mouse device common for all mice
Calling initcall 0xffffffff8057eaa0: atkbd_init+0x0/0x20()
Calling initcall 0xffffffff8057eac0: psmouse_init+0x0/0xb0()
Calling initcall 0xffffffff8057eb70: pcspkr_init+0x0/0x80()
input: PC Speaker
Calling initcall 0xffffffff8057ebf0: md_init+0x0/0x130()
md: md driver 0.90.1 MAX_MD_DEVS=256, MD_SB_DISKS=27
Calling initcall 0xffffffff80580140: flow_cache_init+0x0/0x220()
Calling initcall 0xffffffff805807c0: llc_init+0x0/0x70()
Calling initcall 0xffffffff80580830: snap_init+0x0/0x40()
Calling initcall 0xffffffff80580870: rif_init+0x0/0x80()
Calling initcall 0xffffffff805816b0: inet_init+0x0/0x3f0()
NET: Registered protocol family 2
IP: routing cache hash table of 32768 buckets, 512Kbytes
TCP established hash table entries: 262144 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
TCP: Hash tables configured (established 262144 bind 65536)
Calling initcall 0xffffffff80583970: tcpdiag_init+0x0/0x30()
Calling initcall 0xffffffff80583b70: af_unix_init+0x0/0x80()
NET: Registered protocol family 1
Calling initcall 0xffffffff80583bf0: init_sunrpc+0x0/0x50()
Calling initcall 0xffffffff80583c40: init_rpcsec_gss+0x0/0x40()
Calling initcall 0xffffffff80583c80: init_kerberos_module+0x0/0x25()
Calling initcall 0xffffffff80568c60: init_lapic_nmi_sysfs+0x0/0x40()
Calling initcall 0xffffffff8025480c: acpi_poweroff_init+0x0/0x3a()
Calling initcall 0xffffffff80577450: acpi_wakeup_device_init+0x0/0xec()
ACPI wakeup devices:
P0P1 MC97 USB1 USB2 USB3 USB4 EUSB P2P3 P2P4
Calling initcall 0xffffffff8057755d: acpi_sleep_init+0x0/0xc5()
ACPI: (supports S0 S1 S3 S4 S5)
Calling initcall 0xffffffff80254d18: acpi_sleep_proc_init+0x0/0x94()
Calling initcall 0xffffffff80578500: seqgen_init+0x0/0x20()
Calling initcall 0xffffffff8057a3a0: serial8250_late_console_init+0x0/0x30()
Calling initcall 0xffffffff8057aab0: early_uart_console_switch+0x0/0x90()
Calling initcall 0xffffffff802e6090: net_random_reseed+0x0/0x50()
Calling initcall 0xffffffff80582820: ip_auto_config+0x0/0xf00()
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
RAMDISK: Compressed image found at block 0
VFS: Waiting 19sec for root device...
VFS: Waiting 18sec for root device...
VFS: Waiting 17sec for root device...
VFS: Waiting 16sec for root device...
VFS: Waiting 15sec for root device...
VFS: Waiting 14sec for root device...
VFS: Waiting 13sec for root device...
VFS: Waiting 12sec for root device...
VFS: Waiting 11sec for root device...
VFS: Waiting 10sec for root device...
VFS: Waiting 9sec for root device...
VFS: Waiting 8sec for root device...
VFS: Waiting 7sec for root device...
VFS: Waiting 6sec for root device...
VFS: Waiting 5sec for root device...
VFS: Waiting 4sec for root device...
VFS: Waiting 3sec for root device...
VFS: Waiting 2sec for root device...
VFS: Waiting 1sec for root device...
VFS: Cannot open root device "sda2" or unknown-block(0,0)
Please append a correct "root=" boot option
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

2005-01-15 04:13:14

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Hello Roman,

Roman Zippel wrote:
> This doesn't mean everything has to be put into a single call. Several
> parameters can still be set after creation.

I don't have a problem with that. If that's preferable, then we can do
it this way too.

> Why should a subsystem care about the details of the buffer management?

Because it wants to enforce a data format on buffer boundaries.

Let me explain how this applies in the case of LTT, but this easily
generalizes itself to any sort of subsystem that needs to transfer
large amounts of information between the kernel and user-space. And
to avoid any confusion, let me repeat that relayfs is not intended
just for conveying debug/performance/trace info.

Basically, in the case of LTT at least, the kernel tracing infrastructure
must provide a stream of data to the user-space tools that they will in
turn process and display to the user. At this point it must be said that
what you write and you how write it in the trace depends largely on a
few key issues. Namely:
- How much data you expect to be generating.
- What you intend to do with it.

Given ltt's target audience (mainstream developers, sysadmins, and power-
users), one of the goals was to have a trace format that provided
easy browsing forward and backwards, and random access. Initially,
this was implemented using two 1MB buffers, one that was being written to
while the other one was being written to disk. So, in essence, we had
random access at 1MB boundaries. For reading backwards, the size of the
event is written at the end of the event and we just need to read
2 bytes prior to the current event to know where the previous event
started.

Eventually we found that this format was rather bulky, and that it
recorded superfluous data. Amongst other things we relied on a single
buffer, so with each event we logged the CPU-ID of the processor on
which the event occured. So, in order to reduce the amount of data
recorded and in trying to obtain better performance at runtime by
avoiding a call to do_gettimeofday for every event, we did the
following:
- Eliminate the CPU-ID => use per-cpu buffers instead.
- Stop calling do_gettimeofday when possible => instead write a
complete time-stamp at sub-buffer boundaries (begining and end;
because of clock drift) and only read the lower-half of the TSC
for each event. Determining an event's actual time is done in
post-mortem in user-space.

So how does this translate in practice? Here's the trace header. This
is written only once at the start of the trace:
/* Information logged when a trace is started */
typedef struct _ltt_trace_start {
u32 magic_number;
u32 arch_type;
u32 arch_variant;
u32 system_type;
u8 major_version;
u8 minor_version;

u32 buffer_size;
ltt_event_mask event_mask;
ltt_event_mask details_mask;
u8 log_cpuid;
u8 use_tsc;
u8 flight_recorder;
} LTT_PACKED_STRUCT ltt_trace_start;

This is written in the begining of every new sub-buffer:
/* Start of trace buffer information */
typedef struct _ltt_buffer_start {
struct timeval time; /* Time stamp of this buffer */
u32 tsc; /* TSC of this buffer, if applicable */
u32 id; /* Unique buffer ID */
} LTT_PACKED_STRUCT ltt_buffer_start;

This is written at the end of every sub-buffer:
typedef struct _ltt_buffer_end {
struct timeval time; /* Time stamp of this buffer */
u32 tsc; /* TSC of this buffer, if applicable */
} LTT_PACKED_STRUCT ltt_buffer_end;

As you can see, we can't just dump this information in an event channel.
This is really intrinsic to how the trace data is going to be read
later on. Removing this data would require more data for each event to
be logged, and require parsing through the trace before reading it in
order to obtain markers allowing random access. This wouldn't be so
bad if we were expecting users to use LTT sporadically for very short
periods of time. However, given ltt's target audience (i.e. need to
run traces for hours, maybe days, weeks), traces would rapidely become
useless because while plowing through a few hundred KBs of data and
allocating RAM for building internal structures as you go is fine,
plowing through tens of GBs of data, possibly hundreds, requires that
you come up with a format that won't require unreasonable resources
from your system, while incuring negligeable runtime costs for generating
it. We believe the format we currently have achieves the right balance
here.

So what happens now is that ltt tells relayfs when creating a channel
how much space it needs for these basic structures, and provides it
with callbacks which are invoked at boundaries for filling the actual
reserved space. In all other circumstances, here's what we are writing
into the relayfs buffer for each event:
- Event ID (1 byte)
- Time delta (4 bytes) => this the low 32-bits from the TSC or a
diff between the current do_gettimeofday and the one at buffer start.
- Event details (variable length, see include/linux/ltt-events.h)
- Event size (2 bytes)

Of course there are possible improvements. For one thing, we've
discussed dropping the "event size" altogether and rely on smaller
buffers and dynamically create sub-buffer indexing tables for reading
backwards. This is still part of a work in progress which aims at
creating an even better and more flexible format. Of course in an
ideal world this new format and the corresponding user tools would
be available as we speak, but there's only so much that can be done
without having an existing solid base to work off on. As usual,
we're open to any other outside suggestions.

> You could move all this into the relay layer by making a relay channel
> an event channel. I know you want to save space, but having a magic
> event_struct_size array is not a good idea. If you have that much events,
> that a little more overhead causes problems, the tracing results won't be
> reliable anymore anyway.

I hope what I said above explains why this isn't possible.

> Simplicity and maintainability are far more important than saving a few
> bytes, the general case should be fast and simple, leave the complexity to
> the special cases.

I agree. I also realize that not all relayfs clients will have the
same requirements as ltt. Already, ltt uses a few things from relayfs
that others are unlikely to need. For example, it directly invokes
relay_lock_channel() to directly lock a channel and relay_write_direct()
to directly write to the buffers without relying on the usual
relay_write() which takes care of both. This allows LTT to do
zero-copy (i.e. no need to pack a buffer before comiting it.) Other
subsystems may actually not use any relayfs function to write, but
instead write directly to a channel as if it was an allocated buffer
(which in fact it is). In all cases, though, the open(), mmap(),
write() semantic makes it very simple for user-space applications
to process channeled data.

So here's a suggested change. Instead of the current relay_open()
API, here are three replacement functions (inspired by Tim's input
and your comments above):
relay_open(channel_path, mode, bufsize, nbufs);
relay_set_property(property, value);
relay_get_property(property, &value);

Is this more palatable?

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-15 04:18:25

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Tim Bird wrote:
> Some of these options (e.g. bufsize) are available to the user
> via tracedaemon. I can honestly say I haven't got a clue what
> to use for some of them, and so always leave them at defaults.

Yes, but those defaults were chosen by a person who understood the
kernel part's use of the buffer space, right? Presumably if you
are writing your own relayfs client you know what type of
throughput to expect and what size you'd like your buffers to
be (bufsize and nbufs), so you need to be able to set this somehow
and it only seems right that this be done upon instantiation.

> Could these be simplified to a few enumerated modes?

I don't see how. Do you have actual examples?

As for the other fields, please see my response to Roman.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-15 08:43:19

by Miklos Szeredi

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Some things I'd like to see (as I am currently using the KIO
equivalent) implemented as FUSE fs:
- "fish", virtual file access over ssh

This is already available here:

http://sourceforge.net/projects/fuse

You need to dowload fuse-2.2-pre3 and sshfs-1.0. It should work on
any kernel including the 2.6.10-rc1-mm1 with FUSE compiled in.

Miklos

2005-01-15 08:45:48

by Miklos Szeredi

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Sorry about the missing quotes. It should read:

You wrote:
> Some things I'd like to see (as I am currently using the KIO
> equivalent) implemented as FUSE fs:
> - "fish", virtual file access over ssh

This is already available here:

http://sourceforge.net/projects/fuse

You need to dowload fuse-2.2-pre3 and sshfs-1.0. It should work on
any kernel including the 2.6.10-rc1-mm1 with FUSE compiled in.

Miklos

2005-01-15 09:57:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi Karim,

On Fri, 2005-01-14 at 20:14 -0500, Karim Yaghmour wrote:
> Gee Thomas, I guess you really want to take this one until the last
> man is standing. Feel free to use the ad-hominem tone if it suits
> you. Don't hold it against me though if I don't bite :)

No personal offence was intended.

> Thomas Gleixner wrote:
> > It's not only me, who needs constant time. Everybody interested in
> > tracing will need that. In my opinion its a principle of tracing.
>
> relayfs is a generalized buffering mechanism. Tracing is one application
> it serves. Check out the web site: "high-speed data-relay filesystem."
> Fancy name huh ...

I do not doubt that.

But hardwiring an instrumentation framework on it is also hardwiring
implicit restrictions on the usability of the instrumentation for
certain purposes.

> > The "lockless" mechanism is _FAKE_ as I already pointed out. It replaces
> > locks by do { } while loops. So what ?
>
> Well for one thing, a portion of code running in user-context won't
> disable interrupts while it's attempting to get buffer space, and
> therefore won't impact on interrupt delivery.

The do {} while loops are in the fast ltt_log_event path

> Clearly you haven't read the implementation and/or aren't familiar with
> its use. Usually, what you want to do is open(), mmap(), write(), there
> is no "conversion" to a file. The filesystem abstraction is just a
> namespace holder for us.

I have read it and tried it. I don't see a point why I can't map a
ringbuffer into user space.
I'm not beating on the ringbuffer, but I'm using it as an example. :)

> That's not the point. You're bending backwards as far as you can reach
> trying to raise as much mud as you can, but when pressed for actual
> constructive input you hide behind a strawman argument. If you don't
> have anything to say, then stop whining.

I gave constructive criticism along with points, where I just point on
the restrictions and weakness of the implementation.

tglx


2005-01-15 10:20:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Fri, 2005-01-14 at 20:25 -0500, Karim Yaghmour wrote:
> Thomas Gleixner wrote:
>
> You have previously demonstrated that you do not understand the
> implementation you are criticizing. You keep repeating the size
> of the patch like a mantra, yet when pressed for actual bits of
> code that need fixing, you use a circular argument to slip away.

Yeah, did you answer one of my arguments except claiming that I'm to
stupid to understand how it works ?

I completely understand what this code does and I don't beat on the
patch size. I beat on the timing burden and restrictions which are given
by the implementation.

I have no objection against relayfs itself. I can just leave the config
switch off, so it does not affect me.

Adding instrumentation to the kernel is a good thing.

I just dont like the idea, that instrumentation is bound on relayfs and
adds a feature to the kernel which fits for a restricted set of problems
rather than providing a generic optimized instrumentation framework,
where one can use relayfs as a backend, if it fits his needs. Making
this less glued together leaves the possibility to use other backends.

> If you feel that there is some unncessary processing being done
> in the kernel, please show me the piece of code affected so that
> it can be fixed if it is broken.

Just doing codepath analysis shows me:

There is a loop in ltt_log_event, which enforces the processing of each
event twice. Spliting traces is postprocessing and can be done
elsewhere.

In _ltt_log_event lives quite a bunch of if(...) processing decisions
which have to be evaluated for _each_ event.

The relay_reserve code can loop in the do { } while() and even go into a
slow path where another do { } while() is found.
So it can not be used in fast paths and for timing related problem
tracking, because it adds variable time overhead.

Due to the fact, that the ltt_log_event path is not preempt safe you can
actually hit the additional go in the do { } while() loop.

I pointed out before, that it is not possible to selectively select the
events which I'm interested in during compile time. I get either nothing
or everything. If I want to use instrumentation for a particular
problem, why must I process a loop of _ltt_log_event calls for stuff I
do not need instead of just compiling it away ?

If I compile a event in, then adding a couple of checks into the
instrumentation macro itself does not hurt as much as leaving the
straight code path for a disabled event.

tglx


2005-01-15 13:08:23

by Thomas Gleixner

[permalink] [raw]
Subject: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

</Flame off>

On Fri, 2005-01-14 at 15:22 -0800, Tim Bird wrote:
> but not 1) supporting infrastructure for timestamping, managing event
> data, etc., and 2) a static list of generally useful tracepoints.

Both points are well taken. Thats the essential minimum what
instrumentation needs.

I'd like to see this infrastructure usable for all kinds of
instrumentation mechanisms which are built in to the kernel already or
functions which are used for similar purposes in experimental trees and
other instrumentation related projects.

This requires to seperate the backend from the infrastructure, so you
can chose from a set of backends which fit best for the intended use.

One of those backends is LTT+relayfs.
I really respect the work you have done there, but please accept that I
just see the limitations and try to figure out a way to make it more
generic and flexible before it is cemented into the kernel and makes it
hard to use for other interesting instrumentation aspects and maybe
enforces redundant implementation of infrastructure related
functionality.

E.g. tracking down timing related issues can make use from such
functionality if the infrastructure is provided seperately.
I guess a lot of developers would be happy to use it when it is already
around in the kernel and it can help testers for giving better
information to developers.

tglx


2005-01-16 00:59:51

by Joseph Fannin

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Fri, Jan 14, 2005 at 12:23:52AM -0800, Andrew Morton wrote:
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/

> waiting-10s-before-mounting-root-filesystem.patch
> retry mounting the root filesystem at boot time

With this patch, initrds seem to get 'skipped'. I think this is
probably the cause for the reports of problems with RAID too.

Just after loading the initrd (RAMDISK: Loading 5284KiB [1 disk]
into ram disk...) the kernel tries to mount the real root fs -- if the
necessary drivers are built-in, it proceeds from there; if not, not.

I'm guessing that when the initrd code calls mount_block_root() to
mount the ramdisk, this bit makes it decide to try to mount the real
root instead:

if (!ROOT_DEV) {
ROOT_DEV = name_to_dev_t(saved_root_name);
create_dev(name, ROOT_DEV, root_device_name);
}

Perhaps this should not be done until after the first attempt to
mount fails? Sorry, I haven't had nearly enough coffee today to
attempt to make a patch. :-)


--
Joseph Fannin
[email protected]

"Bull in pure form is rare; there is usually some contamination by data."
-- William Graves Perry Jr.


Attachments:
(No filename) (1.17 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-01-16 02:12:16

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)


Hello Thomas,

I don't mind having a general discussion about instrumentation, but
it has to be understood that the topic is so general and means so
many different things to different people that we are unlikely to
reach any useful consensus. Believe me, it's not for the lack of
trying. More below.

Thomas Gleixner wrote:
> </Flame off>

:D

> One of those backends is LTT+relayfs.
> I really respect the work you have done there, but please accept that I
> just see the limitations and try to figure out a way to make it more
> generic and flexible before it is cemented into the kernel and makes it
> hard to use for other interesting instrumentation aspects and maybe
> enforces redundant implementation of infrastructure related
> functionality.
>
> E.g. tracking down timing related issues can make use from such
> functionality if the infrastructure is provided seperately.
> I guess a lot of developers would be happy to use it when it is already
> around in the kernel and it can help testers for giving better
> information to developers.

I would invite you to review the history behind LTT and the history
behind the efforts to get LTT integrated in the kernel (which are
two separate topics.) If you look back, you will see that I worked
very hard trying to get people to think about a common framework
and that I and others made numerous suggestions in this regard. Here
are a few examples:

- DProbes (kprobes ancestor):
Shortly after dprobes came out in 2000, I was one of the first to
suggest that there could be interfacing between both to allow
dynamically added trace points. We worked with, and eventually
joined forces with, the IBM team working on this and very early
on, LTT and DProbes were interfacing:
http://marc.theaimsgroup.com/?l=linux-kernel&m=97079714009328&w=2
- OProfile:
When time came to integrate oprofile in the kernel, I tried to push
for oprofile to use ltt as it's logging engine (to John's utter
horror.) relayfs didn't exist at the time, and obviously oprofile
made it in without relying on ltt.
Here's a posting from July 2002 where I suggested oprofile rely on
ltt. In that same posting I listed a number of drivers/subsystems
that already contained tracing statements. Obviously I was pointing
out that there was an opportunity to create a common, uniform
infrastructure based on ltt:
http://marc.theaimsgroup.com/?l=linux-kernel&m=102624656615567&w=2
- Syscalltrack:
In replying to a posting of someone looking for tracing info, there
was a brief discussion as to how syscalltrack could use ltt instead
of: a) redirecting the syscall table, b) have its own buffering
mechanism. Again, relayfs didn't exist at the time:
http://marc.theaimsgroup.com/?l=linux-kernel&m=102822343523369&w=2
- Event logging:
When there was discussion about event logging, there was suggestion
to use ltt's engine. Again, relayfs wasn't there:
http://marc.theaimsgroup.com/?l=linux-kernel&m=101836133400796&w=2

And there are many other cases. As you can see, it's not as if
I didn't try to have this discussion before. Unfortunately, interest
in this was rather limited.

In addition, and this is a very important issue, quite a few
kernel developers mistook LTT for a kernel debugging tool, which
it was never meant to be. When, in fact, if you ask those who have
looked at using it for that purpose (try Marcelo or Andrea) you will
see that they didn't find it to be appropriate for them. And
rightly so, it was never meant for that purpose. Even lately, when
I suggested Ingo try using relayfs instead of his custom tracing
code for his preemption work, he looked at it and said that it
wasn't suited, but would consider reusing parts of it if it were
in the kernel.

So, in general, one thing I learned over the years is to not touch
the topic of kernel debugging even with a 10 foot poll when
discussing LTT.

What you are hinting at here (mention of developers vs. testers,
for example), and your stated preference for the type of ring-buffer
you described earlier clearly goes in the direction I've learned to
avoid: buffering support for the general purpose of kernel debugging.

Let me say outright that I see the relevance of what you are looking
for, but let me also say that what we tried to achieve with relayfs
is to provide a general mechanism for kernel subsystems that need to
convey large amounts of data to user-space. We did not attempt to
solve the problem of providing a buffering framework for core kernel
debugging. As I mentioned to Ingo in the mail I referred to earlier
regarding the type of buffering you are looking for:
> The above tracer may indeed be very appropriate for kernel development,
> but it doesn't provide enough functionality for the requirements of
> mainstream users.

If there is interest for using either relayfs and/or ltt for that
purpose, then this is an entirely different mandate and a few things
would need to be added for that to happen. For starters, we could
add another mode to relayfs. Currently, it supports a locking and
a lockless buffering scheme. We could also have ring-buffer mode
which would function very much as you, and Ingo before, have
described. But let me be crystal clear about this: don't count on
me to make a case for it on LKML. I've had enough flak as it is.
If you believe this is necessary, then you are welcome to make a
case for it, and obtain support from others on LKML. Obviously, as
the maintainers of relayfs, we see no reason to avoid extending it
for purposes others may find it useful for and/or accepting patches
to that end, if indeed such extensions don't preclude its adoption
in the mainline kernel.

Hope this helps clarify things a little,

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-16 02:39:15

by Roman Zippel

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi,

On Fri, 14 Jan 2005, Karim Yaghmour wrote:

> > Why should a subsystem care about the details of the buffer management?
>
> Because it wants to enforce a data format on buffer boundaries.

It's interesting to read more about ltt's requirements, but I still think
it's possible to leave this work to the relayfs layer.
Why not just move the ltt buffer management into relayfs and provide a
small library, which extracts the event stream again? Otherwise you have
to duplicate this work for every serious relayfs user anyway.
Completely abstracting the buffer management would the make whole
interface simpler and it would be a lot easier to change without breaking
everything. E.g. it would be possible to use per cpu buffers and remove
the need for different locking mechanisms, for a good tracing mechanism
it's not just important that it's lockless, but also that the cpus don't
share cache lines in the fast path. In this regard relayfs/ltt has really
still too much overhead and the complex relayfs API isn't really making it
easy to fix this.

bye, Roman

2005-01-16 03:11:18

by Roman Zippel

[permalink] [raw]
Subject: Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

Hi,

On Sat, 15 Jan 2005, Karim Yaghmour wrote:

> In addition, and this is a very important issue, quite a few
> kernel developers mistook LTT for a kernel debugging tool, which
> it was never meant to be. When, in fact, if you ask those who have
> looked at using it for that purpose (try Marcelo or Andrea) you will
> see that they didn't find it to be appropriate for them. And
> rightly so, it was never meant for that purpose. Even lately, when
> I suggested Ingo try using relayfs instead of his custom tracing
> code for his preemption work, he looked at it and said that it
> wasn't suited, but would consider reusing parts of it if it were
> in the kernel.

Well, that's really a core problem. We don't want to duplicate
infrastructure, which practically does the same. So if relayfs isn't
usable in this kind of situation, it really raises the question whether
relayfs is usable at all. We need to make relayfs generally usable,
otherwise it will join the fate of devfs.

bye, Roman

2005-01-16 04:08:32

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Hello Thomas,

In the interest of avoiding expanding the thread too thin, I'm replying to
both emails in the same time.

Thomas Gleixner wrote:
>>relayfs is a generalized buffering mechanism. Tracing is one application
>>it serves. Check out the web site: "high-speed data-relay filesystem."
>>Fancy name huh ...
>
>
> I do not doubt that.
>
> But hardwiring an instrumentation framework on it is also hardwiring
> implicit restrictions on the usability of the instrumentation for
> certain purposes.

To a certain extent this is true. Please refer to my reply to your RFC
for a discussion of this.

>>Well for one thing, a portion of code running in user-context won't
>>disable interrupts while it's attempting to get buffer space, and
>>therefore won't impact on interrupt delivery.
>
>
> The do {} while loops are in the fast ltt_log_event path

You mean that it would impact on interrupt deliver? This code's behavior
has actually been carefully studied, and what has been seen is that
there code almost never loops, and when it does, it very rarely does
it more than twice. In the case of an interrupt, you'd have to receive
an interrupt while reserving space for logging a current's interrupt
occurrence for the loop to be done twice. I've CC'ed Bob Wisniewski
on this as he's the one that implemented this code and studied its
behavior in depth.

> Yeah, did you answer one of my arguments except claiming that I'm to
> stupid to understand how it works ?

If I miss-spoke, then I appologize. For one thing, I've never thought
of you as stupid. I'm just trying to get specifics here.

> I just dont like the idea, that instrumentation is bound on relayfs and
> adds a feature to the kernel which fits for a restricted set of problems
> rather than providing a generic optimized instrumentation framework,
> where one can use relayfs as a backend, if it fits his needs. Making
> this less glued together leaves the possibility to use other backends.

Yes, I understand and I hope my other mail properly addresses this issue.

> There is a loop in ltt_log_event, which enforces the processing of each
> event twice. Spliting traces is postprocessing and can be done
> elsewhere.

Sorry, this is not postprocessing. Let me explain:

Basically, the ltt framework allows only one tracing session to be active
at all times. IOW, if you were planning on starting a 2 week trace and
after doing so wanted to trace a short 10s on an application then you are
screwed, LTT won't allow you to do that. Currently this is a limitation
which we haven't heard any complaints about, so we're not going to
generalize it until there is proof that people really need this.

However, there are cases where you want to have tracing running at _all_
times in what is refered to as flight-recorder mode and only dump the
content of the buffers when something special happens. Yet, those who
are interested in having this 24x7 mode also know enough about tracing
that they do need to actually trace other things for short periods
without disrupting their flight-recording. That's why there's a loop.
An event will be processed twice only if you're tracing AND flight-
recording in the same time.

There is no way to do an equivalent of what I just described with any
form of postprocessing.

Here's the proper snippet from include/linux/ltt-events.h:
/* We currently support 2 traces, normal trace and flight recorder */
#define NR_TRACES 2
#define TRACE_HANDLE 0
#define FLIGHT_HANDLE 1

> In _ltt_log_event lives quite a bunch of if(...) processing decisions
> which have to be evaluated for _each_ event.

Correct, and I'm honest enough with myself to admit that this is the bit
of code that I think needs the most reviewing. So, in order to help
you help me, here's the various code snippets and things I can think
of which would help make the code faster/simpler:

Here's the preamble where we check some make some basic sanity checks:

if (!trace)
return -ENOMEDIUM;

if (trace->paused)
return -EBUSY;

tracer_handle = trace->trace_handle;

if (!trace->flight_recorder && (trace->daemon_task_struct == NULL))
return -ENODEV;

channel_handle = trace_channel_handle(tracer_handle, cpu_id);

if ((trace->tracer_started == 1) || (event_id == LTT_EV_START) || (event_id == LTT_EV_BUFFER_START))
goto trace_event;

return -EBUSY;

trace_event:
if (!ltt_test_bit(event_id, &trace->traced_events))
return 0;

Basically, unless we've succeeded in all those if's, we're not going to
write anything. I think we could get rid of the first 4 ones by simply
maintaining a state-machine for the tracer. Then we could either have
a single if or even use function pointers (though I think this costs
more) to call or not call _ltt_log_event. As for checking whether the
event has a certain ID (EV_START or EV_BUFFER_START and ltt_test_bit),
we could do the testing at the event's occurrence (i.e. as soon as the
event occurs, check whether it's being monitored right there and drop
it otherwise.)

Here's the part where we check if some basic filtering requirements
have been made:

if ((event_id != LTT_EV_START) && (event_id != LTT_EV_BUFFER_START)) {
if (event_id == LTT_EV_SCHEDCHANGE)
incoming_process = (struct task_struct *) (((ltt_schedchange *) event_struct)->in);
if ((trace->tracing_pid == 1) && (current->pid != trace->traced_pid)) {
if (incoming_process == NULL)
return 0;
else if (incoming_process->pid != trace->traced_pid)
return 0;
}
if ((trace->tracing_pgrp == 1) && (process_group(current) != trace->traced_pgrp)) {
if (incoming_process == NULL)
return 0;
else if (process_group(incoming_process) != trace->traced_pgrp)
return 0;
}
if ((trace->tracing_gid == 1) && (current->egid != trace->traced_gid)) {
if (incoming_process == NULL)
return 0;
else if (incoming_process->egid != trace->traced_gid)
return 0;
}
if ((trace->tracing_uid == 1) && (current->euid != trace->traced_uid)) {
if (incoming_process == NULL)
return 0;
else if (incoming_process->euid != trace->traced_uid)
return 0;
}
if (event_id == LTT_EV_SCHEDCHANGE)
(((ltt_schedchange *) event_struct)->in) = incoming_process->pid;
}

First, the first inner if (LTT_EV_SCHEDCHANGE) really ought to be outside.
Instead we should modify ltt_log_event from:
int ltt_log_event(u8 event_id,
void *event_struct)
to:
int ltt_log_event(u8 event_id,
void *event_struct,
void *data,
int data_len)

where data is used to pass the pointer to the incoming process' task struct,
and reused below in conjunction with data_len for other purposes.

and have something like this instead in the code:
if ((any_filtering) && !(ltt_filter(event_id, event_struct, data)))
return -EINVAL;

where ltt_filter is the filtering function, called only when there is any
sort of filtering being done.

The we calculate the size of this event:
data_size = sizeof(event_id) + sizeof(time_delta) + sizeof(data_size);


if (ltt_test_bit(event_id, &trace->log_event_details_mask)) {
data_size += event_struct_size[event_id];
switch (event_id) {
case LTT_EV_FILE_SYSTEM:
if ((((ltt_file_system *) event_struct)->event_sub_id == LTT_EV_FILE_SYSTEM_EXEC)
|| (((ltt_file_system *) event_struct)->event_sub_id == LTT_EV_FILE_SYSTEM_OPEN)) {
var_data_beg = ((ltt_file_system *) event_struct)->file_name;
var_data_len = ((ltt_file_system *) event_struct)->event_data2 + 1;
data_size += (uint16_t) var_data_len;
}
break;
case LTT_EV_CUSTOM:
var_data_beg = ((ltt_custom *) event_struct)->data;
var_data_len = ((ltt_custom *) event_struct)->data_size;
data_size += (uint16_t) var_data_len;
break;
}
}

Here we reuse data and data_len, and remove the checking for whether the
user wants to log event details or not in order to remove this if/switch
altogether. The log_event_details_mask was a feature I added early on
in LTT's life and I don't know of anyone for whom this was really crucial.
We could revive it later if it became important.

Then we check whether we should be logging the CPU-ID:
if ((trace->log_cpuid == 1) && (event_id != LTT_EV_START) && (event_id != LTT_EV_BUFFER_START))
data_size += sizeof(cpu_id);

Frankly this is legacy code for when ltt only supported one trace buffer,
and I don't know that we need to keep it. Clearly if you've got many
CPUs you don't want to be using one buffer. So this code can go.

Now we do the relayfs part:
rchan = rchan_get(channel_handle);
if (rchan == NULL)
return -ENODEV;

relay_lock_channel(rchan, flags); /* nop for lockless */
reserved = relay_reserve(rchan, data_size, &time_stamp, &time_delta, &reserve_code, &interrupting);

if (reserve_code & RELAY_WRITE_DISCARD) {
events_lost(trace->trace_handle, cpu_id)++;
bytes_written = 0;
goto check_buffer_switch;
}

First, the rchan_get() really ought to go. As Roman suggested, relayfs
should be handing out IDs, it should be handing out pointers. Once this
is changed in relayfs, this piece of code will go and be replaced by
something like:
atomic_inc(&rchan->refcount);

The rest is ok.

At this point we actually write to the buffer:
if ((trace->log_cpuid == 1) && (event_id != LTT_EV_START)
&& (event_id != LTT_EV_BUFFER_START))
relay_write_direct(reserved,
&cpu_id,
sizeof(cpu_id));

relay_write_direct(reserved,
&event_id,
sizeof(event_id));

relay_write_direct(reserved,
&time_delta,
sizeof(time_delta));

if (ltt_test_bit(event_id, &trace->log_event_details_mask)) {
relay_write_direct(reserved,
event_struct,
event_struct_size[event_id]);
if (var_data_len)
relay_write_direct(reserved,
var_data_beg,
var_data_len);
}

relay_write_direct(reserved,
&data_size,
sizeof(data_size));

bytes_written = data_size;

As above, the CPU-Id and the check for log_event_details_mask should
go. And the details snippet should look something like this:

relay_write_direct(reserved,
event_struct,
event_struct_size[event_id]);
if (data_len)
relay_write_direct(reserved,
data,
data_len);

Finally, we complete the relayfs management:

check_buffer_switch:
if ((event_id == LTT_EV_SCHEDCHANGE) && (tracer_handle == TRACE_HANDLE) && current_traces[FLIGHT_HANDLE].active)
(((ltt_schedchange *) event_struct)->in) = (u32)incoming_process;

/* We need to commit even if we didn't write anything because
that's how the deliver callback is invoked. */
relay_commit(rchan, reserved, bytes_written, reserve_code, interrupting);

relay_unlock_channel(rchan, flags);
rchan_put(rchan);

For this bit, it's the if() that ought to go now that we would be using
data and data_len. Also, the rchan_put() should be replaced with the
following once relayfs is changed:
atomic_dec(&rchan->refcount);

Let me know if have additional suggestions.

> The relay_reserve code can loop in the do { } while() and even go into a
> slow path where another do { } while() is found.
> So it can not be used in fast paths and for timing related problem
> tracking, because it adds variable time overhead.

True. But remember what I said earlier, if timing is an issue you need to
be using the locking scheme.

> Due to the fact, that the ltt_log_event path is not preempt safe you can
> actually hit the additional go in the do { } while() loop.

Yes, we should have something like this instead:
u32 cpu;

preempt_disable();
cpu = smp_processor_id();
for (i = 0; i < NR_TRACES; i++) {
trace = current_traces[i].active;
err[i] = _ltt_log_event(trace, event_id, event_struct, cpu);
}
preempt_enable();

This better?

> I pointed out before, that it is not possible to selectively select the
> events which I'm interested in during compile time. I get either nothing
> or everything. If I want to use instrumentation for a particular
> problem, why must I process a loop of _ltt_log_event calls for stuff I
> do not need instead of just compiling it away ?

Like I said, that's an easy hack in Kconfig.

> If I compile a event in, then adding a couple of checks into the
> instrumentation macro itself does not hurt as much as leaving the
> straight code path for a disabled event.

Right, like I said above, the instrumentation macros should check for
the event's logging as early as possible.

As you can see, I am open to your feedback. The above improvements
will go in the ltt code.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-16 04:16:34

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)


Hello Roman,

Roman Zippel wrote:
> On Sat, 15 Jan 2005, Karim Yaghmour wrote:
>>In addition, and this is a very important issue, quite a few
>>kernel developers mistook LTT for a kernel debugging tool, which
>>it was never meant to be. When, in fact, if you ask those who have
>>looked at using it for that purpose (try Marcelo or Andrea) you will
>>see that they didn't find it to be appropriate for them. And
>>rightly so, it was never meant for that purpose. Even lately, when
>>I suggested Ingo try using relayfs instead of his custom tracing
>>code for his preemption work, he looked at it and said that it
>>wasn't suited, but would consider reusing parts of it if it were
>>in the kernel.
>
> Well, that's really a core problem. We don't want to duplicate
> infrastructure, which practically does the same. So if relayfs isn't
> usable in this kind of situation, it really raises the question whether
> relayfs is usable at all. We need to make relayfs generally usable,
> otherwise it will join the fate of devfs.

Hmm, coming from you I will take this is a pretty strong endorsement
for what I was suggesting earlier: provide a basic buffering mode
in relayfs to be used in kernel debugging. However, it must be
understood that this is separate from the existing modes and ltt,
for example, could not use such a basic infrastructure. If this is
ok with you, and no one wants to complain too loudly about this, I
will go ahead and add this to our to-do list for relayfs.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-16 05:54:36

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Hello Roman,

Roman Zippel wrote:
> It's interesting to read more about ltt's requirements, but I still think
> it's possible to leave this work to the relayfs layer.

Ok, I'm willing to play ball, but can you be a little bit more specific.

> Why not just move the ltt buffer management into relayfs and provide a
> small library, which extracts the event stream again? Otherwise you have
> to duplicate this work for every serious relayfs user anyway.

Ok, I've been meditating over what you say above for some time in order
to understand how best to follow what you are suggesting. So here's
what I've been able to come up with. Let me know if you have other
suggestions:

Drop the buffer-start/end callbacks altogether. Instead, allow user
to specify in the channel properties whether they want to have
sub-buffer delimiters. If so, relayfs would automatically prepend
and append the structures currently written by ltt:
/* Start of trace buffer information */
typedef struct _ltt_buffer_start {
struct timeval time; /* Time stamp of this buffer */
u32 tsc; /* TSC of this buffer, if applicable */
u32 id; /* Unique buffer ID */
} LTT_PACKED_STRUCT ltt_buffer_start;

/* End of trace buffer information */
typedef struct _ltt_buffer_end {
struct timeval time; /* Time stamp of this buffer */
u32 tsc; /* TSC of this buffer, if applicable */
} LTT_PACKED_STRUCT ltt_buffer_end;

This would also allow dropping the start_reserve, end_reserve, and
channel_start_reserve. The latter can be added by ltt as its first
event.

Is this what you are looking for and is there something else we should
be doing.

> Completely abstracting the buffer management would the make whole
> interface simpler and it would be a lot easier to change without breaking
> everything. E.g. it would be possible to use per cpu buffers and remove
> the need for different locking mechanisms, for a good tracing mechanism
> it's not just important that it's lockless, but also that the cpus don't
> share cache lines in the fast path. In this regard relayfs/ltt has really
> still too much overhead and the complex relayfs API isn't really making it
> easy to fix this.

The per-cpu buffering issue is really specific to the client. It just
so happens that LTT creates one channel for each CPU. Not everyone
who needs to ship lots of data to user-space needs/wants one channel
per cpu. You could, for example, use a relayfs channel as a big
chunk of memory visible to both a user-space app and its kernel buddy
in order to exchange data without ever using either needing more
than one such channel for your entire subsystem.

As for lockless vs. locking there is a need for both. Not having
to get locks has obvious advantages, but if you require strict
timing you will want to use the locking scheme because its logging
time is linear (see Thomas' complaints about lockless elsewhere
in this thread, and Ingo's complaints about relayfs somewhere back
in October.)

But in trying to make things simpler, here's a reworked API:

rchan* relay_open(channel_path, mode, bufsize, nbufs);
int relay_close(*rchan);
int relay_reset(*rchan)
int relay_write(*rchan, *data_ptr, count, **wrote-pos);

int relay_info(*rchan, *channel_info)
void relay_set_property(*rchan, property, value);
void relay_get_property(*rchan, property, *value);

For direct writing (currently already used by ltt, for example):

char* relay_reserve(*rchan, len, *ts, *td, *err, *interrupting)
void relay_commit(*rchan, *from, len, reserve_code, interrupting);

These are the related macros:

#define relay_write_direct(DEST, SRC, SIZE) \
#define relay_lock_channel(RCHAN, FLAGS) \
#define relay_unlock_channel(RCHAN, FLAGS) \

As I hinted elsewhere, we would now have three modes for relayfs
channels:
- locking => relies on local_irq_save.
- lockless => relies on try_reserve/fail->retry (based on cmpxchg).
- kdebug => this is for kernel debugging.

The last one could be based on Ingo's tracing code, or any
implementation suggestions by Thomas. It wouldn't do all
the checks and provide all the capabilities of the other two
mechanisms, but would really be a hot-path logger with only
minimalistic provisions for content loss and other such things.

(note to Tom: time_delta_offset that used to be in relay_write
should be a property set using relay_set_property).

What I'm dropping for now is all the functions that allow a
subsystem to read from a channel from within the kernel. So,
for example, if you want to obtain large amounts of data from
user-space via a relayfs channel you won't be able to. Here
are the functions that would go:

rchan_reader *add_rchan_reader(channel_id, auto_consume)
int remove_rchan_reader(rchan_reader *reader)
rchan_reader *add_map_reader(channel_id)
int remove_map_reader(rchan_reader *reader)
int relay_read(reader, buf, count, wait, *actual_read_offset)
void relay_buffers_consumed(reader, buffers_consumed)
void relay_bytes_consumed(reader, bytes_consumed, read_offset)
int relay_bytes_avail(reader)
int rchan_full(reader)
int rchan_empty(reader)

We could add these at a later time when/if needed. Removing
these changes nothing for ltt.

Also, we should try to get rid of the following. They are there
for allowing dynamically-resizable buffers, but if we are to
make buffer-management opaque, then this should be done
internally (Tom: I can't remember the rationale for these. Let
me know if there's a reason why the must be kept.)

int relay_realloc_buffer(*rchan, nbufs, async)
int relay_replace_buffer(*rchan)

I think this is a pretty major change and simplification of the
API along the lines of what others have asked for. Let me know
what you think.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-16 15:20:28

by Robert Wisniewski

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Karim Yaghmour writes:
>
> Hello Thomas,
>
> In the interest of avoiding expanding the thread too thin, I'm replying to
> both emails in the same time.
>
> Thomas Gleixner wrote:
> >>relayfs is a generalized buffering mechanism. Tracing is one application
> >>it serves. Check out the web site: "high-speed data-relay filesystem."
> >>Fancy name huh ...
> >
> >
> > I do not doubt that.
> >
> > But hardwiring an instrumentation framework on it is also hardwiring
> > implicit restrictions on the usability of the instrumentation for
> > certain purposes.
>
> To a certain extent this is true. Please refer to my reply to your RFC
> for a discussion of this.
>
> >>Well for one thing, a portion of code running in user-context won't
> >>disable interrupts while it's attempting to get buffer space, and
> >>therefore won't impact on interrupt delivery.
> >
> >
> > The do {} while loops are in the fast ltt_log_event path

As Greg's comments implicitly involved this issue as well, maybe it's worth
expanding on what is going on here. The idea behind the lockless tracing
is for each process/thread to atomically reserve space in the buffer, then
write in the events. Also note that buffers are per-processor. So the do
{} while loop loads the current index, does a calculation and attempts to
use the calculated value (which is the old index + length of current event)
to atomically compare_and_swap with the actual index pointer. As Karim
correctly notes, the only way this will fail is if an interrupt occurred
during the couple of instruction calculation, i.e., between when the old
value was loaded and when we do the CAS, so it's unlikely, but even much
more unlikely that, as he notes, this process would be woken up only for a
couple of instructions and re-interrupted. Back to Greg's volatile issue:
The reason the index needs to be volatile (or as was originally coded the
reason we clobbered the registers) is to make sure the compiler knows the
index value needs to get reloaded from memory each time around the loop.

Hope this helps. I'm certainly happy to discuss in more length if there's
any concerns/questions.

-bob

Robert Wisniewski
The K42 MP OS Project
http://www.research.ibm.com/K42/
[email protected]

>
> You mean that it would impact on interrupt deliver? This code's behavior
> has actually been carefully studied, and what has been seen is that
> there code almost never loops, and when it does, it very rarely does
> it more than twice. In the case of an interrupt, you'd have to receive
> an interrupt while reserving space for logging a current's interrupt
> occurrence for the loop to be done twice. I've CC'ed Bob Wisniewski
> on this as he's the one that implemented this code and studied its
> behavior in depth.
>
> > Yeah, did you answer one of my arguments except claiming that I'm to
> > stupid to understand how it works ?
>
> If I miss-spoke, then I appologize. For one thing, I've never thought
> of you as stupid. I'm just trying to get specifics here.
>
> > I just dont like the idea, that instrumentation is bound on relayfs and
> > adds a feature to the kernel which fits for a restricted set of problems
> > rather than providing a generic optimized instrumentation framework,
> > where one can use relayfs as a backend, if it fits his needs. Making
> > this less glued together leaves the possibility to use other backends.
>
> Yes, I understand and I hope my other mail properly addresses this issue.
>
> > There is a loop in ltt_log_event, which enforces the processing of each
> > event twice. Spliting traces is postprocessing and can be done
> > elsewhere.
>
> Sorry, this is not postprocessing. Let me explain:
>
> Basically, the ltt framework allows only one tracing session to be active
> at all times. IOW, if you were planning on starting a 2 week trace and
> after doing so wanted to trace a short 10s on an application then you are
> screwed, LTT won't allow you to do that. Currently this is a limitation
> which we haven't heard any complaints about, so we're not going to
> generalize it until there is proof that people really need this.
>
> However, there are cases where you want to have tracing running at _all_
> times in what is refered to as flight-recorder mode and only dump the
> content of the buffers when something special happens. Yet, those who
> are interested in having this 24x7 mode also know enough about tracing
> that they do need to actually trace other things for short periods
> without disrupting their flight-recording. That's why there's a loop.
> An event will be processed twice only if you're tracing AND flight-
> recording in the same time.
>
> There is no way to do an equivalent of what I just described with any
> form of postprocessing.
>
> Here's the proper snippet from include/linux/ltt-events.h:
> /* We currently support 2 traces, normal trace and flight recorder */
> #define NR_TRACES 2
> #define TRACE_HANDLE 0
> #define FLIGHT_HANDLE 1
>
> > In _ltt_log_event lives quite a bunch of if(...) processing decisions
> > which have to be evaluated for _each_ event.
>
> Correct, and I'm honest enough with myself to admit that this is the bit
> of code that I think needs the most reviewing. So, in order to help
> you help me, here's the various code snippets and things I can think
> of which would help make the code faster/simpler:
>
> Here's the preamble where we check some make some basic sanity checks:
>
> if (!trace)
> return -ENOMEDIUM;
>
> if (trace->paused)
> return -EBUSY;
>
> tracer_handle = trace->trace_handle;
>
> if (!trace->flight_recorder && (trace->daemon_task_struct == NULL))
> return -ENODEV;
>
> channel_handle = trace_channel_handle(tracer_handle, cpu_id);
>
> if ((trace->tracer_started == 1) || (event_id == LTT_EV_START) || (event_id == LTT_EV_BUFFER_START))
> goto trace_event;
>
> return -EBUSY;
>
> trace_event:
> if (!ltt_test_bit(event_id, &trace->traced_events))
> return 0;
>
> Basically, unless we've succeeded in all those if's, we're not going to
> write anything. I think we could get rid of the first 4 ones by simply
> maintaining a state-machine for the tracer. Then we could either have
> a single if or even use function pointers (though I think this costs
> more) to call or not call _ltt_log_event. As for checking whether the
> event has a certain ID (EV_START or EV_BUFFER_START and ltt_test_bit),
> we could do the testing at the event's occurrence (i.e. as soon as the
> event occurs, check whether it's being monitored right there and drop
> it otherwise.)
>
> Here's the part where we check if some basic filtering requirements
> have been made:
>
> if ((event_id != LTT_EV_START) && (event_id != LTT_EV_BUFFER_START)) {
> if (event_id == LTT_EV_SCHEDCHANGE)
> incoming_process = (struct task_struct *) (((ltt_schedchange *) event_struct)->in);
> if ((trace->tracing_pid == 1) && (current->pid != trace->traced_pid)) {
> if (incoming_process == NULL)
> return 0;
> else if (incoming_process->pid != trace->traced_pid)
> return 0;
> }
> if ((trace->tracing_pgrp == 1) && (process_group(current) != trace->traced_pgrp)) {
> if (incoming_process == NULL)
> return 0;
> else if (process_group(incoming_process) != trace->traced_pgrp)
> return 0;
> }
> if ((trace->tracing_gid == 1) && (current->egid != trace->traced_gid)) {
> if (incoming_process == NULL)
> return 0;
> else if (incoming_process->egid != trace->traced_gid)
> return 0;
> }
> if ((trace->tracing_uid == 1) && (current->euid != trace->traced_uid)) {
> if (incoming_process == NULL)
> return 0;
> else if (incoming_process->euid != trace->traced_uid)
> return 0;
> }
> if (event_id == LTT_EV_SCHEDCHANGE)
> (((ltt_schedchange *) event_struct)->in) = incoming_process->pid;
> }
>
> First, the first inner if (LTT_EV_SCHEDCHANGE) really ought to be outside.
> Instead we should modify ltt_log_event from:
> int ltt_log_event(u8 event_id,
> void *event_struct)
> to:
> int ltt_log_event(u8 event_id,
> void *event_struct,
> void *data,
> int data_len)
>
> where data is used to pass the pointer to the incoming process' task struct,
> and reused below in conjunction with data_len for other purposes.
>
> and have something like this instead in the code:
> if ((any_filtering) && !(ltt_filter(event_id, event_struct, data)))
> return -EINVAL;
>
> where ltt_filter is the filtering function, called only when there is any
> sort of filtering being done.
>
> The we calculate the size of this event:
> data_size = sizeof(event_id) + sizeof(time_delta) + sizeof(data_size);
>
>
> if (ltt_test_bit(event_id, &trace->log_event_details_mask)) {
> data_size += event_struct_size[event_id];
> switch (event_id) {
> case LTT_EV_FILE_SYSTEM:
> if ((((ltt_file_system *) event_struct)->event_sub_id == LTT_EV_FILE_SYSTEM_EXEC)
> || (((ltt_file_system *) event_struct)->event_sub_id == LTT_EV_FILE_SYSTEM_OPEN)) {
> var_data_beg = ((ltt_file_system *) event_struct)->file_name;
> var_data_len = ((ltt_file_system *) event_struct)->event_data2 + 1;
> data_size += (uint16_t) var_data_len;
> }
> break;
> case LTT_EV_CUSTOM:
> var_data_beg = ((ltt_custom *) event_struct)->data;
> var_data_len = ((ltt_custom *) event_struct)->data_size;
> data_size += (uint16_t) var_data_len;
> break;
> }
> }
>
> Here we reuse data and data_len, and remove the checking for whether the
> user wants to log event details or not in order to remove this if/switch
> altogether. The log_event_details_mask was a feature I added early on
> in LTT's life and I don't know of anyone for whom this was really crucial.
> We could revive it later if it became important.
>
> Then we check whether we should be logging the CPU-ID:
> if ((trace->log_cpuid == 1) && (event_id != LTT_EV_START) && (event_id != LTT_EV_BUFFER_START))
> data_size += sizeof(cpu_id);
>
> Frankly this is legacy code for when ltt only supported one trace buffer,
> and I don't know that we need to keep it. Clearly if you've got many
> CPUs you don't want to be using one buffer. So this code can go.
>
> Now we do the relayfs part:
> rchan = rchan_get(channel_handle);
> if (rchan == NULL)
> return -ENODEV;
>
> relay_lock_channel(rchan, flags); /* nop for lockless */
> reserved = relay_reserve(rchan, data_size, &time_stamp, &time_delta, &reserve_code, &interrupting);
>
> if (reserve_code & RELAY_WRITE_DISCARD) {
> events_lost(trace->trace_handle, cpu_id)++;
> bytes_written = 0;
> goto check_buffer_switch;
> }
>
> First, the rchan_get() really ought to go. As Roman suggested, relayfs
> should be handing out IDs, it should be handing out pointers. Once this
> is changed in relayfs, this piece of code will go and be replaced by
> something like:
> atomic_inc(&rchan->refcount);
>
> The rest is ok.
>
> At this point we actually write to the buffer:
> if ((trace->log_cpuid == 1) && (event_id != LTT_EV_START)
> && (event_id != LTT_EV_BUFFER_START))
> relay_write_direct(reserved,
> &cpu_id,
> sizeof(cpu_id));
>
> relay_write_direct(reserved,
> &event_id,
> sizeof(event_id));
>
> relay_write_direct(reserved,
> &time_delta,
> sizeof(time_delta));
>
> if (ltt_test_bit(event_id, &trace->log_event_details_mask)) {
> relay_write_direct(reserved,
> event_struct,
> event_struct_size[event_id]);
> if (var_data_len)
> relay_write_direct(reserved,
> var_data_beg,
> var_data_len);
> }
>
> relay_write_direct(reserved,
> &data_size,
> sizeof(data_size));
>
> bytes_written = data_size;
>
> As above, the CPU-Id and the check for log_event_details_mask should
> go. And the details snippet should look something like this:
>
> relay_write_direct(reserved,
> event_struct,
> event_struct_size[event_id]);
> if (data_len)
> relay_write_direct(reserved,
> data,
> data_len);
>
> Finally, we complete the relayfs management:
>
> check_buffer_switch:
> if ((event_id == LTT_EV_SCHEDCHANGE) && (tracer_handle == TRACE_HANDLE) && current_traces[FLIGHT_HANDLE].active)
> (((ltt_schedchange *) event_struct)->in) = (u32)incoming_process;
>
> /* We need to commit even if we didn't write anything because
> that's how the deliver callback is invoked. */
> relay_commit(rchan, reserved, bytes_written, reserve_code, interrupting);
>
> relay_unlock_channel(rchan, flags);
> rchan_put(rchan);
>
> For this bit, it's the if() that ought to go now that we would be using
> data and data_len. Also, the rchan_put() should be replaced with the
> following once relayfs is changed:
> atomic_dec(&rchan->refcount);
>
> Let me know if have additional suggestions.
>
> > The relay_reserve code can loop in the do { } while() and even go into a
> > slow path where another do { } while() is found.
> > So it can not be used in fast paths and for timing related problem
> > tracking, because it adds variable time overhead.
>
> True. But remember what I said earlier, if timing is an issue you need to
> be using the locking scheme.
>
> > Due to the fact, that the ltt_log_event path is not preempt safe you can
> > actually hit the additional go in the do { } while() loop.
>
> Yes, we should have something like this instead:
> u32 cpu;
>
> preempt_disable();
> cpu = smp_processor_id();
> for (i = 0; i < NR_TRACES; i++) {
> trace = current_traces[i].active;
> err[i] = _ltt_log_event(trace, event_id, event_struct, cpu);
> }
> preempt_enable();
>
> This better?
>
> > I pointed out before, that it is not possible to selectively select the
> > events which I'm interested in during compile time. I get either nothing
> > or everything. If I want to use instrumentation for a particular
> > problem, why must I process a loop of _ltt_log_event calls for stuff I
> > do not need instead of just compiling it away ?
>
> Like I said, that's an easy hack in Kconfig.
>
> > If I compile a event in, then adding a couple of checks into the
> > instrumentation macro itself does not hurt as much as leaving the
> > straight code path for a disabled event.
>
> Right, like I said above, the instrumentation macros should check for
> the event's logging as early as possible.
>
> As you can see, I am open to your feedback. The above improvements
> will go in the ltt code.
>
> Karim
> --
> Author, Speaker, Developer, Consultant
> Pushing Embedded and Real-Time Linux Systems Beyond the Limits
> http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-16 16:15:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Fri, Jan 14, 2005 at 04:11:38PM -0500, Karim Yaghmour wrote:
> Where does this appear in relayfs and what rights do
> user-space apps have over it (rwx).

Why would you want anything but read access?

> bufsize, nbufs:
> Usually things have to be subdivided in sub-buffers to make
> both writing and reading simple. LTT uses this to allow,
> among other things, random trace access.

I think random access is overkill. Keeping the code simple is more
important and user-space can post-process it.

> resize_min, resize_max:
> Allow for dynamic resizing of buffer.

Auto-resizing sounds like a really bad idea.

> init_buf, init_buf_size:
> Is there an initial buffer containing some data that should
> be used to initialize the channel's content. If you're doing
> init-time tracing, for example, you need to have a pre-allocated
> static buffer that is copied to relayfs once relayfs is mounted.

And why can't you do this from that code? It just needs an initcall-like
thing that runs after mounting of relayfs.

2005-01-16 16:19:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Sat, Jan 15, 2005 at 01:24:16AM +0100, Thomas Gleixner wrote:
> Putting a 200k patch into the kernel for limited usage and maybe
> restricting a generic simple non intrusive and more generic
> implementation by its mere presence is making it inapplicable enough.
>
> Merge the instrumentation points from ltt and other projects like DSKI
> and the places where in kernel instrumentation for specific purposes is
> already available and use a simple and effective framework which moves
> the burden into postprocessing and provides a simple postmortem dump
> interface, is the goal IMHO.
>
> When this is available, trace tool developers can concentrate on
> postprocessing improvement rather than moving postprocessing
> incapabilities into the kernel.

I completely agree with that statement. We've been working in most
areas of the kernel to move or keep complexity and policy in userspace.
The same should be true for a tracing framework.

2005-01-16 16:22:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Fri, Jan 14, 2005 at 06:09:23PM -0500, Karim Yaghmour wrote:
> relayfs implements two schemes: lockless and locking. The later uses
> standard linear locking mechanisms. If you need stringent constant
> time, you know what to do.

the lockless mode is really just loops around cmpxchg. It's spinlocks
reinvented poorly.

2005-01-16 16:46:52

by Daniel Drake

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Joseph Fannin wrote:
> On Fri, Jan 14, 2005 at 12:23:52AM -0800, Andrew Morton wrote:
>
>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/
>
>
>>waiting-10s-before-mounting-root-filesystem.patch
>> retry mounting the root filesystem at boot time
>
>
> With this patch, initrds seem to get 'skipped'. I think this is
> probably the cause for the reports of problems with RAID too.

This seems likely and is probably also the cause of wli's problems mentioned
elsewhere in this thread.

I had overlooked the way that initrd's work in that part of the boot sequence.
Will investigate.

Daniel

2005-01-16 16:52:53

by Roman Zippel

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi,

On Sun, 16 Jan 2005, Karim Yaghmour wrote:

> The per-cpu buffering issue is really specific to the client. It just
> so happens that LTT creates one channel for each CPU. Not everyone
> who needs to ship lots of data to user-space needs/wants one channel
> per cpu. You could, for example, use a relayfs channel as a big
> chunk of memory visible to both a user-space app and its kernel buddy
> in order to exchange data without ever using either needing more
> than one such channel for your entire subsystem.

It seems we first need to specify, what relayfs actually is supposed to
be. Is it a relaying mechanism for large amount of data from kernel to
user space or is it a general communication channel between kernel and
user space? You have to choose one, if you mix contradicting requirements,
you'll never get a simple abstraction layer and relayfs will always be a
pain to work with.

> > Why not just move the ltt buffer management into relayfs and provide a
> > small library, which extracts the event stream again? Otherwise you have
> > to duplicate this work for every serious relayfs user anyway.
>
> Ok, I've been meditating over what you say above for some time in order
> to understand how best to follow what you are suggesting. So here's
> what I've been able to come up with. Let me know if you have other
> suggestions:
>
> Drop the buffer-start/end callbacks altogether. Instead, allow user
> to specify in the channel properties whether they want to have
> sub-buffer delimiters. If so, relayfs would automatically prepend
> and append the structures currently written by ltt:
> /* Start of trace buffer information */
> typedef struct _ltt_buffer_start {
> struct timeval time; /* Time stamp of this buffer */
> u32 tsc; /* TSC of this buffer, if applicable */
> u32 id; /* Unique buffer ID */
> } LTT_PACKED_STRUCT ltt_buffer_start;
>
> /* End of trace buffer information */
> typedef struct _ltt_buffer_end {
> struct timeval time; /* Time stamp of this buffer */
> u32 tsc; /* TSC of this buffer, if applicable */
> } LTT_PACKED_STRUCT ltt_buffer_end;

You can make it even simpler by dropping this completely. Every buffer is
simply a list of events and you can let ltt write periodically a timer
event. In userspace you can randomly seek at buffer boundaries and search
for the timer events. It will require a bit more work for userspace, but
even large amount of tracing data stays managable.

> As for lockless vs. locking there is a need for both. Not having
> to get locks has obvious advantages, but if you require strict
> timing you will want to use the locking scheme because its logging
> time is linear (see Thomas' complaints about lockless elsewhere
> in this thread, and Ingo's complaints about relayfs somewhere back
> in October.)

But why has it to be done in relayfs? Simply leave it to the user to write
an extra id field:

event_id = atomic_inc_return(&event_cnt);

Userspace can then easily restore the original order of events.

bye, Roman

2005-01-16 18:19:07

by Tom Zanussi

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Karim Yaghmour writes:
>
> What I'm dropping for now is all the functions that allow a
> subsystem to read from a channel from within the kernel. So,
> for example, if you want to obtain large amounts of data from
> user-space via a relayfs channel you won't be able to. Here
> are the functions that would go:
>
> rchan_reader *add_rchan_reader(channel_id, auto_consume)
> int remove_rchan_reader(rchan_reader *reader)
> rchan_reader *add_map_reader(channel_id)
> int remove_map_reader(rchan_reader *reader)
> int relay_read(reader, buf, count, wait, *actual_read_offset)
> void relay_buffers_consumed(reader, buffers_consumed)
> void relay_bytes_consumed(reader, bytes_consumed, read_offset)
> int relay_bytes_avail(reader)
> int rchan_full(reader)
> int rchan_empty(reader)
>
> We could add these at a later time when/if needed. Removing
> these changes nothing for ltt.

One of the things that uses these functions to read from a channel
from within the kernel is the relayfs code that implements read(2), so
taking them away means you wouldn't be able to use read() on a relayfs
file. That wouldn't matter for ltt since it mmaps the file, but there
are existing users of relayfs that do use relayfs this way. In fact,
most of the bug reports I've gotten are from people using it in this
mode. That doesn't mean though that it's necessarily the right thing
for relayfs or these users to be doing if they have suitable
alternatives for passing lower-volume messages in this way. As others
have mentioned, that seems to be the major question - should relayfs
concentrate on being solely a high-speed data relay mechanism or
should it try to be more, as it currently is implemented? If the
former, then I wonder if you need a filesystem at all - all you have
is a collection of mmappable buffers and the only thing the filesystem
provides is the namespace. Removing read()/write() and filesystem
support would of course greatly simplify the code; I'd like to hear
from any existing users though and see what they'd be missing.

ltt would still need at least relay_buffers_consumed() though. This
is used to support the 'no-overwrite' option, which means that when
the buffers are full i.e. the daemon has fallen behind and needs to
catch up, channel writing is 'suspended' until it catches up.

>
> Also, we should try to get rid of the following. They are there
> for allowing dynamically-resizable buffers, but if we are to
> make buffer-management opaque, then this should be done
> internally (Tom: I can't remember the rationale for these. Let
> me know if there's a reason why the must be kept.)
>
> int relay_realloc_buffer(*rchan, nbufs, async)
> int relay_replace_buffer(*rchan)

relay_realloc_buffer actually does the work of allocating the new
buffer space for used for resizing, and since it can sleep, it's done
in the background using a work queue. When everything's ready, the
channel buffer can then be replaced, thus relay_replace_buffer().

The only user of channel resizing that I know of is the 'dynamically
resizeable printk replacement' I posted awhile back, and that
apparently doesn't have any users, so I'd be happy to get rid of all
the resizing code.

Tom

>
> I think this is a pretty major change and simplification of the
> API along the lines of what others have asked for. Let me know
> what you think.
>
> Karim
> --
> Author, Speaker, Developer, Consultant
> Pushing Embedded and Real-Time Linux Systems Beyond the Limits
> http://www.opersys.com || [email protected] || 1-866-677-4546

--
Regards,

Tom Zanussi <[email protected]>
IBM Linux Technology Center/RAS

2005-01-16 18:46:54

by Daniel Drake

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi,

Joseph Fannin wrote:
> On Fri, Jan 14, 2005 at 12:23:52AM -0800, Andrew Morton wrote:
>
>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/
>
>
>>waiting-10s-before-mounting-root-filesystem.patch
>> retry mounting the root filesystem at boot time
>
>
> With this patch, initrds seem to get 'skipped'. I think this is
> probably the cause for the reports of problems with RAID too.

This patch should do the job. Replaces the existing
waiting-10s-before-mounting-root-filesystem.patch in 2.6.11-rc1-mm1.

Daniel


Attachments:
waiting-10s-before-mounting-root-filesystem.patch (2.84 kB)

2005-01-16 19:21:30

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Joseph Fannin wrote:
>> With this patch, initrds seem to get 'skipped'. I think this is
>> probably the cause for the reports of problems with RAID too.

On Sun, Jan 16, 2005 at 07:09:31PM +0000, Daniel Drake wrote:
> This seems likely and is probably also the cause of wli's problems
> mentioned elsewhere in this thread.
> I had overlooked the way that initrd's work in that part of the boot
> sequence. Will investigate.

akpm suspected this immediately, and my tests confirmed it.

I should probably do the work to make the box boot with CONFIG_MODULES=n
as I don't like initrd's or modules anyway (new points of failure).


-- wli

2005-01-16 19:40:21

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Hello Christoph,

Christoph Hellwig wrote:
> Why would you want anything but read access?

Fine, we can put it read-only, we'll drop the "mode" field.

> I think random access is overkill. Keeping the code simple is more
> important and user-space can post-process it.

it's overkill if you're thinking in terms of kbs or mbs of data.
it isn't if you're looking at gbs and 100gbs. please read my
other posting as to who is using this and how.

but regardless of access, you have to have some way of telling
relayfs of the size of the channel you want. bufsize, nbufs
just tell relayfs the size of the buffers you want and how many
buffers there are in the ring. both of which are really basic
to any sort of buffering scheme.

> Auto-resizing sounds like a really bad idea.

Ok, it will go.

> And why can't you do this from that code? It just needs an initcall-like
> thing that runs after mounting of relayfs.

Ok, we'll leave it to the caller to do a relay_write() with his
init-bufs at startup.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-16 19:42:42

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Christoph Hellwig wrote:
> the lockless mode is really just loops around cmpxchg. It's spinlocks
> reinvented poorly.

I beg to differ. You have to use different spinlocks depending on
where you are:
- serving user-space
- bh-derivatives
- irq

lockless is the same primitive regardless of your current state,
it's not the same as spinlocks.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-16 19:45:04

by Tom Zanussi

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Christoph Hellwig writes:
> On Fri, Jan 14, 2005 at 04:11:38PM -0500, Karim Yaghmour wrote:
> > Where does this appear in relayfs and what rights do
> > user-space apps have over it (rwx).
>
> Why would you want anything but read access?

This would allow an application to write trace events of its own to a
trace stream for instance. Also, I added a user-requested 'feature'
whereby write()s on a relayfs channel would be sent to a callback that
could be used to interpret 'out-of-band' commands sent from the
userspace application. And if lockless logging were being used, this
could provide a cheaper way for applications to write to the trace
buffer than having to do it via syscall.

>
> > bufsize, nbufs:
> > Usually things have to be subdivided in sub-buffers to make
> > both writing and reading simple. LTT uses this to allow,
> > among other things, random trace access.
>
> I think random access is overkill. Keeping the code simple is more
> important and user-space can post-process it.
>
> > resize_min, resize_max:
> > Allow for dynamic resizing of buffer.
>
> Auto-resizing sounds like a really bad idea.

It also doesn't seem to be really useful to anyone, so we should
probably remove it.

Tom

>
> > init_buf, init_buf_size:
> > Is there an initial buffer containing some data that should
> > be used to initialize the channel's content. If you're doing
> > init-time tracing, for example, you need to have a pre-allocated
> > static buffer that is copied to relayfs once relayfs is mounted.
>
> And why can't you do this from that code? It just needs an initcall-like
> thing that runs after mounting of relayfs.
>

--
Regards,

Tom Zanussi <[email protected]>
IBM Linux Technology Center/RAS

2005-01-16 20:12:11

by Robert Wisniewski

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Karim Yaghmour writes:
>
> Christoph Hellwig wrote:
> > the lockless mode is really just loops around cmpxchg. It's spinlocks
> > reinvented poorly.

Christoph,
Sadly they're not the same, atomic operations provide a set of
functionality that simple spin locks do not give you. Consider two
different processes each executing the following code

int global_val;

modify_val_spin()
{
acquire_spin_lock()
// calculate some_value based on global_val
// for example c=global_val; if (c%0) some_value=10; else some_value=20;
global_val = global_val + some_value
release_spin_lock()
}

modify_val_atomic()
{
do
// calculate some_value based on global_val
// for example c=global_val; if (c%0) some_value=10; else some_value=20;
global_val = global_val + some_value
while (compare_and_store(global_val, , ))
}

What's the difference. The deal is if two processes execute this code
simultaneously and one gets interrupted in the middle of modify_val_spin,
then the other wastes its entire quantum spinning for the lock. In the
modify_val_atomic if one process gets interrupted, no problem, the other
process can proceed through, then when the first one runs again the CAS
will fail, and it will go around the loop again. Now imagine it was the
kernel involved...

I don't claim to have all the answers and am happy to have discussion on
something, but the attitude expressed by "It's spinlocks reinvented
poorly." is not conducive to a useful exchange even if you were correct.

>
> I beg to differ. You have to use different spinlocks depending on
> where you are:
> - serving user-space
> - bh-derivatives
> - irq
>
> lockless is the same primitive regardless of your current state,
> it's not the same as spinlocks.
>
> Karim
> --
> Author, Speaker, Developer, Consultant
> Pushing Embedded and Real-Time Linux Systems Beyond the Limits
> http://www.opersys.com || [email protected] || 1-866-677-4546

Robert Wisniewski
The K42 MP OS Project
http://www.research.ibm.com/K42/
[email protected]

2005-01-16 20:33:36

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Robert Wisniewski <[email protected]> wrote:
>
> modify_val_spin()
> {
> acquire_spin_lock()
> // calculate some_value based on global_val
> // for example c=global_val; if (c%0) some_value=10; else some_value=20;
> global_val = global_val + some_value
> release_spin_lock()
> }
>
> modify_val_atomic()
> {
> do
> // calculate some_value based on global_val
> // for example c=global_val; if (c%0) some_value=10; else some_value=20;
> global_val = global_val + some_value
> while (compare_and_store(global_val, , ))
> }
>
> What's the difference. The deal is if two processes execute this code
> simultaneously and one gets interrupted in the middle of modify_val_spin,
> then the other wastes its entire quantum spinning for the lock. In the
> modify_val_atomic if one process gets interrupted, no problem, the other
> process can proceed through, then when the first one runs again the CAS
> will fail, and it will go around the loop again.

One could use spin_lock_irq(). The performance would be similar.

> Now imagine it was the kernel involved...

Or are you saying that userspace does the above as well?

2005-01-16 20:39:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Sun, Jan 16, 2005 at 03:11:00PM -0500, Robert Wisniewski wrote:
> int global_val;
>
> modify_val_spin()
> {
> acquire_spin_lock()
> // calculate some_value based on global_val
> // for example c=global_val; if (c%0) some_value=10; else some_value=20;
> global_val = global_val + some_value
> release_spin_lock()
> }
>
> modify_val_atomic()
> {
> do
> // calculate some_value based on global_val
> // for example c=global_val; if (c%0) some_value=10; else some_value=20;
> global_val = global_val + some_value
> while (compare_and_store(global_val, , ))
> }
>
> What's the difference. The deal is if two processes execute this code
> simultaneously and one gets interrupted in the middle of modify_val_spin,
> then the other wastes its entire quantum spinning for the lock. In the
> modify_val_atomic if one process gets interrupted, no problem, the other
> process can proceed through, then when the first one runs again the CAS
> will fail, and it will go around the loop again. Now imagine it was the
> kernel involved...

Just prevent that with spin_lock_irq. But anyway this example doesn't
fit the ltt code. cmpxchg loops can make lots of sense for such simple
loops, but as soon as you need to do significant work in the loop it
starts to get problematic. Your example would btw be better off using
atomic_t and it's primitives so you abstract away the actual implementation
and the architecture can chose the most efficient implementation.

2005-01-16 21:08:38

by Robert Wisniewski

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Andrew Morton writes:
> Robert Wisniewski <[email protected]> wrote:
> >
> > modify_val_spin()
> > {
> > acquire_spin_lock()
> > // calculate some_value based on global_val
> > // for example c=global_val; if (c%0) some_value=10; else some_value=20;
> > global_val = global_val + some_value
> > release_spin_lock()
> > }
> >
> > modify_val_atomic()
> > {
> > do
> > // calculate some_value based on global_val
> > // for example c=global_val; if (c%0) some_value=10; else some_value=20;
> > global_val = global_val + some_value
> > while (compare_and_store(global_val, , ))
> > }
> >
> > What's the difference. The deal is if two processes execute this code
> > simultaneously and one gets interrupted in the middle of modify_val_spin,
> > then the other wastes its entire quantum spinning for the lock. In the
> > modify_val_atomic if one process gets interrupted, no problem, the other
> > process can proceed through, then when the first one runs again the CAS
> > will fail, and it will go around the loop again.
>
> One could use spin_lock_irq(). The performance would be similar.

Yes on some architectures I think you right (on some archs though I'm not
so sure) - Ingo and I had that debate a bit ago. But as you astutely noted
or asked below, the original intent was to be able to use a single shared
buffer for user and kernel space. In fact, the lockless design of tracing
in K42, which motivated this design does that. For a couple of reasons we
have not (yet?) done that for LTT. But, for example, NPTL could have made
use of it when they were investigating a tracing facility. Recently,
another company using LTT for device driver and video debugging is very
interested in cheap user space tracing in conjunction with kernel tracing
because they need both sets of events to understand what is up. The debate
is still open for the best way to get cheap user space logging, but there
seems to be an increasing need for it by the community.

>
> > Now imagine it was the kernel involved...
>
> Or are you saying that userspace does the above as well?

:-) - as above. Furthermore, it seems that reducing the places where
interrupts are disabled would be a good thing? By not introducing
additional disable interrupts tracing has less of an impact. I was also
pointing out Christoph's statement that spin locks and atomic ops are the
same is not accurate (except for perhaps limited cases, but then you must
make such arguments - not necessarily good), and we had good reasons for
using an atomic op.

Thanks.

-bob

Robert Wisniewski
The K42 MP OS Project
http://www.research.ibm.com/K42/
[email protected]

2005-01-16 21:11:26

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Hello Roman,

Roman Zippel wrote:
> It seems we first need to specify, what relayfs actually is supposed to
> be. Is it a relaying mechanism for large amount of data from kernel to
> user space or is it a general communication channel between kernel and
> user space? You have to choose one, if you mix contradicting requirements,
> you'll never get a simple abstraction layer and relayfs will always be a
> pain to work with.

I think we want to concentrate on the former, though I suspect the latter
will happen eventually. But let's keep our focus on providing a mechanism
for relaying large amounts of data from the kernel to user-space.

> You can make it even simpler by dropping this completely. Every buffer is
> simply a list of events and you can let ltt write periodically a timer
> event. In userspace you can randomly seek at buffer boundaries and search
> for the timer events. It will require a bit more work for userspace, but
> even large amount of tracing data stays managable.

We already do write a heartbeat event periodically to have readable
traces in the case where the lower 32 bits of the TSC wrap-around.

As I mentioned elsewhere, please don't think of this in terms of
kbs or mbs of data. What we're talking about here is gbs if not
100gbs of data. Having to start reading each sub-buffer until you
hit a heartbeat really is a killer for such large traces. If there
was a significant impact on relayfs for having this I would have
understood the argument, but relayfs needs to do buffer-management
anyway, so I don't see that much complexity being added by allowing
the channel user to ask relayfs for delimiters.

> Userspace can then easily restore the original order of events.

As above, restoring the original order of events is fine if you are
looking at mbs or kbs of data. It's just totally unrealistic for
the amounts of data we want to handle.

But like I said earlier, the added relayfs mode (kdebug) would allow
for exactly what you are suggesting:
event_id = atomic_inc_return(&event_cnt);

So here's the new API based on input from Christoph and Tom:

rchan* relay_open(channel_path, bufsize, nbufs);
int relay_close(*rchan);
int relay_reset(*rchan)
int relay_write(*rchan, *data_ptr, count, **wrote-pos);

int relay_info(*rchan, *channel_info)
void relay_set_property(*rchan, property, value);
void relay_get_property(*rchan, property, *value);

For direct writing (currently already used by ltt, for example):

char* relay_reserve(*rchan, len, *ts, *td, *err, *interrupting)
void relay_commit(*rchan, *from, len, reserve_code, interrupting);
void relay_buffers_consumed(*rchan, u32)

These are the related macros:

#define relay_write_direct(DEST, SRC, SIZE) \
#define relay_lock_channel(RCHAN, FLAGS) \
#define relay_unlock_channel(RCHAN, FLAGS) \

What we are dropping for later review: read/write semantics from
user-space. It has to be understood that we believe that this is
a major drawback. For one thing, you won't be able to do something
like:
$ cat /relayfs/xchg/my-file > ~/test-data

Instead, you will have to write a custom app that does open(),
mmap(), write(). We could still provide a small app/library that
did this automagically, but you've got to admit that nothing
beats the real thing.

Also note that there are people who currently use this already,
so there will be some unhappy campers.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-16 21:15:42

by Robert Wisniewski

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Christoph Hellwig writes:
> On Sun, Jan 16, 2005 at 03:11:00PM -0500, Robert Wisniewski wrote:
> > int global_val;
> >
> > modify_val_spin()
> > {
> > acquire_spin_lock()
> > // calculate some_value based on global_val
> > // for example c=global_val; if (c%0) some_value=10; else some_value=20;
> > global_val = global_val + some_value
> > release_spin_lock()
> > }
> >
> > modify_val_atomic()
> > {
> > do
> > // calculate some_value based on global_val
> > // for example c=global_val; if (c%0) some_value=10; else some_value=20;
> > global_val = global_val + some_value
> > while (compare_and_store(global_val, , ))
> > }
> >
> > What's the difference. The deal is if two processes execute this code
> > simultaneously and one gets interrupted in the middle of modify_val_spin,
> > then the other wastes its entire quantum spinning for the lock. In the
> > modify_val_atomic if one process gets interrupted, no problem, the other
> > process can proceed through, then when the first one runs again the CAS
> > will fail, and it will go around the loop again. Now imagine it was the
> > kernel involved...
>
> Just prevent that with spin_lock_irq. But anyway this example doesn't
> fit the ltt code. cmpxchg loops can make lots of sense for such simple
> loops, but as soon as you need to do significant work in the loop it
> starts to get problematic. Your example would btw be better off using

The loop in question is where we grab the current (old) index, perform a
computation (or three). The only expensive operation is the timestamp
acquisition which has been modified to use the cheaper rtsc, so I still
think that's within the realm of a reasonably simply loop. I think what
you want to avoid is starting to walk a (potentially indeterminate) data
structure in such atomic op loop.

> atomic_t and it's primitives so you abstract away the actual implementation
> and the architecture can chose the most efficient implementation.
>

That's an interesting thought because it might address Andrew's concern.
We'll investigate. Thanks.

-bob

2005-01-16 21:40:46

by Arjan van de Ven

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Sun, 2005-01-16 at 16:06 -0500, Robert Wisniewski wrote:

> :-) - as above. Furthermore, it seems that reducing the places where
> interrupts are disabled would be a good thing?

depends at the price. On several cpus, disabling interupts is hundreds
of times cheaper than doing an atomic op.

2005-01-16 23:43:44

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

On Sat, 2005-01-15 at 23:23 -0500, Karim Yaghmour wrote:
> > Well, that's really a core problem. We don't want to duplicate
> > infrastructure, which practically does the same. So if relayfs isn't
> > usable in this kind of situation, it really raises the question whether
> > relayfs is usable at all. We need to make relayfs generally usable,
> > otherwise it will join the fate of devfs.
>
> Hmm, coming from you I will take this is a pretty strong endorsement
> for what I was suggesting earlier: provide a basic buffering mode
> in relayfs to be used in kernel debugging. However, it must be
> understood that this is separate from the existing modes and ltt,
> for example, could not use such a basic infrastructure. If this is
> ok with you, and no one wants to complain too loudly about this, I
> will go ahead and add this to our to-do list for relayfs.

This implies to seperate

- infrastructure
- event registration
- transport mechanism

tglx


2005-01-17 01:38:53

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Sun, 2005-01-16 at 16:18 -0500, Karim Yaghmour wrote:

> We already do write a heartbeat event periodically to have readable
> traces in the case where the lower 32 bits of the TSC wrap-around.

Which is every 1.42 seconds on a 3GHz machine. I guess we don't have
GB's of data when the 1.42 seconds elapse without an event.

> > Userspace can then easily restore the original order of events.
>
> As above, restoring the original order of events is fine if you are
> looking at mbs or kbs of data. It's just totally unrealistic for
> the amounts of data we want to handle.

I still don't see the point. The implicit ability of LTT to allow
tracing of up to 8192 bytes user data, strings and XML makes this
neccecary. I do not see any neccecarity to integrate this special usage
modes instead of an generic usable instrumentation implementation.

If relayfs is giving those users the ability to do so then they can do
it, but I object the fact that LTT/relayfs is occupying the place of a
more generic implementation in the way it is implemeted now.

For normal event tracing you have about 32-64 byte of data per event. So
disabling interrupts in order to copy this amount of imformation into a
buffer is cheaper on most architectures than doing the whole magic in
LTT and relayfs. This also keeps your buffers consistent and does not
need any magic for postprocessing.

Sorting out disabled events in the hot path and moving the if
(pid/gid/grp) whatever stuff into userspace postprocessing is not an
alien request.

You are talking of Gigabytes of data. In what time ?

Let's do some math.

For simplicity all events use 64 Byte event space.

~ 64kB/sec for 1000 events/s (event frequency 1kHz) ( 1 ms)
1024kB/sec for 16 events/ms (event frequency 16kHz) (62 us)
2048kB/sec for 32 events/ms (event frequency 32kHz) (31 us)
4096kB/sec for 64 events/ms (event frequency 64kHz) (15 us)
8192kB/sec for 128 events/ms (event frequency 128kHz) ( 8 us)

where a 100Mbit network can theoretically transport 10240kB/sec and
practically does 4000-8000 kB/sec.

An event frequency of 8us even on a 3 GHz machine is complete illusion,
because we spend already a couple of usecs in servicing the legacy 8254
timer.

So the realistic assumption on a 3Ghz machine is definitely below 64kHz,
which means we have to handle max. 4Mb of data per second.

I'm not impressed. Disabling interrupts for a couple of nano seconds to
store the trace data in the buffer does not hurt at all. Running through
a big bunch of out of cache line instructions does.

If you try to trace more than this amount you are toast anyway.

Please beware me of "reality has bitten" arguments. The whole if(..)
scenario in _ltt_event_log() is doing postprocessing, which can be done
in userspace. I don't care about the required time as long as it does
not introduce additional burden into the kernel.

> Also note that there are people who currently use this already,
> so there will be some unhappy campers.

Be aware that there are some unhappy campers in the kernel community too
when the special purpose tracing is included instead of a general usable
framework.

tglx


2005-01-17 01:47:58

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)


Thomas Gleixner wrote:
> This implies to seperate
>
> - infrastructure
> - event registration
> - transport mechanism

Like I said in my first response: we can't be everything for everbody,
the requirements are just too broad. ISO tried it with OSI. Have a
look at net/* for the result.

Currently, LTT provides the first two in one piece, and relayfs
provides the third. Like I acknowledged earlier, there is room for
generalizing the transport mechanism, and I'm thinking of amending
the relayfs API proposal further and rename the modes to make them
more straight-forward:
- Managed (locking or lockless.)
- Ad-Hoc (which works like Ingo, yourself, and others have requested.)

If you really want to define layers, then there are actually four
layers:
1- hooking mechanism
2- event definition / registration
3- event management infrastructure
4- transport mechanism

LTT currently does 1, 2 & 3. Clearly, as in the mail I refered to
earlier, there is code in the kernel that already does 1, 2, 3,
and 4 in very hardwired/ad-hoc fashion and there isn't anyone asking
for them to remove it. We're offering 4 separately and are putting
LTT on top of it. If you want to get 1 & 2 separately, have a look
at kernel hooks and genevent:
http://www-124.ibm.com/developerworks/oss/linux/projects/kernelhooks/
http://www.listserv.shafik.org/pipermail/ltt-dev/2003-January/000408.html

We'd gladly take a serious look at using the former if it was
included, and there is work in progress being conducted on getting
the latter being the standard way for declaring LTT events instead
of using a static ltt-events.h.

Five years ago, there was a discussion about integrating GKHI into
the kernel (the kernel hooks ancestor). Have a look for yourself
as to the response to this suggestion (basically people weren't
ready to accept a generalized hooking mechanism without a defined
set of hooks, and then others didn't like the idea at all because
creating general hooks in the kernel which anybody can register
to creates legal and maintenance problems ... basically it's a
can of worms):
http://marc.theaimsgroup.com/?l=linux-kernel&m=97371908916365&w=2

There's only so much we can push into the kernel in the same time.
Not to mention that before you can be generic, you've got to have
some specific implementation to start working off on. I believe
that what we've ironed out through the discussion of the past
two days is a good basis.

There is some irony in all this. For years, we were told that we
couldn't make it into the kernel because we were perceived as
providing a kernel debugging tool, and now that we're starting
to get our things seriously reviewed we're being told that maybe
it ain't really that useful because those who want to do kernel
debugging can't use it as-is ... go figure.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-17 02:16:57

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Thomas Gleixner wrote:
> Which is every 1.42 seconds on a 3GHz machine. I guess we don't have
> GB's of data when the 1.42 seconds elapse without an event.

My argument was about being able to browse the amount of data I was
refering to. The hearbeat thing was an asside to Roman as to the
fact that we already do what he's suggesting.

> I still don't see the point. The implicit ability of LTT to allow
> tracing of up to 8192 bytes user data, strings and XML makes this
> neccecary. I do not see any neccecarity to integrate this special usage
> modes instead of an generic usable instrumentation implementation.

I've already clarified your mischaracterization of custom events,
you are being dissengenious here. If you want a generalized hooking
mechanism, feel free to ask Andrew to take kernel hooks:
http://www-124.ibm.com/developerworks/oss/linux/projects/kernelhooks/

> If relayfs is giving those users the ability to do so then they can do
> it, but I object the fact that LTT/relayfs is occupying the place of a
> more generic implementation in the way it is implemeted now.

Again, damned if we do, damned if don't. LTT isn't meant for kernel
debugging per se, though you can use it to that end to a certain extent.
However, if you are kernel debugging, you will find the ad-hoc mode I'm
talking about adding to relayfs quite useful.

> For normal event tracing you have about 32-64 byte of data per event. So
> disabling interrupts in order to copy this amount of imformation into a
> buffer is cheaper on most architectures than doing the whole magic in
> LTT and relayfs. This also keeps your buffers consistent and does not
> need any magic for postprocessing.

Oh, now you want to lighten the weight on postprocessing? Common Thomas,
please stop wasting my time.

Note, however, that we are thinking of dropping the lockless scheme
for now. We will pick up this discussion separately further down the
road.

> Sorting out disabled events in the hot path and moving the if
> (pid/gid/grp) whatever stuff into userspace postprocessing is not an
> alien request.

It is. Have you even read what I suggested to change in my other mail:
if ((any_filtering) && !(ltt_filter(event_id, event_struct, data)))
return -EINVAL;

You're not honestly telling me that checking for any_filtering is
going to ruin your day.

> You are talking of Gigabytes of data. In what time ?
>
> Let's do some math.
>
> For simplicity all events use 64 Byte event space.
>
> ~ 64kB/sec for 1000 events/s (event frequency 1kHz) ( 1 ms)
> 1024kB/sec for 16 events/ms (event frequency 16kHz) (62 us)
> 2048kB/sec for 32 events/ms (event frequency 32kHz) (31 us)
> 4096kB/sec for 64 events/ms (event frequency 64kHz) (15 us)
> 8192kB/sec for 128 events/ms (event frequency 128kHz) ( 8 us)
>
> where a 100Mbit network can theoretically transport 10240kB/sec and
> practically does 4000-8000 kB/sec.
>
> An event frequency of 8us even on a 3 GHz machine is complete illusion,
> because we spend already a couple of usecs in servicing the legacy 8254
> timer.
>
> So the realistic assumption on a 3Ghz machine is definitely below 64kHz,
> which means we have to handle max. 4Mb of data per second.

Actually, on a PII-350MHz, I was already generating 0.5MB/s of data
just by running an X session. If we assume that a machine 10 times
faster generates 10 times as many events, we've already got 5MB/s,
and I'm sure that there are heavier cases than X.

Here's the paper if you want to read it:
http://www.opersys.com/ftp/pub/LTT/Documentation/ltt-usenix.ps.gz

> I'm not impressed. Disabling interrupts for a couple of nano seconds to
> store the trace data in the buffer does not hurt at all. Running through
> a big bunch of out of cache line instructions does.

Like I said above, fighting for/against lockless is not our immediate
goal, and we will likely remove it.

> If you try to trace more than this amount you are toast anyway.
>
> Please beware me of "reality has bitten" arguments. The whole if(..)
> scenario in _ltt_event_log() is doing postprocessing, which can be done
> in userspace. I don't care about the required time as long as it does
> not introduce additional burden into the kernel.

Not even Ingo hinted at getting rid of filtering. Remember the earlier
e-mail I refered to? Here's what he was suggesting:
> void trace(event, data1, data2, data3)
> {
> int cpu = smp_processor_id();
> int idx, pending, *curr = curr_idx + cpu;
> struct trace_event *t;
> unsigned long flags;
>
> if (!event_wanted(current, event, data1, data2, data3))
> return;
>
> local_irq_save(flags);
>
> idx = ++curr_idx[cpu] & (NR_TRACE_ENTRIES - 1);
> pending = ++curr_pending[cpu];
>
> t = trace_ring[cpu] + idx;
>
> t->event = event;
> rdtscll(t->timestamp);
> t->data1 = data1;
> t->data2 = data2;
> t->data3 = data3;
>
> if (curr_pending == TRACE_LOW_WATERMARK && tracer_task)
> wake_up_process(tracer_task);
>
> local_irq_restore(flags);
> }

Notice the "event_wanted()"?

Original found here:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103273730326318&w=2

Again, Thomas, I don't mind hearing you out, but please don't waste
my time.

> Be aware that there are some unhappy campers in the kernel community too
> when the special purpose tracing is included instead of a general usable
> framework.

Like I said, we are willing to accomodate those who want to be able
to use relayfs for kernel debugging purposes, but we can hardly
be blamed for not making LTT a generic kernel debugging tool as this
is exactly the excuse many kernel developers had for not including
LTT to start with. It's just totally dissengenious for giving us
grief for claiming that we are doing something and then later turn
around and blame us for not doing it ... cheesh ...

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-17 06:47:14

by S. P. Prasanna

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi Karim,

> Thomas Gleixner wrote:
>> It's not only me, who needs constant time. Everybody interested in
>> tracing will need that. In my opinion its a principle of tracing.
>
> relayfs is a generalized buffering mechanism. Tracing is one application
> it serves. Check out the web site: "high-speed data-relay filesystem."
> Fancy name huh ...
>
>> The "lockless" mechanism is _FAKE_ as I already pointed out. It replaces
>> locks by do { } while loops. So what ?
>

How about combining "buffering mechansim of relayfs" and
"kernel-> user space tranport by debugfs"
This will also remove lots of compilcated code from realyfs.

Thanks
Prasanna
--

Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Ph: 91-80-25044636
<[email protected]>

2005-01-17 10:26:48

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

On Sun, 2005-01-16 at 20:54 -0500, Karim Yaghmour wrote:

> If you really want to define layers, then there are actually four
> layers:
> 1- hooking mechanism
> 2- event definition / registration
> 3- event management infrastructure
> 4- transport mechanism
>
> LTT currently does 1, 2 & 3. Clearly, as in the mail I refered to
> earlier, there is code in the kernel that already does 1, 2, 3,
> and 4 in very hardwired/ad-hoc fashion and there isn't anyone asking
> for them to remove it. We're offering 4 separately and are putting
> LTT on top of it. If you want to get 1 & 2 separately, have a look
> at kernel hooks and genevent:

I know that there is enough code which does x,y,z hardcoded/hardwired
already.

Thats the point. Adding another hardwired implementation does not give
us a possibility to solve the hardwired problem of the already available
stuff.

tglx


2005-01-17 12:20:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Sun, 2005-01-16 at 21:24 -0500, Karim Yaghmour wrote:

> > Sorting out disabled events in the hot path and moving the if
> > (pid/gid/grp) whatever stuff into userspace postprocessing is not an
> > alien request.
>
> It is. Have you even read what I suggested to change in my other mail:
> if ((any_filtering) && !(ltt_filter(event_id, event_struct, data)))
> return -EINVAL;

Sorting out disabled events is the filtering you have to do in kernel
and you should do it in the hot path or remove the unneccecary
tracepoints at compiletime.

> > 4096kB/sec for 64 events/ms (event frequency 64kHz) (15 us)
> > 8192kB/sec for 128 events/ms (event frequency 128kHz) ( 8 us)

> Actually, on a PII-350MHz, I was already generating 0.5MB/s of data
> just by running an X session. If we assume that a machine 10 times
> faster generates 10 times as many events, we've already got 5MB/s,
> and I'm sure that there are heavier cases than X.

You are not answering my argument. 8MB/sec is an event frequency of
128hz when we assume 64byte/event. It's one event every 8us. So every
unneccecary computation, every leaving the hotpath for nothing is just
giving you performance loss.

> Not even Ingo hinted at getting rid of filtering. Remember the earlier
> e-mail I refered to? Here's what he was suggesting:

I said:
> > Sorting out disabled events in the hot path

s/Sorting/Filtering/

I never said this should not be done.

> Like I said, we are willing to accomodate those who want to be able
> to use relayfs for kernel debugging purposes, but we can hardly
> be blamed for not making LTT a generic kernel debugging tool as this
> is exactly the excuse many kernel developers had for not including
> LTT to start with. It's just totally dissengenious for giving us
> grief for claiming that we are doing something and then later turn
> around and blame us for not doing it ... cheesh ...

Seperating layers as I suggested before is not making it a generic
debugging tool. It makes parts of those layers available for other usage
and gives us the chance to reuse the parts for cleaning up already
available code which has the same hardwired structure.

tglx



2005-01-17 13:55:21

by Roman Zippel

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi,

On Sun, 16 Jan 2005, Karim Yaghmour wrote:

> > You can make it even simpler by dropping this completely. Every buffer is
> > simply a list of events and you can let ltt write periodically a timer
> > event. In userspace you can randomly seek at buffer boundaries and search
> > for the timer events. It will require a bit more work for userspace, but
> > even large amount of tracing data stays managable.
>
> We already do write a heartbeat event periodically to have readable
> traces in the case where the lower 32 bits of the TSC wrap-around.
>
> As I mentioned elsewhere, please don't think of this in terms of
> kbs or mbs of data. What we're talking about here is gbs if not
> 100gbs of data. Having to start reading each sub-buffer until you
> hit a heartbeat really is a killer for such large traces. If there
> was a significant impact on relayfs for having this I would have
> understood the argument, but relayfs needs to do buffer-management
> anyway, so I don't see that much complexity being added by allowing
> the channel user to ask relayfs for delimiters.

Periodically can also mean a buffer start call back from relayfs
(although that would mean the first entry is not guaranteed) or a
(per cpu) eventcnt from the subsystem. The amount of needed search would
be limited. The main point is from the relayfs POV the buffer structure
has always the same (simple) structure.
You have to be more specific, what's so special about this amount of data.
You likely want to (incrementally) build an index file, so you don't have
to repeat the searches, but even with your current format you would
benefit from such an index file.

> > Userspace can then easily restore the original order of events.
>
> As above, restoring the original order of events is fine if you are
> looking at mbs or kbs of data. It's just totally unrealistic for
> the amounts of data we want to handle.

Why is it "totally unrealistic"?

> But like I said earlier, the added relayfs mode (kdebug) would allow
> for exactly what you are suggesting:
> event_id = atomic_inc_return(&event_cnt);

Actually that would be already too much for low level kernel debugging.
Why do you want to put this into relayfs?
What are the _specific_ reasons you need these various modes, why can't
you build any special requirements on top of a very light weight relay
mechanism?

bye, Roman

2005-01-17 15:50:26

by Robert Wisniewski

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Arjan van de Ven writes:
> On Sun, 2005-01-16 at 16:06 -0500, Robert Wisniewski wrote:
>
> > :-) - as above. Furthermore, it seems that reducing the places where
> > interrupts are disabled would be a good thing?
>
> depends at the price. On several cpus, disabling interupts is hundreds
> of times cheaper than doing an atomic op.

Wow - disabling interrupts is handfuls to tens of cycles, so that means
some architectures take thousands of cycles to do atomic operations. Then
I would definitely agree we should not be using atomic operations on those,
fwiw, out of curiosity, what archs make atomic ops so expensive.

Andrew, on the broader note. If the community feels disabling interrupts
is the better way to go for the variables (I think it's index and count) we
were protecting with atomic ops then as the code stands things should be
fine with that approach and we can make that change.

Thanks for your attention to looking through this.

-bob

Robert Wisniewski
The K42 MP OS Project
http://www.research.ibm.com/K42/
[email protected]

2005-01-17 16:13:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Mon, Jan 17, 2005 at 10:48:52AM -0500, Robert Wisniewski wrote:
> Wow - disabling interrupts is handfuls to tens of cycles, so that means
> some architectures take thousands of cycles to do atomic operations. Then
> I would definitely agree we should not be using atomic operations on those,
> fwiw, out of curiosity, what archs make atomic ops so expensive.
>
> Andrew, on the broader note. If the community feels disabling interrupts
> is the better way to go for the variables (I think it's index and count) we
> were protecting with atomic ops then as the code stands things should be
> fine with that approach and we can make that change.

The thing I'm unhappy with is what the code does currently. I haven't
looked at the code enough nor through about the problem enough to tell
you what's the right thing to do. Knowing that will involve review of
the architecture and serious benchmarking on a few plattforms.

2005-01-17 16:16:39

by Tom Zanussi

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Karim Yaghmour writes:
>
> Hello Roman,
>
>
> What we are dropping for later review: read/write semantics from
> user-space. It has to be understood that we believe that this is
> a major drawback. For one thing, you won't be able to do something
> like:
> $ cat /relayfs/xchg/my-file > ~/test-data
>
> Instead, you will have to write a custom app that does open(),
> mmap(), write(). We could still provide a small app/library that
> did this automagically, but you've got to admit that nothing
> beats the real thing.
>

Maybe we could use FUSE to provide read()/write() for relayfs files -
opening a FUSE relayfs file would open and mmap the actual relayfs
file, read() would move around in the buffer using basically the
current relayfs read logic moved down into the FUSE filesystem read
fileop, and write() could write directly to the buffer...

Tom

> Also note that there are people who currently use this already,
> so there will be some unhappy campers.
>
> Karim
> --
> Author, Speaker, Developer, Consultant
> Pushing Embedded and Real-Time Linux Systems Beyond the Limits
> http://www.opersys.com || [email protected] || 1-866-677-4546

--
Regards,

Tom Zanussi <[email protected]>
IBM Linux Technology Center/RAS

2005-01-17 17:14:19

by Matthias Urlichs

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi, Andrew Morton schrub am Fri, 14 Jan 2005 10:35:34 -0800:

> What filesystem(s) do you use, and why?

sshfs (best idea for file access through firewalls).
gmailfs (best free off-site backup facility).
Will use encfs as soon as FUSE is in mainline
(I'm using cryptoloop now, but that's not sanely backupable.)

--
Matthias Urlichs | {M:U} IT Design @ m-u-it.de | [email protected]


2005-01-17 20:25:49

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Thomas Gleixner wrote:
> Sorting out disabled events is the filtering you have to do in kernel
> and you should do it in the hot path or remove the unneccecary
> tracepoints at compiletime.

Do you actually read my replies or do you just grep for something
you can object to? If you care to read my replies you will see that
this has already been answered.

> You are not answering my argument. 8MB/sec is an event frequency of
> 128hz when we assume 64byte/event. It's one event every 8us. So every
> unneccecary computation, every leaving the hotpath for nothing is just
> giving you performance loss.

I have, you just choose not to read. Here's what I said earlier:
> Note, however, that we are thinking of dropping the lockless scheme
> for now. We will pick up this discussion separately further down the
> road.

IOW, we will be using cli/sti. So there is no "leaving the hotpath".

> I said:
>
>>>Sorting out disabled events in the hot path
>
>
> s/Sorting/Filtering/
>
> I never said this should not be done.

You're either on crack or I don't know how to read english. Here's what
you said:
> Sorting out disabled events in the hot path and moving the if
> (pid/gid/grp) whatever stuff into userspace postprocessing is not an
> alien request.

Clearly you are suggesting to moving the filtering into user-space.

> Seperating layers as I suggested before is not making it a generic
> debugging tool. It makes parts of those layers available for other usage
> and gives us the chance to reuse the parts for cleaning up already
> available code which has the same hardwired structure.

This has already been answered.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-17 20:27:16

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)


Thomas Gleixner wrote:
> Thats the point. Adding another hardwired implementation does not give
> us a possibility to solve the hardwired problem of the already available
> stuff.

Well then, like I said before, you know what you need to do:
http://www-124.ibm.com/developerworks/oss/linux/projects/kernelhooks/

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-17 21:20:41

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Hello Roman,

Roman Zippel wrote:
> Periodically can also mean a buffer start call back from relayfs
> (although that would mean the first entry is not guaranteed) or a
> (per cpu) eventcnt from the subsystem. The amount of needed search would
> be limited. The main point is from the relayfs POV the buffer structure
> has always the same (simple) structure.

But two e-mails ago, you told us to drop the start_reserve and end_reserve
and move the details of the buffer management into relayfs and out of
ltt? Either we have a callback, like you suggest, and then we need to
reserve some space to make sure that the callback is guaranteed to have
the first entry, or we drop the callback and provide an option to the
user for relayfs to write this first entry for him. Providing a callback
without reservation is no different than relying purely on the heartbeat,
which, like I said before and for the reasons illustrated below, is
unrealistic.

> You have to be more specific, what's so special about this amount of data.
> You likely want to (incrementally) build an index file, so you don't have
> to repeat the searches, but even with your current format you would
> benefit from such an index file.
[snip]
>>As above, restoring the original order of events is fine if you are
>>looking at mbs or kbs of data. It's just totally unrealistic for
>>the amounts of data we want to handle.
>
>
> Why is it "totally unrealistic"?

Ok, let's expand a little here on the amount of data. Say you're getting
2MB/s of data (which is not unrealistic on a loaded system.) That means
that if I'm tracing for 2 days, I've got 345GB of data (~7.5GB/hour).
In practice, users aren't necessarily interested in plowing through the
entire 345GB, they just want to view a given portion of it. Now, if I
follow what you are suggesting, I have to go through the entire 345GB to:
a) create indexes, b) reorder events, and likely c) have to rewrite
another 345GB of data. And I haven't yet discussed the kind of problems
you would encounter in trying to reorder such a beast that contains,
by definition, variable-sized events. For one thing, if event N+1 doesn't
follow N, then you would be forced to browse forward until you actually
found it before you could write a properly ordered trace. And it just
takes a few processes that are interrupted and forced to sleep here and
there to make this unusable. That's without the RAM or fs space required
to store those index tables ... At 3 to 12 bytes per events, that's a lot
of space for indexes ...

If I keep things as they are with ordered events and delimiters on buffer
boundaries, I can skip to any place within this 345GB and start processing
from there.

And that's for two days. If you're a sysadmin encountering a transient
problem on a server, you may actually want more than that.

>>But like I said earlier, the added relayfs mode (kdebug) would allow
>>for exactly what you are suggesting:
>> event_id = atomic_inc_return(&event_cnt);
>
>
> Actually that would be already too much for low level kernel debugging.
> Why do you want to put this into relayfs?

I don't. I was just saying that with the adhoc mode, a relayfs client
could use the code snippet you were suggesting.

> What are the _specific_ reasons you need these various modes, why can't
> you build any special requirements on top of a very light weight relay
> mechanism?

Because of the opposite requirements.

Here are the two modes I'm suggesting in relayfs and how they operate:

Managed:
- Presumes active user-space daemon interested in catching _all_ events.
- Allows N buffers in buffer ring
- Provides limit-checking (callback on end of sub-buffer)
- Provides buffer delimiters (writes timestamp at beg and end)
- Suited for all types of event sizes (both fixed and variable) at
very high frequency.
- Daemon is woken up when buffer is ready for writing, executes a
write() on an mmaped area and notifies relevant kernel subsystem,
which in turn notifies relayfs that buffer can now be reused.
- Relies on proper abstraction of cli/sti.

Ad-Hoc:
- Presumes transient userspace tool interested in event snapshots.
- Single circular buffer.
- No limits checking (or very basic: as in stop if overwrite).
- No buffer delimiters.
- Best suited for fixed-size events at extreme high frequency.
- User-space tool simply does a write() on an mmaped area and
exits or goes back to sleep.
- Relies on proper abstraction of cli/sti.

Basically, the ad-hoc modes abides by the principles of KISS, whereas
the managed is a more elaborate for clients like LTT.

Rhetorical: Couldn't the ad-hoc mode case be a special case of the
managed mode? In theory yes, in practice no. The various conditionals
and code paths for switching buffers, invoking callbacks, writing
delimiters and the likes, which make this mode useful to client like
LTT, will always be a problem for those seeking the shortest path to
buffer comital. In the case of Ingo, for example, I'm sure he'd
probably go in the code and "#if 0" it to make sure it doesn't slow
him down.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-17 21:31:29

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Hello Chistoph,

Christoph Hellwig wrote:
> The thing I'm unhappy with is what the code does currently. I haven't
> looked at the code enough nor through about the problem enough to tell
> you what's the right thing to do. Knowing that will involve review of
> the architecture and serious benchmarking on a few plattforms.

Like I was saying elswhere, we are likely going to drop the lockless
code for now (i.e. the code that does the cmpxchg). Instead we will
depend on normal cli/sti abstractions.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-17 22:26:06

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Fri, Jan 14, 2005 at 06:58:10PM -0800, William Lee Irwin III wrote:
> No idea what hit me just yet. x86-64 doesn't boot. Still going through
> the various architectures. The same system (including the initrd FPOS
> bullcrap, though, of course, I'm using an initrd built just for this
> kernel) boots various 2.6.x up to 2.6.10-mm1. There are vague indications
> something in/around SCSI and/or initrd's has violently exploded in my face.

With the waiting 10s patch backed out, things seem to be going well:

$ ssh analyticity
Last login: Mon Jan 17 14:03:13 2005 from meromorphy
Linux analyticity 2.6.11-rc1-mm1 #5 SMP Sat Jan 15 01:25:23 PST 2005 sparc64 GNU/Linux
$ uptime
14:10:55 up 10 min, 7 users, load average: 0.10, 0.40, 0.31

Now I just have to remember to set up ip route add 192.168.1.0/24 dev
eth3 via 192.168.1.1 instead of just ip route add 192.168.1.0/24 dev
eth3 so I can tftpboot the thing (well, it took all of 10s to figure
out, but it may not next time). Routing changes are painful.


-- wli

2005-01-17 22:38:29

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Mon, 2005-01-17 at 15:32 -0500, Karim Yaghmour wrote:
> You're either on crack or I don't know how to read english. Here's what
> you said:

Maybe you should read your own comment about ad-hominem attacks earlier
in this thread and consider if it might apply to you.

I know, what I have said. I said reduce the filtering to the absolute
minimum and do the rest in userspace.

The now builtin filters are defined to fit somebodys needs or idea of
what the user should / wants to see. They will not fit everybodys
needs / ideas. So we start modifying, adding and #ifdefing kernel
filters, which is a scary vision.

Enabling and disabling events is a valid basic filter request, which
should live in the kernel. Anything else should go into userspace, IMO.

tglx


2005-01-17 22:59:48

by Robert Wisniewski

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

n <[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
X-Mailer: VM 6.43 under 20.4 "Emerald" XEmacs Lucid
Message-ID: <[email protected]>
From: Robert Wisniewski <[email protected]>
Bcc: [email protected],[email protected]

Thomas Gleixner writes:
> On Mon, 2005-01-17 at 15:32 -0500, Karim Yaghmour wrote:
> > You're either on crack or I don't know how to read english. Here's what
> > you said:
>
> Maybe you should read your own comment about ad-hominem attacks earlier
> in this thread and consider if it might apply to you.
>
> I know, what I have said. I said reduce the filtering to the absolute
> minimum and do the rest in userspace.
>
> The now builtin filters are defined to fit somebodys needs or idea of
> what the user should / wants to see. They will not fit everybodys
> needs / ideas. So we start modifying, adding and #ifdefing kernel
> filters, which is a scary vision.
>
> Enabling and disabling events is a valid basic filter request, which
> should live in the kernel. Anything else should go into userspace, IMO.
>
> tglx

I believe (and Karim can correct me if I'm wrong) the idea is to have
groups of events that can be disabled and enabled via a one word mask. No
checking multiple variables, no #ifdefing, something very streamlined. By
userspace I assume you mean post-processing, i.e., if the user/library/etc
needs to log events they use the same simple facility.

I think we agree to optimize/streamline performance for the gathering and
do work in the post processing. There is an outstanding patch that makes
strides in this direction.

-bob

Robert Wisniewski
The K42 MP OS Project
http://www.research.ibm.com/K42/
[email protected]

2005-01-17 23:28:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

On Mon, 2005-01-17 at 15:34 -0500, Karim Yaghmour wrote:
> Thomas Gleixner wrote:
> > Thats the point. Adding another hardwired implementation does not give
> > us a possibility to solve the hardwired problem of the already available
> > stuff.
>
> Well then, like I said before, you know what you need to do:
> http://www-124.ibm.com/developerworks/oss/linux/projects/kernelhooks/

Oh, I guess my English must be really bad.

I was talking about seperation of layers, so why do I need
kernelhooks ?

The seperation of layers makes it possible to actually reuse
functionality and gives the possibility that existing hardwired stuff
can be cleaned up to use the new functionality too.

If we add another hardwired implementation then we do not have said
benefits.

tglx



2005-01-17 23:36:25

by J.A. Magallon

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


On 2005.01.16, Daniel Drake wrote:
> Hi,
>
> Joseph Fannin wrote:
> > On Fri, Jan 14, 2005 at 12:23:52AM -0800, Andrew Morton wrote:
> >
> >>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/
> >
> >
> >>waiting-10s-before-mounting-root-filesystem.patch
> >> retry mounting the root filesystem at boot time
> >
> >
> > With this patch, initrds seem to get 'skipped'. I think this is
> > probably the cause for the reports of problems with RAID too.
>
> This patch should do the job. Replaces the existing
> waiting-10s-before-mounting-root-filesystem.patch in 2.6.11-rc1-mm1.
>
> Daniel
>

> Retry up to 20 times if mounting the root device fails. This fixes booting
> from usb-storage devices, which no longer make their partitions immediately
> available. Also cleans up the mount_block_root() function.
>
> Based on an earlier patch from William Park <[email protected]>
>
> Signed-off-by: Daniel Drake <[email protected]>
>

This does not patch against -mm1. -mm1 looks like a mix of plain 2.6.10
and your code.
Could you revamp it against -mm1, please ? I looked at it but seems out
of my understanding...

TIA

--
J.A. Magallon <jamagallon()able!es> \ Software is like sex:
werewolf!able!es \ It's better when it's free
Mandrakelinux release 10.2 (Cooker) for i586
Linux 2.6.10-jam4 (gcc 3.4.3 (Mandrakelinux 10.2 3.4.3-3mdk)) #2


Attachments:
(No filename) (1.41 kB)
(No filename) (189.00 B)
Download all attachments

2005-01-17 23:41:20

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Thomas Gleixner wrote:
> I know, what I have said. I said reduce the filtering to the absolute
> minimum and do the rest in userspace.

You keep adopting the interpretation which best suits you, taking
quotes out of context, and keep repeating things that have already
been answered. There are limits to one's patience.

What you did is change your position twice. It's there for anyone to see.

> The now builtin filters are defined to fit somebodys needs or idea of
> what the user should / wants to see. They will not fit everybodys
> needs / ideas. So we start modifying, adding and #ifdefing kernel
> filters, which is a scary vision.

Ah, finally. Here's an actual suggestion. _IF_ you want, I'll just
export a ltt_set_filter(*callback) and rewrite the if in
_ltt_log_event() to:
if ((ltt_filter != NULL) && !(&ltt_filter(event_id, event_struct, data)))
return -EINVAL;

You're always welcome to do the following from anywhere in your code:
ltt_set_filter(NULL);

> Enabling and disabling events is a valid basic filter request, which
> should live in the kernel. Anything else should go into userspace, IMO.

What you are suggesting is that a system administator that wants to
monitor his sendmail server over a period of three weeks should
just postprocess 1.8TB (1MB/s) of data because Thomas Gleixner didn't
like the idea of kernel event filtering based on anything but events.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-17 23:33:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Mon, 2005-01-17 at 17:42 -0500, Robert Wisniewski wrote:

> I believe (and Karim can correct me if I'm wrong) the idea is to have
> groups of events that can be disabled and enabled via a one word mask. No
> checking multiple variables, no #ifdefing, something very streamlined. By
> userspace I assume you mean post-processing, i.e., if the user/library/etc
> needs to log events they use the same simple facility.

Yes, I was talking about postprocessing in userspace.

The logging of userspace events is a complete seperate issue. You have
to solve the timestamp problem and do the correlation to kernel events
in the postprocessing.

> I think we agree to optimize/streamline performance for the gathering and
> do work in the post processing. There is an outstanding patch that makes
> strides in this direction.

Ack.

Have you any plans to seperate the layers into different pieces, so they
provide better reusability ?

tglx


2005-01-17 23:56:57

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)


Thomas Gleixner wrote:
> If we add another hardwired implementation then we do not have said
> benefits.

Please stop handwaving. Folks like Andrew, Christoph, Zwane, Roman,
and others actually made specific requests for changes in the code.
What makes you think you're so special that you think you are
entitled to stay on the side and handwave about concepts.

If there is a limitation with the code, please present actual
snippets that need to be changed and suggest alternatives. That's
what everyone else does on this list.

If you want to clean-up the existing tracing code in the kernel,
then here are some ltt calls you may be interested in:
int ltt_create_event(char *event_type,
char *event_desc,
int format_type,
char *format_data);
int ltt_log_raw_event(int event_id, int event_size, void *event_data);

And here's an actual example:
...
delta_id = ltt_create_event("Delta",
NULL,
CUSTOM_EVENT_FORMAT_TYPE_HEX,
NULL);
...
ltt_log_raw_event(delta_id, sizeof(a_delta_event), &a_delta_event);
...
ltt_destroy_event(delta_id);

You can then use LibLTT to read the trace and extract your custom
events and format your binary data as it suits you.

Save the bandwidth and start cleaning.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-17 23:58:16

by Roman Zippel

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi,

On Mon, 17 Jan 2005, Karim Yaghmour wrote:

> > Periodically can also mean a buffer start call back from relayfs
> > (although that would mean the first entry is not guaranteed) or a
> > (per cpu) eventcnt from the subsystem. The amount of needed search would
> > be limited. The main point is from the relayfs POV the buffer structure
> > has always the same (simple) structure.
>
> But two e-mails ago, you told us to drop the start_reserve and end_reserve
> and move the details of the buffer management into relayfs and out of
> ltt? Either we have a callback, like you suggest, and then we need to
> reserve some space to make sure that the callback is guaranteed to have
> the first entry, or we drop the callback and provide an option to the
> user for relayfs to write this first entry for him. Providing a callback
> without reservation is no different than relying purely on the heartbeat,
> which, like I said before and for the reasons illustrated below, is
> unrealistic.

Why is so important that it's at the start of the buffer? What's wrong
with a special event _near_ the start of a buffer?

> > Why is it "totally unrealistic"?
>
> Ok, let's expand a little here on the amount of data. Say you're getting
> 2MB/s of data (which is not unrealistic on a loaded system.) That means
> that if I'm tracing for 2 days, I've got 345GB of data (~7.5GB/hour).
> In practice, users aren't necessarily interested in plowing through the
> entire 345GB, they just want to view a given portion of it. Now, if I
> follow what you are suggesting, I have to go through the entire 345GB to:
> a) create indexes, b) reorder events, and likely c) have to rewrite
> another 345GB of data. And I haven't yet discussed the kind of problems
> you would encounter in trying to reorder such a beast that contains,
> by definition, variable-sized events. For one thing, if event N+1 doesn't
> follow N, then you would be forced to browse forward until you actually
> found it before you could write a properly ordered trace. And it just
> takes a few processes that are interrupted and forced to sleep here and
> there to make this unusable. That's without the RAM or fs space required
> to store those index tables ... At 3 to 12 bytes per events, that's a lot
> of space for indexes ...
>
> If I keep things as they are with ordered events and delimiters on buffer
> boundaries, I can skip to any place within this 345GB and start processing
> from there.

What gives you the idea, that you can't do this with what I proposed?
You can still seek freely within the data at buffer boundaries and you
only have to search a little into the buffer to find the delimiter. Events
are not completely at random, so that the little reordering can be done at
runtime. Sorry, but I don't get what kind of unsolvable problems you see
here.

> Rhetorical: Couldn't the ad-hoc mode case be a special case of the
> managed mode?

Wrong question. What compromises can be made on both sides to create a
common simple framework? Your unwillingness to compromise a little on the
ltt requirements really amazes me.

bye, Roman

2005-01-18 00:02:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Mon, 2005-01-17 at 18:41 -0500, Karim Yaghmour wrote:
> Thomas Gleixner wrote:
> > I know, what I have said. I said reduce the filtering to the absolute
> > minimum and do the rest in userspace.
>
> You keep adopting the interpretation which best suits you, taking
> quotes out of context, and keep repeating things that have already
> been answered. There are limits to one's patience.

I said before: "Sorting out disabled events is the filtering you
have to do in kernel and you should do it in the hot path or
remove the unneccecary tracepoints at compiletime."

This is exactly what I stated above. I omitted the addon of "do the rest
in userspace", as it was obvious enough.

> What you did is change your position twice. It's there for anyone to see.

Sorry, I didn't know that you are representing anyone.

> > The now builtin filters are defined to fit somebodys needs or idea of
> > what the user should / wants to see. They will not fit everybodys
> > needs / ideas. So we start modifying, adding and #ifdefing kernel
> > filters, which is a scary vision.
>
> Ah, finally. Here's an actual suggestion. _IF_ you want, I'll just
> export a ltt_set_filter(*callback) and rewrite the if in
> _ltt_log_event() to:
> if ((ltt_filter != NULL) && !(&ltt_filter(event_id, event_struct, data)))
> return -EINVAL;
>
> You're always welcome to do the following from anywhere in your code:
> ltt_set_filter(NULL);

Provide a hook, export it and load your filters as a module, but keep
the filters out of the mainline kernel code.

> > Enabling and disabling events is a valid basic filter request, which
> > should live in the kernel. Anything else should go into userspace, IMO.
>
> What you are suggesting is that a system administator that wants to
> monitor his sendmail server over a period of three weeks should
> just postprocess 1.8TB (1MB/s) of data because Thomas Gleixner didn't
> like the idea of kernel event filtering based on anything but events.

A real common scenario with a broad range of users. And everybody has to
like the idea of hardwired filters in the kernel code to make the life
of this sysadmin easier.

See above about hooks.

Maybe some simple pipe would be helpful too:
read_stream | prefilter | buildbuffers | storeit

tglx


2005-01-18 00:04:31

by Daniel Drake

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

J.A. Magallon wrote:
> This does not patch against -mm1. -mm1 looks like a mix of plain 2.6.10
> and your code.
> Could you revamp it against -mm1, please ? I looked at it but seems out
> of my understanding...

My patch replaces the one in -mm1.

Just revert the waiting-10s-... patch that is in 2.6.11-rc1-mm1 using patch -p1 -R
Then apply the one I attached to the last mail normally.

I'll also be sending in a cleaner version of the patch shortly.

Daniel

2005-01-18 00:23:21

by Daniel Drake

[permalink] [raw]
Subject: [PATCH] Wait and retry mounting root device (revised)

Retry up to 20 times if mounting the root device fails. This fixes booting
from usb-storage devices, which no longer make their partitions immediately
available.

This should allow booting from root=/dev/sda1 and root=8:1 style parameters,
whilst not breaking booting from RAID or initrd :)
I have also cleaned up the mount_block_root() function a bit.

Based on an earlier patch from William Park <[email protected]>
Replaces the existing waiting-10s-before-mounting-root-filesystem.patch patch
in 2.6.11-rc1-mm1

Signed-off-by: Daniel Drake <[email protected]>


Attachments:
boot-delay-retry-v3.patch (2.67 kB)

2005-01-18 00:34:17

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] Wait and retry mounting root device (revised)

On Tue, Jan 18, 2005 at 02:54:24AM +0000, Daniel Drake wrote:
> Retry up to 20 times if mounting the root device fails. This fixes booting
> from usb-storage devices, which no longer make their partitions immediately
> available.

Sigh... So we can very well get device coming up in the middle of a loop
and get the actual attempts to mount the sucker in wrong order. How nice...

Folks, that's not a solution. And kludges like that really have no
business being there - they only hide the problem and make it harder
to reproduce.

2005-01-18 00:38:58

by Randy.Dunlap

[permalink] [raw]
Subject: Re: [PATCH] Wait and retry mounting root device (revised)

Al Viro wrote:
> On Tue, Jan 18, 2005 at 02:54:24AM +0000, Daniel Drake wrote:
>
>>Retry up to 20 times if mounting the root device fails. This fixes booting
>>from usb-storage devices, which no longer make their partitions immediately
>>available.
>
>
> Sigh... So we can very well get device coming up in the middle of a loop
> and get the actual attempts to mount the sucker in wrong order. How nice...
>
> Folks, that's not a solution. And kludges like that really have no
> business being there - they only hide the problem and make it harder
> to reproduce.

Is there a solution other than initrd/initramfs ?

Thanks,
--
~Randy

2005-01-18 01:04:47

by William Park

[permalink] [raw]
Subject: Re: [PATCH] Wait and retry mounting root device (revised)

On Tue, Jan 18, 2005 at 12:34:13AM +0000, Al Viro wrote:
> On Tue, Jan 18, 2005 at 02:54:24AM +0000, Daniel Drake wrote:
> > Retry up to 20 times if mounting the root device fails. This fixes
> > booting from usb-storage devices, which no longer make their
> > partitions immediately available.
>
> Sigh... So we can very well get device coming up in the middle of a
> loop and get the actual attempts to mount the sucker in wrong order.
> How nice...
>
> Folks, that's not a solution. And kludges like that really have no
> business being there - they only hide the problem and make it harder
> to reproduce.

The problem at hand is that USB key drive (which is my immediate
concern) takes 5sec to show up. So, it's much better approach than
'initrd'.

--
William Park <[email protected]>, Toronto, Canada
Slackware Linux -- because I can type.

2005-01-18 01:14:59

by Roman Zippel

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi,

On Mon, 17 Jan 2005, Karim Yaghmour wrote:

> a) create indexes, b) reorder events, and likely c) have to rewrite

An additional comment about the order of events. What you're doing in
lockless_reserve is bogus anyway. There is no single correct time to
write into the event. By artificially synchronizing event order and event
time you only cheat yourself. You either take it into account during
postprocessing that events can be interrupted or the time stamp doesn't
seem to be that important, but there is nothing you can do during the
recording of the event except of completely disabling interrupts.

bye, Roman

2005-01-18 02:45:33

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Hello Roman,

Roman Zippel wrote:
> An additional comment about the order of events. What you're doing in
> lockless_reserve is bogus anyway. There is no single correct time to
> write into the event. By artificially synchronizing event order and event
> time you only cheat yourself. You either take it into account during
> postprocessing that events can be interrupted or the time stamp doesn't
> seem to be that important, but there is nothing you can do during the
> recording of the event except of completely disabling interrupts.

Correct and like I said before, we are dropping the lockless scheme.
Ergo, disabling interrupts we will.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-18 02:58:05

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Thomas Gleixner wrote:
> Provide a hook, export it and load your filters as a module, but keep
> the filters out of the mainline kernel code.

Great idea! I will do exactly that.

Thanks,

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-18 03:56:28

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Hello Roman,

Roman Zippel wrote:
> Why is so important that it's at the start of the buffer? What's wrong
> with a special event _near_ the start of a buffer?
[snip]
> What gives you the idea, that you can't do this with what I proposed?
> You can still seek freely within the data at buffer boundaries and you
> only have to search a little into the buffer to find the delimiter. Events
> are not completely at random, so that the little reordering can be done at
> runtime. Sorry, but I don't get what kind of unsolvable problems you see
> here.

Actually I just checked the code and this is a non-issue. The callback
can only be called when the condition is met, which itself happens only
on buffer switch, which itself only happens when we try to reserve
something bigger than what is left in the buffer. IOW, there is no need
for reserving anything. Here's what the code does:
if (!finalizing) {
bytes_written = rchan->callbacks->buffer_start ...
cur_write_pos(rchan) += bytes_written;
}

With that said, I hope we've agreed that we'll have a callback for
letting relayfs clients know that they need to write the begining of
the buffer event. There won't be any associated reserve. Conversly,
I hope it is not too much to ask to have an end-of-buffer callback.

> Wrong question. What compromises can be made on both sides to create a
> common simple framework? Your unwillingness to compromise a little on the
> ltt requirements really amazes me.

Roman, of all people I've been more than happy to change my stuff following
your recommendations. Do I have to list how far down relayfs has been
stripped down? I mean, we got rid of the lockless scheme (which was
one of ltt's explicit requirements), we got rid of the read/write capabilities
for user-space, etc. And we are now only left with the bare-bones API:
rchan* relay_open(channel_path, bufsize, nbufs, flags, *callbacks);
int relay_close(*rchan);
int relay_reset(*rchan);
int relay_write(*rchan, *data_ptr, count, **wrote-pos);

char* relay_reserve(*rchan, len, *ts, *td, *err, *interrupting);
void relay_commit(*rchan, *from, len, reserve_code, interrupting);
void relay_buffers_consumed(*rchan, u32);

#define relay_write_direct(DEST, SRC, SIZE) \
#define relay_lock_channel(RCHAN, FLAGS) \
#define relay_unlock_channel(RCHAN, FLAGS) \

This is a far-cry from what we had before, have a look at the
relayfs.txt file in 2.6.11-rc1-mm1's Documentation/filesystems if
you want to compare. Please at least acknowledge as much.

I'm more than willing to compromise, but at least give me something
substantive to feed on. I've explained why I believe there needs to be
two modes for relayfs. If you don't think they are appropriate, then
please explain why. Either my experience blinds me or it rightly
compels me to continue defending it.

You ask what compromises can be found from both sides to obtain a
single implementation. I have looked at this, and given how
stripped down it has become, anything less from relayfs will make
it useless for LTT. IOW, I would have to reimplement a buffering
scheme within LTT outside of relayfs.

Can't you see that not all buffering schemes are adapted to all
applications and that it's preferable to have a single API
transparently providing separate mechanisms instead of a single
mechanism that doesn't satisfy any of its users?

If I can't convince you of the concept, can I at least convince
you to withhold your final judgement until you actually see the
code for the managed vs. ad-hoc schemes?

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-18 04:30:54

by Aaron Cohen

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi,
I'm very much a newbie to all of this, but I'm finding this
discussion fairly interesting.

I've got a quick question and I just want to be clear that it
doesn't have a political agenda behind it.

Here goes, why can't LTT and/or relayfs, work similar to the way
syslog does and just fill a buffer (aka ring-buffer or whatever is
appropriate), while a userspace daemon of some kind periodically reads
that buffer and massages it. I'm probably being naive but if the
difficulty is with huge several hundred-gig files, the daemon if it
monitors the buffer often enough could stuff it into a database or
whatever high-performance format you need.

It also seems to me that Linus' nascent "splice and tee" work would
be really useful for something like this to avoid a lot of unnecessary
copying by the userspace daemon.


On Mon, 17 Jan 2005 23:03:46 -0500, Karim Yaghmour <[email protected]> wrote:
>
> Hello Roman,
>
> Roman Zippel wrote:
> > Why is so important that it's at the start of the buffer? What's wrong
> > with a special event _near_ the start of a buffer?
> [snip]
> > What gives you the idea, that you can't do this with what I proposed?
> > You can still seek freely within the data at buffer boundaries and you
> > only have to search a little into the buffer to find the delimiter. Events
> > are not completely at random, so that the little reordering can be done at
> > runtime. Sorry, but I don't get what kind of unsolvable problems you see
> > here.
>
> Actually I just checked the code and this is a non-issue. The callback
> can only be called when the condition is met, which itself happens only
> on buffer switch, which itself only happens when we try to reserve
> something bigger than what is left in the buffer. IOW, there is no need
> for reserving anything. Here's what the code does:
> if (!finalizing) {
> bytes_written = rchan->callbacks->buffer_start ...
> cur_write_pos(rchan) += bytes_written;
> }
>
> With that said, I hope we've agreed that we'll have a callback for
> letting relayfs clients know that they need to write the begining of
> the buffer event. There won't be any associated reserve. Conversly,
> I hope it is not too much to ask to have an end-of-buffer callback.
>
> > Wrong question. What compromises can be made on both sides to create a
> > common simple framework? Your unwillingness to compromise a little on the
> > ltt requirements really amazes me.
>
> Roman, of all people I've been more than happy to change my stuff following
> your recommendations. Do I have to list how far down relayfs has been
> stripped down? I mean, we got rid of the lockless scheme (which was
> one of ltt's explicit requirements), we got rid of the read/write capabilities
> for user-space, etc. And we are now only left with the bare-bones API:
> rchan* relay_open(channel_path, bufsize, nbufs, flags, *callbacks);
> int relay_close(*rchan);
> int relay_reset(*rchan);
> int relay_write(*rchan, *data_ptr, count, **wrote-pos);
>
> char* relay_reserve(*rchan, len, *ts, *td, *err, *interrupting);
> void relay_commit(*rchan, *from, len, reserve_code, interrupting);
> void relay_buffers_consumed(*rchan, u32);
>
> #define relay_write_direct(DEST, SRC, SIZE) \
> #define relay_lock_channel(RCHAN, FLAGS) \
> #define relay_unlock_channel(RCHAN, FLAGS) \
>
> This is a far-cry from what we had before, have a look at the
> relayfs.txt file in 2.6.11-rc1-mm1's Documentation/filesystems if
> you want to compare. Please at least acknowledge as much.
>
> I'm more than willing to compromise, but at least give me something
> substantive to feed on. I've explained why I believe there needs to be
> two modes for relayfs. If you don't think they are appropriate, then
> please explain why. Either my experience blinds me or it rightly
> compels me to continue defending it.
>
> You ask what compromises can be found from both sides to obtain a
> single implementation. I have looked at this, and given how
> stripped down it has become, anything less from relayfs will make
> it useless for LTT. IOW, I would have to reimplement a buffering
> scheme within LTT outside of relayfs.
>
> Can't you see that not all buffering schemes are adapted to all
> applications and that it's preferable to have a single API
> transparently providing separate mechanisms instead of a single
> mechanism that doesn't satisfy any of its users?
>
> If I can't convince you of the concept, can I at least convince
> you to withhold your final judgement until you actually see the
> code for the managed vs. ad-hoc schemes?
>
> Karim
> --
> Author, Speaker, Developer, Consultant
> Pushing Embedded and Real-Time Linux Systems Beyond the Limits
> http://www.opersys.com || [email protected] || 1-866-677-4546
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2005-01-18 04:39:31

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Aaron Cohen wrote:
> I've got a quick question and I just want to be clear that it
> doesn't have a political agenda behind it.

:)

> Here goes, why can't LTT and/or relayfs, work similar to the way
> syslog does and just fill a buffer (aka ring-buffer or whatever is
> appropriate), while a userspace daemon of some kind periodically reads
> that buffer and massages it. I'm probably being naive but if the
> difficulty is with huge several hundred-gig files, the daemon if it
> monitors the buffer often enough could stuff it into a database or
> whatever high-performance format you need.

Because of the bandwidth it is not possible to do any sort of live
processing of any kind. The only thing the daemon can possibly do
is write large blocks of tracing info to disk as rapidly as possible.

> It also seems to me that Linus' nascent "splice and tee" work would
> be really useful for something like this to avoid a lot of unnecessary
> copying by the userspace daemon.

There is no copying by the userspace daemon. All it does is open(),
then mmap(), and then it sleeps until it is woken up by the ltt
kernel subsystem. When that happens, it only does a write() on the
mmaped area, tells the ltt subsystem that it commited X number of
sub-buffers and goes back asleep. This is all zero-copy.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-18 07:23:16

by Tom Zanussi

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Karim Yaghmour writes:
>
> Aaron Cohen wrote:
> > I've got a quick question and I just want to be clear that it
> > doesn't have a political agenda behind it.
>
> :)
>
> > Here goes, why can't LTT and/or relayfs, work similar to the way
> > syslog does and just fill a buffer (aka ring-buffer or whatever is
> > appropriate), while a userspace daemon of some kind periodically reads
> > that buffer and massages it. I'm probably being naive but if the
> > difficulty is with huge several hundred-gig files, the daemon if it
> > monitors the buffer often enough could stuff it into a database or
> > whatever high-performance format you need.
>
> Because of the bandwidth it is not possible to do any sort of live
> processing of any kind. The only thing the daemon can possibly do
> is write large blocks of tracing info to disk as rapidly as possible.

I have to disagree. Awhile back, if you remember, I posted a patch to
the LTT daemon that would monitor the trace stream in real time, and
process it using an embedded Perl interpreter, no less:

http://marc.theaimsgroup.com/?l=linux-kernel&m=109405724500237&w=2

It didn't seem to have any problems keeping up with the trace stream
even though it was monitoring all LTT event types (and a couple of
others - custom events injected using kprobes) and not doing any
filtering in the kernel, through kernel compiles, normal X traffic,
etc. I don't know what volume of event traffic would cause this model
to break down, but I think it shows that at least some level of
non-trivial live processing is possible...

Tom

>
> > It also seems to me that Linus' nascent "splice and tee" work would
> > be really useful for something like this to avoid a lot of unnecessary
> > copying by the userspace daemon.
>
> There is no copying by the userspace daemon. All it does is open(),
> then mmap(), and then it sleeps until it is woken up by the ltt
> kernel subsystem. When that happens, it only does a write() on the
> mmaped area, tells the ltt subsystem that it commited X number of
> sub-buffers and goes back asleep. This is all zero-copy.
>
> Karim
> --
> Author, Speaker, Developer, Consultant
> Pushing Embedded and Real-Time Linux Systems Beyond the Limits
> http://www.opersys.com || [email protected] || 1-866-677-4546

--
Regards,

Tom Zanussi <[email protected]>
IBM Linux Technology Center/RAS

2005-01-18 08:02:35

by Andries Brouwer

[permalink] [raw]
Subject: Re: [PATCH] Wait and retry mounting root device (revised)

On Tue, Jan 18, 2005 at 02:54:24AM +0000, Daniel Drake wrote:

> Retry up to 20 times if mounting the root device fails. This fixes booting
> from usb-storage devices, which no longer make their partitions immediately
> available.
>
> This should allow booting from root=/dev/sda1 and root=8:1 style
> parameters, whilst not breaking booting from RAID or initrd :)
> I have also cleaned up the mount_block_root() function a bit.

+ if (err == -EACCES && (flags | MS_RDONLY) == 0)
+ err = sys_mount(name, "/root", fs, flags | MS_RDONLY, data);
+

It is rather unlikely that (flags | MS_RDONLY) == 0 ...

I don't like the 20 - so arbitrary.
And since we are going to panic anyway, why not wait indefinitely?

Suppose we have kernel command line options
rootdev=, rootpttype=, root=, rootfstype=, rootwait=
telling the kernel what device is the root device,
what type of partition table it has,
on which partition the root filesystem lives,
what type of filesystem it has,
and whether we want to wait until it becomes available instead of panicking.

If we wait, possibly after the first failure to mount, do a printk
to tell the user: waiting for device to become available.

rootwait can have several values: for example, with a boot/root floppy combo,
we want the user to hit enter or so before accessing the device.

Andries

2005-01-18 08:06:14

by Andries Brouwer

[permalink] [raw]
Subject: Re: [PATCH] Wait and retry mounting root device (revised)

On Mon, Jan 17, 2005 at 04:02:15PM -0800, Randy.Dunlap wrote:
> Al Viro wrote:
> >On Tue, Jan 18, 2005 at 02:54:24AM +0000, Daniel Drake wrote:
> >
> >>Retry up to 20 times if mounting the root device fails. This fixes
> >>booting
> >>from usb-storage devices, which no longer make their partitions
> >>immediately
> >>available.
> >
> >
> >Sigh... So we can very well get device coming up in the middle of a loop
> >and get the actual attempts to mount the sucker in wrong order. How
> >nice...
> >
> >Folks, that's not a solution. And kludges like that really have no
> >business being there - they only hide the problem and make it harder
> >to reproduce.
>
> Is there a solution other than initrd/initramfs ?

On the one hand, I entirely agree with Al - this guessing business
is a bad kludge, and building complications on top of it makes
things worse.

On the other hand, we do already have the rootfstype= option,
so one can avoid trying things in the "wrong" order.

Andries

2005-01-18 08:20:23

by Helge Hafting

[permalink] [raw]
Subject: Re: [PATCH] Wait and retry mounting root device (revised)

Randy.Dunlap wrote:

> Al Viro wrote:
>
>> On Tue, Jan 18, 2005 at 02:54:24AM +0000, Daniel Drake wrote:
>>
>>> Retry up to 20 times if mounting the root device fails. This fixes
>>> booting
>>> from usb-storage devices, which no longer make their partitions
>>> immediately
>>> available.
>>
>>
>>
>> Sigh... So we can very well get device coming up in the middle of a
>> loop
>> and get the actual attempts to mount the sucker in wrong order. How
>> nice...
>>
>> Folks, that's not a solution. And kludges like that really have no
>> business being there - they only hide the problem and make it harder
>> to reproduce.
>
>
> Is there a solution other than initrd/initramfs ?

There is a solution that seems obvious to me, so obvious that it ought to
be the first solution to try. And it is guaranteed to not mess up raid
or anything
else too. So perhaps there is something wrong with it, or someone would
have done this
already? Here it is:

Apparently, USB devices doesn't appear immediately (after powerup? after
USB bus initialization?) We know this - therefore the USB block driver
should know this.
The USB block driver should know that 10s (or whatever) hasn't yet
passed, and simply
block any attempt to access block devices (or scan for them) knowing
that it will
not work yet, but any device will be there after the pause. A root mount
on USB will
then succeed at the _first_ try everytime, so no need for retries.

This solution is guaranteed to not mess up raid or anything else,
because the fix is done
in the driver for the "odd" devices, not in the upper layer trying to
use the device as a
root fs.

Surely someone must have thought of this before - is there any reason
why this
won't work well?

The only thing I can think of is that partition scanning will cause a delay
on every system with USB block devices compiled-in, but this could be
postponed when root isn't on usb.
Partition scanning is moving to "early userspace" anyway, isn't it? In
the meantime,
people without USB root that don't want a bootup delay can use modular
usb and load
the module later in some boot script.

Helge Hafting

2005-01-18 08:46:28

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

On Mon, 2005-01-17 at 18:57 -0500, Karim Yaghmour wrote:
> Thomas Gleixner wrote:
> > If we add another hardwired implementation then we do not have said
> > benefits.
>
> Please stop handwaving. Folks like Andrew, Christoph, Zwane, Roman,
> and others actually made specific requests for changes in the code.
> What makes you think you're so special that you think you are
> entitled to stay on the side and handwave about concepts.

So the points you added to your todo list which were brought up by me
are worthless ?

I'm not handwaving. I started this RFC to move the discussion into a
general discussion about instrumentation. A couple of people are
seriosly interested to do this. If you are not interested then ignore
the thread, but you're way not in a position to tell me to shut up.

You turned this thread into your LTT prayer wheel.

Roman pointed out your unwillingness to create a common framework
before. But I have to disagree with him in one point. It's not amazing,
it's annoying.

> If there is a limitation with the code, please present actual
> snippets that need to be changed and suggest alternatives. That's
> what everyone else does on this list.

I pointed you to actually broken code and you accused me of throwing
mud.

> Save the bandwidth

Please remove me from cc, it's a good start to save bandwidth.

> and start cleaning.

Yes, I did already start cleaning

cat ../broken-out/ltt* | patch -p1 -R

tglx


2005-01-18 08:50:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Wait and retry mounting root device (revised)

Helge Hafting <[email protected]> wrote:
>
> The USB block driver should know that 10s (or whatever) hasn't yet
> passed, and simply
> block any attempt to access block devices (or scan for them) knowing
> that it will
> not work yet, but any device will be there after the pause. A root mount
> on USB will
> then succeed at the _first_ try everytime, so no need for retries.

Maybe a simple delay somewhere in the boot sequence would suffice? Boot
with `mount_delay=10'.

But it sure would be nice to simply get this stuff right somehow. If the
USB block driver knows that discovery is still in progress it should wait
until it has completed. (I suggested that before, but wasn't 100% convinced
by the answer).

2005-01-18 11:40:12

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hello,

I?m a developer of yet another kernel tracer, LKST. I and co-developers
are very glad to hear that LTT was merged into -mm tree and to talk
about the kernel tracer on this ML. Because we think that the kernel
event tracer is useful to debug Linux systems, and to improve the kernel
reliability.

Andi Kleen wrote:
> Andrew Morton <[email protected]> writes:
>
>>- Added the Linux Trace Toolkit (and hence relayfs). Mainly because I
>> haven't yet taken as close a look at LTT as I should have. Probably neither
>> have you.
>
>
> I think it would be better to have a standard set of kprobes instead
> of all the ugly LTT hooks. kprobes could then log to relayfs or another
> fast logging mechanism.

I agree.
I?m interested in kprobes. Currently, LKST can switch off and on each
hook. But, even if a hook was disabled, there is a little overhead-time
(one conditional-jump instruction should be executed). I think
kprobes-based hooks can completely remove this overhead-time. Moreover,
kprobes-based hooks can be inserted dynamically into the code-point
specified by user. This feature is greatly useful for debugging. So, I
have an idea to renew LKST to kprobes-based hooks.
Also, I?m developing a prototype implementation.


> The problem relayfs has IMHO is that it is too complicated. It
> seems to either suffer from a overfull specification or second system
> effect. There are lots of different options to do everything,
> instead of a nice simple fast path that does one thing efficiently.
> IMHO before merging it should go through a diet and only keep
> the paths that are actually needed and dropping a lot of the current
> baggage.
>
> Preferably that would be only the fastest options (extremly simple
> per CPU buffer with inlined fast path that drop data on buffer overflow),
> with leaving out anything more complicated. My ideal is something
> like the old SGI ktrace which was an extremly simple mechanism
> to do lockless per CPU logging of binary data efficiently and
> reading that from a user daemon.

LKST?s logging buffer is (much) simpler than relayfs. It is just the
linked-perCPU-buffer.

If you are interested in this, please try LKST.


--
Masami HIRAMATSU

Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]

2005-01-18 11:46:36

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Tue, Jan 18, 2005 at 08:19:18PM +0900, Masami Hiramatsu wrote:
> Hello,
>
> I?m a developer of yet another kernel tracer, LKST. I and co-developers
> are very glad to hear that LTT was merged into -mm tree and to talk
> about the kernel tracer on this ML. Because we think that the kernel
> event tracer is useful to debug Linux systems, and to improve the kernel
> reliability.

I haven't looked at your code, but I would suggest you also post
for review it so that it can be evaluated in the same way
as other more noisy proposals.

Perhaps Andrew can test both for some time in MM like he used
to do for the various schedulers.

-Andi

2005-01-18 13:12:40

by Helge Hafting

[permalink] [raw]
Subject: Re: [PATCH] Wait and retry mounting root device (revised)

Andrew Morton wrote:

>Helge Hafting <[email protected]> wrote:
>
>
>>The USB block driver should know that 10s (or whatever) hasn't yet
>> passed, and simply
>> block any attempt to access block devices (or scan for them) knowing
>> that it will
>> not work yet, but any device will be there after the pause. A root mount
>> on USB will
>> then succeed at the _first_ try everytime, so no need for retries.
>>
>>
>
>Maybe a simple delay somewhere in the boot sequence would suffice? Boot
>with `mount_delay=10'.
>
>
>
Certainly the simplest solution, and it also solves a related
but rare problem: People booting linux from ROM long before
the disks have time to spin up.

There seems to be a disadvantage in that one must specify
this pause manually, but the admin have to select the root fs
somewhere anyway (lilo.conf) and may specify the delay at
the same time.

>But it sure would be nice to simply get this stuff right somehow. If the
>USB block driver knows that discovery is still in progress it should wait
>until it has completed. (I suggested that before, but wasn't 100% convinced
>by the answer).
>
>
Sure, if the USB core can know, then it should use the knowledge.
Or utilize a simple timeout if all it knows is that "common
storage devices appear on the bus up to 10s after powerup/reset".

Helge Hafting

2005-01-18 14:52:40

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [Lkst-develop] Re: 2.6.11-rc1-mm1

Hi,

Andi Kleen wrote:
> On Tue, Jan 18, 2005 at 08:19:18PM +0900, Masami Hiramatsu wrote:
>
>>Hello,
>>
>>I?m a developer of yet another kernel tracer, LKST. I and co-developers
>>are very glad to hear that LTT was merged into -mm tree and to talk
>>about the kernel tracer on this ML. Because we think that the kernel
>>event tracer is useful to debug Linux systems, and to improve the kernel
>>reliability.
>
>
> I haven't looked at your code, but I would suggest you also post
> for review it so that it can be evaluated in the same way
> as other more noisy proposals.
>
> Perhaps Andrew can test both for some time in MM like he used
> to do for the various schedulers.

Thanks to your advice.
The latest release package of LKST baesd on linux-2.6.9 can be
downloaded from
http://sourceforge.net/projects/lkst/

I'll release the LKST based on the latest kernel as soon as possible.

Regards,

--
Masami HIRAMATSU

Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]

2005-01-18 15:32:22

by Roman Zippel

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi,

On Mon, 17 Jan 2005, Karim Yaghmour wrote:

> With that said, I hope we've agreed that we'll have a callback for
> letting relayfs clients know that they need to write the begining of
> the buffer event. There won't be any associated reserve. Conversly,
> I hope it is not too much to ask to have an end-of-buffer callback.

There of course has to be some kind of end marker, but that's less
critical as it's not the active buffer anymore.

> Roman, of all people I've been more than happy to change my stuff following
> your recommendations. Do I have to list how far down relayfs has been
> stripped down?

Sorry, you missunderstood me. At the moment I'm only secondarily
interested in the API details, primarily I want to work out the details of
what exactly relayfs/ltt are supposed to do. One main question here I
can't answer yet, why you insist on multiple relayfs modes.
This is what I basically have in mind for the relay_write function:

cpu = get_cpu();
buffer = relay_get_buffer(chan, cpu);
while(1) {
offset = local_add_return(buffer->offset, length);
if (likely(offset + length <= buffer->size))
break;
buffer = relay_switch_buffer(chan, buffer, offset);
}
memcpy(buffer->data + offset, data, length);
put_cpu();

ltt_log_event should only be a few lines more (for writing header and
event data).
What I'd like to know now are the reasons why you need more than this.
It's not the amount of data and any timing requirements have to be done by
the caller. During processing you either take the events in the order they
were recorded (often that's good enough) or you sort them which is not
that difficult.

> You ask what compromises can be found from both sides to obtain a
> single implementation. I have looked at this, and given how
> stripped down it has become, anything less from relayfs will make
> it useless for LTT. IOW, I would have to reimplement a buffering
> scheme within LTT outside of relayfs.

I know you don't want to touch the topic of kernel debugging, but its
requirements greatly overlap with what you want to do with ltt, e.g. one
needs very often information about scheduling events as many kernel
processes rely more and more on kernel threads. The only real requirement
for kernel debugging is low runtime overhead, which you certainly like to
have as well. So what exactly are these requirements and why can't there
be no reasonable alternative?

bye, Roman

2005-01-18 16:24:06

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)


Thomas,

Thomas Gleixner wrote:
> Yes, I did already start cleaning
>
> cat ../broken-out/ltt* | patch -p1 -R

:D

If it gives you a warm and fuzzy feeling to have the last
cheap-shot, then I'm all for it, it is of no consequence anyway.
And _please_ don't forget to answer this very email with
something of the same substance.

For my part I consider that I've invested a substantial amount
of time in responding to both your conceptual and practical
feedback, as the archives clearly show.

That being said, I have to thank you for making sure that all
the obvious questions have been asked. I now have more than a
dozen archive links of my answers to those. I'll sure come in
handy when writing an FAQ.

Thanks again,

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-18 16:33:47

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Tom Zanussi wrote:
> I have to disagree. Awhile back, if you remember, I posted a patch to
> the LTT daemon that would monitor the trace stream in real time, and
> process it using an embedded Perl interpreter, no less:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109405724500237&w=2
>
> It didn't seem to have any problems keeping up with the trace stream
> even though it was monitoring all LTT event types (and a couple of
> others - custom events injected using kprobes) and not doing any
> filtering in the kernel, through kernel compiles, normal X traffic,
> etc. I don't know what volume of event traffic would cause this model
> to break down, but I think it shows that at least some level of
> non-trivial live processing is possible...

Good Point.

My bad. Thanks for bringing this up. Obviously this didn't get as
much attention as it should've had the last time it was posted,
especially as it allows very easy scripting of filtering in userspace.
That email you refer to is pretty loaded and I'm sure those who
are interested will dig through it. But in the interest of helping
everyone get a rapid understanding of what it does and how it does it,
can you break it down in to a short description, possibly with a
diagram? I'm sure many will find this very interesting.

Thanks,

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-18 18:53:15

by Tom Zanussi

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Karim Yaghmour writes:
>
> Tom Zanussi wrote:
> > I have to disagree. Awhile back, if you remember, I posted a patch to
> > the LTT daemon that would monitor the trace stream in real time, and
> > process it using an embedded Perl interpreter, no less:
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=109405724500237&w=2
> >
> > It didn't seem to have any problems keeping up with the trace stream
> > even though it was monitoring all LTT event types (and a couple of
> > others - custom events injected using kprobes) and not doing any
> > filtering in the kernel, through kernel compiles, normal X traffic,
> > etc. I don't know what volume of event traffic would cause this model
> > to break down, but I think it shows that at least some level of
> > non-trivial live processing is possible...
>
> Good Point.
>
> My bad. Thanks for bringing this up. Obviously this didn't get as
> much attention as it should've had the last time it was posted,
> especially as it allows very easy scripting of filtering in userspace.
> That email you refer to is pretty loaded and I'm sure those who
> are interested will dig through it. But in the interest of helping
> everyone get a rapid understanding of what it does and how it does it,
> can you break it down in to a short description, possibly with a
> diagram? I'm sure many will find this very interesting.

It's so simple it doesn't really deserve a diagram, which I'm pretty
bad at anyway...

Basically all it does is loop around the received buffer, reading each
event and sending it off to a handler. In this case the handler
massages the data into a form that allows it to be passed to the Perl
interpreter as arguments to a Perl function that in turn acts as
callback handler in the Perl interpreter.

At that point, the Perl callback can do whatever it wants with the
data - save events matching a certain pid and discard everything else,
keep running counts or time totals e.g. total syscall counts for each
pid, function call tracing (if you dynamically instrumented function
call entry/exit with kprobes for example), etc, etc, etc. Probably
even more useful is the ability to monitor the event stream looking
for sporadically occuring events, again under the control of the Perl
interpreter, so your criteria for deciding what an 'important event'
is can be arbitrarily complex and incorporate past history. It also
means that you don't have to save anything at all to disk until you
detect your specified condition (which makes tracing for days or weeks
on end more practical), at which point you can dump out the currently
mapped buffer containing the last bufsize number of events most likely
to be of interest anyway.

Perl makes this kind of quick and dirty processing extremely easy and
it has a lot of powerful language features such as nested hashes built
in, which is why I chose it, but you could of course avoid the extra
layer and the interpreter and do your filtering in straight C, or
create a binding for any language you want.

IMHO being able to do most of the filtering in user space like this
opens up a lot of avenues for not only one-off problem determination
hacks, but a proliferation of more substantial tools, considering how
easy it is to put together applications using for instance the copious
number of Perl modules available.

Tom

>
> Thanks,
>
> Karim
> --
> Author, Speaker, Developer, Consultant
> Pushing Embedded and Real-Time Linux Systems Beyond the Limits
> http://www.opersys.com || [email protected] || 1-866-677-4546

--
Regards,

Tom Zanussi <[email protected]>
IBM Linux Technology Center/RAS

2005-01-19 00:44:20

by Werner Almesberger

[permalink] [raw]
Subject: Re: [PATCH] Wait and retry mounting root device (revised)

William Park wrote:
> The problem at hand is that USB key drive (which is my immediate
> concern) takes 5sec to show up. So, it's much better approach than
> 'initrd'.

I'm a little biased, but I disagree ;-) The main problems with initrd
seem to be that it adds at least one more moving part, and that most
initrd-making procedures give you something non-interactive that
hardly interacts with the outside world. Lo and behold, nobody likes
sudden silent failure of a complex and opaque subsystem, particularly
if it happens to be vitally important.

I think initrds could be greatly improved by including a BusyBox in
their failure paths (plus a way to manually enter the BusyBox, in case
apparent success still means failure). That way, you can actually try
to fix things if there are problems.

Another issue is configuration data that has to exist in the initrd,
yielding a possibly complex initrd construction process that has to
follow each configuration change. Also there, an initrd could be able
to try to access the regular file system to access such information,
possibly combined with caching and heuristics. (I realize that this
isn't trivial and bears a high risk of intractable failure paths, but
I also think that it's worth exploring this direction.)

Regarding the delayed mount problem, I think some retry procedure may
be the best possible band-aid for a while. While it would be desirable
for the USB subsystem (etc.) to just block until the device is ready,
this doesn't work so well if the presence of the device can't be
predicted at that point, e.g. if a "devfs" (udev, etc.) name has to be
looked up first.

I'm not sure I understand Al's concern with devices popping up in the
middle of the loop. For all practical purposes, mounting the root file
system has a single target anyway, so it can't really compete with
anything else. Automatically selected alternative roots can make
sense, but that's sufficiently policy-ish that I think it would be
better kept in an initrd, where instrumentation is more naturally
added than in the kernel.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2005-01-19 07:14:40

by Werner Almesberger

[permalink] [raw]
Subject: Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

>From all I've heard and seen of LTT (and I have to admit that most
of it comes from reading this thread, not from reading the code),
I have the impression that it may try to be a bit too specialized,
and thus might miss opportunities for synergy.

You must be getting tired of people trying to redesign things from
scratch, but maybe you'll humor me anyway ;-)

Karim Yaghmour wrote:
> If you really want to define layers, then there are actually four
> layers:
> 1- hooking mechanism
> 2- event definition / registration
> 3- event management infrastructure
> 4- transport mechanism

For 1, kprobes would seem largely sufficient. In cases where you
don't have a usable attachment point (e.g. in the middle of a
function and you need access to variables with unknown location),
you can add lightweight instrumentation that arranges the code
flow suitably. [1, 2]

2 and 3 should be the main domain of LTT, with 2 sitting on top
of kprobes. kprobes currently doesn't have a nice way for
describing handlers, but that can be fixed [3]. But you probably
don't need a "nice" interface right now, but might be satisfied
with one that works and is fast (?)

>From the discussion, it seems that the management is partially
done by relayfs. I find this a little strange. E.g. instead of
filtering events, you may just not generate them in the first
place, e.g. by not placing a probe, or by filtering in LTT,
before submitting the event.

Timestamps may be fine either way. Restoring sequence should be
a task user-space can handle: in the worst case, you'd have to
read and merge from #cpus streams. Seeking works in that context,
too.

Last but not least, 4 should be simple. Particularly since you're
worried about extreme speeds, there should be as little
processing as you can afford. If you need to seek efficiently
(do you, really ?), you may not even want message boundaries at
that level.

Something that isn't entirely clear to me is if you also need to
aggregate information in buffers. E.g. by updating a record until
is has been retrieved by user space, or by updating a record
when there is no space to create a new one. Such functionality
would add complexity and needs tight sychronization with the
transport.

[1] I've seen the argument that kprobes aren't portable. This
strikes me a highly questionable. Even if an architecture
doesn't have a trap instruction (or equivalent code sequence)
that is at least as short as the shortest instruction, you
can always fall back to adding instrumentation [2]. Also, if
you know where your basic blocks are, you may be able to
use traps that span multiple instructions. I recall that
things of this kind are already planned for kprobes.

[2] See the "reliable markers" of umlsim from umlsim.sf.net.
Implementation: cd umlsim/lib; make; tail -50 markers_kernel.h
Examples: cd umlsim/sim/tests; cat sbug.marker
They're basically extra-light markup in the source code.
Works on ia32, but I haven't found a way to get the assembler
to cooperate for amd64, yet.

[3] I've already solved this problem in umlsim: there, I have a
Perl/C-like scripting language that allows handlers to do
pretty much anything they want. Of course, kprobes would
want pre-compiled C code, not some scripts, but I think the
design could be developped in a direction that would allow
both. Will take a while, but since I'll eventually have to
rewrite the "microcode" anyway, ...

So my comments are basically as follows:

1) kprobes seems like a suitable and elegant mechanism for
placing all the hooks LTT needs, so I think that it would
be better to build on this basis, and extend it where
necessary, than to build yet another specialized variant
in parallel.
2) LTT should do what it is good at, and not have to worry
about the rest (i.e. supporting infrastructure).
3) relayfs should be lean and fast, as you intend it to be, so
that non-LTT tracing or fnord debugging fnord code may find
it useful, too.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2005-01-19 11:11:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Sun, Jan 16, 2005 at 02:30:33PM -0600, Tom Zanussi wrote:
> This would allow an application to write trace events of its own to a
> trace stream for instance.

I don't think this is a good idea. Userspace could aswell easily write
its trace into shared memory segments.

> Also, I added a user-requested 'feature'
> whereby write()s on a relayfs channel would be sent to a callback that
> could be used to interpret 'out-of-band' commands sent from the
> userspace application.

Now write as a control channel makes lots of sense, but I'd encapsulate
that differently. Basically a net ctl file for each stream (and get
rid of ioctl in favour of this one while we're at it)

2005-01-19 11:14:32

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Sun, Jan 16, 2005 at 01:05:19PM -0600, Tom Zanussi wrote:
> One of the things that uses these functions to read from a channel
> from within the kernel is the relayfs code that implements read(2), so
> taking them away means you wouldn't be able to use read() on a relayfs
> file.

Removing them from the public API is different from disallowing the
read operation.

> That wouldn't matter for ltt since it mmaps the file, but there
> are existing users of relayfs that do use relayfs this way. In fact,
> most of the bug reports I've gotten are from people using it in this
> mode. That doesn't mean though that it's necessarily the right thing
> for relayfs or these users to be doing if they have suitable
> alternatives for passing lower-volume messages in this way. As others
> have mentioned, that seems to be the major question - should relayfs
> concentrate on being solely a high-speed data relay mechanism or
> should it try to be more, as it currently is implemented?

I'd say let it do one thing well, that is high-volume data transfer.

> If the
> former, then I wonder if you need a filesystem at all - all you have
> is a collection of mmappable buffers and the only thing the filesystem
> provides is the namespace. Removing read()/write() and filesystem
> support would of course greatly simplify the code; I'd like to hear
> from any existing users though and see what they'd be missing.

What else would manage the namespace?

2005-01-19 16:57:06

by Tom Zanussi

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Christoph Hellwig wrote:
> On Sun, Jan 16, 2005 at 01:05:19PM -0600, Tom Zanussi wrote:
>
>>One of the things that uses these functions to read from a channel
>>from within the kernel is the relayfs code that implements read(2), so
>>taking them away means you wouldn't be able to use read() on a relayfs
>>file.
>
>
> Removing them from the public API is different from disallowing the
> read operation.
>

Right, but we were planning on removing all that code in the interest of
stripping relayfs down to its bare minimum as a high-speed data
transfer mechanism.

>
>>That wouldn't matter for ltt since it mmaps the file, but there
>>are existing users of relayfs that do use relayfs this way. In fact,
>>most of the bug reports I've gotten are from people using it in this
>>mode. That doesn't mean though that it's necessarily the right thing
>>for relayfs or these users to be doing if they have suitable
>>alternatives for passing lower-volume messages in this way. As others
>>have mentioned, that seems to be the major question - should relayfs
>>concentrate on being solely a high-speed data relay mechanism or
>>should it try to be more, as it currently is implemented?
>
>
> I'd say let it do one thing well, that is high-volume data transfer.

Yes, I think that's the one thing everyone's agreed on.

>
>
>>If the
>>former, then I wonder if you need a filesystem at all - all you have
>>is a collection of mmappable buffers and the only thing the filesystem
>>provides is the namespace. Removing read()/write() and filesystem
>>support would of course greatly simplify the code; I'd like to hear
>>from any existing users though and see what they'd be missing.
>
>
> What else would manage the namespace?

I have to confess I haven't had the time to look at it in detail, but I
previously suggested that we might be able to recover the read()
operations by providing them in userspace on top of the mmapped relayfs
buffer, using FUSE. If we did that, our FUSE filesystem could also
provide the namespace, I assume.

Anyway, I don't think I've seen any objections in principal to the
filesystem part of relayfs, so maybe it's not an issue - any other
suggestions would be welcome, of course...

Tom

>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2005-01-19 17:30:45

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)


Werner Almesberger wrote:
>>From all I've heard and seen of LTT (and I have to admit that most
> of it comes from reading this thread, not from reading the code),

Might I add that this is part of the problem ... No personal
offence intended, but there's been _A LOT_ of things said about
LTT that were based on third-hand account and no direct contact
with the toolset/code. And part of the problem is that _many_
people on this list, and elsewhere, have done some form of
tracing or another as part of their development, so they all
have their idea of how this is best done. Yet, while such
experience can help provide additional ideas to LTT's development,
it also often requires re-explaining to every new suggestor why we
added features he couldn't imagine would be useful to any of
his/her own tracing needs ... Sometimes I wish my interests lied
in some arcane feature that few had ever played with ;)

IOW, while I don't discount anybody else's experience with tracing,
please give us at least the benefit of the doubt by actually:
a) Looking at the code
b) Looking at the mailing list archives
c) Asking us questions directly related to the code

> I have the impression that it may try to be a bit too specialized,
> and thus might miss opportunities for synergy.

Bare with me on this one ...

> You must be getting tired of people trying to redesign things from
> scratch, but maybe you'll humor me anyway ;-)

Hey, from you Werner I'll take anything. It's always a pleasure
talking with you :)

> Karim Yaghmour wrote:
>
>>If you really want to define layers, then there are actually four
>>layers:
>>1- hooking mechanism
>>2- event definition / registration
>>3- event management infrastructure
>>4- transport mechanism
>
>
> For 1, kprobes would seem largely sufficient. In cases where you
> don't have a usable attachment point (e.g. in the middle of a
> function and you need access to variables with unknown location),
> you can add lightweight instrumentation that arranges the code
> flow suitably. [1, 2]

Let me say outright, as I said to Andi early on in the sister thread,
that I have no problems with having the trace points being fed by
kprobes. In fact, in 2000, way back before kprobes even existed, LTT
was already interfacing with DProbes for dynamic insertion of trace
points.

... There I said it ... now watch me have to repeat this yet again
later on ... :/

However, kprobes is not magic:
a) Like I said to Andi:
> As far as kprobes go, then you still need to have some form or another
> of marking the code for key events, unless you keep maintaining a set
> of kprobes-able points separately, which really makes it unusable for
> the rest of us, as the users of LTT have discovered over time (having
> to create a new patch for every new kernel that comes out.)

b) Like I said to Andrew back in July:
> I've double-checked what I already knew about kprobes and have looked again
> at the site and the patch, and unless there's some feature of kprobes I don't
> know about that allows using something else than the debug interrupt to add
> hooks,
...
> Generating new interrupts is simply unacceptable for LTT's functionality.
> Not to mention that it breaks LTT because tracing something will generate
> events of its own, which will generating tracing events of their own ...
> recursion.

Ok, you can argue about the recursion thing with an "if()", but you'll
have to admit that like in the case I described to Roman:
> ... Say you're getting
> 2MB/s of data (which is not unrealistic on a loaded system.) That means
> that if I'm tracing for 2 days, I've got 345GB of data (~7.5GB/hour).
IOW, something like 200,000events/s (average of 10bytes/event). Do I
really need to explain that 200,000 traps/interrupts per second is
not something you want ... ?

But don't despair, like I said to Andi:
> So lately I've been thinking that there may be a middle-ground here
> where everyone could be happy. Define three states for the hooks:
> disabled, static, marker. The third one just adds some info into
> System.map for allowing the automation of the insertion of kprobes
> hooks (though you would still need the debugging info to find the
> values of the variables that you want to log.) Hence, you get to
> choose which type of poison you prefer. For my part, I think the
> noop/early-check should be sufficient to get better performance from
> the existing hook-set.
I have received very little feedback on this suggestion, though I
really think it's worth entertaining, especially with your mention
of uml-sim markers further below.

As for the location of ltt trace points, then they are very rarely
at function boundaries. Here's a classic:
prepare_arch_switch(rq, next);
ltt_ev_schedchange(prev, next);
prev = context_switch(rq, prev, next);

> 2 and 3 should be the main domain of LTT, with 2 sitting on top
> of kprobes. kprobes currently doesn't have a nice way for
> describing handlers, but that can be fixed [3]. But you probably
> don't need a "nice" interface right now, but might be satisfied
> with one that works and is fast (?)

The functions have been there for DProbes for 5 years:
int ltt_create_event(char *event_type,
char *event_desc,
int format_type,
char *format_data)
int ltt_log_raw_event(int event_id, int event_size, void *event_data)

>>From the discussion, it seems that the management is partially
> done by relayfs. I find this a little strange. E.g. instead of
> filtering events, you may just not generate them in the first
> place, e.g. by not placing a probe, or by filtering in LTT,
> before submitting the event.

Like I said to Andi:
> ... For one thing, the current
> ltt hooks aren't as fast as they should be (i.e. we check whether
> the tracing is enabled for a certain event way too far in the code-path.)
> This should be rather simple to fix.
And I've already got the code snippet to fix this ready.

> Timestamps may be fine either way. Restoring sequence should be
> a task user-space can handle: in the worst case, you'd have to
> read and merge from #cpus streams. Seeking works in that context,
> too.
>
> Last but not least, 4 should be simple. Particularly since you're
> worried about extreme speeds, there should be as little
> processing as you can afford. If you need to seek efficiently
> (do you, really ?), you may not even want message boundaries at
> that level.

Like I said to Roman:
> Removing this data would require more data for each event to
> be logged, and require parsing through the trace before reading it in
> order to obtain markers allowing random access. This wouldn't be so
> bad if we were expecting users to use LTT sporadically for very short
> periods of time. However, given ltt's target audience (i.e. need to
> run traces for hours, maybe days, weeks), traces would rapidely become
> useless because while plowing through a few hundred KBs of data and
> allocating RAM for building internal structures as you go is fine,
> plowing through tens of GBs of data, possibly hundreds, requires that
> you come up with a format that won't require unreasonable resources
> from your system, while incuring negligeable runtime costs for generating
> it. We believe the format we currently have achieves the right balance
> here.

What we've agreed with Roman is that relayfs won't write anything at
the boundaries. Its clients will provide it with callbacks to be
invoked at buffer boundaries. When invoked, said callbacks can add
whatever they feel is important to the buffer, relayfs doesn't care.

> Something that isn't entirely clear to me is if you also need to
> aggregate information in buffers. E.g. by updating a record until
> is has been retrieved by user space, or by updating a record
> when there is no space to create a new one. Such functionality
> would add complexity and needs tight sychronization with the
> transport.

If I understand you correctly, you are talking about the fact that
the transport layer's management of the buffers is syncrhonized
with some user-space entity that consumes the buffers produced
and talks back to relayfs (albeit indirectly) to let it know that
said buffers are now available? If so, then that's why I suggested
elsewhere that we have two modes for relayfs: managed and adhoc.
In the former, you have the required mechanics for what I just
described. In the latter, you have a very basic buffering scheme
that cares nothing about user-space synchronization.

> [1] I've seen the argument that kprobes aren't portable. This
> strikes me a highly questionable. Even if an architecture
> doesn't have a trap instruction (or equivalent code sequence)
> that is at least as short as the shortest instruction, you
> can always fall back to adding instrumentation [2]. Also, if
> you know where your basic blocks are, you may be able to
> use traps that span multiple instructions. I recall that
> things of this kind are already planned for kprobes.

I have nothing against kprobes. People keep refering to it as if
it magically made all the related problems go away, and it doesn't.
See above.

> [2] See the "reliable markers" of umlsim from umlsim.sf.net.
> Implementation: cd umlsim/lib; make; tail -50 markers_kernel.h
> Examples: cd umlsim/sim/tests; cat sbug.marker
> They're basically extra-light markup in the source code.
> Works on ia32, but I haven't found a way to get the assembler
> to cooperate for amd64, yet.

Nothing precludes us to move in this direction once something is
in the kernel, it's all currently hidden away in a .h, and it would
be the same with this.

> [3] I've already solved this problem in umlsim: there, I have a
> Perl/C-like scripting language that allows handlers to do
> pretty much anything they want. Of course, kprobes would
> want pre-compiled C code, not some scripts, but I think the
> design could be developped in a direction that would allow
> both. Will take a while, but since I'll eventually have to
> rewrite the "microcode" anyway, ...

Like I said, nothing precludes us ...

> So my comments are basically as follows:
>
> 1) kprobes seems like a suitable and elegant mechanism for
> placing all the hooks LTT needs, so I think that it would
> be better to build on this basis, and extend it where
> necessary, than to build yet another specialized variant
> in parallel.

Whichever way you look at this, you need to mark the code. What's
in the .h is something we can tweak ad-nauseam.

> 2) LTT should do what it is good at, and not have to worry
> about the rest (i.e. supporting infrastructure).

I'm guessing that when you're talking about "supporting
infrastructure" you are refering to the trace statements. If so,
please see above. Also note that without the existing marker set
LTT is useless to its users (application developers, sysadmins,
power users, etc.)

> 3) relayfs should be lean and fast, as you intend it to be, so
> that non-LTT tracing or fnord debugging fnord code may find
> it useful, too.

relayfs has already been used for many non-LTT related. Ask
Hubertus or Jamal, to name a few.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-19 20:11:48

by Frank van Maarseveen

[permalink] [raw]
Subject: Re: [PATCH] Wait and retry mounting root device (revised)

On Tue, Jan 18, 2005 at 09:02:14AM +0100, Andries Brouwer wrote:
>
> Suppose we have kernel command line options
> rootdev=, rootpttype=, root=, rootfstype=, rootwait=
> telling the kernel what device is the root device,
> what type of partition table it has,
> on which partition the root filesystem lives,
> what type of filesystem it has,

might as well add rootuuid= for those fs which support it.

--
Frank

2005-01-19 23:23:41

by Marcos D. Marado Torres

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, 14 Jan 2005, Barry K. Nathan wrote:

> This isn't new to 2.6.11-rc1-mm1, but it has the infamous (to Fedora
> users) "ACPI shutdown bug" -- poweroff hangs instead of actually turning
> the computer off, on some computers. Here's the RH Bugzilla report where
> most of the discussion took place:
>
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=132761

This is the same bug I've talked here:
http://lkml.org/lkml/2005/1/11/88

This only happens with -mm and not with vanilla sources.

I'm reporting about this issue in an ASUS M3N laptop with Debian.

Best regards,
Mind Booster Noori

> In the Fedora kernels it turned out to be due to kexec. I'll see if I
> can narrow it down further.
>
> -Barry K. Nathan <[email protected]>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

- --
/* *************************************************************** */
Marcos Daniel Marado Torres AKA Mind Booster Noori
http://student.dei.uc.pt/~marado - [email protected]
() Join the ASCII ribbon campaign against html email, Microsoft
/\ attachments and Software patents. They endanger the World.
Sign a petition against patents: http://petition.eurolinux.org
/* *************************************************************** */
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Made with pgp4pine 1.76

iD8DBQFB7ufzmNlq8m+oD34RAmsIAKDM55tzy957YqEXtNkz9l2O3O7V1ACeKXQB
v2LuSPMWch9A7NQApq6Bm8c=
=F7on
-----END PGP SIGNATURE-----

2005-01-20 00:01:14

by Barry K. Nathan

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

On Wed, Jan 19, 2005 at 11:06:10PM +0000, Marcos D. Marado Torres wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Fri, 14 Jan 2005, Barry K. Nathan wrote:
>
> >This isn't new to 2.6.11-rc1-mm1, but it has the infamous (to Fedora
> >users) "ACPI shutdown bug" -- poweroff hangs instead of actually turning
> >the computer off, on some computers. Here's the RH Bugzilla report where
> >most of the discussion took place:
> >
> >https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=132761
>
> This is the same bug I've talked here:
> http://lkml.org/lkml/2005/1/11/88

FWIW the RH Bugzilla bug is (unfortunately) discussing several different
similar but not identical bugs, as far as I can tell.

> This only happens with -mm and not with vanilla sources.
>
> I'm reporting about this issue in an ASUS M3N laptop with Debian.
>
> Best regards,
> Mind Booster Noori

FWIW my report against -mm (where I narrowed it down to one of the kexec
patches in particular) is here:
http://bugme.osdl.org/show_bug.cgi?id=4041

-Barry K. Nathan <[email protected]>

2005-01-20 18:14:28

by Daniel Drake

[permalink] [raw]
Subject: [PATCH] Configurable delay before mounting root device

Adds a boot parameter which can be used to specify a delay (in seconds) before
the root device is decoded/discovered/mounted.

Example usage for 10 second delay:

rootdelay=10

Useful for usb-storage devices which no longer make their partitions
immediately available, and for other storage devices which require some
"spin-up" time.

Signed-off-by: Daniel Drake <[email protected]>


Attachments:
rootdelay-boot-param.patch (932.00 B)

2005-01-20 20:25:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Configurable delay before mounting root device

Daniel Drake <[email protected]> wrote:
>
> + if (root_delay) {
> + printk(KERN_INFO "Waiting %dsec before mounting root device...\n",
> + root_delay);
> + ssleep(root_delay);
> + }

Totally sad, but it's hard to see how that could break anything.

You owe me an update to Documentation/kernel-parameters.txt ;)

2005-01-20 21:42:23

by Werner Almesberger

[permalink] [raw]
Subject: Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

[ 3rd try. Apologies to Karim, Thomas, and Roman, who apparently also
received my previous attempts. For some reason, one of my upstream
DNS servers decided to send me highly bogus MX records. ]

Karim Yaghmour wrote:
> Might I add that this is part of the problem ... No personal
> offence intended, but there's been _A LOT_ of things said about
> LTT that were based on third-hand account and no direct contact
> with the toolset/code.

Sigh, yes, guilty as charged ...

At least today, I have a good excuse: my cable modem died, and I
couldn't possibly have download things to look at :)

> > As far as kprobes go, then you still need to have some form or another
> > of marking the code for key events, unless you keep maintaining a set
> > of kprobes-able points separately, which really makes it unusable for
> > the rest of us, as the users of LTT have discovered over time (having
> > to create a new patch for every new kernel that comes out.)

Yes, I think you will need some set of "pads" in the code, where you
can attach probes. I'm not sure how many, though. An alternative, at
least in some cases, would be to move such things into separate
functions, so that you could put the probe just at function entry.
Then add a comment that this function isn't supposed to be torn
apart without dire need.

> > Generating new interrupts is simply unacceptable for LTT's functionality.

Absolutely. If I remember correctly, this is in the process of being
addressed in kprobes. You basically have the following choices:

- if the probe target is an instruction long enough, replace it with
a jump or call (that's what I think the kprobes folks are working
on. I remember for sure that they were thinking about it.)
- if the probe target is in a basic block with enough room after the
target, see above (needs feedback from compiler or assembler)
- if all else fails, add some NOPs (i.e. the marker approach)

> I have received very little feedback on this suggestion,

Probably because everybody saw that it was good :-)

> As for the location of ltt trace points, then they are very rarely
> at function boundaries. Here's a classic:
> prepare_arch_switch(rq, next);
> ltt_ev_schedchange(prev, next);
> prev = context_switch(rq, prev, next);

Yes, in some cases, you don't have a choice but to add some marker.

> > Removing this data would require more data for each event to
> > be logged, and require parsing through the trace before reading it in
> > order to obtain markers allowing random access.

So you need seeking, even in the presence of fine-grained control
over what gets traced in the first place ? (As opposed to extracting
the interesting data from the full trace, given that the latter
shouldn't contain too much noise.)

> If I understand you correctly, you are talking about the fact that
> the transport layer's management of the buffers is syncrhonized
> with some user-space entity that consumes the buffers produced
> and talks back to relayfs (albeit indirectly) to let it know that
> said buffers are now available?

Or that they have been consumed. My question is just whether this
kind of aggregation is something you need.

> I have nothing against kprobes. People keep refering to it as if
> it magically made all the related problems go away, and it doesn't.

Yes, I know just too well :-) In umlsim, I have pretty much the
same problems, and the solutions aren't always nice. So far, I've
been lucky enough that I could almost always find a suitable
function entry to abuse.

However, since a kprobes-based mechanism is - in the worst case,
i.e. when needing markup - as good as direct calls to LTT, and gives
you a lot more flexibility if things aren't quite as hostile, I
think it makes sense to focus on such a solution.

> Nothing precludes us to move in this direction once something is
> in the kernel, it's all currently hidden away in a .h, and it would
> be the same with this.

Yup, but you could move even more intelligence outside the kernel.
All you really need in the kernel is a place to put the probe,
plus some debugging information to tell you where you find the
data (the latter possibly combined with gently coercing the
compiler to put it at some accessible place).

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2005-01-20 22:58:12

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)


Werner Almesberger wrote:
> - if the probe target is an instruction long enough, replace it with
> a jump or call (that's what I think the kprobes folks are working
> on. I remember for sure that they were thinking about it.)

I heard about this years ago, but I don't know that anything came of
it. I suspect that this is not as simple as it looks and that the
only reliable way to do it is with a trap.

> Probably because everybody saw that it was good :-)

Great, thanks. That's what we'll aim for then. We've already got
the "disable" and "static" implemented, so now we need to figure
out how do we best implement this tagging. IBM's kernel hooks
allowed the NOP solution, so I'm guessing it shouldn't be that
much of a stretch to extend it for marking up the code for kprobes
and friends. I don't know whether this code is still maintained or
not, but I'd like to hear input as to whether this is a good basis,
or whether you're thinking of something like your uml-sim hooks?

> So you need seeking, even in the presence of fine-grained control
> over what gets traced in the first place ? (As opposed to extracting
> the interesting data from the full trace, given that the latter
> shouldn't contain too much noise.)

The problem is that you don't necessarily know beforehand what's
the problem. So here's an actual example:

I had a client who had this box on which a task was always getting
picked up by the OOM killer. Try as they might, the development
team couldn't figure out which part of the code was causing this.
So we put LTT in there and in less than 5 minutes we found the
problem. It turned out that a user-space access to a memory-mapped
FPGA caused an unexpected FP interrupt to occur, and the application
found itself in a recursive signal handler. In this case there was
an application symptom, but it was a hardware problem.

This is just a simple example, but there are plenty of other
examples where a sysadmin will be experiencing some weird
hard to reproduce bugs on some of his systems and he'll spend
a considerable amount of time trying to guess what's happening.
This is especially complicated when there's no indication as to
what's the root of the problem. So at that point being able to
log everything and being able to rapidely browse through it is
critical.

Once you've done such a first trace you _may_ _possibly_ be
able to refine your search requirements and relog with that in
mind, but that's after the fact.

> Or that they have been consumed. My question is just whether this
> kind of aggregation is something you need.

Absolutely. If you're thinking about short 100kb or MBs traces,
then a simpler scheme would be possible. But when we're talking
about GB and 100GBs spaning days, there's got to be a managed
way of doing it.

>>I have nothing against kprobes. People keep refering to it as if
>>it magically made all the related problems go away, and it doesn't.
>
>
> Yes, I know just too well :-) In umlsim, I have pretty much the
> same problems, and the solutions aren't always nice. So far, I've
> been lucky enough that I could almost always find a suitable
> function entry to abuse.

Glad you acknowledge as much.

> However, since a kprobes-based mechanism is - in the worst case,
> i.e. when needing markup - as good as direct calls to LTT, and gives
> you a lot more flexibility if things aren't quite as hostile, I
> think it makes sense to focus on such a solution.

You certainly have a lot more experience than I do with that, so
I'd like to solicit your help. As above: what's the best way to
provide this in addition to the static and disable points?

> Yup, but you could move even more intelligence outside the kernel.
> All you really need in the kernel is a place to put the probe,
> plus some debugging information to tell you where you find the
> data (the latter possibly combined with gently coercing the
> compiler to put it at some accessible place).

Right, but then you end up with a mechanism with generalized hooks.
Actually there was a time when LTT was a driver and you could
either build it as a module or keep it built-in. However, when
we published patches to get LTT accepted in 2.5 we were told on
LKML to move LTT into kernel/ and avoid all this driver stuff.
Having it, or parts of it, in the kernel makes it much simpler
and much more likely that the existing ad-hoc tracing code
spreading accross the sources be removed in exchange for a
single agreed upon way of doing things.

It must be said that like I had done with relayfs, the LTT patch
will go through a major redux and I will post the patches for
review like before on LKML.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-21 06:16:35

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


OK, I finally come around to answering this ...

Roman Zippel wrote:
> Sorry, you missunderstood me. At the moment I'm only secondarily
> interested in the API details, primarily I want to work out the details of
> what exactly relayfs/ltt are supposed to do. One main question here I
> can't answer yet, why you insist on multiple relayfs modes.

I should have avoided earlier confusing the use of a certain type of
relayfs channel for a given purpose (i.e. LTT should not necessarily
depend on the managed mode.) I believe that there is a need for
more than one mode in relayfs independently of LTT. There are users
who want to be able to manage the data in a buffer (by manage I mean:
receive notification of important buffer events, be able to insert
important data at boundaries, etc.), and there are users who just
want to dump as much information as possible in as fast a way as
possible without having to deal with non-essential codepaths.

> This is what I basically have in mind for the relay_write function:
>
> cpu = get_cpu();
> buffer = relay_get_buffer(chan, cpu);
> while(1) {
> offset = local_add_return(buffer->offset, length);
> if (likely(offset + length <= buffer->size))
> break;
> buffer = relay_switch_buffer(chan, buffer, offset);
> }
> memcpy(buffer->data + offset, data, length);
> put_cpu();

looking at this code:

1) get_cpu() and put_cpu() won't do. You need to outright disable
interrupts because you may be called from an interrupt handler.

2) You assume that relayfs creates one buffer per cpu for each
channel. We think this is wrong. Relayfs should not need to care
about the number of CPUs, it's the clients' responsibility to
create as many channels as they see fit, whether it be one channel
per CPU or 10 channels per CPU or 1 channel per interrupt, etc.

3) I'm unclear about the need for local_add_return(), why not
just:
if (likely(buffer->offset + length <= buffer->size)
In any case, here's what we do in relay_write():
write_pos = relay_reserve(rchan, count, &reserve_code, &interrupting);
If there's any buffer switching required, that will be done in
relay_reserve. This has the added advantage that clients that
want to write directly to the buffer without using relay_write()
can do so by calling relay_reserve() and not care about required
buffer switching.

4) After securing the area, you simply go ahead and do a memcpy()
and leave. We think that this is insufficient. Here's what we
do:
if (likely(write_pos != NULL)) {
relay_write_direct(write_pos, data_ptr, count);
relay_commit(rchan, write_pos, count, reserve_code, interrupting);
*wrote_pos = write_pos;
the relay_write_direct() is basically an memcpy(). We also do
a relay_commit(). This actually effects the delivery of the
event. If, for example, there had been a buffer switch at the
previous relay_reserve(), then this call to relay_commit() will
generate a call to the client's deliver() callback function.
In the case of LTT, for example, this is how it knows that it's
got to notify the user-space daemon that there are buffers to
consume (i.e. write to disk.)

> ltt_log_event should only be a few lines more (for writing header and
> event data).

Actually no, you don't want ltt_log_event using relay_write(),
for one thing because is can generate variable size events.
Instead, ltt_log_event does (basically):
data_size = sizeof(event_id) + sizeof(time_delta) + sizeof(data_size);

relay_lock_channel();
relay_reserve();

relay_write_direct(&event_id, sizeof(event_id));
relay_write_direct(&time_delta, sizeof(event_id));
if (var_data) {
relay_write_direct(var_data, var_data_len);
data_size += var_data_len;
}
relay_write_direct(&data_size, sizeof(data_size));

relay_commit();
relay_unlock_channel();

> What I'd like to know now are the reasons why you need more than this.

I hope the above explanation clarifies things.

> It's not the amount of data and any timing requirements have to be done by
> the caller. During processing you either take the events in the order they
> were recorded (often that's good enough) or you sort them which is not
> that difficult.

Ordering is a non-issue to be honest. Unless you've got some hardware
scope in there, it's almost impossible to pinpoint exactly when an
event occurred. There is no single line of code where an event occurs,
so it's all an educated guess anyway. You want things to resemble what
really happened in as much as possible though.

> I know you don't want to touch the topic of kernel debugging, but its
> requirements greatly overlap with what you want to do with ltt, e.g. one
> needs very often information about scheduling events as many kernel
> processes rely more and more on kernel threads. The only real requirement
> for kernel debugging is low runtime overhead, which you certainly like to
> have as well. So what exactly are these requirements and why can't there
> be no reasonable alternative?

ok, ok, ok, ok, ok, ok, OK!

You've hit it enough times on its head that I'll actually have to answer.

In terms of low runtime overhead, you are correct, the requirements overlap,
and I will agree to do my best to trim down LTT to make it useable for
kernel tracing without jeopardizing its existing purpose.

I'll start this separately in a "Ripping LTT apart" thread.

In regards to relayfs, I think that LTT should run on both modes
transparently. Unlike what I said before, no single mode should be tied
to LTT. If you want tracing with the ad-hoc mode, then fine, you should
be able to do that. There is merit in keeping both relayfs modes,
irrespective of what modes LTT uses. A review of the managed and adhoc
code should consider all clients, including LTT, as potential users of
both. Sure, we'll want to optimize the managed mode in as much as
possible, but its functionality stands on its own and is different from
that of the ad-hoc mode. The difference between these modes is akin the
difference between GFP_KERNEL, GFP_ATOMIC, GFP_USER, etc.: same API,
different underlying functionality.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-21 15:18:16

by Daniel Drake

[permalink] [raw]
Subject: Re: [PATCH] Configurable delay before mounting root device

Andrew Morton wrote:
> You owe me an update to Documentation/kernel-parameters.txt ;)


Attachments:
rootdelay-boot-param.patch (1.75 kB)

2005-01-21 16:42:37

by William Park

[permalink] [raw]
Subject: Re: [PATCH] Configurable delay before mounting root device

On Thu, Jan 20, 2005 at 08:55:54PM +0000, Daniel Drake wrote:
> Adds a boot parameter which can be used to specify a delay (in seconds)
> before the root device is decoded/discovered/mounted.
>
> Example usage for 10 second delay:
>
> rootdelay=10
>
> Useful for usb-storage devices which no longer make their partitions
> immediately available, and for other storage devices which require some
> "spin-up" time.
>
> Signed-off-by: Daniel Drake <[email protected]>

Very concise. It's much better than 2.4 patch or its 2.6 adaptation (my
patch)...

--
William Park <[email protected]>, Toronto, Canada
Slackware Linux -- because I can type.

2005-01-21 22:28:15

by Roman Zippel

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi,

On Fri, 21 Jan 2005, Karim Yaghmour wrote:

> I should have avoided earlier confusing the use of a certain type of
> relayfs channel for a given purpose (i.e. LTT should not necessarily
> depend on the managed mode.) I believe that there is a need for
> more than one mode in relayfs independently of LTT. There are users
> who want to be able to manage the data in a buffer (by manage I mean:
> receive notification of important buffer events, be able to insert
> important data at boundaries, etc.), and there are users who just
> want to dump as much information as possible in as fast a way as
> possible without having to deal with non-essential codepaths.

Well, let's concentrate for a moment on the last thing and check later
if and how they fit into relayfs. Since ltt will be first main user, let's
optimize it for this.
Also since relayfs is intended for large, fast data transfers, per cpu
buffers are pretty much always required, so it would make sense to leave
this to relayfs (less to get wrong for the client).

> looking at this code:

I have to modify it a little (only the if (!buffer) part is new):

cpu = get_cpu();
buffer = relay_get_buffer(chan, cpu);
while(1) {
offset = local_add_return(buffer->offset, length);
if (likely(offset + length <= buffer->size))
break;
buffer = relay_switch_buffer(chan, buffer, offset);
if (!buffer) {
put_cpu();
return;
}
}
memcpy(buffer->data + offset, data, length);
put_cpu();

This has a very short fast path and I need very good reasons to change/add
anything here. OTOH the slow path with relay_switch_buffer() is less
critical and still leaves a lot of flexibility.

> 1) get_cpu() and put_cpu() won't do. You need to outright disable
> interrupts because you may be called from an interrupt handler.

Look closer, it's already interrupt safe, the synchronization for the
buffer switch is left to relay_switch_buffer().

> 3) I'm unclear about the need for local_add_return(), why not
> just:
> if (likely(buffer->offset + length <= buffer->size)
> In any case, here's what we do in relay_write():
> write_pos = relay_reserve(rchan, count, &reserve_code, &interrupting);

Ok, let's take a closer look at the fast path of relay_write (via
relay_managed.c):

> rchan_get(rchan);

This is not needed, it's the responsibility of the client to keep a
reference to the channel. A synchronize_kernel() is enough to get rid of
current users of the channel on other cpus.

> relay_lock_channel(rchan, flags);

what becomes:

> FLAGS = 0;
> if (RCHAN->flags & RELAY_USAGE_SMP) local_irq_save(FLAGS);
> else spin_lock_irqsave(&(RCHAN)->mode.managed.lock, FLAGS);

This adds a conditional and is not really needed. Above shows how to make
it interrupt safe and if the clients wants to reuse the same buffer, leave
the locking to the client.

> write_pos = relay_reserve(rchan, count, &reserve_code, &interrupting);

what becomes:

> if (rchan == NULL) ...

Is this really needed?

> if (slot_len >= rchan->buf_size) ...

You can leave it to caller to check for this, a BUG_ON should be enough
here.

> if (rchan->initialized == 0) ...

Does this really have to be in the fast path?

> if (in_progress_event_size(rchan)) ...

What's the point of this? You already disable interrupts, so how can
anything else be in progress?

> if (cur_write_pos(rchan) + slot_len > write_limit(rchan)) ...

Ok. This leads to the slow path and not interesting right now.

> if (likely(write_pos != NULL)) {

After 7 conditions we finally have a valid write position (and that's
without ltt).

> relay_write_direct(write_pos, data_ptr, count);

If write_pos is just a normal memory pointer, why not also just use
memcpy?

> relay_commit(rchan, write_pos, count, reserve_code, interrupting);

what becomes:

> if (rchan == NULL)
> return;

Hopefully no comment needed.

> if (interrupting) ...

Same comment as above for in_progress_event_size().

> if (deliver) ...
> ...
> if (deliver && waitqueue_active(&rchan->mmap_read_wait))

Why is that hook needed here? Why can't this be done by the client?
A buffer switch notification can be done somewhere else.

> relay_unlock_channel(rchan, flags);
> rchan_put(rchan);

Same comment as above.

That's quite a lot of code with at least 14 conditions (or 13 conditions
too much) and this is just relayfs.

> The difference between these modes is akin the
> difference between GFP_KERNEL, GFP_ATOMIC, GFP_USER, etc.: same API,
> different underlying functionality.

That's not always true, where perfomance matters we provide different
functions (e.g. spinlocks), so having an alternative version of
relay_write is a possibility (although I'd like to see the user first).

bye, Roman

2005-01-23 07:33:19

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Hello Roman,

Roman Zippel wrote:
> Well, let's concentrate for a moment on the last thing and check later
> if and how they fit into relayfs. Since ltt will be first main user, let's
> optimize it for this.
> Also since relayfs is intended for large, fast data transfers, per cpu
> buffers are pretty much always required, so it would make sense to leave
> this to relayfs (less to get wrong for the client).

But how does relayfs organize the namespace then? What if I have
multiple channels per CPU, each for a different type of data, will
all channels for the same CPU be under the same directory or will
each type of data have its own directory with one entry per CPU?
I don't have an answer to that, and I don't know that we should. Why
not just leave it to the client to organize his data as he wishes.
If we must assume that everyone will have at least one channel per
CPU, then why not provide helper functions built on top of very
basic functions instead of fixing the namespace in stone?

> I have to modify it a little (only the if (!buffer) part is new):
>
> cpu = get_cpu();
> buffer = relay_get_buffer(chan, cpu);
> while(1) {
> offset = local_add_return(buffer->offset, length);
> if (likely(offset + length <= buffer->size))
> break;
> buffer = relay_switch_buffer(chan, buffer, offset);
> if (!buffer) {
> put_cpu();
> return;
> }
> }
> memcpy(buffer->data + offset, data, length);
> put_cpu();
>
> This has a very short fast path and I need very good reasons to change/add
> anything here. OTOH the slow path with relay_switch_buffer() is less
> critical and still leaves a lot of flexibility.

This is not good for any client that doesn't know beforehand the exact
size of their data units, as in the case of LTT. If LTT has to use this
code that means we are going to loose performance because we will need to
fill an intermediate data structure which will only be used for relay_write().
Instead of zero-copy, we would have an extra unnecessary copy. There has
got to be a way for clients to directly reserve and write as they wish.
Even Zach Brown recognized this in his tracepipe proposal, here's from
his patch:
+ * - let caller reserve space and get a pointer into buf

>>1) get_cpu() and put_cpu() won't do. You need to outright disable
>>interrupts because you may be called from an interrupt handler.
>
>
> Look closer, it's already interrupt safe, the synchronization for the
> buffer switch is left to relay_switch_buffer().

Sorry, I'm still missing something. What exactly does local_add_return()
do? I assume this code has got to be interrupt safe? Something like:
#define local_add_return(OFFSET, LEN) \
do {\
...
local_irq_save(); \
OFFSET += LEN;
local_irq_restore(); \
...
} while(0);

I'm assuming local_irq_XXX because we were told by quite a few people
in the related thread to avoid atomic ops because they are more expensive
on most CPUs than cli/sti.

Also how does relay_get_buffer() operate? What if I'm writing an event
from within a system call and I'm about to switch buffers and get
an interrupt at the if(likely(...))? Isn't relay_get_buffer() going to
return the same pointer as the one obtained for the syscall, and aren't
both cases now going to effect relay_switch_buffer(), one of which will
be superfluous?

> This adds a conditional and is not really needed. Above shows how to make
> it interrupt safe and if the clients wants to reuse the same buffer, leave
> the locking to the client.

Fine, but how is the client going to be able to reuse the same buffer if
relayfs always assumes per-CPU buffer as you said above? This would be
solved if at its core relayfs' functions worked on single channels and
additional code provided helpers for making the SMP case very simple.

> That's quite a lot of code with at least 14 conditions (or 13 conditions
> too much) and this is just relayfs.

I believe Tom has refactored the code with your comments in mind, and has
something ready for review. I just want to clear up the above before we
make this final. Among other things, he just dropped all modes, and there's
only a basic relay_write() that closely resembles what you have above.

> That's not always true, where perfomance matters we provide different
> functions (e.g. spinlocks), so having an alternative version of
> relay_write is a possibility (although I'd like to see the user first).

Sure, see above in the case of LTT.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-23 07:43:41

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Karim Yaghmour wrote:
> This is not good for any client that doesn't know beforehand the exact
> size of their data units, as in the case of LTT. If LTT has to use this
> code that means we are going to loose performance because we will need to
> fill an intermediate data structure which will only be used for relay_write().
> Instead of zero-copy, we would have an extra unnecessary copy. There has
> got to be a way for clients to directly reserve and write as they wish.
> Even Zach Brown recognized this in his tracepipe proposal, here's from
> his patch:
> + * - let caller reserve space and get a pointer into buf

Actually, come to think of it, this code is not good for any client that
needs to fill complex data structures, whether they be fixed-size or not,
because it requires having a prepackaged structure already available.
Any client that wants to have zero-copying will want to write data
directly into the buffer instead of filling an intermediate buffer first.
And this requires being able to atomically reserve.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-23 08:17:55

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Karim Yaghmour wrote:
> This is not good for any client that doesn't know beforehand the exact
> size of their data units, as in the case of LTT. If LTT has to use this
> code that means we are going to loose performance because we will need to
> fill an intermediate data structure which will only be used for relay_write().
> Instead of zero-copy, we would have an extra unnecessary copy. There has
> got to be a way for clients to directly reserve and write as they wish.
> Even Zach Brown recognized this in his tracepipe proposal, here's from
> his patch:
> + * - let caller reserve space and get a pointer into buf

Also, if the reserve is exported, then a client that chooses so, can
do something like:

local_irq_save();
relay_reserve();
write(); write(); write(); ...
local_irq_restore();

And therefore enforce in-order events is he so chooses.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546

2005-01-24 00:41:58

by Roman Zippel

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1

Hi,

On Sun, 23 Jan 2005, Karim Yaghmour wrote:

> But how does relayfs organize the namespace then? What if I have
> multiple channels per CPU, each for a different type of data, will
> all channels for the same CPU be under the same directory or will
> each type of data have its own directory with one entry per CPU?

I'd say the latter, you already do this for ltt.

> I don't have an answer to that, and I don't know that we should. Why
> not just leave it to the client to organize his data as he wishes.
> If we must assume that everyone will have at least one channel per
> CPU, then why not provide helper functions built on top of very
> basic functions instead of fixing the namespace in stone?

How should simple do you want to have these helper functions, isn't
something like relay_create(path, num_chan) simple enough?
I don't think a directory structure is that bad, as that allows to add
more control files to the relay stream and still leave the option to write
out all buffers into one file.

> > I have to modify it a little (only the if (!buffer) part is new):
> >
> > cpu = get_cpu();
> > buffer = relay_get_buffer(chan, cpu);
> > while(1) {
> > offset = local_add_return(buffer->offset, length);
> > if (likely(offset + length <= buffer->size))
> > break;
> > buffer = relay_switch_buffer(chan, buffer, offset);
> > if (!buffer) {
> > put_cpu();
> > return;
> > }
> > }
> > memcpy(buffer->data + offset, data, length);
> > put_cpu();
> >
> > This has a very short fast path and I need very good reasons to change/add
> > anything here. OTOH the slow path with relay_switch_buffer() is less
> > critical and still leaves a lot of flexibility.
>
> This is not good for any client that doesn't know beforehand the exact
> size of their data units, as in the case of LTT. If LTT has to use this
> code that means we are going to loose performance because we will need to
> fill an intermediate data structure which will only be used for relay_write().
> Instead of zero-copy, we would have an extra unnecessary copy. There has
> got to be a way for clients to directly reserve and write as they wish.

Ok, let's change it a little so it's more familiar. :)

void *relay_reserve(chan, length, cpu)
{
buffer = relay_get_buffer(chan, cpu);
while(1) {
offset = local_add_return(buffer->offset, length);
if (likely(offset + length <= buffer->size))
return buffer->data + offset;
buffer = relay_switch_buffer(chan, buffer, offset);
if (!buffer)
return NULL;
}
}

All you have to do is to put between get_cpu()/put_cpu().
The same is also possible as macro, which allows you to directly jump out
of it to the failure code and avoid one test.

> > Look closer, it's already interrupt safe, the synchronization for the
> > buffer switch is left to relay_switch_buffer().
>
> Sorry, I'm still missing something. What exactly does local_add_return()
> do? I assume this code has got to be interrupt safe? Something like:
> #define local_add_return(OFFSET, LEN) \
> do {\
> ...
> local_irq_save(); \
> OFFSET += LEN;
> local_irq_restore(); \
> ...
> } while(0);
>
> I'm assuming local_irq_XXX because we were told by quite a few people
> in the related thread to avoid atomic ops because they are more expensive
> on most CPUs than cli/sti.

That would be about the generic implementation, but it allows archs to
provide more efficient implementations in <asm/local.h>, e.g. i386 can use
xadd.

> Also how does relay_get_buffer() operate?

#define relay_get_buffer(chan, cpu) chan->buffer[cpu]

> What if I'm writing an event
> from within a system call and I'm about to switch buffers and get
> an interrupt at the if(likely(...))? Isn't relay_get_buffer() going to
> return the same pointer as the one obtained for the syscall, and aren't
> both cases now going to effect relay_switch_buffer(), one of which will
> be superfluous?

The synchronization has to be done in relay_switch_buffer(), but catching
it there is still cheaper as in the fast path.

> > This adds a conditional and is not really needed. Above shows how to make
> > it interrupt safe and if the clients wants to reuse the same buffer, leave
> > the locking to the client.
>
> Fine, but how is the client going to be able to reuse the same buffer if
> relayfs always assumes per-CPU buffer as you said above? This would be
> solved if at its core relayfs' functions worked on single channels and
> additional code provided helpers for making the SMP case very simple.

What do you mean? Why not make SMP case simple (less to get wrong)? The
client can still serialize everything with a simple spinlock.

> > That's quite a lot of code with at least 14 conditions (or 13 conditions
> > too much) and this is just relayfs.
>
> I believe Tom has refactored the code with your comments in mind, and has
> something ready for review. I just want to clear up the above before we
> make this final. Among other things, he just dropped all modes, and there's
> only a basic relay_write() that closely resembles what you have above.

Ok, great.
BTW I don't really expect the first version to be fully optimized (unless
you want to :) ), but once the basics are right, that can still be added
later.

bye, Roman

2005-01-25 08:22:40

by Karim Yaghmour

[permalink] [raw]
Subject: Re: 2.6.11-rc1-mm1


Roman Zippel wrote:
> Ok, great.
> BTW I don't really expect the first version to be fully optimized (unless
> you want to :) ), but once the basics are right, that can still be added
> later.

Agreed. Tom will post updated patches sometime this week. I'll follow up
with the LTT stuff separately as agreed.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [email protected] || 1-866-677-4546