2006-01-24 07:52:25

by Lee Revell

[permalink] [raw]
Subject: RCU latency regression in 2.6.16-rc1

I ported the latency tracer to 2.6.16 and got this 13ms latency within a
few hours. This is a regression from 2.6.15.

It appears that RCU can invoke ipv4_dst_destroy thousands of times in a
single batch.

preemption latency trace v1.1.5 on 2.6.16-rc1
--------------------------------------------------------------------
latency: 13805 us, #14352/14352, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0)
-----------------
| task: gtk-gnutella-3116 (uid:1000 nice:0 policy:0 rt_prio:0)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
<idle>-0 0d.s2 1us : __trace_start_sched_wakeup (try_to_wake_up)
<idle>-0 0d.s2 2us : __trace_start_sched_wakeup <<...>-3116> (73 0)
<idle>-0 0d.s. 3us : wake_up_process (process_timeout)
<idle>-0 0d.s. 4us : net_rx_action (__do_softirq)
<idle>-0 0d.s. 5us : process_backlog (net_rx_action)
<idle>-0 0d.s. 6us : netif_receive_skb (process_backlog)
<idle>-0 0d.s1 7us : packet_rcv_spkt (netif_receive_skb)
<idle>-0 0d.s1 7us : skb_clone (packet_rcv_spkt)
<idle>-0 0d.s1 8us : kmem_cache_alloc (skb_clone)
<idle>-0 0d.s1 9us : strlcpy (packet_rcv_spkt)
<idle>-0 0d.s2 10us : sk_run_filter (packet_rcv_spkt)
<idle>-0 0d.s1 12us : __kfree_skb (packet_rcv_spkt)
<idle>-0 0d.s1 12us : kfree_skbmem (__kfree_skb)
<idle>-0 0d.s1 13us : skb_release_data (kfree_skbmem)
<idle>-0 0d.s1 14us : kmem_cache_free (kfree_skbmem)
<idle>-0 0d.s1 14us : ip_rcv (netif_receive_skb)
<idle>-0 0d.s1 15us : ip_route_input (ip_rcv)
<idle>-0 0d.s1 16us : rt_hash_code (ip_route_input)
<idle>-0 0d.s1 17us : memset (ip_route_input)
<idle>-0 0d.s1 18us : fn_hash_lookup (ip_route_input)
<idle>-0 0d.s2 19us : fib_semantic_match (fn_hash_lookup)
<idle>-0 0d.s1 20us : fib_validate_source (ip_route_input)
<idle>-0 0d.s1 21us : memset (fib_validate_source)
<idle>-0 0d.s1 22us : fn_hash_lookup (fib_validate_source)
<idle>-0 0d.s1 23us : fn_hash_lookup (fib_validate_source)
<idle>-0 0d.s2 24us : fib_semantic_match (fn_hash_lookup)
<idle>-0 0d.s1 25us : __fib_res_prefsrc (fib_validate_source)
<idle>-0 0d.s1 26us : inet_select_addr (__fib_res_prefsrc)
<idle>-0 0d.s1 27us : dst_alloc (ip_route_input)
<idle>-0 0d.s1 28us : rt_garbage_collect (dst_alloc)
<idle>-0 0d.s1 28us+: kmem_cache_alloc (dst_alloc)
<idle>-0 0d.s1 31us : rt_hash_code (ip_route_input)
<idle>-0 0d.s1 32us : rt_intern_hash (ip_route_input)
<idle>-0 0d.s1 32us : local_bh_enable (rt_intern_hash)
<idle>-0 0d.s1 33us : ip_local_deliver (ip_rcv)
<idle>-0 0d.s2 34us : tcp_v4_rcv (ip_local_deliver)
<idle>-0 0d.s2 35us : __skb_checksum_complete (tcp_v4_rcv)
<idle>-0 0d.s2 36us+: skb_checksum (__skb_checksum_complete)
<idle>-0 0d.s3 39us : tcp_v4_do_rcv (tcp_v4_rcv)
<idle>-0 0d.s3 39us : tcp_rcv_state_process (tcp_v4_do_rcv)
<idle>-0 0d.s3 40us+: tcp_parse_options (tcp_rcv_state_process)
<idle>-0 0d.s3 44us : tcp_ack (tcp_rcv_state_process)
<idle>-0 0d.s3 46us+: tcp_sync_mss (tcp_ack)
<idle>-0 0d.s3 49us : __kfree_skb (tcp_ack)
<idle>-0 0d.s3 50us : kfree_skbmem (__kfree_skb)
<idle>-0 0d.s3 50us : skb_release_data (kfree_skbmem)
<idle>-0 0d.s3 51us : kfree (skb_release_data)
<idle>-0 0d.s3 52us : kmem_cache_free (kfree_skbmem)
<idle>-0 0d.s3 54us : tcp_ack_saw_tstamp (tcp_ack)
<idle>-0 0d.s3 54us+: tcp_rtt_estimator (tcp_ack_saw_tstamp)
<idle>-0 0d.s3 58us : tcp_sync_mss (tcp_rcv_state_process)
<idle>-0 0d.s3 59us : tcp_initialize_rcv_mss (tcp_rcv_state_process)
<idle>-0 0d.s3 60us : inet_sk_rebuild_header (tcp_rcv_state_process)
<idle>-0 0d.s3 61us : tcp_init_metrics (tcp_rcv_state_process)
<idle>-0 0d.s3 63us : tcp_init_cwnd (tcp_init_metrics)
<idle>-0 0d.s3 64us : tcp_init_congestion_control (tcp_rcv_state_process)
<idle>-0 0d.s3 65us : bictcp_init (tcp_init_congestion_control)
<idle>-0 0d.s3 66us+: tcp_init_buffer_space (tcp_rcv_state_process)
<idle>-0 0d.s3 69us : sock_def_wakeup (tcp_rcv_state_process)
<idle>-0 0d.s4 70us : __wake_up (sock_def_wakeup)
<idle>-0 0d.s5 71us : __wake_up_common (__wake_up)
<idle>-0 0d.s5 72us : default_wake_function (__wake_up_common)
<idle>-0 0d.s5 73us : try_to_wake_up (default_wake_function)
<idle>-0 0d.s6 74us : sched_clock (try_to_wake_up)
<idle>-0 0d.s6 75us : recalc_task_prio (try_to_wake_up)
<idle>-0 0d.s6 76us : effective_prio (recalc_task_prio)
<idle>-0 0d.s6 77us : try_to_wake_up <<...>-2854> (73 1)
<idle>-0 0d.s6 78us : enqueue_task (try_to_wake_up)
<idle>-0 0d.s6 78us : __trace_start_sched_wakeup (try_to_wake_up)
<idle>-0 0d.s3 80us : tcp_send_ack (tcp_rcv_state_process)
<idle>-0 0d.s3 81us : __alloc_skb (tcp_send_ack)
<idle>-0 0d.s3 82us : kmem_cache_alloc (__alloc_skb)
<idle>-0 0d.s3 82us : __kmalloc (__alloc_skb)
<idle>-0 0d.s3 84us : tcp_transmit_skb (tcp_send_ack)
<idle>-0 0d.s3 85us : __tcp_select_window (tcp_transmit_skb)
<idle>-0 0d.s3 86us : tcp_v4_send_check (tcp_transmit_skb)
<idle>-0 0d.s3 87us : ip_queue_xmit (tcp_transmit_skb)
<idle>-0 0d.s3 88us : ip_output (ip_queue_xmit)
<idle>-0 0d.s3 90us : local_bh_enable (ip_output)
<idle>-0 0d.s3 91us : dev_queue_xmit (ip_output)
<idle>-0 0d.s4 91us : pfifo_fast_enqueue (dev_queue_xmit)
<idle>-0 0d.s4 92us : qdisc_restart (dev_queue_xmit)
<idle>-0 0d.s4 93us : pfifo_fast_dequeue (qdisc_restart)
<idle>-0 0d.s4 94us : dev_queue_xmit_nit (qdisc_restart)
<idle>-0 0d.s5 95us : skb_clone (dev_queue_xmit_nit)
<idle>-0 0d.s5 95us : kmem_cache_alloc (skb_clone)
<idle>-0 0d.s5 97us : packet_rcv_spkt (dev_queue_xmit_nit)
<idle>-0 0d.s5 98us : strlcpy (packet_rcv_spkt)
<idle>-0 0d.s6 99us : sk_run_filter (packet_rcv_spkt)
<idle>-0 0d.s5 100us : __kfree_skb (packet_rcv_spkt)
<idle>-0 0d.s5 101us : kfree_skbmem (__kfree_skb)
<idle>-0 0d.s5 101us : skb_release_data (kfree_skbmem)
<idle>-0 0d.s5 102us : kmem_cache_free (kfree_skbmem)
<idle>-0 0d.s4 103us : rhine_start_tx (qdisc_restart)
<idle>-0 0d.s5 104us : ioread8 (rhine_start_tx)
<idle>-0 0d.s5 105us : iowrite8 (rhine_start_tx)
<idle>-0 0d.s5 106us : ioread8 (rhine_start_tx)
<idle>-0 0d.s4 108us : qdisc_restart (dev_queue_xmit)
<idle>-0 0d.s4 108us : pfifo_fast_dequeue (qdisc_restart)
<idle>-0 0d.s3 109us : local_bh_enable (dev_queue_xmit)
<idle>-0 0d.s3 110us : tcp_urg (tcp_rcv_state_process)
<idle>-0 0d.s3 111us : __kfree_skb (tcp_rcv_state_process)
<idle>-0 0d.s3 111us : kfree_skbmem (__kfree_skb)
<idle>-0 0d.s3 112us : skb_release_data (kfree_skbmem)
<idle>-0 0d.s3 113us : kfree (skb_release_data)
<idle>-0 0d.s3 114us : kmem_cache_free (kfree_skbmem)
<idle>-0 0d.H3 116us : do_IRQ (c012ae3b b 0)
<idle>-0 0d.h. 117us : __do_IRQ (do_IRQ)
<idle>-0 0d.h1 117us+: mask_and_ack_8259A (__do_IRQ)
<idle>-0 0d.h. 122us : handle_IRQ_event (__do_IRQ)
<idle>-0 0d.h. 123us : usb_hcd_irq (handle_IRQ_event)
<idle>-0 0d.h. 124us : uhci_irq (usb_hcd_irq)
<idle>-0 0d.h. 125us : rhine_interrupt (handle_IRQ_event)
<idle>-0 0d.h. 125us : ioread16 (rhine_interrupt)
<idle>-0 0d.h. 127us : ioread8 (rhine_interrupt)
<idle>-0 0d.h. 128us : iowrite16 (rhine_interrupt)
<idle>-0 0d.h. 128us : ioread8 (rhine_interrupt)
<idle>-0 0d.h1 130us : raise_softirq_irqoff (rhine_interrupt)
<idle>-0 0d.h. 131us : ioread16 (rhine_interrupt)
<idle>-0 0d.h. 132us : ioread8 (rhine_interrupt)
<idle>-0 0d.h. 133us : via_driver_irq_handler (handle_IRQ_event)
<idle>-0 0d.h1 135us : note_interrupt (__do_IRQ)
<idle>-0 0d.h1 135us : end_8259A_irq (__do_IRQ)
<idle>-0 0d.h1 136us+: enable_8259A_irq (end_8259A_irq)
<idle>-0 0d.H3 138us : irq_exit (do_IRQ)
<idle>-0 0d.s3 139us < (2097760)
<idle>-0 0d.s3 139us : tcp_current_mss (tcp_rcv_state_process)
<idle>-0 0d.s3 140us : __tcp_push_pending_frames (tcp_rcv_state_process)
<idle>-0 0d.s3 141us : tcp_check_space (tcp_rcv_state_process)
<idle>-0 0d.s. 143us : tasklet_action (__do_softirq)
<idle>-0 0d.s. 143us : rcu_process_callbacks (tasklet_action)
<idle>-0 0d.s. 144us : __rcu_process_callbacks (rcu_process_callbacks)
<idle>-0 0d.s. 145us : __rcu_process_callbacks (rcu_process_callbacks)
<idle>-0 0d.s. 146us : dst_rcu_free (__rcu_process_callbacks)
<idle>-0 0d.s. 147us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 148us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 149us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 150us : dst_rcu_free (__rcu_process_callbacks)
<idle>-0 0d.s. 151us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 151us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 152us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 153us : dst_rcu_free (__rcu_process_callbacks)
<idle>-0 0d.s. 154us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 154us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 155us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 156us : dst_rcu_free (__rcu_process_callbacks)
<idle>-0 0d.s. 157us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 158us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 158us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 159us : dst_rcu_free (__rcu_process_callbacks)
<idle>-0 0d.s. 160us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 161us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 162us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 163us : dst_rcu_free (__rcu_process_callbacks)
<idle>-0 0d.s. 163us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 164us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 165us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 166us : dst_rcu_free (__rcu_process_callbacks)
<idle>-0 0d.s. 166us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 167us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 168us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 169us : dst_rcu_free (__rcu_process_callbacks)
<idle>-0 0d.s. 170us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 170us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 171us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 172us : dst_rcu_free (__rcu_process_callbacks)
<idle>-0 0d.s. 173us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 174us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 174us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 175us : dst_rcu_free (__rcu_process_callbacks)
<idle>-0 0d.s. 176us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 177us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 178us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 179us : dst_rcu_free (__rcu_process_callbacks)
<idle>-0 0d.s. 179us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 180us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 181us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 182us : dst_rcu_free (__rcu_process_callbacks)
<idle>-0 0d.s. 182us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 183us : ipv4_dst_destroy (dst_destroy)

[ etc - zillions of dst_rcu_free()s deleted ]

<idle>-0 0d.s. 13403us : dst_rcu_free (__rcu_process_callbacks)
<idle>-0 0d.s. 13403us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 13404us : ipv4_dst_destroy (dst_destroy)
<idle>-0 0d.s. 13405us : kmem_cache_free (dst_destroy)
<idle>-0 0d.s. 13406us : run_timer_softirq (__do_softirq)
<idle>-0 0d.s. 13407us : hrtimer_run_queues (run_timer_softirq)
<idle>-0 0d.s. 13408us : ktime_get_real (hrtimer_run_queues)
<idle>-0 0d.s. 13409us : getnstimeofday (ktime_get_real)
<idle>-0 0d.s. 13410us : do_gettimeofday (getnstimeofday)
<idle>-0 0d.s. 13410us+: get_offset_tsc (do_gettimeofday)
<idle>-0 0d.s. 13413us : ktime_get (hrtimer_run_queues)
<idle>-0 0d.s. 13413us : ktime_get_ts (ktime_get)
<idle>-0 0d.s. 13414us : getnstimeofday (ktime_get_ts)
<idle>-0 0d.s. 13414us : do_gettimeofday (getnstimeofday)
<idle>-0 0d.s. 13415us : get_offset_tsc (do_gettimeofday)
<idle>-0 0d.s. 13416us+: set_normalized_timespec (ktime_get_ts)
<idle>-0 0d.s. 13419us : net_tx_action (__do_softirq)
<idle>-0 0d.s. 13420us : __kfree_skb (net_tx_action)
<idle>-0 0d.s. 13421us : sock_wfree (__kfree_skb)
<idle>-0 0d.s. 13422us : kfree_skbmem (__kfree_skb)
<idle>-0 0d.s. 13423us : skb_release_data (kfree_skbmem)
<idle>-0 0d.s. 13424us : kfree (skb_release_data)
<idle>-0 0d.s. 13425us : kmem_cache_free (kfree_skbmem)
<idle>-0 0d.s. 13427us : __kfree_skb (net_tx_action)
<idle>-0 0d.s. 13428us : sock_wfree (__kfree_skb)
<idle>-0 0d.s. 13429us : kfree_skbmem (__kfree_skb)
<idle>-0 0d.s. 13430us : skb_release_data (kfree_skbmem)
<idle>-0 0d.s. 13431us : kfree (skb_release_data)
<idle>-0 0d.s. 13432us : kmem_cache_free (kfree_skbmem)
<idle>-0 0d.s. 13433us : net_rx_action (__do_softirq)
<idle>-0 0d.s. 13434us : process_backlog (net_rx_action)
<idle>-0 0d.s. 13435us+: netif_receive_skb (process_backlog)
<idle>-0 0d.s1 13437us : packet_rcv_spkt (netif_receive_skb)
<idle>-0 0d.s1 13438us : skb_clone (packet_rcv_spkt)
<idle>-0 0d.s1 13439us : kmem_cache_alloc (skb_clone)
<idle>-0 0d.s1 13440us : strlcpy (packet_rcv_spkt)
<idle>-0 0d.s2 13442us : sk_run_filter (packet_rcv_spkt)
<idle>-0 0d.s1 13444us : __kfree_skb (packet_rcv_spkt)
<idle>-0 0d.s1 13444us : kfree_skbmem (__kfree_skb)
<idle>-0 0d.s1 13445us : skb_release_data (kfree_skbmem)
<idle>-0 0d.s1 13446us : kmem_cache_free (kfree_skbmem)
<idle>-0 0d.s1 13446us : ip_rcv (netif_receive_skb)
<idle>-0 0d.s1 13448us : ip_route_input (ip_rcv)
<idle>-0 0d.s1 13449us+: rt_hash_code (ip_route_input)
<idle>-0 0d.s1 13451us : memset (ip_route_input)
<idle>-0 0d.s1 13452us+: fn_hash_lookup (ip_route_input)
<idle>-0 0d.s2 13454us+: fib_semantic_match (fn_hash_lookup)
<idle>-0 0d.s1 13457us : fib_validate_source (ip_route_input)
<idle>-0 0d.s1 13458us : memset (fib_validate_source)
<idle>-0 0d.s1 13459us+: fn_hash_lookup (fib_validate_source)
<idle>-0 0d.s1 13461us : fn_hash_lookup (fib_validate_source)
<idle>-0 0d.s2 13463us : fib_semantic_match (fn_hash_lookup)
<idle>-0 0d.s1 13464us : __fib_res_prefsrc (fib_validate_source)
<idle>-0 0d.s1 13465us : inet_select_addr (__fib_res_prefsrc)
<idle>-0 0d.s1 13466us : dst_alloc (ip_route_input)
<idle>-0 0d.s1 13467us+: kmem_cache_alloc (dst_alloc)
<idle>-0 0d.s1 13469us : rt_hash_code (ip_route_input)
<idle>-0 0d.s1 13470us : rt_intern_hash (ip_route_input)
<idle>-0 0d.s1 13471us : local_bh_enable (rt_intern_hash)
<idle>-0 0d.s1 13472us : ip_local_deliver (ip_rcv)
<idle>-0 0d.s2 13473us+: tcp_v4_rcv (ip_local_deliver)
<idle>-0 0d.s3 13477us : tcp_v4_do_rcv (tcp_v4_rcv)
<idle>-0 0d.s3 13477us : tcp_rcv_established (tcp_v4_do_rcv)
<idle>-0 0d.s3 13479us : __tcp_checksum_complete_user (tcp_rcv_established)
<idle>-0 0d.s3 13479us : __skb_checksum_complete (__tcp_checksum_complete_user)
<idle>-0 0d.s3 13480us+: skb_checksum (__skb_checksum_complete)
<idle>-0 0d.s3 13489us : tcp_event_data_recv (tcp_rcv_established)
<idle>-0 0d.s3 13491us : tcp_incr_quickack (tcp_event_data_recv)
<idle>-0 0d.s3 13492us : __tcp_ack_snd_check (tcp_rcv_established)
<idle>-0 0d.s3 13493us : tcp_send_ack (__tcp_ack_snd_check)
<idle>-0 0d.s3 13494us : __alloc_skb (tcp_send_ack)
<idle>-0 0d.s3 13494us : kmem_cache_alloc (__alloc_skb)
<idle>-0 0d.s3 13495us : __kmalloc (__alloc_skb)
<idle>-0 0d.s3 13496us : tcp_transmit_skb (tcp_send_ack)
<idle>-0 0d.s3 13498us : __tcp_select_window (tcp_transmit_skb)
<idle>-0 0d.s3 13499us : tcp_v4_send_check (tcp_transmit_skb)
<idle>-0 0d.s3 13500us : ip_queue_xmit (tcp_transmit_skb)
<idle>-0 0d.s3 13501us : ipv4_dst_check (ip_queue_xmit)
<idle>-0 0d.s3 13502us : memset (ip_queue_xmit)
<idle>-0 0d.s3 13504us : ip_route_output_flow (ip_queue_xmit)
<idle>-0 0d.s3 13505us : __ip_route_output_key (ip_route_output_flow)
<idle>-0 0d.s3 13506us : rt_hash_code (__ip_route_output_key)
<idle>-0 0d.s3 13507us : local_bh_enable (__ip_route_output_key)
<idle>-0 0d.s3 13508us : memset (__ip_route_output_key)
<idle>-0 0d.s3 13509us : ip_dev_find (__ip_route_output_key)
<idle>-0 0d.s3 13510us : memset (ip_dev_find)
<idle>-0 0d.s3 13511us : fn_hash_lookup (ip_dev_find)
<idle>-0 0d.s4 13512us : fib_semantic_match (fn_hash_lookup)
<idle>-0 0d.s3 13514us : fn_hash_lookup (__ip_route_output_key)
<idle>-0 0d.s3 13515us : fn_hash_lookup (__ip_route_output_key)
<idle>-0 0d.s4 13516us : fib_semantic_match (fn_hash_lookup)
<idle>-0 0d.s3 13518us+: fn_hash_select_default (__ip_route_output_key)
<idle>-0 0d.s3 13520us : dst_alloc (__ip_route_output_key)
<idle>-0 0d.s3 13521us+: kmem_cache_alloc (dst_alloc)
<idle>-0 0d.s3 13524us+: rt_set_nexthop (__ip_route_output_key)
<idle>-0 0d.s3 13526us : rt_hash_code (__ip_route_output_key)
<idle>-0 0d.s3 13527us : rt_intern_hash (__ip_route_output_key)
<idle>-0 0d.s4 13528us : arp_bind_neighbour (rt_intern_hash)
<idle>-0 0d.s4 13529us : neigh_lookup (arp_bind_neighbour)
<idle>-0 0d.s4 13530us+: arp_hash (neigh_lookup)
<idle>-0 0d.s5 13532us : memcmp (neigh_lookup)
<idle>-0 0d.s4 13534us : local_bh_enable (neigh_lookup)
<idle>-0 0d.s3 13535us+: local_bh_enable (rt_intern_hash)
<idle>-0 0d.s3 13537us : ip_output (ip_queue_xmit)
<idle>-0 0d.s3 13538us : neigh_resolve_output (ip_output)
<idle>-0 0d.s4 13540us+: eth_header (neigh_resolve_output)
<idle>-0 0d.H4 13543us : do_IRQ (c01f6117 0 0)
<idle>-0 0d.h. 13544us : __do_IRQ (do_IRQ)
<idle>-0 0d.h1 13545us+: mask_and_ack_8259A (__do_IRQ)
<idle>-0 0d.h. 13549us : handle_IRQ_event (__do_IRQ)
<idle>-0 0d.h. 13550us : timer_interrupt (handle_IRQ_event)
<idle>-0 0d.h1 13551us+: mark_offset_tsc (timer_interrupt)
<idle>-0 0d.h1 13557us : do_timer (timer_interrupt)
<idle>-0 0d.h1 13558us : update_process_times (timer_interrupt)
<idle>-0 0d.h1 13558us : account_system_time (update_process_times)
<idle>-0 0d.h1 13559us : acct_update_integrals (account_system_time)
<idle>-0 0d.h1 13560us : run_local_timers (update_process_times)
<idle>-0 0d.h1 13561us : raise_softirq (run_local_timers)
<idle>-0 0d.h1 13561us : rcu_pending (update_process_times)
<idle>-0 0d.h1 13562us : __rcu_pending (rcu_pending)
<idle>-0 0d.h1 13563us : __rcu_pending (rcu_pending)
<idle>-0 0d.h1 13563us : scheduler_tick (update_process_times)
<idle>-0 0d.h1 13564us : sched_clock (scheduler_tick)
<idle>-0 0d.h1 13565us : run_posix_cpu_timers (update_process_times)
<idle>-0 0d.h1 13566us : smp_local_timer_interrupt (timer_interrupt)
<idle>-0 0d.h1 13566us : profile_tick (smp_local_timer_interrupt)
<idle>-0 0d.h1 13567us : profile_hit (profile_tick)
<idle>-0 0d.h1 13568us : note_interrupt (__do_IRQ)
<idle>-0 0d.h1 13569us : end_8259A_irq (__do_IRQ)
<idle>-0 0d.h1 13569us : enable_8259A_irq (end_8259A_irq)
<idle>-0 0d.H4 13571us : irq_exit (do_IRQ)
<idle>-0 0d.s4 13572us < (2097760)
<idle>-0 0d.s3 13573us : local_bh_enable (neigh_resolve_output)
<idle>-0 0d.s3 13574us : dev_queue_xmit (neigh_resolve_output)
<idle>-0 0d.s4 13576us : pfifo_fast_enqueue (dev_queue_xmit)
<idle>-0 0d.s4 13577us : qdisc_restart (dev_queue_xmit)
<idle>-0 0d.s4 13578us : pfifo_fast_dequeue (qdisc_restart)
<idle>-0 0d.s4 13579us : dev_queue_xmit_nit (qdisc_restart)
<idle>-0 0d.s5 13579us : skb_clone (dev_queue_xmit_nit)
<idle>-0 0d.s5 13580us : kmem_cache_alloc (skb_clone)
<idle>-0 0d.s5 13581us : packet_rcv_spkt (dev_queue_xmit_nit)
<idle>-0 0d.s5 13582us : strlcpy (packet_rcv_spkt)
<idle>-0 0d.s6 13583us : sk_run_filter (packet_rcv_spkt)
<idle>-0 0d.s5 13584us : __kfree_skb (packet_rcv_spkt)
<idle>-0 0d.s5 13585us : kfree_skbmem (__kfree_skb)
<idle>-0 0d.s5 13585us : skb_release_data (kfree_skbmem)
<idle>-0 0d.s5 13586us : kmem_cache_free (kfree_skbmem)
<idle>-0 0d.s4 13587us : rhine_start_tx (qdisc_restart)
<idle>-0 0d.s5 13589us : ioread8 (rhine_start_tx)
<idle>-0 0d.s5 13590us : iowrite8 (rhine_start_tx)
<idle>-0 0d.s5 13591us+: ioread8 (rhine_start_tx)
<idle>-0 0d.s4 13593us : qdisc_restart (dev_queue_xmit)
<idle>-0 0d.s4 13594us : pfifo_fast_dequeue (qdisc_restart)
<idle>-0 0d.s3 13594us : local_bh_enable (dev_queue_xmit)
<idle>-0 0d.s3 13596us : sock_def_readable (tcp_rcv_established)
<idle>-0 0d.s4 13597us : __wake_up (sock_def_readable)
<idle>-0 0d.s5 13598us : __wake_up_common (__wake_up)
<idle>-0 0d.s5 13599us : ep_poll_callback (__wake_up_common)
<idle>-0 0d.s5 13601us : ep_poll_safewake (ep_poll_callback)
<idle>-0 0d.s5 13602us : __wake_up (ep_poll_safewake)
<idle>-0 0d.s6 13603us : __wake_up_common (__wake_up)
<idle>-0 0d.s6 13604us : default_wake_function (__wake_up_common)
<idle>-0 0d.s6 13605us+: try_to_wake_up (default_wake_function)
<idle>-0 0d.H5 13607us : do_IRQ (c0111d1e b 0)
<idle>-0 0d.h. 13608us : __do_IRQ (do_IRQ)
<idle>-0 0d.h1 13609us+: mask_and_ack_8259A (__do_IRQ)
<idle>-0 0d.h. 13614us : handle_IRQ_event (__do_IRQ)
<idle>-0 0d.h. 13615us : usb_hcd_irq (handle_IRQ_event)
<idle>-0 0d.h. 13616us : uhci_irq (usb_hcd_irq)
<idle>-0 0d.h. 13617us : rhine_interrupt (handle_IRQ_event)
<idle>-0 0d.h. 13618us : ioread16 (rhine_interrupt)
<idle>-0 0d.h. 13619us : ioread8 (rhine_interrupt)
<idle>-0 0d.h. 13621us : iowrite16 (rhine_interrupt)
<idle>-0 0d.h. 13621us : ioread8 (rhine_interrupt)
<idle>-0 0d.h1 13623us : raise_softirq_irqoff (rhine_interrupt)
<idle>-0 0d.h. 13624us : ioread16 (rhine_interrupt)
<idle>-0 0d.h. 13625us : ioread8 (rhine_interrupt)
<idle>-0 0d.h. 13627us+: via_driver_irq_handler (handle_IRQ_event)
<idle>-0 0d.h1 13629us : note_interrupt (__do_IRQ)
<idle>-0 0d.h1 13630us : end_8259A_irq (__do_IRQ)
<idle>-0 0d.h1 13631us : enable_8259A_irq (end_8259A_irq)
<idle>-0 0d.H5 13633us : irq_exit (do_IRQ)
<idle>-0 0d.s5 13633us+< (2097760)
<idle>-0 0d.s. 13635us : netif_receive_skb (process_backlog)
<idle>-0 0d.s1 13637us : packet_rcv_spkt (netif_receive_skb)
<idle>-0 0d.s1 13637us : skb_clone (packet_rcv_spkt)
<idle>-0 0d.s1 13638us : kmem_cache_alloc (skb_clone)
<idle>-0 0d.s1 13640us : strlcpy (packet_rcv_spkt)
<idle>-0 0d.s2 13640us : sk_run_filter (packet_rcv_spkt)
<idle>-0 0d.s1 13642us : __kfree_skb (packet_rcv_spkt)
<idle>-0 0d.s1 13642us : kfree_skbmem (__kfree_skb)
<idle>-0 0d.s1 13643us : skb_release_data (kfree_skbmem)
<idle>-0 0d.s1 13644us : kmem_cache_free (kfree_skbmem)
<idle>-0 0d.s1 13644us : ip_rcv (netif_receive_skb)
<idle>-0 0d.s1 13645us : ip_route_input (ip_rcv)
<idle>-0 0d.s1 13646us : rt_hash_code (ip_route_input)
<idle>-0 0d.s1 13647us : memset (ip_route_input)
<idle>-0 0d.s1 13648us : fn_hash_lookup (ip_route_input)
<idle>-0 0d.s2 13649us : fib_semantic_match (fn_hash_lookup)
<idle>-0 0d.s1 13650us : fib_validate_source (ip_route_input)
<idle>-0 0d.s1 13651us : memset (fib_validate_source)
<idle>-0 0d.s1 13652us : fn_hash_lookup (fib_validate_source)
<idle>-0 0d.s1 13653us : fn_hash_lookup (fib_validate_source)
<idle>-0 0d.s2 13654us : fib_semantic_match (fn_hash_lookup)
<idle>-0 0d.s1 13655us : __fib_res_prefsrc (fib_validate_source)
<idle>-0 0d.s1 13656us : inet_select_addr (__fib_res_prefsrc)
<idle>-0 0d.s1 13657us : dst_alloc (ip_route_input)
<idle>-0 0d.s1 13657us : kmem_cache_alloc (dst_alloc)
<idle>-0 0d.s1 13659us : rt_hash_code (ip_route_input)
<idle>-0 0d.s1 13660us : rt_intern_hash (ip_route_input)
<idle>-0 0d.s1 13661us : local_bh_enable (rt_intern_hash)
<idle>-0 0d.s1 13662us : ip_local_deliver (ip_rcv)
<idle>-0 0d.s2 13663us+: tcp_v4_rcv (ip_local_deliver)
<idle>-0 0d.s3 13665us : tcp_v4_do_rcv (tcp_v4_rcv)
<idle>-0 0d.s3 13666us : tcp_rcv_established (tcp_v4_do_rcv)
<idle>-0 0d.s3 13667us : __tcp_checksum_complete_user (tcp_rcv_established)
<idle>-0 0d.s3 13667us : __skb_checksum_complete (__tcp_checksum_complete_user)
<idle>-0 0d.s3 13668us : skb_checksum (__skb_checksum_complete)
<idle>-0 0d.s3 13670us : tcp_event_data_recv (tcp_rcv_established)
<idle>-0 0d.s3 13671us : __tcp_ack_snd_check (tcp_rcv_established)
<idle>-0 0d.s3 13672us : tcp_send_ack (__tcp_ack_snd_check)
<idle>-0 0d.s3 13672us : __alloc_skb (tcp_send_ack)
<idle>-0 0d.s3 13673us : kmem_cache_alloc (__alloc_skb)
<idle>-0 0d.s3 13674us : __kmalloc (__alloc_skb)
<idle>-0 0d.s3 13675us : tcp_transmit_skb (tcp_send_ack)
<idle>-0 0d.s3 13676us : __tcp_select_window (tcp_transmit_skb)
<idle>-0 0d.s3 13677us : tcp_v4_send_check (tcp_transmit_skb)
<idle>-0 0d.s3 13678us : ip_queue_xmit (tcp_transmit_skb)
<idle>-0 0d.s3 13679us : ipv4_dst_check (ip_queue_xmit)
<idle>-0 0d.s3 13680us : memset (ip_queue_xmit)
<idle>-0 0d.s3 13681us : ip_route_output_flow (ip_queue_xmit)
<idle>-0 0d.s3 13682us : __ip_route_output_key (ip_route_output_flow)
<idle>-0 0d.s3 13682us : rt_hash_code (__ip_route_output_key)
<idle>-0 0d.s3 13683us : local_bh_enable (__ip_route_output_key)
<idle>-0 0d.s3 13684us : memset (__ip_route_output_key)
<idle>-0 0d.s3 13685us : ip_dev_find (__ip_route_output_key)
<idle>-0 0d.s3 13686us : memset (ip_dev_find)
<idle>-0 0d.s3 13687us : fn_hash_lookup (ip_dev_find)
<idle>-0 0d.s4 13688us : fib_semantic_match (fn_hash_lookup)
<idle>-0 0d.s3 13689us : fn_hash_lookup (__ip_route_output_key)
<idle>-0 0d.s3 13690us : fn_hash_lookup (__ip_route_output_key)
<idle>-0 0d.s4 13691us : fib_semantic_match (fn_hash_lookup)
<idle>-0 0d.s3 13692us : fn_hash_select_default (__ip_route_output_key)
<idle>-0 0d.s3 13693us : dst_alloc (__ip_route_output_key)
<idle>-0 0d.s3 13694us+: kmem_cache_alloc (dst_alloc)
<idle>-0 0d.s3 13696us : rt_set_nexthop (__ip_route_output_key)
<idle>-0 0d.s3 13697us : rt_hash_code (__ip_route_output_key)
<idle>-0 0d.s3 13698us : rt_intern_hash (__ip_route_output_key)
<idle>-0 0d.s4 13698us : arp_bind_neighbour (rt_intern_hash)
<idle>-0 0d.s4 13699us : neigh_lookup (arp_bind_neighbour)
<idle>-0 0d.s4 13700us : arp_hash (neigh_lookup)
<idle>-0 0d.s5 13701us : memcmp (neigh_lookup)
<idle>-0 0d.s4 13702us : local_bh_enable (neigh_lookup)
<idle>-0 0d.s3 13703us : local_bh_enable (rt_intern_hash)
<idle>-0 0d.s3 13704us : ip_output (ip_queue_xmit)
<idle>-0 0d.s3 13705us : neigh_resolve_output (ip_output)
<idle>-0 0d.s4 13706us : eth_header (neigh_resolve_output)
<idle>-0 0d.s3 13707us : local_bh_enable (neigh_resolve_output)
<idle>-0 0d.s3 13708us : dev_queue_xmit (neigh_resolve_output)
<idle>-0 0d.s4 13709us : pfifo_fast_enqueue (dev_queue_xmit)
<idle>-0 0d.s4 13710us : qdisc_restart (dev_queue_xmit)
<idle>-0 0d.s4 13710us : pfifo_fast_dequeue (qdisc_restart)
<idle>-0 0d.s4 13711us : dev_queue_xmit_nit (qdisc_restart)
<idle>-0 0d.s5 13712us : skb_clone (dev_queue_xmit_nit)
<idle>-0 0d.s5 13713us : kmem_cache_alloc (skb_clone)
<idle>-0 0d.s5 13714us : packet_rcv_spkt (dev_queue_xmit_nit)
<idle>-0 0d.s5 13715us : strlcpy (packet_rcv_spkt)
<idle>-0 0d.s6 13716us : sk_run_filter (packet_rcv_spkt)
<idle>-0 0d.s5 13717us : __kfree_skb (packet_rcv_spkt)
<idle>-0 0d.s5 13718us : kfree_skbmem (__kfree_skb)
<idle>-0 0d.s5 13718us : skb_release_data (kfree_skbmem)
<idle>-0 0d.s5 13719us : kmem_cache_free (kfree_skbmem)
<idle>-0 0d.s4 13720us : rhine_start_tx (qdisc_restart)
<idle>-0 0d.s5 13721us : ioread8 (rhine_start_tx)
<idle>-0 0d.s5 13722us : iowrite8 (rhine_start_tx)
<idle>-0 0d.s5 13723us : ioread8 (rhine_start_tx)
<idle>-0 0d.s4 13725us : qdisc_restart (dev_queue_xmit)
<idle>-0 0d.s4 13725us : pfifo_fast_dequeue (qdisc_restart)
<idle>-0 0d.s3 13726us : local_bh_enable (dev_queue_xmit)
<idle>-0 0d.s3 13727us : sock_def_readable (tcp_rcv_established)
<idle>-0 0d.s4 13728us : __wake_up (sock_def_readable)
<idle>-0 0d.s5 13729us : __wake_up_common (__wake_up)
<idle>-0 0d.s5 13730us : ep_poll_callback (__wake_up_common)
<idle>-0 0d.s5 13732us : ep_poll_safewake (ep_poll_callback)
<idle>-0 0d.s5 13733us : __wake_up (ep_poll_safewake)
<idle>-0 0d.s6 13733us : __wake_up_common (__wake_up)
<idle>-0 0d.s6 13734us : default_wake_function (__wake_up_common)
<idle>-0 0d.s6 13735us+: try_to_wake_up (default_wake_function)
<idle>-0 0d.H5 13737us : do_IRQ (c0111d1e b 0)
<idle>-0 0d.h. 13738us : __do_IRQ (do_IRQ)
<idle>-0 0d.h1 13739us+: mask_and_ack_8259A (__do_IRQ)
<idle>-0 0d.h. 13743us : handle_IRQ_event (__do_IRQ)
<idle>-0 0d.h. 13744us : usb_hcd_irq (handle_IRQ_event)
<idle>-0 0d.h. 13745us : uhci_irq (usb_hcd_irq)
<idle>-0 0d.h. 13746us : rhine_interrupt (handle_IRQ_event)
<idle>-0 0d.h. 13746us : ioread16 (rhine_interrupt)
<idle>-0 0d.h. 13748us : ioread8 (rhine_interrupt)
<idle>-0 0d.h. 13749us : iowrite16 (rhine_interrupt)
<idle>-0 0d.h. 13750us : ioread8 (rhine_interrupt)
<idle>-0 0d.h1 13751us : raise_softirq_irqoff (rhine_interrupt)
<idle>-0 0d.h. 13752us : ioread16 (rhine_interrupt)
<idle>-0 0d.h. 13753us : ioread8 (rhine_interrupt)
<idle>-0 0d.h. 13754us : via_driver_irq_handler (handle_IRQ_event)
<idle>-0 0d.h1 13756us : note_interrupt (__do_IRQ)
<idle>-0 0d.h1 13756us : end_8259A_irq (__do_IRQ)
<idle>-0 0d.h1 13757us : enable_8259A_irq (end_8259A_irq)
<idle>-0 0d.H5 13759us : irq_exit (do_IRQ)
<idle>-0 0d.s5 13760us+< (2097760)
<idle>-0 0d.s. 13762us : tasklet_action (__do_softirq)
<idle>-0 0d.s. 13763us : rcu_process_callbacks (tasklet_action)
<idle>-0 0d.s. 13764us : __rcu_process_callbacks (rcu_process_callbacks)
<idle>-0 0d.s. 13765us : __rcu_process_callbacks (rcu_process_callbacks)
<idle>-0 0d.s. 13766us : run_timer_softirq (__do_softirq)
<idle>-0 0d.s. 13766us : hrtimer_run_queues (run_timer_softirq)
<idle>-0 0d.s. 13767us : ktime_get_real (hrtimer_run_queues)
<idle>-0 0d.s. 13768us : getnstimeofday (ktime_get_real)
<idle>-0 0d.s. 13768us : do_gettimeofday (getnstimeofday)
<idle>-0 0d.s. 13769us : get_offset_tsc (do_gettimeofday)
<idle>-0 0d.s. 13770us : ktime_get (hrtimer_run_queues)
<idle>-0 0d.s. 13771us : ktime_get_ts (ktime_get)
<idle>-0 0d.s. 13771us : getnstimeofday (ktime_get_ts)
<idle>-0 0d.s. 13772us : do_gettimeofday (getnstimeofday)
<idle>-0 0d.s. 13773us : get_offset_tsc (do_gettimeofday)
<idle>-0 0d.s. 13773us : set_normalized_timespec (ktime_get_ts)
<idle>-0 0d.s. 13775us : net_tx_action (__do_softirq)
<idle>-0 0d.s. 13775us : __kfree_skb (net_tx_action)
<idle>-0 0d.s. 13776us : sock_wfree (__kfree_skb)
<idle>-0 0d.s. 13777us : kfree_skbmem (__kfree_skb)
<idle>-0 0d.s. 13778us : skb_release_data (kfree_skbmem)
<idle>-0 0d.s. 13778us : kfree (skb_release_data)
<idle>-0 0d.s. 13779us : kmem_cache_free (kfree_skbmem)
<idle>-0 0d.s. 13780us : __kfree_skb (net_tx_action)
<idle>-0 0d.s. 13780us : sock_wfree (__kfree_skb)
<idle>-0 0d.s. 13781us : kfree_skbmem (__kfree_skb)
<idle>-0 0d.s. 13782us : skb_release_data (kfree_skbmem)
<idle>-0 0d.s. 13782us : kfree (skb_release_data)
<idle>-0 0d.s. 13783us+: kmem_cache_free (kfree_skbmem)
<idle>-0 0dn.1 13785us < (2097760)
<idle>-0 0dn.. 13787us : schedule (cpu_idle)
<idle>-0 0dn.. 13788us : stop_trace (schedule)
<idle>-0 0dn.. 13789us : profile_hit (schedule)
<idle>-0 0dn.1 13789us+: sched_clock (schedule)
<idle>-0 0dn.2 13791us : recalc_task_prio (schedule)
<idle>-0 0dn.2 13793us : effective_prio (recalc_task_prio)
<idle>-0 0dn.2 13794us : requeue_task (schedule)
<...>-3116 0d..2 13797us+: __switch_to (schedule)
<...>-3116 0d..2 13801us : schedule <<idle>-0> (8c 73)
<...>-3116 0d..1 13802us : trace_stop_sched_switched (schedule)
<...>-3116 0d..2 13802us : trace_stop_sched_switched <<...>-3116> (73 0)
<...>-3116 0d..2 13804us : schedule (schedule)

Lee



2006-01-24 07:56:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1


* Lee Revell <[email protected]> wrote:

> I ported the latency tracer to 2.6.16 and got this 13ms latency within
> a few hours. This is a regression from 2.6.15.
>
> It appears that RCU can invoke ipv4_dst_destroy thousands of times in
> a single batch.

could you try the PREEMPT_RCU patch below?

Ingo

--------
The resulting patch compiles in the following configurations, and has
been tested on x86 and ppc64:

o CONFIG_RCU_TORTURE_TEST:

o CONFIG_RCU_STATS & CONFIG_PREEMPT & CONFIG_PREEMPT_RCU

o !CONFIG_RCU_STATS & CONFIG_PREEMPT & CONFIG_PREEMPT_RCU

o !CONFIG_RCU_STATS & CONFIG_PREEMPT & !CONFIG_PREEMPT_RCU

o !CONFIG_RCU_STATS & !CONFIG_PREEMPT & !CONFIG_PREEMPT_RCU

And ditto with !CONFIG_RCU_TORTURE_TEST. Note that one cannot do
CONFIG_RCU_STATS with !CONFIG_PREEMPT_RCU, since CONFIG_RCU_STATS is
currently only capable of working with CONFIG_PREEMPT_RCU's internal
data structures. One for the todo list. In principle, one could do
CONFIG_PREEMPT_RCU & !CONFIG_PREEMPT, but there is little point in having
preemptible RCU read-side critical sections in a non-preemptible kernel.
I will make this work later (but only for performance comparison purposes)
when I have the heavyweight instructions cleaned out of rcu_read_lock()
and rcu_read_unlock().

And yes, "classic RCU" does pass the torture test. Though I got a bit
of a scare with a broken version of CONFIG_RCU_TORTURE_TEST. ;-)

Thanx, Paul

Signed-off-by: <[email protected]>

Signed-off-by: Ingo Molnar <[email protected]>

Documentation/RCU/proc.txt | 207 +++++++++++++++++
fs/proc/proc_misc.c | 48 ++++
include/linux/rcupdate.h | 29 ++
include/linux/sched.h | 5
kernel/Kconfig.preempt | 24 ++
kernel/rcupdate.c | 533 +++++++++++++++++++++++++++++++++++++++++----
6 files changed, 799 insertions(+), 47 deletions(-)

Index: linux-rt.q/Documentation/RCU/proc.txt
===================================================================
--- /dev/null
+++ linux-rt.q/Documentation/RCU/proc.txt
@@ -0,0 +1,207 @@
+/proc Filesystem Entries for RCU
+
+
+CONFIG_RCU_STATS
+
+The CONFIG_RCU_STATS config option is available only in conjunction with
+CONFIG_PREEMPT_RCU. It makes four /proc entries available, namely: rcuctrs,
+rcuptrs, rcugp, and rcustats.
+
+/proc/rcuctrs
+
+ CPU last cur
+ 0 1 1
+ 1 1 1
+ 2 1 1
+ 3 0 2
+ ggp = 230725
+
+This displays the number of processes that started RCU read-side critical
+sections on each CPU. In absence of preemption, the "last" and "cur"
+counts for a given CPU will always sum to one. Therefore, in the example
+output above, each CPU has started one RCU read-side critical section
+that was later preempted. The "last" column counts RCU read-side critical
+sections that started prior to the last counter flip, while the "cur"
+column counts critical sections that started after the last counter flip.
+
+The "ggp" count is a count of the number of counter flips since boot.
+Since this is shown as an odd number, the "cur" counts are stored in
+the zero-th element of each of the per-CPU arrays, and the "last" counts
+are stored in the first element of each of the per-CPU arrays.
+
+
+/proc/rcuptrs
+
+ nl=c04c7160/c04c7960 nt=c04c72d0
+ wl=c04c7168/c04c794c wt=c04c72bc dl=c04c7170/00000000 dt=c04c7170
+
+This displays the head and tail of each of CONFIG_PREEMPT_RCU's three
+callback lists. This will soon change to display this on a per-CPU
+basis, since each CPU will soon have its own set of callback lists.
+In the example above, the "next" list header is located at hex address
+0xc04c7160, the first element on the list at hex address 0xc04c7960,
+and the last element on the list at hex address 0xc04c72d0. The "wl="
+and "wt=" output is similar for the "wait" list, and the "dl=" and "dt="
+output for the "done" list. The "done" list is normally emptied very
+quickly after being filled, so will usually be empty as shown above.
+Note that the tail pointer points into the list header in this case.
+
+Callbacks are placed in the "next" list by call_rcu(), moved to the
+"wait" list after the next counter flip, and moved to the "done" list
+on the counter flip after that. Once on the "done" list, the callbacks
+are invoked.
+
+
+/proc/rcugp
+
+ oldggp=241419 newggp=241421
+
+This entry invokes synchronize_rcu() and prints out the number of counter
+flips since boot before and after the synchronize_rcu(). These two
+numbers will always differ by at least two. Unless RCU is broken. ;-)
+
+
+/proc/rcustats
+
+ ggp=242416 lgp=242416 sr=0 rcc=396233
+ na=2090938 nl=9 wa=2090929 wl=9 dl=0 dr=2090920 di=2090920
+ rtf1=22230730 rtf2=20139162 rtf3=242416 rtfe1=2085911 rtfe2=5657 rtfe3=19896746
+
+The quantities printed are as follows:
+
+o "ggp=": The number of flips since boot.
+
+o "lgp=": The number of flips sensed by the local structure since
+ boot. This will soon be per-CPU.
+
+o "sr=": The number of explicit call to synchronize_rcu().
+ Except that this is currently broken, so always reads as zero.
+ It is likely to be removed...
+
+o "rcc=": The number of calls to rcu_check_callbacks().
+
+o "na=": The number of callbacks that call_rcu() has registered
+ since boot.
+
+o "nl=": The number of callbacks currently on the "next" list.
+
+o "wa=": The number of callbacks that have moved to the "wait"
+ list since boot.
+
+o "wl=": The number of callbacks currently on the "wait" list.
+
+o "da=": The number of callbacks that have been moved to the
+ "done" list since boot.
+
+o "dl=": The number of callbacks currently on the "done" list.
+
+o "dr=": The number of callbacks that have been removed from the
+ "done" list since boot.
+
+o "di=": The number of callbacks that have been invoked after being
+ removed from the "done" list.
+
+o "rtf1=": The number of attempts to flip the counters.
+
+o "rtf2=": The number of attempts to flip the counters that successfully
+ acquired the fliplock.
+
+o "rtf3=": The number of successful counter flips.
+
+o "rtfe1=": The number of attempts to flip the counters that failed
+ due to the lock being held by someone else.
+
+o "rtfe2=": The number of attempts to flip the counters that were
+ abandoned due to someone else doing the job for us.
+
+o "rtfe3=": The number of attempts to flip the counters that failed
+ due to some task still being in an RCU read-side critical section
+ starting from before the last successful counter flip.
+
+
+CONFIG_RCU_TORTURE_TEST
+
+The CONFIG_RCU_TORTURE_TEST config option is available for all RCU
+implementations. It makes three /proc entries available, namely: rcutw,
+rcutr, and rcuts.
+
+
+/proc/rcutw
+
+Reading this entry starts a new torture test, or ends an earlier one
+if one is already in progress (in other words, there can be only one
+writer at a time). This sleeps uninterruptibly, so be sure to run
+it in the background. One could argue that it would be good to have
+multiple writers, but Linux uses RCU heavily enough that you will get
+write-side contention whether you want it or not. If you want additional
+write-side contention, repeatedly create and destroy several large file
+trees in parallel. Or use some other RCU-protected update.
+
+
+/proc/rcutr
+
+Reading this entry starts a new torture reader, which runs until sent
+a signal (e.g., control-C). If testing an RCU implementation with
+preemptible read-side critical sections, make sure to spawn at least
+two /proc/rcutr instances for each CPU.
+
+
+/proc/rcuts
+
+Displays the current state of the torture test:
+
+ ggp = 20961
+ rtc: c04496f4 ver: 8734 tfle: 0 rta: 8734 rtaf: 0 rtf: 8715
+ Reader Pipe: 88024120 63914 0 0 0 0 0 0 0 0 0
+ Reader Batch: 88024097 63937 0 0 0 0 0 0 0 0
+ Free-Block Circulation: 8733 8731 8729 8727 8725 8723 8721 8719 8717 8715 0
+
+The entries are as follows:
+
+o "ggp": The number of counter flips (or batches) since boot.
+
+o "rtc": The hexadecimal address of the structure currently visible
+ to readers.
+
+o "ver": The number of times since boot that the rcutw writer task
+ has changed the structure visible to readers.
+
+o "tfle": If non-zero, indicates that the "torture freelist"
+ containing structure to be placed into the "rtc" area is empty.
+ This condition is important, since it can fool you into thinking
+ that RCU is working when it is not. :-/
+
+o "rta": Number of structures allocated from the torture freelist.
+
+o "rtaf": Number of allocations from the torture freelist that have
+ failed due to the list being empty.
+
+o "rtf": Number of frees into the torture freelist.
+
+o "Reader Pipe": Histogram of "ages" of structures seen by readers.
+ If any entries past the first two are non-zero, RCU is broken.
+ And /proc/rcuts prints "!!!" to make sure you notice. The age
+ of a newly allocated structure is zero, it becomes one when
+ removed from reader visibility, and is incremented once per
+ grace period subsequently -- and is freed after passing through
+ (RCU_TORTURE_PIPE_LEN-2) grace periods.
+
+ The output displayed above was taken from a correctly working
+ RCU. If you want to see what it looks like when broken, break
+ it yourself. ;-)
+
+o "Reader Batch": Another histogram of "ages" of structures seen
+ by readers, but in terms of counter flips (or batches) rather
+ than in terms of grace periods. The legal number of non-zero
+ entries is again two. The reason for this separate view is
+ that it is easier to get the third entry to show up in the
+ "Reader Batch" list than in the "Reader Pipe" list.
+
+o "Free-Block Circulation": Shows the number of torture structures
+ that have reached a given point in the pipeline. The first element
+ should closely correspond to the number of structures allocated,
+ the second to the number that have been removed from reader view,
+ and all but the last remaining to the corresponding number of
+ passes through a grace period. The last entry should be zero,
+ as it is only incremented if a torture structure's counter
+ somehow gets incremented farther than it should.
Index: linux-rt.q/fs/proc/proc_misc.c
===================================================================
--- linux-rt.q.orig/fs/proc/proc_misc.c
+++ linux-rt.q/fs/proc/proc_misc.c
@@ -563,6 +563,48 @@ void create_seq_entry(char *name, mode_t
entry->proc_fops = f;
}

+#ifdef CONFIG_RCU_STATS
+int rcu_read_proc(char *page, char **start, off_t off,
+ int count, int *eof, void *data)
+{
+ int len;
+ extern int rcu_read_proc_data(char *page);
+
+ len = rcu_read_proc_data(page);
+ return proc_calc_metrics(page, start, off, count, eof, len);
+}
+
+int rcu_read_proc_gp(char *page, char **start, off_t off,
+ int count, int *eof, void *data)
+{
+ int len;
+ extern int rcu_read_proc_gp_data(char *page);
+
+ len = rcu_read_proc_gp_data(page);
+ return proc_calc_metrics(page, start, off, count, eof, len);
+}
+
+int rcu_read_proc_ptrs(char *page, char **start, off_t off,
+ int count, int *eof, void *data)
+{
+ int len;
+ extern int rcu_read_proc_ptrs_data(char *page);
+
+ len = rcu_read_proc_ptrs_data(page);
+ return proc_calc_metrics(page, start, off, count, eof, len);
+}
+
+int rcu_read_proc_ctrs(char *page, char **start, off_t off,
+ int count, int *eof, void *data)
+{
+ int len;
+ extern int rcu_read_proc_ctrs_data(char *page);
+
+ len = rcu_read_proc_ctrs_data(page);
+ return proc_calc_metrics(page, start, off, count, eof, len);
+}
+#endif /* #ifdef CONFIG_RCU_STATS */
+
void __init proc_misc_init(void)
{
struct proc_dir_entry *entry;
@@ -585,6 +627,12 @@ void __init proc_misc_init(void)
{"cmdline", cmdline_read_proc},
{"locks", locks_read_proc},
{"execdomains", execdomains_read_proc},
+#ifdef CONFIG_RCU_STATS
+ {"rcustats", rcu_read_proc},
+ {"rcugp", rcu_read_proc_gp},
+ {"rcuptrs", rcu_read_proc_ptrs},
+ {"rcuctrs", rcu_read_proc_ctrs},
+#endif /* #ifdef CONFIG_RCU_STATS */
{NULL,}
};
for (p = simple_ones; p->name; p++)
Index: linux-rt.q/include/linux/rcupdate.h
===================================================================
--- linux-rt.q.orig/include/linux/rcupdate.h
+++ linux-rt.q/include/linux/rcupdate.h
@@ -59,6 +59,7 @@ struct rcu_head {
} while (0)


+#ifndef CONFIG_PREEMPT_RCU

/* Global control variables for rcupdate callback mechanism. */
struct rcu_ctrlblk {
@@ -185,14 +186,26 @@ static inline int rcu_pending(int cpu)
*
* It is illegal to block while in an RCU read-side critical section.
*/
-#define rcu_read_lock() preempt_disable()
+#define rcu_read_lock preempt_disable

/**
* rcu_read_unlock - marks the end of an RCU read-side critical section.
*
* See rcu_read_lock() for more information.
*/
-#define rcu_read_unlock() preempt_enable()
+#define rcu_read_unlock preempt_enable
+
+#else /* #ifndef CONFIG_PREEMPT_RCU */
+
+#define rcu_qsctr_inc(cpu)
+#define rcu_bh_qsctr_inc(cpu)
+#define call_rcu_bh(head, rcu) call_rcu(head, rcu)
+
+extern void rcu_read_lock(void);
+extern void rcu_read_unlock(void);
+extern int rcu_pending(int cpu);
+
+#endif /* #else #ifndef CONFIG_PREEMPT_RCU */

/*
* So where is rcu_write_lock()? It does not exist, as there is no
@@ -215,14 +228,22 @@ static inline int rcu_pending(int cpu)
* can use just rcu_read_lock().
*
*/
+#ifndef CONFIG_PREEMPT_RCU
#define rcu_read_lock_bh() local_bh_disable()
+#else /* #ifndef CONFIG_PREEMPT_RCU */
+#define rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); }
+#endif /* #else #ifndef CONFIG_PREEMPT_RCU */

/*
* rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section
*
* See rcu_read_lock_bh() for more information.
*/
+#ifndef CONFIG_PREEMPT_RCU
#define rcu_read_unlock_bh() local_bh_enable()
+#else /* #ifndef CONFIG_PREEMPT_RCU */
+#define rcu_read_unlock_bh() { local_bh_enable(); rcu_read_unlock(); }
+#endif /* #else #ifndef CONFIG_PREEMPT_RCU */

/**
* rcu_dereference - fetch an RCU-protected pointer in an
@@ -271,7 +292,11 @@ static inline int rcu_pending(int cpu)
* synchronize_kernel() API. In contrast, synchronize_rcu() only
* guarantees that rcu_read_lock() sections will have completed.
*/
+#ifndef CONFIG_PREEMPT_RCU
#define synchronize_sched() synchronize_rcu()
+#else /* #ifndef CONFIG_PREEMPT_RCU */
+extern void synchronize_sched(void);
+#endif /* #else #ifndef CONFIG_PREEMPT_RCU */

extern void rcu_init(void);
extern void rcu_check_callbacks(int cpu, int user);
Index: linux-rt.q/include/linux/sched.h
===================================================================
--- linux-rt.q.orig/include/linux/sched.h
+++ linux-rt.q/include/linux/sched.h
@@ -711,6 +711,11 @@ struct task_struct {
cpumask_t cpus_allowed;
unsigned int time_slice, first_time_slice;

+#ifdef CONFIG_PREEMPT_RCU
+ int rcu_read_lock_nesting;
+ atomic_t *rcu_flipctr1;
+ atomic_t *rcu_flipctr2;
+#endif
#ifdef CONFIG_SCHEDSTATS
struct sched_info sched_info;
#endif
Index: linux-rt.q/kernel/Kconfig.preempt
===================================================================
--- linux-rt.q.orig/kernel/Kconfig.preempt
+++ linux-rt.q/kernel/Kconfig.preempt
@@ -63,3 +63,27 @@ config PREEMPT_BKL
Say Y here if you are building a kernel for a desktop system.
Say N if you are unsure.

+config PREEMPT_RCU
+ bool "Preemptible RCU"
+ default n
+ depends on PREEMPT
+ help
+ This option reduces the latency of the kernel by making certain
+ RCU sections preemptible. Normally RCU code is non-preemptible, if
+ this option is selected then read-only RCU sections become
+ preemptible. This helps latency, but may expose bugs due to
+ now-naive assumptions about each RCU read-side critical section
+ remaining on a given CPU through its execution.
+
+ Say N if you are unsure.
+
+config RCU_STATS
+ bool "/proc stats for preemptible RCU read-side critical sections"
+ depends on PREEMPT_RCU
+ default y
+ help
+ This option provides /proc stats to provide debugging info for
+ the preemptible realtime RCU implementation.
+
+ Say Y here if you want to see RCU stats in /proc
+ Say N if you are unsure.
Index: linux-rt.q/kernel/rcupdate.c
===================================================================
--- linux-rt.q.orig/kernel/rcupdate.c
+++ linux-rt.q/kernel/rcupdate.c
@@ -19,15 +19,15 @@
*
* Authors: Dipankar Sarma <[email protected]>
* Manfred Spraul <[email protected]>
+ * Paul E. McKenney <[email protected]> (PREEMPT_RCU)
*
* Based on the original work by Paul McKenney <[email protected]>
* and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
- * Papers:
- * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
- * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
+ *
+ * Papers: http://www.rdrop.com/users/paulmck/RCU
*
* For detailed explanation of Read-Copy Update mechanism see -
- * http://lse.sourceforge.net/locking/rcupdate.html
+ * Documentation/RCU/ *.txt
*
*/
#include <linux/types.h>
@@ -47,6 +47,69 @@
#include <linux/rcupdate.h>
#include <linux/rcuref.h>
#include <linux/cpu.h>
+#include <linux/random.h>
+#include <linux/delay.h>
+#include <linux/byteorder/swabb.h>
+
+struct rcu_synchronize {
+ struct rcu_head head;
+ struct completion completion;
+};
+
+/* Because of FASTCALL declaration of complete, we use this wrapper */
+static void wakeme_after_rcu(struct rcu_head *head)
+{
+ struct rcu_synchronize *rcu;
+
+ rcu = container_of(head, struct rcu_synchronize, head);
+ complete(&rcu->completion);
+}
+
+/**
+ * synchronize_rcu - wait until a grace period has elapsed.
+ *
+ * Control will return to the caller some time after a full grace
+ * period has elapsed, in other words after all currently executing RCU
+ * read-side critical sections have completed. RCU read-side critical
+ * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
+ * and may be nested.
+ *
+ * If your read-side code is not protected by rcu_read_lock(), do -not-
+ * use synchronize_rcu().
+ */
+void synchronize_rcu(void)
+{
+ struct rcu_synchronize rcu;
+
+ init_completion(&rcu.completion);
+ /* Will wake me after RCU finished */
+ call_rcu(&rcu.head, wakeme_after_rcu);
+
+ /* Wait for it */
+ wait_for_completion(&rcu.completion);
+}
+
+#ifndef __HAVE_ARCH_CMPXCHG
+/*
+ * We use an array of spinlocks for the rcurefs -- similar to ones in sparc
+ * 32 bit atomic_t implementations, and a hash function similar to that
+ * for our refcounting needs.
+ * Can't help multiprocessors which donot have cmpxchg :(
+ */
+spinlock_t __rcuref_hash[RCUREF_HASH_SIZE];
+
+static inline void init_rcurefs(void)
+{
+ int i;
+
+ for (i = 0; i < RCUREF_HASH_SIZE; i++)
+ spin_lock_init(&__rcuref_hash[i]);
+}
+#else
+#define init_rcurefs() do { } while (0)
+#endif
+
+#ifndef CONFIG_PREEMPT_RCU

/* Definition for rcupdate control block. */
struct rcu_ctrlblk rcu_ctrlblk =
@@ -73,18 +136,6 @@ DEFINE_PER_CPU(struct rcu_data, rcu_bh_d
static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
static int maxbatch = 10000;

-#ifndef __HAVE_ARCH_CMPXCHG
-/*
- * We use an array of spinlocks for the rcurefs -- similar to ones in sparc
- * 32 bit atomic_t implementations, and a hash function similar to that
- * for our refcounting needs.
- * Can't help multiprocessors which donot have cmpxchg :(
- */
-
-spinlock_t __rcuref_hash[RCUREF_HASH_SIZE] = {
- [0 ... (RCUREF_HASH_SIZE-1)] = SPIN_LOCK_UNLOCKED
-};
-#endif

/**
* call_rcu - Queue an RCU callback for invocation after a grace period.
@@ -506,48 +557,370 @@ static struct notifier_block __devinitda
void __init rcu_init(void)
{
sema_init(&rcu_barrier_sema, 1);
+ init_rcurefs();
rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE,
(void *)(long)smp_processor_id());
/* Register notifier for non-boot CPUs */
register_cpu_notifier(&rcu_nb);
}

-struct rcu_synchronize {
- struct rcu_head head;
- struct completion completion;
+/*
+ * Deprecated, use synchronize_rcu() or synchronize_sched() instead.
+ */
+void synchronize_kernel(void)
+{
+ synchronize_rcu();
+}
+
+module_param(maxbatch, int, 0);
+EXPORT_SYMBOL(call_rcu); /* WARNING: GPL-only in April 2006. */
+EXPORT_SYMBOL(call_rcu_bh); /* WARNING: GPL-only in April 2006. */
+EXPORT_SYMBOL_GPL(synchronize_rcu);
+EXPORT_SYMBOL(synchronize_kernel); /* WARNING: GPL-only in April 2006. */
+
+#else /* #ifndef CONFIG_PREEMPT_RCU */
+
+#ifndef CONFIG_PREEMPT_RT
+# define raw_spinlock_t spinlock_t
+# define RAW_SPIN_LOCK_UNLOCKED SPIN_LOCK_UNLOCKED
+# define raw_local_irq_save local_irq_save
+# define raw_local_irq_restore local_irq_restore
+#endif /* #ifndef CONFIG_PREEMPT_RT */
+
+struct rcu_data {
+ raw_spinlock_t lock;
+ long completed; /* Number of last completed batch. */
+ struct tasklet_struct rcu_tasklet;
+ struct rcu_head *nextlist;
+ struct rcu_head **nexttail;
+ struct rcu_head *waitlist;
+ struct rcu_head **waittail;
+ struct rcu_head *donelist;
+ struct rcu_head **donetail;
+#ifdef CONFIG_RCU_STATS
+ long n_next_length;
+ long n_next_add;
+ long n_wait_length;
+ long n_wait_add;
+ long n_done_length;
+ long n_done_add;
+ long n_done_remove;
+ atomic_t n_done_invoked;
+ long n_rcu_check_callbacks;
+ atomic_t n_rcu_try_flip1;
+ long n_rcu_try_flip2;
+ long n_rcu_try_flip3;
+ atomic_t n_rcu_try_flip_e1;
+ long n_rcu_try_flip_e2;
+ long n_rcu_try_flip_e3;
+#endif /* #ifdef CONFIG_RCU_STATS */
+};
+struct rcu_ctrlblk {
+ raw_spinlock_t fliplock;
+ long completed; /* Number of last completed batch. */
};
+static struct rcu_data rcu_data;
+static struct rcu_ctrlblk rcu_ctrlblk = {
+ .fliplock = RAW_SPIN_LOCK_UNLOCKED,
+ .completed = 0,
+};
+static DEFINE_PER_CPU(atomic_t [2], rcu_flipctr) =
+ { ATOMIC_INIT(0), ATOMIC_INIT(0) };

-/* Because of FASTCALL declaration of complete, we use this wrapper */
-static void wakeme_after_rcu(struct rcu_head *head)
+/*
+ * Return the number of RCU batches processed thus far. Useful
+ * for debug and statistics.
+ */
+long rcu_batches_completed(void)
{
- struct rcu_synchronize *rcu;
+ return rcu_ctrlblk.completed;
+}

- rcu = container_of(head, struct rcu_synchronize, head);
- complete(&rcu->completion);
+void
+rcu_read_lock(void)
+{
+ int flipctr;
+ unsigned long oldirq;
+
+ raw_local_irq_save(oldirq);
+ if (current->rcu_read_lock_nesting++ == 0) {
+
+ /*
+ * Outermost nesting of rcu_read_lock(), so atomically
+ * increment the current counter for the current CPU.
+ */
+
+ flipctr = rcu_ctrlblk.completed & 0x1;
+ smp_read_barrier_depends();
+ current->rcu_flipctr1 = &(__get_cpu_var(rcu_flipctr)[flipctr]);
+ /* Can optimize to non-atomic on fastpath, but start simple. */
+ atomic_inc(current->rcu_flipctr1);
+ smp_mb__after_atomic_inc(); /* might optimize out... */
+ if (unlikely(flipctr != (rcu_ctrlblk.completed & 0x1))) {
+
+ /*
+ * We raced with grace-period processing (flip).
+ * Although we cannot be preempted here, there
+ * could be interrupts, ECC errors and the like,
+ * so just nail down both sides of the rcu_flipctr
+ * array for the duration of our RCU read-side
+ * critical section, preventing a second flip
+ * from racing with us. At some point, it would
+ * be safe to decrement one of the counters, but
+ * we have no way of knowing when that would be.
+ * So just decrement them both in rcu_read_unlock().
+ */
+
+ current->rcu_flipctr2 =
+ &(__get_cpu_var(rcu_flipctr)[!flipctr]);
+ /* Can again optimize to non-atomic on fastpath. */
+ atomic_inc(current->rcu_flipctr2);
+ smp_mb__after_atomic_inc(); /* might optimize out... */
+ }
+ }
+ raw_local_irq_restore(oldirq);
}

-/**
- * synchronize_rcu - wait until a grace period has elapsed.
- *
- * Control will return to the caller some time after a full grace
- * period has elapsed, in other words after all currently executing RCU
- * read-side critical sections have completed. RCU read-side critical
- * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
- * and may be nested.
+void
+rcu_read_unlock(void)
+{
+ unsigned long oldirq;
+
+ raw_local_irq_save(oldirq);
+ if (--current->rcu_read_lock_nesting == 0) {
+
+ /*
+ * Just atomically decrement whatever we incremented.
+ * Might later want to awaken some task waiting for the
+ * grace period to complete, but keep it simple for the
+ * moment.
+ */
+
+ smp_mb__before_atomic_dec();
+ atomic_dec(current->rcu_flipctr1);
+ current->rcu_flipctr1 = NULL;
+ if (unlikely(current->rcu_flipctr2 != NULL)) {
+ atomic_dec(current->rcu_flipctr2);
+ current->rcu_flipctr2 = NULL;
+ }
+ }
+ raw_local_irq_restore(oldirq);
+}
+
+static void
+__rcu_advance_callbacks(void)
+{
+
+ if (rcu_data.completed != rcu_ctrlblk.completed) {
+ if (rcu_data.waitlist != NULL) {
+ *rcu_data.donetail = rcu_data.waitlist;
+ rcu_data.donetail = rcu_data.waittail;
+#ifdef CONFIG_RCU_STATS
+ rcu_data.n_done_length += rcu_data.n_wait_length;
+ rcu_data.n_done_add += rcu_data.n_wait_length;
+ rcu_data.n_wait_length = 0;
+#endif /* #ifdef CONFIG_RCU_STATS */
+ }
+ if (rcu_data.nextlist != NULL) {
+ rcu_data.waitlist = rcu_data.nextlist;
+ rcu_data.waittail = rcu_data.nexttail;
+ rcu_data.nextlist = NULL;
+ rcu_data.nexttail = &rcu_data.nextlist;
+#ifdef CONFIG_RCU_STATS
+ rcu_data.n_wait_length += rcu_data.n_next_length;
+ rcu_data.n_wait_add += rcu_data.n_next_length;
+ rcu_data.n_next_length = 0;
+#endif /* #ifdef CONFIG_RCU_STATS */
+ } else {
+ rcu_data.waitlist = NULL;
+ rcu_data.waittail = &rcu_data.waitlist;
+ }
+ rcu_data.completed = rcu_ctrlblk.completed;
+ }
+}
+
+/*
+ * Attempt a single flip of the counters. Remember, a single flip does
+ * -not- constitute a grace period. Instead, the interval between
+ * a pair of consecutive flips is a grace period.
*
- * If your read-side code is not protected by rcu_read_lock(), do -not-
- * use synchronize_rcu().
+ * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
+ * on a large SMP, they might want to use a hierarchical organization of
+ * the per-CPU-counter pairs.
+ */
+static void
+rcu_try_flip(void)
+{
+ int cpu;
+ long flipctr;
+ unsigned long oldirq;
+
+ flipctr = rcu_ctrlblk.completed;
+#ifdef CONFIG_RCU_STATS
+ atomic_inc(&rcu_data.n_rcu_try_flip1);
+#endif /* #ifdef CONFIG_RCU_STATS */
+ if (unlikely(!spin_trylock_irqsave(&rcu_ctrlblk.fliplock, oldirq))) {
+#ifdef CONFIG_RCU_STATS
+ atomic_inc(&rcu_data.n_rcu_try_flip_e1);
+#endif /* #ifdef CONFIG_RCU_STATS */
+ return;
+ }
+ if (unlikely(flipctr != rcu_ctrlblk.completed)) {
+
+ /* Our work is done! ;-) */
+
+#ifdef CONFIG_RCU_STATS
+ rcu_data.n_rcu_try_flip_e2++;
+#endif /* #ifdef CONFIG_RCU_STATS */
+ spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
+ return;
+ }
+ flipctr &= 0x1;
+
+ /*
+ * Check for completion of all RCU read-side critical sections
+ * that started prior to the previous flip.
+ */
+
+#ifdef CONFIG_RCU_STATS
+ rcu_data.n_rcu_try_flip2++;
+#endif /* #ifdef CONFIG_RCU_STATS */
+ for_each_cpu(cpu) {
+ if (atomic_read(&per_cpu(rcu_flipctr, cpu)[!flipctr]) != 0) {
+#ifdef CONFIG_RCU_STATS
+ rcu_data.n_rcu_try_flip_e3++;
+#endif /* #ifdef CONFIG_RCU_STATS */
+ spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
+ return;
+ }
+ }
+
+ /* Do the flip. */
+
+ smp_mb();
+ rcu_ctrlblk.completed++;
+
+#ifdef CONFIG_RCU_STATS
+ rcu_data.n_rcu_try_flip3++;
+#endif /* #ifdef CONFIG_RCU_STATS */
+ spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
+}
+
+void
+rcu_check_callbacks(int cpu, int user)
+{
+ unsigned long oldirq;
+
+ if (rcu_ctrlblk.completed == rcu_data.completed) {
+ rcu_try_flip();
+ if (rcu_ctrlblk.completed == rcu_data.completed) {
+ return;
+ }
+ }
+ spin_lock_irqsave(&rcu_data.lock, oldirq);
+#ifdef CONFIG_RCU_STATS
+ rcu_data.n_rcu_check_callbacks++;
+#endif /* #ifdef CONFIG_RCU_STATS */
+ __rcu_advance_callbacks();
+ if (rcu_data.donelist == NULL) {
+ spin_unlock_irqrestore(&rcu_data.lock, oldirq);
+ } else {
+ spin_unlock_irqrestore(&rcu_data.lock, oldirq);
+ tasklet_schedule(&rcu_data.rcu_tasklet);
+ }
+}
+
+static
+void rcu_process_callbacks(unsigned long data)
+{
+ unsigned long flags;
+ struct rcu_head *next, *list;
+
+ spin_lock_irqsave(&rcu_data.lock, flags);
+ list = rcu_data.donelist;
+ if (list == NULL) {
+ spin_unlock_irqrestore(&rcu_data.lock, flags);
+ return;
+ }
+ rcu_data.donelist = NULL;
+ rcu_data.donetail = &rcu_data.donelist;
+#ifdef CONFIG_RCU_STATS
+ rcu_data.n_done_remove += rcu_data.n_done_length;
+ rcu_data.n_done_length = 0;
+#endif /* #ifdef CONFIG_RCU_STATS */
+ spin_unlock_irqrestore(&rcu_data.lock, flags);
+ while (list) {
+ next = list->next;
+ list->func(list);
+ list = next;
+#ifdef CONFIG_RCU_STATS
+ atomic_inc(&rcu_data.n_done_invoked);
+#endif /* #ifdef CONFIG_RCU_STATS */
+ }
+}
+
+void fastcall
+call_rcu(struct rcu_head *head,
+ void (*func)(struct rcu_head *rcu))
+{
+ unsigned long flags;
+
+ head->func = func;
+ head->next = NULL;
+ spin_lock_irqsave(&rcu_data.lock, flags);
+ __rcu_advance_callbacks();
+ *rcu_data.nexttail = head;
+ rcu_data.nexttail = &head->next;
+#ifdef CONFIG_RCU_STATS
+ rcu_data.n_next_add++;
+ rcu_data.n_next_length++;
+#endif /* #ifdef CONFIG_RCU_STATS */
+ spin_unlock_irqrestore(&rcu_data.lock, flags);
+}
+
+/*
+ * Crude hack, reduces but does not eliminate possibility of failure.
+ * Needs to wait for all CPUs to pass through a -voluntary- context
+ * switch to eliminate possibility of failure. (Maybe just crank
+ * priority down...)
*/
-void synchronize_rcu(void)
+void
+synchronize_sched(void)
{
- struct rcu_synchronize rcu;
+ cpumask_t oldmask;
+ int cpu;

- init_completion(&rcu.completion);
- /* Will wake me after RCU finished */
- call_rcu(&rcu.head, wakeme_after_rcu);
+ if (sched_getaffinity(0, &oldmask) < 0) {
+ oldmask = cpu_possible_map;
+ }
+ for_each_cpu(cpu) {
+ sched_setaffinity(0, cpumask_of_cpu(cpu));
+ schedule();
+ }
+ sched_setaffinity(0, oldmask);
+}

- /* Wait for it */
- wait_for_completion(&rcu.completion);
+int
+rcu_pending(int cpu)
+{
+ return (rcu_data.donelist != NULL ||
+ rcu_data.waitlist != NULL ||
+ rcu_data.nextlist != NULL);
+}
+
+void __init rcu_init(void)
+{
+ init_rcurefs();
+/*&&&&*/printk("WARNING: experimental RCU implementation.\n");
+ rcu_data.lock = RAW_SPIN_LOCK_UNLOCKED;
+ rcu_data.completed = 0;
+ rcu_data.nextlist = NULL;
+ rcu_data.nexttail = &rcu_data.nextlist;
+ rcu_data.waitlist = NULL;
+ rcu_data.waittail = &rcu_data.waitlist;
+ rcu_data.donelist = NULL;
+ rcu_data.donetail = &rcu_data.donelist;
+ tasklet_init(&rcu_data.rcu_tasklet, rcu_process_callbacks, 0UL);
}

/*
@@ -558,9 +931,79 @@ void synchronize_kernel(void)
synchronize_rcu();
}

-module_param(maxbatch, int, 0);
+#ifdef CONFIG_RCU_STATS
+int rcu_read_proc_data(char *page)
+{
+ return sprintf(page,
+ "ggp=%ld lgp=%ld rcc=%ld\n"
+ "na=%ld nl=%ld wa=%ld wl=%ld da=%ld dl=%ld dr=%ld di=%d\n"
+ "rtf1=%d rtf2=%ld rtf3=%ld rtfe1=%d rtfe2=%ld rtfe3=%ld\n",
+
+ rcu_ctrlblk.completed,
+ rcu_data.completed,
+ rcu_data.n_rcu_check_callbacks,
+
+ rcu_data.n_next_add,
+ rcu_data.n_next_length,
+ rcu_data.n_wait_add,
+ rcu_data.n_wait_length,
+ rcu_data.n_done_add,
+ rcu_data.n_done_length,
+ rcu_data.n_done_remove,
+ atomic_read(&rcu_data.n_done_invoked),
+
+ atomic_read(&rcu_data.n_rcu_try_flip1),
+ rcu_data.n_rcu_try_flip2,
+ rcu_data.n_rcu_try_flip3,
+ atomic_read(&rcu_data.n_rcu_try_flip_e1),
+ rcu_data.n_rcu_try_flip_e2,
+ rcu_data.n_rcu_try_flip_e3);
+}
+
+int rcu_read_proc_gp_data(char *page)
+{
+ long oldgp = rcu_ctrlblk.completed;
+
+ synchronize_rcu();
+ return sprintf(page, "oldggp=%ld newggp=%ld\n",
+ oldgp, rcu_ctrlblk.completed);
+}
+
+int rcu_read_proc_ptrs_data(char *page)
+{
+ return sprintf(page,
+ "nl=%p/%p nt=%p\n wl=%p/%p wt=%p dl=%p/%p dt=%p\n",
+ &rcu_data.nextlist, rcu_data.nextlist, rcu_data.nexttail,
+ &rcu_data.waitlist, rcu_data.waitlist, rcu_data.waittail,
+ &rcu_data.donelist, rcu_data.donelist, rcu_data.donetail
+ );
+}
+
+int rcu_read_proc_ctrs_data(char *page)
+{
+ int cnt = 0;
+ int cpu;
+ int f = rcu_data.completed & 0x1;
+
+ cnt += sprintf(&page[cnt], "CPU last cur\n");
+ for_each_cpu(cpu) {
+ cnt += sprintf(&page[cnt], "%3d %4d %3d\n",
+ cpu,
+ atomic_read(&per_cpu(rcu_flipctr, cpu)[!f]),
+ atomic_read(&per_cpu(rcu_flipctr, cpu)[f]));
+ }
+ cnt += sprintf(&page[cnt], "ggp = %ld\n", rcu_data.completed);
+ return (cnt);
+}
+
+#endif /* #ifdef CONFIG_RCU_STATS */
+
+EXPORT_SYMBOL(call_rcu); /* WARNING: GPL-only in April 2006. */
EXPORT_SYMBOL_GPL(rcu_batches_completed);
-EXPORT_SYMBOL(call_rcu); /* WARNING: GPL-only in April 2006. */
-EXPORT_SYMBOL(call_rcu_bh); /* WARNING: GPL-only in April 2006. */
EXPORT_SYMBOL_GPL(synchronize_rcu);
-EXPORT_SYMBOL(synchronize_kernel); /* WARNING: GPL-only in April 2006. */
+EXPORT_SYMBOL_GPL(synchronize_sched);
+EXPORT_SYMBOL(rcu_read_lock); /* WARNING: GPL-only in April 2006. */
+EXPORT_SYMBOL(rcu_read_unlock); /* WARNING: GPL-only in April 2006. */
+EXPORT_SYMBOL(synchronize_kernel); /* WARNING: Removal in April 2006. */
+
+#endif /* #else #ifndef CONFIG_PREEMPT_RCU */

2006-01-24 07:58:06

by Lee Revell

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Tue, 2006-01-24 at 08:56 +0100, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > I ported the latency tracer to 2.6.16 and got this 13ms latency within
> > a few hours. This is a regression from 2.6.15.
> >
> > It appears that RCU can invoke ipv4_dst_destroy thousands of times in
> > a single batch.
>
> could you try the PREEMPT_RCU patch below?
>


Sure. If it works do you see this making it in 2.6.16? Otherwise we
still would have a regression...

Lee


2006-01-24 08:01:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1


* Lee Revell <[email protected]> wrote:

> On Tue, 2006-01-24 at 08:56 +0100, Ingo Molnar wrote:
> > * Lee Revell <[email protected]> wrote:
> >
> > > I ported the latency tracer to 2.6.16 and got this 13ms latency within
> > > a few hours. This is a regression from 2.6.15.
> > >
> > > It appears that RCU can invoke ipv4_dst_destroy thousands of times in
> > > a single batch.
> >
> > could you try the PREEMPT_RCU patch below?
>
> Sure. If it works do you see this making it in 2.6.16? Otherwise we
> still would have a regression...

nope, that likely wont make v2.6.16, which is frozen already.

Ingo

2006-01-24 08:03:56

by Lee Revell

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Tue, 2006-01-24 at 09:01 +0100, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > On Tue, 2006-01-24 at 08:56 +0100, Ingo Molnar wrote:
> > > * Lee Revell <[email protected]> wrote:
> > >
> > > > I ported the latency tracer to 2.6.16 and got this 13ms latency within
> > > > a few hours. This is a regression from 2.6.15.
> > > >
> > > > It appears that RCU can invoke ipv4_dst_destroy thousands of times in
> > > > a single batch.
> > >
> > > could you try the PREEMPT_RCU patch below?
> >
> > Sure. If it works do you see this making it in 2.6.16? Otherwise we
> > still would have a regression...
>
> nope, that likely wont make v2.6.16, which is frozen already.
>

Well, the last latency regression I found, I was told "I wish you'd
caught this at 2.6.15-rc1, something could have been done". Now I've
found another one at the -rc1 stage, and there's still nothing that can
be done?

Lee

2006-01-24 08:08:03

by Lee Revell

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Tue, 2006-01-24 at 09:01 +0100, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > On Tue, 2006-01-24 at 08:56 +0100, Ingo Molnar wrote:
> > > * Lee Revell <[email protected]> wrote:
> > >
> > > > I ported the latency tracer to 2.6.16 and got this 13ms latency within
> > > > a few hours. This is a regression from 2.6.15.
> > > >
> > > > It appears that RCU can invoke ipv4_dst_destroy thousands of times in
> > > > a single batch.
> > >
> > > could you try the PREEMPT_RCU patch below?
> >
> > Sure. If it works do you see this making it in 2.6.16? Otherwise we
> > still would have a regression...
>
> nope, that likely wont make v2.6.16, which is frozen already.
>

How about just lowering maxbatch to 1000?

Lee

2006-01-24 08:10:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

* Lee Revell <[email protected]> wrote:

> On Tue, 2006-01-24 at 09:01 +0100, Ingo Molnar wrote:
> > * Lee Revell <[email protected]> wrote:
> >
> > > On Tue, 2006-01-24 at 08:56 +0100, Ingo Molnar wrote:
> > > > * Lee Revell <[email protected]> wrote:
> > > >
> > > > > I ported the latency tracer to 2.6.16 and got this 13ms latency within
> > > > > a few hours. This is a regression from 2.6.15.
> > > > >
> > > > > It appears that RCU can invoke ipv4_dst_destroy thousands of times in
> > > > > a single batch.
> > > >
> > > > could you try the PREEMPT_RCU patch below?
> > >
> > > Sure. If it works do you see this making it in 2.6.16? Otherwise we
> > > still would have a regression...
> >
> > nope, that likely wont make v2.6.16, which is frozen already.
> >
>
> Well, the last latency regression I found, I was told "I wish you'd
> caught this at 2.6.15-rc1, something could have been done". Now I've
> found another one at the -rc1 stage, and there's still nothing that
> can be done?

i did not say that - it should clearly be fixed for 2.6.16. But
nevertheless we should do PREEMPT_RCU in v2.6.17, because RCU related
latencies are hard to fix in general.

Ingo

2006-01-24 08:12:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1


* Lee Revell <[email protected]> wrote:

> On Tue, 2006-01-24 at 09:01 +0100, Ingo Molnar wrote:
> > * Lee Revell <[email protected]> wrote:
> >
> > > On Tue, 2006-01-24 at 08:56 +0100, Ingo Molnar wrote:
> > > > * Lee Revell <[email protected]> wrote:
> > > >
> > > > > I ported the latency tracer to 2.6.16 and got this 13ms latency within
> > > > > a few hours. This is a regression from 2.6.15.
> > > > >
> > > > > It appears that RCU can invoke ipv4_dst_destroy thousands of times in
> > > > > a single batch.
> > > >
> > > > could you try the PREEMPT_RCU patch below?
> > >
> > > Sure. If it works do you see this making it in 2.6.16? Otherwise we
> > > still would have a regression...
> >
> > nope, that likely wont make v2.6.16, which is frozen already.
> >
>
> How about just lowering maxbatch to 1000?

does that fix the latency for you? I think "maxbatch=1000" should work
as a boot parameter too.

Ingo

2006-01-24 08:15:30

by Lee Revell

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Tue, 2006-01-24 at 09:13 +0100, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > On Tue, 2006-01-24 at 09:01 +0100, Ingo Molnar wrote:
> > > * Lee Revell <[email protected]> wrote:
> > >
> > > > On Tue, 2006-01-24 at 08:56 +0100, Ingo Molnar wrote:
> > > > > * Lee Revell <[email protected]> wrote:
> > > > >
> > > > > > I ported the latency tracer to 2.6.16 and got this 13ms latency within
> > > > > > a few hours. This is a regression from 2.6.15.
> > > > > >
> > > > > > It appears that RCU can invoke ipv4_dst_destroy thousands of times in
> > > > > > a single batch.
> > > > >
> > > > > could you try the PREEMPT_RCU patch below?
> > > >
> > > > Sure. If it works do you see this making it in 2.6.16? Otherwise we
> > > > still would have a regression...
> > >
> > > nope, that likely wont make v2.6.16, which is frozen already.
> > >
> >
> > How about just lowering maxbatch to 1000?
>
> does that fix the latency for you? I think "maxbatch=1000" should work
> as a boot parameter too.
>

Have not tested yet but it appears that will reduce it substantially:

$ grep "dst_destroy (dst_rcu_free)" /proc/latency_trace | wc -l
3027

This implies the latency would be reduced to ~4ms, still not great but
it will be overshadowed by rt_run_flush/rt_garbage_collect.

Lee

2006-01-24 09:17:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Tue, Jan 24, 2006 at 03:15:27AM -0500, Lee Revell wrote:
> On Tue, 2006-01-24 at 09:13 +0100, Ingo Molnar wrote:
> > * Lee Revell <[email protected]> wrote:
> >
> > > On Tue, 2006-01-24 at 09:01 +0100, Ingo Molnar wrote:
> > > > * Lee Revell <[email protected]> wrote:
> > > >
> > > > > On Tue, 2006-01-24 at 08:56 +0100, Ingo Molnar wrote:
> > > > > > * Lee Revell <[email protected]> wrote:
> > > > > >
> > > > > > > I ported the latency tracer to 2.6.16 and got this 13ms latency within
> > > > > > > a few hours. This is a regression from 2.6.15.
> > > > > > >
> > > > > > > It appears that RCU can invoke ipv4_dst_destroy thousands of times in
> > > > > > > a single batch.
> > > > > >
> > > > > > could you try the PREEMPT_RCU patch below?
> > > > >
> > > > > Sure. If it works do you see this making it in 2.6.16? Otherwise we
> > > > > still would have a regression...
> > > >
> > > > nope, that likely wont make v2.6.16, which is frozen already.
> > > >
> > >
> > > How about just lowering maxbatch to 1000?
> >
> > does that fix the latency for you? I think "maxbatch=1000" should work
> > as a boot parameter too.
> >
>
> Have not tested yet but it appears that will reduce it substantially:
>
> $ grep "dst_destroy (dst_rcu_free)" /proc/latency_trace | wc -l
> 3027
>
> This implies the latency would be reduced to ~4ms, still not great but
> it will be overshadowed by rt_run_flush/rt_garbage_collect.

The other patch to try would be Dipankar Sarma's patch at:

http://marc.theaimsgroup.com/?l=linux-kernel&m=113657112726596&w=2

This patch was primarily designed to reduce memory overhead, but given
that it tends to reduce batch size, it should also reduce latency.

Thanx, Paul

2006-01-24 09:23:02

by Ingo Molnar

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1


* Paul E. McKenney <[email protected]> wrote:

> > Have not tested yet but it appears that will reduce it substantially:
> >
> > $ grep "dst_destroy (dst_rcu_free)" /proc/latency_trace | wc -l
> > 3027
> >
> > This implies the latency would be reduced to ~4ms, still not great but
> > it will be overshadowed by rt_run_flush/rt_garbage_collect.
>
> The other patch to try would be Dipankar Sarma's patch at:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=113657112726596&w=2
>
> This patch was primarily designed to reduce memory overhead, but given
> that it tends to reduce batch size, it should also reduce latency.

if this solves Lee's problem, i think we should apply this as a fix, and
get it into v2.6.16. The patch looks straightforward and correct to me.

Ingo

2006-01-24 09:44:20

by Lee Revell

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Tue, 2006-01-24 at 10:23 +0100, Ingo Molnar wrote:
> * Paul E. McKenney <[email protected]> wrote:
>
> > > Have not tested yet but it appears that will reduce it substantially:
> > >
> > > $ grep "dst_destroy (dst_rcu_free)" /proc/latency_trace | wc -l
> > > 3027
> > >
> > > This implies the latency would be reduced to ~4ms, still not great but
> > > it will be overshadowed by rt_run_flush/rt_garbage_collect.
> >
> > The other patch to try would be Dipankar Sarma's patch at:
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=113657112726596&w=2
> >
> > This patch was primarily designed to reduce memory overhead, but given
> > that it tends to reduce batch size, it should also reduce latency.
>
> if this solves Lee's problem, i think we should apply this as a fix, and
> get it into v2.6.16. The patch looks straightforward and correct to me.
>

Does not compile:

CC kernel/rcupdate.o
kernel/rcupdate.c:76: warning: 'struct rcu_state' declared inside parameter list
kernel/rcupdate.c:76: warning: its scope is only this definition or declaration, which is probably not what you want
kernel/rcupdate.c: In function 'call_rcu':
kernel/rcupdate.c:113: error: 'rcu_state' undeclared (first use in this function)
kernel/rcupdate.c:113: error: (Each undeclared identifier is reported only once
kernel/rcupdate.c:113: error: for each function it appears in.)
kernel/rcupdate.c: In function 'call_rcu_bh':
kernel/rcupdate.c:155: error: 'rcu_bh_state' undeclared (first use in this function)
make[1]: *** [kernel/rcupdate.o] Error 1
make: *** [kernel] Error 2

Lee

2006-01-24 16:29:18

by Dipankar Sarma

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Tue, Jan 24, 2006 at 04:44:15AM -0500, Lee Revell wrote:
> On Tue, 2006-01-24 at 10:23 +0100, Ingo Molnar wrote:
> > * Paul E. McKenney <[email protected]> wrote:
> >
> > > The other patch to try would be Dipankar Sarma's patch at:
> > >
> > > http://marc.theaimsgroup.com/?l=linux-kernel&m=113657112726596&w=2
> > >
> > > This patch was primarily designed to reduce memory overhead, but given
> > > that it tends to reduce batch size, it should also reduce latency.
> >
> > if this solves Lee's problem, i think we should apply this as a fix, and
> > get it into v2.6.16. The patch looks straightforward and correct to me.
> >
>
> Does not compile:
>
> CC kernel/rcupdate.o
> kernel/rcupdate.c:76: warning: 'struct rcu_state' declared inside parameter list

My original patch was against a much older kernel.
I will send out a more uptodate patch as soon as I am done with some
testing.

Thanks
Dipankar

2006-01-24 16:57:52

by Dipankar Sarma

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Tue, Jan 24, 2006 at 02:52:18AM -0500, Lee Revell wrote:
> I ported the latency tracer to 2.6.16 and got this 13ms latency within a
> few hours. This is a regression from 2.6.15.
>
> It appears that RCU can invoke ipv4_dst_destroy thousands of times in a
> single batch.
>
> <idle>-0 0d.s. 143us : tasklet_action (__do_softirq)
> <idle>-0 0d.s. 143us : rcu_process_callbacks (tasklet_action)
> <idle>-0 0d.s. 144us : __rcu_process_callbacks (rcu_process_callbacks)
> <idle>-0 0d.s. 145us : __rcu_process_callbacks (rcu_process_callbacks)
> <idle>-0 0d.s. 146us : dst_rcu_free (__rcu_process_callbacks)
> <idle>-0 0d.s. 147us : dst_destroy (dst_rcu_free)
> <idle>-0 0d.s. 148us : ipv4_dst_destroy (dst_destroy)
> <idle>-0 0d.s. 149us : kmem_cache_free (dst_destroy)
>
> [ etc - zillions of dst_rcu_free()s deleted ]

Are these predominantly coming from rt_run_flush() ? I had an old
patch that frees one hash chain per RCU callback. However the
cost of each RCU callback goes up here, so I am not sure whether
that helps. Atleast with shorter RCU callbacks, you get a chance
for preemption after the tasklet softirq has happened a few
times. Can you describe what you are running in your system ?
My old rt-flush-list patch is included below for reference.

Thanks
Dipankar



Reduce the number of RCU callbacks by flushing one hash chain
at a time. This is intended to reduce softirq overhead during
frequent flushing.


net/ipv4/route.c | 19 +++++++++++++------
1 files changed, 13 insertions(+), 6 deletions(-)

diff -puN net/ipv4/route.c~rcu-rt-flush-list net/ipv4/route.c
--- linux-2.6.0-test2-rcu/net/ipv4/route.c~rcu-rt-flush-list 2003-08-13 21:46:35.000000000 +0530
+++ linux-2.6.0-test2-rcu-dipankar/net/ipv4/route.c 2003-08-27 14:17:54.000000000 +0530
@@ -521,13 +521,22 @@ static void rt_check_expire(unsigned lon
mod_timer(&rt_periodic_timer, now + ip_rt_gc_interval);
}

+static void rt_list_free(struct rtable *rth)
+{
+ struct rtable *next;
+ for (; rth; rth = next) {
+ next = rth->u.rt_next;
+ dst_free(&rth->u.dst);
+ }
+}
+
/* This can run from both BH and non-BH contexts, the latter
* in the case of a forced flush event.
*/
static void rt_run_flush(unsigned long dummy)
{
int i;
- struct rtable *rth, *next;
+ struct rtable *rth;

rt_deadline = 0;

@@ -539,11 +548,9 @@ static void rt_run_flush(unsigned long d
if (rth)
rt_hash_table[i].chain = NULL;
spin_unlock_bh(&rt_hash_table[i].lock);
-
- for (; rth; rth = next) {
- next = rth->u.rt_next;
- rt_free(rth);
- }
+ if (rth)
+ call_rcu_bh(&rth->u.dst.rcu_head,
+ (void (*)(void *))rt_list_free, rth);
}
}


_

2006-01-24 21:38:32

by Dipankar Sarma

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Tue, Jan 24, 2006 at 09:58:46PM +0530, Dipankar Sarma wrote:
> On Tue, Jan 24, 2006 at 04:44:15AM -0500, Lee Revell wrote:
> > On Tue, 2006-01-24 at 10:23 +0100, Ingo Molnar wrote:
> > > * Paul E. McKenney <[email protected]> wrote:
> > >
> > > > This patch was primarily designed to reduce memory overhead, but given
> > > > that it tends to reduce batch size, it should also reduce latency.
> > >
> > > if this solves Lee's problem, i think we should apply this as a fix, and
> > > get it into v2.6.16. The patch looks straightforward and correct to me.
> > >
> >
> > Does not compile:
> >
> > CC kernel/rcupdate.o
> > kernel/rcupdate.c:76: warning: 'struct rcu_state' declared inside parameter list
>
> My original patch was against a much older kernel.
> I will send out a more uptodate patch as soon as I am done with some
> testing.

Here is an updated version of that patch against 2.6.16-rc1. I have
sanity-tested it on ppc64 and x86_64 using dbench and kernbench.
I have also tested this for OOM situations - open()/close() in
a tight loop in my x86_64 which earlier used to reach file limit
if I set batch limit to 10 and found no problem. This patch does set
default RCU batch limit to 10 and changes it only when there is an RCU
flood.

Any latency measurement result will be greatly appreciated.

Thanks
Dipankar


This patch adds new tunables for RCU queue and finished batches.
There are two types of controls - number of completed RCU updates
invoked in a batch (blimit) and monitoring for high rate of
incoming RCUs on a cpu (qhimark, qlowmark). By default,
the per-cpu batch limit is set to a small value. If
the input RCU rate exceeds the high watermark, we do two things -
force quiescent state on all cpus and set the batch limit
of the CPU to -1. Setting batch limit to -1 forces all
finished RCUs to be processed in one shot. Once the incoming
queued RCUs fall below the low watermark, the batch limit
is set to the default.

Signed-off-by: Dipankar Sarma <[email protected]>


include/linux/rcupdate.h | 6 +++
kernel/rcupdate.c | 73 +++++++++++++++++++++++++++++++++++------------
2 files changed, 60 insertions(+), 19 deletions(-)

diff -puN include/linux/rcupdate.h~rcu-batch-tuning include/linux/rcupdate.h
--- linux-2.6.16-rc1-rcu/include/linux/rcupdate.h~rcu-batch-tuning 2006-01-25 00:09:54.000000000 +0530
+++ linux-2.6.16-rc1-rcu-dipankar/include/linux/rcupdate.h 2006-01-25 01:07:39.000000000 +0530
@@ -98,13 +98,17 @@ struct rcu_data {
long batch; /* Batch # for current RCU batch */
struct rcu_head *nxtlist;
struct rcu_head **nxttail;
- long count; /* # of queued items */
+ long qlen; /* # of queued callbacks */
struct rcu_head *curlist;
struct rcu_head **curtail;
struct rcu_head *donelist;
struct rcu_head **donetail;
+ long blimit; /* Upper limit on a processed batch */
int cpu;
struct rcu_head barrier;
+#ifdef CONFIG_SMP
+ long last_rs_qlen; /* qlen during the last resched */
+#endif
};

DECLARE_PER_CPU(struct rcu_data, rcu_data);
diff -puN kernel/rcupdate.c~rcu-batch-tuning kernel/rcupdate.c
--- linux-2.6.16-rc1-rcu/kernel/rcupdate.c~rcu-batch-tuning 2006-01-25 00:09:54.000000000 +0530
+++ linux-2.6.16-rc1-rcu-dipankar/kernel/rcupdate.c 2006-01-25 01:20:55.000000000 +0530
@@ -67,7 +67,36 @@ DEFINE_PER_CPU(struct rcu_data, rcu_bh_d

/* Fake initialization required by compiler */
static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
-static int maxbatch = 10000;
+static int blimit = 10;
+static int qhimark = 10000;
+static int qlowmark = 100;
+#ifdef CONFIG_SMP
+static int rsinterval = 1000;
+#endif
+
+static atomic_t rcu_barrier_cpu_count;
+static struct semaphore rcu_barrier_sema;
+static struct completion rcu_barrier_completion;
+
+#ifdef CONFIG_SMP
+static inline void force_quiescent_state(struct rcu_data *rdp,
+ struct rcu_ctrlblk *rcp)
+{
+ int cpu;
+ set_need_resched();
+ if (unlikely(rdp->qlen - rdp->last_rs_qlen > rsinterval)) {
+ rdp->last_rs_qlen = rdp->qlen;
+ for_each_cpu_mask(cpu, rcp->cpumask)
+ smp_send_reschedule(cpu);
+ }
+}
+#else
+static inline void force_quiescent_state(struct rcu_data *rdp,
+ struct rcu_ctrlblk *rcp)
+{
+ set_need_resched();
+}
+#endif

/**
* call_rcu - Queue an RCU callback for invocation after a grace period.
@@ -92,17 +121,13 @@ void fastcall call_rcu(struct rcu_head *
rdp = &__get_cpu_var(rcu_data);
*rdp->nxttail = head;
rdp->nxttail = &head->next;
-
- if (unlikely(++rdp->count > 10000))
- set_need_resched();
-
+ if (unlikely(++rdp->qlen > qhimark)) {
+ rdp->blimit = INT_MAX;
+ force_quiescent_state(rdp, &rcu_ctrlblk);
+ }
local_irq_restore(flags);
}

-static atomic_t rcu_barrier_cpu_count;
-static struct semaphore rcu_barrier_sema;
-static struct completion rcu_barrier_completion;
-
/**
* call_rcu_bh - Queue an RCU for invocation after a quicker grace period.
* @head: structure to be used for queueing the RCU updates.
@@ -131,12 +156,12 @@ void fastcall call_rcu_bh(struct rcu_hea
rdp = &__get_cpu_var(rcu_bh_data);
*rdp->nxttail = head;
rdp->nxttail = &head->next;
- rdp->count++;
-/*
- * Should we directly call rcu_do_batch() here ?
- * if (unlikely(rdp->count > 10000))
- * rcu_do_batch(rdp);
- */
+
+ if (unlikely(++rdp->qlen > qhimark)) {
+ rdp->blimit = INT_MAX;
+ force_quiescent_state(rdp, &rcu_bh_ctrlblk);
+ }
+
local_irq_restore(flags);
}

@@ -199,10 +224,12 @@ static void rcu_do_batch(struct rcu_data
next = rdp->donelist = list->next;
list->func(list);
list = next;
- rdp->count--;
- if (++count >= maxbatch)
+ rdp->qlen--;
+ if (++count >= rdp->blimit)
break;
}
+ if (rdp->blimit == INT_MAX && rdp->qlen <= qlowmark)
+ rdp->blimit = blimit;
if (!rdp->donelist)
rdp->donetail = &rdp->donelist;
else
@@ -473,6 +500,11 @@ static void rcu_init_percpu_data(int cpu
rdp->quiescbatch = rcp->completed;
rdp->qs_pending = 0;
rdp->cpu = cpu;
+ rdp->qlen = 0;
+#ifdef CONFIG_SMP
+ rdp->last_rs_qlen = 0;
+#endif
+ rdp->blimit = blimit;
}

static void __devinit rcu_online_cpu(int cpu)
@@ -567,7 +599,12 @@ void synchronize_kernel(void)
synchronize_rcu();
}

-module_param(maxbatch, int, 0);
+module_param(blimit, int, 0);
+module_param(qhimark, int, 0);
+module_param(qlowmark, int, 0);
+#ifdef CONFIG_SMP
+module_param(rsinterval, int, 0);
+#endif
EXPORT_SYMBOL_GPL(rcu_batches_completed);
EXPORT_SYMBOL(call_rcu); /* WARNING: GPL-only in April 2006. */
EXPORT_SYMBOL(call_rcu_bh); /* WARNING: GPL-only in April 2006. */

_

2006-01-25 21:28:28

by Lee Revell

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Wed, 2006-01-25 at 03:08 +0530, Dipankar Sarma wrote:
> On Tue, Jan 24, 2006 at 09:58:46PM +0530, Dipankar Sarma wrote:
> > On Tue, Jan 24, 2006 at 04:44:15AM -0500, Lee Revell wrote:
> > > On Tue, 2006-01-24 at 10:23 +0100, Ingo Molnar wrote:
> > > > * Paul E. McKenney <[email protected]> wrote:
> > > >
> > > > > This patch was primarily designed to reduce memory overhead, but given
> > > > > that it tends to reduce batch size, it should also reduce latency.
> > > >
> > > > if this solves Lee's problem, i think we should apply this as a fix, and
> > > > get it into v2.6.16. The patch looks straightforward and correct to me.
> > > >
> > >
> > > Does not compile:
> > >
> > > CC kernel/rcupdate.o
> > > kernel/rcupdate.c:76: warning: 'struct rcu_state' declared inside parameter list
> >
> > My original patch was against a much older kernel.
> > I will send out a more uptodate patch as soon as I am done with some
> > testing.
>
> Here is an updated version of that patch against 2.6.16-rc1. I have
> sanity-tested it on ppc64 and x86_64 using dbench and kernbench.
> I have also tested this for OOM situations - open()/close() in
> a tight loop in my x86_64 which earlier used to reach file limit
> if I set batch limit to 10 and found no problem. This patch does set
> default RCU batch limit to 10 and changes it only when there is an RCU
> flood.

OK this seems to work, I can't tell yet whether it help the latency I
reported, but rt_run_flush still produces terrible latencies.

Ingo, should I try the softirq preemption patch + Dipankar's patch +
latency tracing patch?

preemption latency trace v1.1.5 on 2.6.16-rc1
--------------------------------------------------------------------
latency: 7418 us, #6397/6397, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0)
-----------------
| task: gmplayer-8638 (uid:1000 nice:-20 policy:0 rt_prio:0)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
Xorg-2154 0d.s2 1us : __trace_start_sched_wakeup (try_to_wake_up)
Xorg-2154 0d.s2 2us : __trace_start_sched_wakeup <<...>-8638> (64 0)
Xorg-2154 0d.s. 3us+: wake_up_process (hrtimer_run_queues)
Xorg-2154 0d.s. 6us : rt_secret_rebuild (run_timer_softirq)
Xorg-2154 0d.s. 7us : rt_cache_flush (rt_secret_rebuild)
Xorg-2154 0d.s1 7us : del_timer (rt_cache_flush)
Xorg-2154 0d.s. 9us : local_bh_enable (rt_cache_flush)
Xorg-2154 0d.s. 10us : rt_run_flush (rt_cache_flush)
Xorg-2154 0d.s. 11us : get_random_bytes (rt_run_flush)
Xorg-2154 0d.s. 12us : extract_entropy (get_random_bytes)
Xorg-2154 0d.s. 13us : xfer_secondary_pool (extract_entropy)
Xorg-2154 0d.s. 15us : extract_entropy (xfer_secondary_pool)
Xorg-2154 0d.s. 16us : xfer_secondary_pool (extract_entropy)
Xorg-2154 0d.s. 17us+: account (extract_entropy)
Xorg-2154 0d.s. 19us : extract_buf (extract_entropy)
Xorg-2154 0d.s. 20us : sha_init (extract_buf)
Xorg-2154 0d.s. 21us+: sha_transform (extract_buf)
Xorg-2154 0d.s. 29us+: __add_entropy_words (extract_buf)
Xorg-2154 0d.s. 32us+: sha_transform (extract_buf)
Xorg-2154 0d.s. 39us : __add_entropy_words (extract_buf)
Xorg-2154 0d.s. 40us+: sha_transform (extract_buf)
Xorg-2154 0d.s. 47us : __add_entropy_words (extract_buf)
Xorg-2154 0d.s. 48us+: sha_transform (extract_buf)
Xorg-2154 0d.s. 55us : __add_entropy_words (extract_buf)
Xorg-2154 0d.s. 56us+: sha_transform (extract_buf)
Xorg-2154 0d.s. 63us : __add_entropy_words (extract_buf)
Xorg-2154 0d.s. 64us+: sha_transform (extract_buf)
Xorg-2154 0d.s. 71us : __add_entropy_words (extract_buf)
Xorg-2154 0d.s. 73us+: sha_transform (extract_buf)
Xorg-2154 0d.s. 79us : __add_entropy_words (extract_buf)
Xorg-2154 0d.s. 81us+: sha_transform (extract_buf)
Xorg-2154 0d.s. 87us : __add_entropy_words (extract_buf)
Xorg-2154 0d.s. 81us+: sha_transform (extract_buf)
Xorg-2154 0d.s. 87us : __add_entropy_words (extract_buf)
Xorg-2154 0d.s. 89us : __add_entropy_words (extract_buf)
Xorg-2154 0d.s. 91us+: sha_transform (extract_buf)
Xorg-2154 0d.s. 98us : __add_entropy_words (xfer_secondary_pool)
Xorg-2154 0d.s. 100us : credit_entropy_store (xfer_secondary_pool)
Xorg-2154 0d.s. 101us : account (extract_entropy)
Xorg-2154 0d.s1 102us : __wake_up (account)
Xorg-2154 0d.s2 103us : __wake_up_common (__wake_up)
Xorg-2154 0d.s. 104us : extract_buf (extract_entropy)
Xorg-2154 0d.s. 105us : sha_init (extract_buf)
Xorg-2154 0d.s. 106us+: sha_transform (extract_buf)
Xorg-2154 0d.s. 113us : __add_entropy_words (extract_buf)
Xorg-2154 0d.s. 114us+: sha_transform (extract_buf)
Xorg-2154 0d.s. 121us : __add_entropy_words (extract_buf)
Xorg-2154 0d.s. 122us : __add_entropy_words (extract_buf)
Xorg-2154 0d.s. 124us+: sha_transform (extract_buf)
Xorg-2154 0d.s. 132us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 133us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 134us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 135us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 136us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 137us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 139us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 140us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 141us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 142us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 143us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 145us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 146us : call_rcu_bh (rt_run_flush)
Xorg-2154 0d.s. 148us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 149us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 150us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 151us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 152us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 154us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 155us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 156us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 157us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 158us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 159us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 161us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 162us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 163us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 164us : call_rcu_bh (rt_run_flush)
Xorg-2154 0d.s. 166us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 167us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 168us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 169us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 170us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 172us : call_rcu_bh (rt_run_flush)
Xorg-2154 0d.s. 173us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 174us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 175us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 176us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 178us : call_rcu_bh (rt_run_flush)
Xorg-2154 0d.s. 179us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 180us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 181us : call_rcu_bh (rt_run_flush)
Xorg-2154 0d.s. 182us : call_rcu_bh (rt_run_flush)
Xorg-2154 0d.s. 184us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 185us : call_rcu_bh (rt_run_flush)
Xorg-2154 0d.s. 186us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 187us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 188us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 190us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 191us : call_rcu_bh (rt_run_flush)
Xorg-2154 0d.s. 192us : call_rcu_bh (rt_run_flush)
Xorg-2154 0d.s. 193us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 194us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 195us : call_rcu_bh (rt_run_flush)
Xorg-2154 0d.s. 197us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 198us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 199us : call_rcu_bh (rt_run_flush)
Xorg-2154 0d.s. 200us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 201us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 202us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 204us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 205us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 206us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 207us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 208us : call_rcu_bh (rt_run_flush)
Xorg-2154 0d.s. 210us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 211us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 212us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 213us : call_rcu_bh (rt_run_flush)
Xorg-2154 0d.s. 215us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 216us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 217us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 218us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 219us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 220us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 222us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 223us : call_rcu_bh (rt_run_flush)

[ zillions of these deleted ]

Xorg-2154 0d.s. 7335us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 7336us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 7337us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 7339us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 7340us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 7341us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 7342us : call_rcu_bh (rt_run_flush)
Xorg-2154 0d.s. 7343us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 7344us : local_bh_enable (rt_run_flush)
Xorg-2154 0d.s. 7346us : mod_timer (rt_secret_rebuild)
Xorg-2154 0d.s. 7346us : __mod_timer (mod_timer)
Xorg-2154 0d.s. 7347us : lock_timer_base (__mod_timer)
Xorg-2154 0d.s1 7348us+: internal_add_timer (__mod_timer)
Xorg-2154 0d.s. 7350us : run_timer_softirq (__do_softirq)
Xorg-2154 0d.s. 7351us : hrtimer_run_queues (run_timer_softirq)
Xorg-2154 0d.s. 7352us : ktime_get_real (hrtimer_run_queues)
Xorg-2154 0d.s. 7352us : getnstimeofday (ktime_get_real)
Xorg-2154 0d.s. 7353us : do_gettimeofday (getnstimeofday)
Xorg-2154 0d.s. 7353us : get_offset_tsc (do_gettimeofday)
Xorg-2154 0d.s. 7355us : ktime_get (hrtimer_run_queues)
Xorg-2154 0d.s. 7355us : ktime_get_ts (ktime_get)
Xorg-2154 0d.s. 7356us : getnstimeofday (ktime_get_ts)
Xorg-2154 0d.s. 7357us : do_gettimeofday (getnstimeofday)
Xorg-2154 0d.s. 7357us : get_offset_tsc (do_gettimeofday)
Xorg-2154 0d.s. 7358us : set_normalized_timespec (ktime_get_ts)
Xorg-2154 0d.s1 7359us : __remove_hrtimer (hrtimer_run_queues)
Xorg-2154 0d.s1 7360us : rb_next (__remove_hrtimer)
Xorg-2154 0d.s1 7361us : rb_erase (__remove_hrtimer)
Xorg-2154 0d.s. 7362us : it_real_fn (hrtimer_run_queues)
Xorg-2154 0d.s. 7363us : send_group_sig_info (it_real_fn)
Xorg-2154 0d.s1 7364us : group_send_sig_info (send_group_sig_info)
Xorg-2154 0d.s1 7365us : check_kill_permission (group_send_sig_info)
Xorg-2154 0d.s1 7367us : dummy_task_kill (check_kill_permission)
Xorg-2154 0d.s2 7369us : __group_send_sig_info (group_send_sig_info)
Xorg-2154 0d.s2 7369us : handle_stop_signal (__group_send_sig_info)
Xorg-2154 0d.s2 7371us : sig_ignored (__group_send_sig_info)
Xorg-2154 0d.s2 7372us : send_signal (__group_send_sig_info)
Xorg-2154 0d.s2 7373us : __sigqueue_alloc (send_signal)
Xorg-2154 0d.s2 7374us+: kmem_cache_alloc (__sigqueue_alloc)
Xorg-2154 0d.s2 7377us : __group_complete_signal (__group_send_sig_info)
Xorg-2154 0d.s2 7378us : task_curr (__group_complete_signal)
Xorg-2154 0d.s2 7379us : signal_wake_up (__group_complete_signal)
Xorg-2154 0d.s2 7380us : wake_up_state (signal_wake_up)
Xorg-2154 0d.s2 7380us : try_to_wake_up (wake_up_state)
Xorg-2154 0d.s2 7381us : wake_up_state (signal_wake_up)
Xorg-2154 0d.s. 7382us : hrtimer_forward (it_real_fn)
Xorg-2154 0d.s. 7383us : ktime_get (hrtimer_forward)
Xorg-2154 0d.s. 7383us : ktime_get_ts (ktime_get)
Xorg-2154 0d.s. 7384us : getnstimeofday (ktime_get_ts)
Xorg-2154 0d.s. 7385us : do_gettimeofday (getnstimeofday)
Xorg-2154 0d.s. 7385us : get_offset_tsc (do_gettimeofday)
Xorg-2154 0d.s. 7386us : set_normalized_timespec (ktime_get_ts)
Xorg-2154 0d.s1 7388us : enqueue_hrtimer (hrtimer_run_queues)
Xorg-2154 0d.s1 7389us : rb_insert_color (enqueue_hrtimer)
Xorg-2154 0d.s. 7390us : tasklet_action (__do_softirq)
Xorg-2154 0d.s. 7391us : rcu_process_callbacks (tasklet_action)
Xorg-2154 0d.s. 7392us : __rcu_process_callbacks (rcu_process_callbacks)
Xorg-2154 0d.s. 7393us : __rcu_process_callbacks (rcu_process_callbacks)
Xorg-2154 0d.s1 7394us+: rcu_start_batch (__rcu_process_callbacks)
Xorg-2154 0dn.. 7397us : schedule (work_resched)
Xorg-2154 0dn.. 7397us : stop_trace (schedule)
Xorg-2154 0dn.. 7398us : profile_hit (schedule)
Xorg-2154 0dn.1 7399us+: sched_clock (schedule)
Xorg-2154 0dn.2 7402us : recalc_task_prio (schedule)
Xorg-2154 0dn.2 7403us : effective_prio (recalc_task_prio)
Xorg-2154 0dn.2 7404us+: requeue_task (schedule)
<...>-8638 0d..2 7410us+: __switch_to (schedule)
<...>-8638 0d..2 7414us : schedule <Xorg-2154> (74 64)
<...>-8638 0d..1 7414us : trace_stop_sched_switched (schedule)
<...>-8638 0d..2 7415us : trace_stop_sched_switched <<...>-8638> (64 0)
<...>-8638 0d..2 7417us : schedule (schedule)


2006-01-25 22:56:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1


* Lee Revell <[email protected]> wrote:

> > Here is an updated version of that patch against 2.6.16-rc1. I have
> > sanity-tested it on ppc64 and x86_64 using dbench and kernbench.
> > I have also tested this for OOM situations - open()/close() in
> > a tight loop in my x86_64 which earlier used to reach file limit
> > if I set batch limit to 10 and found no problem. This patch does set
> > default RCU batch limit to 10 and changes it only when there is an RCU
> > flood.
>
> OK this seems to work, I can't tell yet whether it help the latency I
> reported, but rt_run_flush still produces terrible latencies.
>
> Ingo, should I try the softirq preemption patch + Dipankar's patch +
> latency tracing patch?

yes, that would be a nice test. (I'm busy now with mutex stuff to be
able to do a working softirq-preemption patch, but i sent you my current
patches off-list - if you want to give it a shot. Be warned though,
there will likely be quite some merging work to do, so it's definitely
not for the faint hearted.)

Ingo

2006-01-25 23:14:00

by Lee Revell

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Wed, 2006-01-25 at 23:56 +0100, Ingo Molnar wrote:
>
> yes, that would be a nice test. (I'm busy now with mutex stuff to be
> able to do a working softirq-preemption patch, but i sent you my
> current patches off-list - if you want to give it a shot. Be warned
> though, there will likely be quite some merging work to do, so it's
> definitely not for the faint hearted.)
>

OK, I probably won't have time to test it this week either.

In the meantime can anyone explain briefly why such a heavy fix is
needed? It seems like it would be simpler to make the route cache
flushing operate in batches of 100 routes, rather than invalidating the
whole thing in one shot. This does seem to be the only softirq that
regularly runs for much more than 1ms.

Would this require major surgery on the networking subsystem?

Lee

2006-01-26 19:18:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Wed, Jan 25, 2006 at 04:28:25PM -0500, Lee Revell wrote:
> On Wed, 2006-01-25 at 03:08 +0530, Dipankar Sarma wrote:
> > On Tue, Jan 24, 2006 at 09:58:46PM +0530, Dipankar Sarma wrote:
> > > On Tue, Jan 24, 2006 at 04:44:15AM -0500, Lee Revell wrote:
> > > > On Tue, 2006-01-24 at 10:23 +0100, Ingo Molnar wrote:
> > > > > * Paul E. McKenney <[email protected]> wrote:
> > > > >
> > > > > > This patch was primarily designed to reduce memory overhead, but given
> > > > > > that it tends to reduce batch size, it should also reduce latency.
> > > > >
> > > > > if this solves Lee's problem, i think we should apply this as a fix, and
> > > > > get it into v2.6.16. The patch looks straightforward and correct to me.
> > > > >
> > > >
> > > > Does not compile:
> > > >
> > > > CC kernel/rcupdate.o
> > > > kernel/rcupdate.c:76: warning: 'struct rcu_state' declared inside parameter list
> > >
> > > My original patch was against a much older kernel.
> > > I will send out a more uptodate patch as soon as I am done with some
> > > testing.
> >
> > Here is an updated version of that patch against 2.6.16-rc1. I have
> > sanity-tested it on ppc64 and x86_64 using dbench and kernbench.
> > I have also tested this for OOM situations - open()/close() in
> > a tight loop in my x86_64 which earlier used to reach file limit
> > if I set batch limit to 10 and found no problem. This patch does set
> > default RCU batch limit to 10 and changes it only when there is an RCU
> > flood.
>
> OK this seems to work, I can't tell yet whether it help the latency I
> reported, but rt_run_flush still produces terrible latencies.
>
> Ingo, should I try the softirq preemption patch + Dipankar's patch +
> latency tracing patch?
>
> preemption latency trace v1.1.5 on 2.6.16-rc1
> --------------------------------------------------------------------
> latency: 7418 us, #6397/6397, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0)
> -----------------
> | task: gmplayer-8638 (uid:1000 nice:-20 policy:0 rt_prio:0)
> -----------------
>
> _------=> CPU#
> / _-----=> irqs-off
> | / _----=> need-resched
> || / _---=> hardirq/softirq
> ||| / _--=> preempt-depth
> |||| /
> ||||| delay
> cmd pid ||||| time | caller
> \ / ||||| \ | /
> Xorg-2154 0d.s2 1us : __trace_start_sched_wakeup (try_to_wake_up)
> Xorg-2154 0d.s2 2us : __trace_start_sched_wakeup <<...>-8638> (64 0)
> Xorg-2154 0d.s. 3us+: wake_up_process (hrtimer_run_queues)
> Xorg-2154 0d.s. 6us : rt_secret_rebuild (run_timer_softirq)
> Xorg-2154 0d.s. 7us : rt_cache_flush (rt_secret_rebuild)
> Xorg-2154 0d.s1 7us : del_timer (rt_cache_flush)
> Xorg-2154 0d.s. 9us : local_bh_enable (rt_cache_flush)
> Xorg-2154 0d.s. 10us : rt_run_flush (rt_cache_flush)
> Xorg-2154 0d.s. 11us : get_random_bytes (rt_run_flush)
> Xorg-2154 0d.s. 12us : extract_entropy (get_random_bytes)
> Xorg-2154 0d.s. 13us : xfer_secondary_pool (extract_entropy)
> Xorg-2154 0d.s. 15us : extract_entropy (xfer_secondary_pool)
> Xorg-2154 0d.s. 16us : xfer_secondary_pool (extract_entropy)
> Xorg-2154 0d.s. 17us+: account (extract_entropy)
> Xorg-2154 0d.s. 19us : extract_buf (extract_entropy)
> Xorg-2154 0d.s. 20us : sha_init (extract_buf)
> Xorg-2154 0d.s. 21us+: sha_transform (extract_buf)
> Xorg-2154 0d.s. 29us+: __add_entropy_words (extract_buf)
> Xorg-2154 0d.s. 32us+: sha_transform (extract_buf)
> Xorg-2154 0d.s. 39us : __add_entropy_words (extract_buf)
> Xorg-2154 0d.s. 40us+: sha_transform (extract_buf)
> Xorg-2154 0d.s. 47us : __add_entropy_words (extract_buf)
> Xorg-2154 0d.s. 48us+: sha_transform (extract_buf)
> Xorg-2154 0d.s. 55us : __add_entropy_words (extract_buf)
> Xorg-2154 0d.s. 56us+: sha_transform (extract_buf)
> Xorg-2154 0d.s. 63us : __add_entropy_words (extract_buf)
> Xorg-2154 0d.s. 64us+: sha_transform (extract_buf)
> Xorg-2154 0d.s. 71us : __add_entropy_words (extract_buf)
> Xorg-2154 0d.s. 73us+: sha_transform (extract_buf)
> Xorg-2154 0d.s. 79us : __add_entropy_words (extract_buf)
> Xorg-2154 0d.s. 81us+: sha_transform (extract_buf)
> Xorg-2154 0d.s. 87us : __add_entropy_words (extract_buf)
> Xorg-2154 0d.s. 81us+: sha_transform (extract_buf)
> Xorg-2154 0d.s. 87us : __add_entropy_words (extract_buf)
> Xorg-2154 0d.s. 89us : __add_entropy_words (extract_buf)
> Xorg-2154 0d.s. 91us+: sha_transform (extract_buf)
> Xorg-2154 0d.s. 98us : __add_entropy_words (xfer_secondary_pool)
> Xorg-2154 0d.s. 100us : credit_entropy_store (xfer_secondary_pool)
> Xorg-2154 0d.s. 101us : account (extract_entropy)
> Xorg-2154 0d.s1 102us : __wake_up (account)
> Xorg-2154 0d.s2 103us : __wake_up_common (__wake_up)
> Xorg-2154 0d.s. 104us : extract_buf (extract_entropy)
> Xorg-2154 0d.s. 105us : sha_init (extract_buf)
> Xorg-2154 0d.s. 106us+: sha_transform (extract_buf)
> Xorg-2154 0d.s. 113us : __add_entropy_words (extract_buf)
> Xorg-2154 0d.s. 114us+: sha_transform (extract_buf)
> Xorg-2154 0d.s. 121us : __add_entropy_words (extract_buf)
> Xorg-2154 0d.s. 122us : __add_entropy_words (extract_buf)
> Xorg-2154 0d.s. 124us+: sha_transform (extract_buf)
> Xorg-2154 0d.s. 132us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 133us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 134us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 135us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 136us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 137us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 139us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 140us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 141us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 142us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 143us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 145us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 146us : call_rcu_bh (rt_run_flush)
> Xorg-2154 0d.s. 148us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 149us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 150us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 151us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 152us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 154us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 155us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 156us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 157us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 158us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 159us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 161us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 162us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 163us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 164us : call_rcu_bh (rt_run_flush)
> Xorg-2154 0d.s. 166us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 167us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 168us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 169us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 170us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 172us : call_rcu_bh (rt_run_flush)
> Xorg-2154 0d.s. 173us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 174us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 175us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 176us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 178us : call_rcu_bh (rt_run_flush)
> Xorg-2154 0d.s. 179us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 180us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 181us : call_rcu_bh (rt_run_flush)
> Xorg-2154 0d.s. 182us : call_rcu_bh (rt_run_flush)
> Xorg-2154 0d.s. 184us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 185us : call_rcu_bh (rt_run_flush)
> Xorg-2154 0d.s. 186us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 187us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 188us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 190us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 191us : call_rcu_bh (rt_run_flush)
> Xorg-2154 0d.s. 192us : call_rcu_bh (rt_run_flush)
> Xorg-2154 0d.s. 193us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 194us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 195us : call_rcu_bh (rt_run_flush)
> Xorg-2154 0d.s. 197us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 198us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 199us : call_rcu_bh (rt_run_flush)
> Xorg-2154 0d.s. 200us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 201us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 202us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 204us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 205us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 206us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 207us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 208us : call_rcu_bh (rt_run_flush)
> Xorg-2154 0d.s. 210us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 211us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 212us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 213us : call_rcu_bh (rt_run_flush)
> Xorg-2154 0d.s. 215us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 216us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 217us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 218us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 219us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 220us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 222us : local_bh_enable (rt_run_flush)
> Xorg-2154 0d.s. 223us : call_rcu_bh (rt_run_flush)
>
> [ zillions of these deleted ]
>
> Xorg-2154 0d.s. 7335us : local_bh_enable (rt_run_flush)

Dipankar's latest patch should hopefully address this problem.

Could you please give it a spin when you get a chance?

Thanx, Paul

2006-01-27 18:55:29

by Lee Revell

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Thu, 2006-01-26 at 11:18 -0800, Paul E. McKenney wrote:
> > Xorg-2154 0d.s. 213us : call_rcu_bh (rt_run_flush)
> > Xorg-2154 0d.s. 215us : local_bh_enable (rt_run_flush)
> > Xorg-2154 0d.s. 216us : local_bh_enable (rt_run_flush)
> > Xorg-2154 0d.s. 217us : local_bh_enable (rt_run_flush)
> > Xorg-2154 0d.s. 218us : local_bh_enable (rt_run_flush)
> > Xorg-2154 0d.s. 219us : local_bh_enable (rt_run_flush)
> > Xorg-2154 0d.s. 220us : local_bh_enable (rt_run_flush)
> > Xorg-2154 0d.s. 222us : local_bh_enable (rt_run_flush)
> > Xorg-2154 0d.s. 223us : call_rcu_bh (rt_run_flush)
> >
> > [ zillions of these deleted ]
> >
> > Xorg-2154 0d.s. 7335us : local_bh_enable (rt_run_flush)
>
> Dipankar's latest patch should hopefully address this problem.
>
> Could you please give it a spin when you get a chance?

Nope, no improvement at all, furthermore, the machine locked up once
under heavy disk activity.

I just got an 8ms+ latency from rt_run_flush that looks basically
identical to the above. It's still flushing routes in huge batches:

$ grep 'call_rcu_bh (rt_run_flush)' /proc/latency_trace | wc -l
2738

Lee

2006-01-28 17:03:36

by Dipankar Sarma

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Fri, Jan 27, 2006 at 01:55:22PM -0500, Lee Revell wrote:
> On Thu, 2006-01-26 at 11:18 -0800, Paul E. McKenney wrote:
> > > Xorg-2154 0d.s. 213us : call_rcu_bh (rt_run_flush)
> > > Xorg-2154 0d.s. 215us : local_bh_enable (rt_run_flush)
> > > Xorg-2154 0d.s. 222us : local_bh_enable (rt_run_flush)
> > > Xorg-2154 0d.s. 223us : call_rcu_bh (rt_run_flush)
> > >
> > > [ zillions of these deleted ]
> > >
> > > Xorg-2154 0d.s. 7335us : local_bh_enable (rt_run_flush)
> >
> > Dipankar's latest patch should hopefully address this problem.
> >
> > Could you please give it a spin when you get a chance?
>
> Nope, no improvement at all, furthermore, the machine locked up once
> under heavy disk activity.
>
> I just got an 8ms+ latency from rt_run_flush that looks basically
> identical to the above. It's still flushing routes in huge batches:

I am not supprised that the earlier patch doesn't help your
test. Once you reach the high watermark, the "desperation mode"
latency can be fairly bad since the RCU batch size is pretty
big.

How about trying out the patch included below ? It doesn't reduce
amount of work done from softirq context, but decreases the
*number of RCUs* generated during rt_run_flush() by using
one RCU per hash chain. Can you check if this makes any
difference ?

Thanks
Dipankar


Reduce the number of RCU callbacks by flushing one hash chain
at a time. This is intended to reduce RCU overhead during
frequent flushing.

Signed-off-by: Dipankar Sarma <[email protected]>
---


net/ipv4/route.c | 18 +++++++++++++-----
1 files changed, 13 insertions(+), 5 deletions(-)

diff -puN net/ipv4/route.c~rcu-rt-flush-list net/ipv4/route.c
--- linux-2.6.16-rc1-rcu/net/ipv4/route.c~rcu-rt-flush-list 2006-01-28 20:34:10.000000000 +0530
+++ linux-2.6.16-rc1-rcu-dipankar/net/ipv4/route.c 2006-01-28 21:30:27.000000000 +0530
@@ -670,13 +670,23 @@ static void rt_check_expire(unsigned lon
mod_timer(&rt_periodic_timer, jiffies + ip_rt_gc_interval);
}

+static void rt_list_free(struct rcu_head *head)
+{
+ struct rtable *next, *rth;
+ rth = container_of(head, struct rtable, u.dst.rcu_head);
+ for (; rth; rth = next) {
+ next = rth->u.rt_next;
+ dst_free(&rth->u.dst);
+ }
+}
+
/* This can run from both BH and non-BH contexts, the latter
* in the case of a forced flush event.
*/
static void rt_run_flush(unsigned long dummy)
{
int i;
- struct rtable *rth, *next;
+ struct rtable *rth;

rt_deadline = 0;

@@ -689,10 +699,8 @@ static void rt_run_flush(unsigned long d
rt_hash_table[i].chain = NULL;
spin_unlock_bh(rt_hash_lock_addr(i));

- for (; rth; rth = next) {
- next = rth->u.rt_next;
- rt_free(rth);
- }
+ if (rth)
+ call_rcu_bh(&rth->u.dst.rcu_head, rt_list_free);
}
}


_

2006-01-28 18:00:09

by Lee Revell

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Sat, 2006-01-28 at 22:33 +0530, Dipankar Sarma wrote:
> On Fri, Jan 27, 2006 at 01:55:22PM -0500, Lee Revell wrote:
> > On Thu, 2006-01-26 at 11:18 -0800, Paul E. McKenney wrote:
> > > > Xorg-2154 0d.s. 213us : call_rcu_bh (rt_run_flush)
> > > > Xorg-2154 0d.s. 215us : local_bh_enable (rt_run_flush)
> > > > Xorg-2154 0d.s. 222us : local_bh_enable (rt_run_flush)
> > > > Xorg-2154 0d.s. 223us : call_rcu_bh (rt_run_flush)
> > > >
> > > > [ zillions of these deleted ]
> > > >
> > > > Xorg-2154 0d.s. 7335us : local_bh_enable (rt_run_flush)
> > >
> > > Dipankar's latest patch should hopefully address this problem.
> > >
> > > Could you please give it a spin when you get a chance?
> >
> > Nope, no improvement at all, furthermore, the machine locked up once
> > under heavy disk activity.
> >
> > I just got an 8ms+ latency from rt_run_flush that looks basically
> > identical to the above. It's still flushing routes in huge batches:
>
> I am not supprised that the earlier patch doesn't help your
> test. Once you reach the high watermark, the "desperation mode"
> latency can be fairly bad since the RCU batch size is pretty
> big.
>
> How about trying out the patch included below ? It doesn't reduce
> amount of work done from softirq context, but decreases the
> *number of RCUs* generated during rt_run_flush() by using
> one RCU per hash chain. Can you check if this makes any
> difference ?
>
> Thanks
> Dipankar
>
>
> Reduce the number of RCU callbacks by flushing one hash chain
> at a time. This is intended to reduce RCU overhead during
> frequent flushing.
>
> Signed-off-by: Dipankar Sarma <[email protected]>

OK, now we are making progress. I've been running my Gnutella client
for half an hour and the worst latency is only ~1ms, in what looks like
a closely related code path, but due to holding a spinlock
(rt_hash_lock_addr() in rt_check_expire), rather than merely being in
softirq context like the previous case. Whether 1ms is too long to be
holding a spinlock can be addressed later; this is a significant
improvement.

preemption latency trace v1.1.5 on 2.6.16-rc1
--------------------------------------------------------------------
latency: 1036 us, #1001/1001, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0)
-----------------
| task: Xorg-2221 (uid:0 nice:0 policy:0 rt_prio:0)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
<idle>-0 0d.s4 1us : __trace_start_sched_wakeup (try_to_wake_up)
<idle>-0 0d.s4 1us : __trace_start_sched_wakeup <<...>-2221> (73 0)
<idle>-0 0d.s2 2us : wake_up_state (signal_wake_up)
<idle>-0 0d.s. 3us : hrtimer_forward (it_real_fn)
<idle>-0 0d.s. 4us : ktime_get (hrtimer_forward)
<idle>-0 0d.s. 5us : ktime_get_ts (ktime_get)
<idle>-0 0d.s. 5us : getnstimeofday (ktime_get_ts)
<idle>-0 0d.s. 6us : do_gettimeofday (getnstimeofday)
<idle>-0 0d.s. 6us : get_offset_tsc (do_gettimeofday)
<idle>-0 0d.s. 7us : set_normalized_timespec (ktime_get_ts)
<idle>-0 0d.s1 8us : enqueue_hrtimer (hrtimer_run_queues)
<idle>-0 0d.s1 10us+: rb_insert_color (enqueue_hrtimer)
<idle>-0 0d.s. 12us+: rt_check_expire (run_timer_softirq)
<idle>-0 0d.s1 14us : rt_may_expire (rt_check_expire)
<idle>-0 0d.s1 16us : rt_may_expire (rt_check_expire)
<idle>-0 0d.s1 17us : call_rcu_bh (rt_check_expire)
<idle>-0 0d.s1 18us : rt_may_expire (rt_check_expire)
<idle>-0 0d.s1 20us : rt_may_expire (rt_check_expire)
<idle>-0 0d.s1 21us : call_rcu_bh (rt_check_expire)
<idle>-0 0d.s1 22us : rt_may_expire (rt_check_expire)
<idle>-0 0d.s1 23us : rt_may_expire (rt_check_expire)
<idle>-0 0d.s1 24us : call_rcu_bh (rt_check_expire)
<idle>-0 0d.s1 25us : rt_may_expire (rt_check_expire)
<idle>-0 0d.s1 26us : call_rcu_bh (rt_check_expire)
<idle>-0 0d.s1 27us : rt_may_expire (rt_check_expire)
<idle>-0 0d.s1 27us : call_rcu_bh (rt_check_expire)

[ etc ]

<idle>-0 0d.s1 995us : rt_may_expire (rt_check_expire)
<idle>-0 0d.s1 996us : call_rcu_bh (rt_check_expire)
<idle>-0 0d.s1 997us : rt_may_expire (rt_check_expire)
<idle>-0 0d.s1 998us : call_rcu_bh (rt_check_expire)
<idle>-0 0d.s. 999us : mod_timer (rt_check_expire)
<idle>-0 0d.s. 1000us : __mod_timer (mod_timer)
<idle>-0 0d.s. 1001us : lock_timer_base (__mod_timer)
<idle>-0 0d.s1 1001us : internal_add_timer (__mod_timer)
<idle>-0 0d.s. 1003us : run_timer_softirq (__do_softirq)
<idle>-0 0d.s. 1004us : hrtimer_run_queues (run_timer_softirq)
<idle>-0 0d.s. 1005us : ktime_get_real (hrtimer_run_queues)
<idle>-0 0d.s. 1005us : getnstimeofday (ktime_get_real)
<idle>-0 0d.s. 1006us : do_gettimeofday (getnstimeofday)
<idle>-0 0d.s. 1007us : get_offset_tsc (do_gettimeofday)
<idle>-0 0d.s. 1008us : ktime_get (hrtimer_run_queues)
<idle>-0 0d.s. 1008us : ktime_get_ts (ktime_get)
<idle>-0 0d.s. 1009us : getnstimeofday (ktime_get_ts)
<idle>-0 0d.s. 1010us : do_gettimeofday (getnstimeofday)
<idle>-0 0d.s. 1010us : get_offset_tsc (do_gettimeofday)
<idle>-0 0d.s. 1011us : set_normalized_timespec (ktime_get_ts)
<idle>-0 0d.s. 1013us : tasklet_action (__do_softirq)
<idle>-0 0d.s. 1013us : rcu_process_callbacks (tasklet_action)
<idle>-0 0d.s. 1014us : __rcu_process_callbacks (rcu_process_callbacks)
<idle>-0 0d.s. 1015us : __rcu_process_callbacks (rcu_process_callbacks)
<idle>-0 0d.s1 1016us+: rcu_start_batch (__rcu_process_callbacks)
<idle>-0 0dn.1 1018us < (2097760)
<idle>-0 0dn.. 1019us : schedule (cpu_idle)
<idle>-0 0dn.. 1020us : stop_trace (schedule)
<idle>-0 0dn.. 1020us : profile_hit (schedule)
<idle>-0 0dn.1 1021us+: sched_clock (schedule)
<idle>-0 0dn.2 1023us : recalc_task_prio (schedule)
<idle>-0 0dn.2 1024us : effective_prio (recalc_task_prio)
<idle>-0 0dn.2 1025us : requeue_task (schedule)
<...>-2221 0d..2 1029us+: __switch_to (schedule)
<...>-2221 0d..2 1031us : schedule <<idle>-0> (8c 73)
<...>-2221 0d..1 1032us : trace_stop_sched_switched (schedule)
<...>-2221 0d..2 1033us+: trace_stop_sched_switched <<...>-2221> (73 0)
<...>-2221 0d..2 1035us : schedule (schedule)


2006-01-28 18:51:27

by Lee Revell

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Sat, 2006-01-28 at 13:00 -0500, Lee Revell wrote:
> OK, now we are making progress.

I spoke too soon, it's not fixed:

preemption latency trace v1.1.5 on 2.6.16-rc1
--------------------------------------------------------------------
latency: 4183 us, #3676/3676, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0)
-----------------
| task: Xorg-2221 (uid:0 nice:0 policy:0 rt_prio:0)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
evolutio-2877 0d.h6 1us : __trace_start_sched_wakeup (try_to_wake_up)
evolutio-2877 0d.h6 2us : __trace_start_sched_wakeup <<...>-2221> (73 0)
evolutio-2877 0d.h4 3us : wake_up_state (signal_wake_up)
evolutio-2877 0d.h. 4us : __wake_up (mousedev_notify_readers)
evolutio-2877 0d.h1 5us : __wake_up_common (__wake_up)
evolutio-2877 0d.h. 7us+: usb_submit_urb (hid_irq_in)
evolutio-2877 0d.h. 10us : hcd_submit_urb (usb_submit_urb)
evolutio-2877 0d.h1 11us : usb_get_dev (hcd_submit_urb)
evolutio-2877 0d.h1 11us : get_device (usb_get_dev)
evolutio-2877 0d.h1 12us : kobject_get (get_device)
evolutio-2877 0d.h1 13us : kref_get (kobject_get)
evolutio-2877 0d.h. 14us : usb_get_urb (hcd_submit_urb)
evolutio-2877 0d.h. 14us : kref_get (usb_get_urb)
evolutio-2877 0d.h. 16us : uhci_urb_enqueue (hcd_submit_urb)
evolutio-2877 0d.h1 17us : kmem_cache_alloc (uhci_urb_enqueue)
evolutio-2877 0d.h1 18us : usb_check_bandwidth (uhci_urb_enqueue)
evolutio-2877 0d.h1 19us : usb_calc_bus_time (usb_check_bandwidth)
evolutio-2877 0d.h1 20us : uhci_submit_common (uhci_urb_enqueue)
evolutio-2877 0d.h1 21us : uhci_alloc_td (uhci_submit_common)
evolutio-2877 0d.h1 22us+: dma_pool_alloc (uhci_alloc_td)
evolutio-2877 0d.h1 24us : uhci_alloc_qh (uhci_submit_common)
evolutio-2877 0d.h1 25us : dma_pool_alloc (uhci_alloc_qh)
evolutio-2877 0d.h1 27us : uhci_insert_tds_in_qh (uhci_submit_common)
evolutio-2877 0d.h1 28us : uhci_insert_qh (uhci_submit_common)
evolutio-2877 0d.h1 29us : usb_claim_bandwidth (uhci_urb_enqueue)
evolutio-2877 0d.h. 30us : usb_free_urb (usb_hcd_giveback_urb)
evolutio-2877 0d.h. 31us : kref_put (usb_free_urb)
evolutio-2877 0d.h1 32us : __wake_up (uhci_scan_schedule)
evolutio-2877 0d.h2 32us : __wake_up_common (__wake_up)
evolutio-2877 0d.h. 34us : rhine_interrupt (handle_IRQ_event)
evolutio-2877 0d.h. 35us : ioread16 (rhine_interrupt)
evolutio-2877 0d.h. 37us : ioread8 (rhine_interrupt)
evolutio-2877 0d.h. 39us+: via_driver_irq_handler (handle_IRQ_event)
evolutio-2877 0d.h1 41us : note_interrupt (__do_IRQ)
evolutio-2877 0d.h1 42us : end_8259A_irq (__do_IRQ)
evolutio-2877 0d.H. 45us : irq_exit (do_IRQ)
evolutio-2877 0d.s. 46us < (608)
evolutio-2877 0d.s. 47us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 48us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 49us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 51us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 52us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 53us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 54us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 55us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 56us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 57us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 59us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 60us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 61us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 62us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 63us : call_rcu_bh (rt_run_flush)
evolutio-2877 0d.s. 64us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 66us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 67us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 68us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 69us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 70us : call_rcu_bh (rt_run_flush)
evolutio-2877 0d.s. 72us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 73us : call_rcu_bh (rt_run_flush)
evolutio-2877 0d.s. 74us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 75us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 76us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 77us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 78us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 79us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 81us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 82us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 83us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 84us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 85us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 86us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 88us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 89us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 90us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 91us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 92us : call_rcu_bh (rt_run_flush)
evolutio-2877 0d.s. 93us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 94us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 95us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 97us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 98us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 99us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 100us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 101us : local_bh_enable (rt_run_flush)

[ etc ]

evolutio-2877 0d.s. 4079us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 4080us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 4081us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 4082us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 4083us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 4085us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 4086us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 4087us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 4088us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 4089us : local_bh_enable (rt_run_flush)
evolutio-2877 0d.s. 4091us : mod_timer (rt_secret_rebuild)
evolutio-2877 0d.s. 4092us : __mod_timer (mod_timer)
evolutio-2877 0d.s. 4092us : lock_timer_base (__mod_timer)
evolutio-2877 0d.s1 4093us+: internal_add_timer (__mod_timer)
evolutio-2877 0d.s. 4096us : process_timeout (run_timer_softirq)
evolutio-2877 0d.s. 4097us : wake_up_process (process_timeout)
evolutio-2877 0d.s. 4097us : try_to_wake_up (wake_up_process)
evolutio-2877 0d.s1 4099us : sched_clock (try_to_wake_up)
evolutio-2877 0d.s1 4099us : recalc_task_prio (try_to_wake_up)
evolutio-2877 0d.s1 4101us : effective_prio (recalc_task_prio)
evolutio-2877 0d.s1 4102us : try_to_wake_up <<...>-2847> (73 2)
evolutio-2877 0d.s1 4102us : enqueue_task (try_to_wake_up)
evolutio-2877 0d.s1 4104us : __trace_start_sched_wakeup (try_to_wake_up)
evolutio-2877 0d.H1 4106us : do_IRQ (c01123e4 0 0)
evolutio-2877 0d.h. 4106us : __do_IRQ (do_IRQ)
evolutio-2877 0d.h1 4107us+: mask_and_ack_8259A (__do_IRQ)
evolutio-2877 0d.h. 4111us : handle_IRQ_event (__do_IRQ)
evolutio-2877 0d.h. 4112us : timer_interrupt (handle_IRQ_event)
evolutio-2877 0d.h1 4112us+: mark_offset_tsc (timer_interrupt)
evolutio-2877 0d.h1 4119us : do_timer (timer_interrupt)
evolutio-2877 0d.h1 4120us : update_process_times (timer_interrupt)
evolutio-2877 0d.h1 4120us : account_system_time (update_process_times)
evolutio-2877 0d.h1 4121us : acct_update_integrals (account_system_time)
evolutio-2877 0d.h1 4122us : run_local_timers (update_process_times)
evolutio-2877 0d.h1 4122us : raise_softirq (run_local_timers)
evolutio-2877 0d.h1 4123us : rcu_pending (update_process_times)
evolutio-2877 0d.h1 4123us : __rcu_pending (rcu_pending)
evolutio-2877 0d.h1 4124us : __rcu_pending (rcu_pending)
evolutio-2877 0d.h1 4125us : rcu_check_callbacks (update_process_times)
evolutio-2877 0d.h1 4125us : idle_cpu (rcu_check_callbacks)
evolutio-2877 0d.h1 4126us : scheduler_tick (update_process_times)
evolutio-2877 0d.h1 4127us : sched_clock (scheduler_tick)
evolutio-2877 0d.h1 4128us : run_posix_cpu_timers (update_process_times)
evolutio-2877 0d.h1 4129us : smp_local_timer_interrupt (timer_interrupt)
evolutio-2877 0d.h1 4129us : profile_tick (smp_local_timer_interrupt)
evolutio-2877 0d.h1 4130us : profile_hit (profile_tick)
evolutio-2877 0d.h1 4131us : note_interrupt (__do_IRQ)
evolutio-2877 0d.h1 4132us : end_8259A_irq (__do_IRQ)
evolutio-2877 0d.h1 4132us : enable_8259A_irq (end_8259A_irq)
evolutio-2877 0d.H1 4134us : irq_exit (do_IRQ)
evolutio-2877 0d.s1 4135us < (608)
evolutio-2877 0d.s. 4136us+: wake_up_process (process_timeout)
evolutio-2877 0d.s. 4138us : i8042_timer_func (run_timer_softirq)
evolutio-2877 0d.s. 4139us : i8042_interrupt (i8042_timer_func)
evolutio-2877 0d.s. 4139us : mod_timer (i8042_interrupt)
evolutio-2877 0d.s. 4140us : __mod_timer (mod_timer)
evolutio-2877 0d.s. 4141us : lock_timer_base (__mod_timer)
evolutio-2877 0d.s1 4141us+: internal_add_timer (__mod_timer)
evolutio-2877 0d.s. 4145us : run_timer_softirq (__do_softirq)
evolutio-2877 0d.s. 4145us : hrtimer_run_queues (run_timer_softirq)
evolutio-2877 0d.s. 4146us : ktime_get_real (hrtimer_run_queues)
evolutio-2877 0d.s. 4147us : getnstimeofday (ktime_get_real)
evolutio-2877 0d.s. 4147us : do_gettimeofday (getnstimeofday)
evolutio-2877 0d.s. 4148us : get_offset_tsc (do_gettimeofday)
evolutio-2877 0d.s. 4149us : ktime_get (hrtimer_run_queues)
evolutio-2877 0d.s. 4150us : ktime_get_ts (ktime_get)
evolutio-2877 0d.s. 4151us : getnstimeofday (ktime_get_ts)
evolutio-2877 0d.s. 4151us : do_gettimeofday (getnstimeofday)
evolutio-2877 0d.s. 4152us : get_offset_tsc (do_gettimeofday)
evolutio-2877 0d.s. 4153us : set_normalized_timespec (ktime_get_ts)
evolutio-2877 0d.s. 4154us : tasklet_action (__do_softirq)
evolutio-2877 0d.s. 4155us : rcu_process_callbacks (tasklet_action)
evolutio-2877 0d.s. 4156us : __rcu_process_callbacks (rcu_process_callbacks)
evolutio-2877 0d.s. 4157us : __rcu_process_callbacks (rcu_process_callbacks)
evolutio-2877 0d.s1 4158us+: rcu_start_batch (__rcu_process_callbacks)
evolutio-2877 0dn.. 4160us : preempt_schedule_irq (need_resched)
evolutio-2877 0dn.. 4161us : schedule (preempt_schedule_irq)
evolutio-2877 0dn.. 4162us : stop_trace (schedule)
evolutio-2877 0dn.. 4163us : profile_hit (schedule)
evolutio-2877 0dn.1 4164us+: sched_clock (schedule)
evolutio-2877 0dn.2 4167us : recalc_task_prio (schedule)
evolutio-2877 0dn.2 4168us : effective_prio (recalc_task_prio)
evolutio-2877 0dn.2 4169us+: requeue_task (schedule)
<...>-2221 0d..2 4174us+: __switch_to (schedule)
<...>-2221 0d..2 4178us : schedule <evolutio-2877> (7d 73)
<...>-2221 0d..1 4179us : trace_stop_sched_switched (schedule)
<...>-2221 0d..2 4180us : trace_stop_sched_switched <<...>-2221> (73 0)
<...>-2221 0d..2 4182us : schedule (schedule)

Lee


2006-01-28 19:34:42

by Dipankar Sarma

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Sat, Jan 28, 2006 at 01:51:23PM -0500, Lee Revell wrote:
> On Sat, 2006-01-28 at 13:00 -0500, Lee Revell wrote:
> > OK, now we are making progress.
>
> I spoke too soon, it's not fixed:
>
> preemption latency trace v1.1.5 on 2.6.16-rc1
> --------------------------------------------------------------------
> latency: 4183 us, #3676/3676, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0)
> -----------------
> evolutio-2877 0d.s. 97us : local_bh_enable (rt_run_flush)
> evolutio-2877 0d.s. 98us : local_bh_enable (rt_run_flush)
> evolutio-2877 0d.s. 99us : local_bh_enable (rt_run_flush)
> evolutio-2877 0d.s. 100us : local_bh_enable (rt_run_flush)
> evolutio-2877 0d.s. 101us : local_bh_enable (rt_run_flush)
>
> [ etc ]
>
> evolutio-2877 0d.s. 4079us : local_bh_enable (rt_run_flush)
> evolutio-2877 0d.s. 4080us : local_bh_enable (rt_run_flush)

I am not sure if I am interpreting the latency trace right,
but it seems that there is a difference between the problem
you were seeing earlier and now.

In one of your earlier traces, I saw -

<idle>-0 0d.s. 182us : dst_destroy (dst_rcu_free)
<idle>-0 0d.s. 183us : ipv4_dst_destroy (dst_destroy)

[ etc - zillions of dst_rcu_free()s deleted ]

<idle>-0 0d.s. 13403us : dst_rcu_free (__rcu_process_callbacks)
<idle>-0 0d.s. 13403us : dst_destroy (dst_rcu_free)

This points to latency increase caused by lots and lots of
RCU callbacks doing dst_rcu_free(). Do you still see those ?

Your new trace shows that we are held up in in rt_run_flush().
I guess we need to investigate why we spend so much time in rt_run_flush(),
because of a big route table or the lock acquisitions.

Thanks
Dipankar

2006-01-28 19:46:54

by Lee Revell

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Sun, 2006-01-29 at 01:04 +0530, Dipankar Sarma wrote:
> On Sat, Jan 28, 2006 at 01:51:23PM -0500, Lee Revell wrote:
> > On Sat, 2006-01-28 at 13:00 -0500, Lee Revell wrote:
> > > OK, now we are making progress.
> >
> > I spoke too soon, it's not fixed:
> >
> > preemption latency trace v1.1.5 on 2.6.16-rc1
> > --------------------------------------------------------------------
> > latency: 4183 us, #3676/3676, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0)
> > -----------------
> > evolutio-2877 0d.s. 97us : local_bh_enable (rt_run_flush)
> > evolutio-2877 0d.s. 98us : local_bh_enable (rt_run_flush)
> > evolutio-2877 0d.s. 99us : local_bh_enable (rt_run_flush)
> > evolutio-2877 0d.s. 100us : local_bh_enable (rt_run_flush)
> > evolutio-2877 0d.s. 101us : local_bh_enable (rt_run_flush)
> >
> > [ etc ]
> >
> > evolutio-2877 0d.s. 4079us : local_bh_enable (rt_run_flush)
> > evolutio-2877 0d.s. 4080us : local_bh_enable (rt_run_flush)
>
> I am not sure if I am interpreting the latency trace right,
> but it seems that there is a difference between the problem
> you were seeing earlier and now.
>
> In one of your earlier traces, I saw -
>
> <idle>-0 0d.s. 182us : dst_destroy (dst_rcu_free)
> <idle>-0 0d.s. 183us : ipv4_dst_destroy (dst_destroy)
>
> [ etc - zillions of dst_rcu_free()s deleted ]
>
> <idle>-0 0d.s. 13403us : dst_rcu_free (__rcu_process_callbacks)
> <idle>-0 0d.s. 13403us : dst_destroy (dst_rcu_free)
>
> This points to latency increase caused by lots and lots of
> RCU callbacks doing dst_rcu_free(). Do you still see those ?
>
> Your new trace shows that we are held up in in rt_run_flush().
> I guess we need to investigate why we spend so much time in rt_run_flush(),
> because of a big route table or the lock acquisitions.

Yes, you are right, they are 2 different problems. Paul suggested that
this patch might help with the rt_run_flush() problem.

It will take a while to determine whether it fixes the dst_rcu_free()
problem as I was never able to reproduce it on demand.

Lee

2006-01-28 19:52:09

by Eric Dumazet

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

Dipankar Sarma a ?crit :
> On Sat, Jan 28, 2006 at 01:51:23PM -0500, Lee Revell wrote:
>> On Sat, 2006-01-28 at 13:00 -0500, Lee Revell wrote:
>>> OK, now we are making progress.
>> I spoke too soon, it's not fixed:
>>
>> preemption latency trace v1.1.5 on 2.6.16-rc1
>> --------------------------------------------------------------------
>> latency: 4183 us, #3676/3676, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0)
>> -----------------
>> evolutio-2877 0d.s. 97us : local_bh_enable (rt_run_flush)
>> evolutio-2877 0d.s. 98us : local_bh_enable (rt_run_flush)
>> evolutio-2877 0d.s. 99us : local_bh_enable (rt_run_flush)
>> evolutio-2877 0d.s. 100us : local_bh_enable (rt_run_flush)
>> evolutio-2877 0d.s. 101us : local_bh_enable (rt_run_flush)
>>
>> [ etc ]
>>
>> evolutio-2877 0d.s. 4079us : local_bh_enable (rt_run_flush)
>> evolutio-2877 0d.s. 4080us : local_bh_enable (rt_run_flush)
>
> I am not sure if I am interpreting the latency trace right,
> but it seems that there is a difference between the problem
> you were seeing earlier and now.
>
> In one of your earlier traces, I saw -
>
> <idle>-0 0d.s. 182us : dst_destroy (dst_rcu_free)
> <idle>-0 0d.s. 183us : ipv4_dst_destroy (dst_destroy)
>
> [ etc - zillions of dst_rcu_free()s deleted ]
>
> <idle>-0 0d.s. 13403us : dst_rcu_free (__rcu_process_callbacks)
> <idle>-0 0d.s. 13403us : dst_destroy (dst_rcu_free)
>
> This points to latency increase caused by lots and lots of
> RCU callbacks doing dst_rcu_free(). Do you still see those ?
>
> Your new trace shows that we are held up in in rt_run_flush().
> I guess we need to investigate why we spend so much time in rt_run_flush(),
> because of a big route table or the lock acquisitions.

Some machines have millions of entries in their route cache.

I suspect we cannot queue all them (or only hash heads as your previous patch)
by RCU. Latencies and/or OOM can occur.

What can be done is :

in rt_run_flush(), allocate a new empty hash table, and exchange the hash tables.

Then wait a quiescent/grace RCU period (may be the exact term is not this one,
sorry, I'm not RCU expert)

Then free all the entries from the old hash table (direclty of course, no need
for RCU grace period), and free the hash table.

As the hash table can be huge, we might need allocate it at boot time, just in
case a flush is needed (it usually is :) ). If we choose dynamic allocation
and this allocation fails, then fallback to what is done today.

Eric

2006-01-29 07:42:34

by Lee Revell

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Sat, 2006-01-28 at 20:52 +0100, Eric Dumazet wrote:
> > Your new trace shows that we are held up in in rt_run_flush().
> > I guess we need to investigate why we spend so much time in rt_run_flush(),
> > because of a big route table or the lock acquisitions.
>
> Some machines have millions of entries in their route cache.
>
> I suspect we cannot queue all them (or only hash heads as your
> previous patch) by RCU. Latencies and/or OOM can occur.
>
> What can be done is :
>
> in rt_run_flush(), allocate a new empty hash table, and exchange the
> hash tables.
>
> Then wait a quiescent/grace RCU period (may be the exact term is not
> this one, sorry, I'm not RCU expert)
>
> Then free all the entries from the old hash table (direclty of course,
> no need for RCU grace period), and free the hash table.
>
> As the hash table can be huge, we might need allocate it at boot time,
> just in case a flush is needed (it usually is :) ). If we choose
> dynamic allocation and this allocation fails, then fallback to what is
> done today.
>

No problem, I'm not a networking expert...

Ingo's response to these traces was that softirq preemption, which
simply offloads all softirq processing to softirqd and has been tested
in the -rt patchset for over a year, is the easiest solution. Any
thoughts on that? Personally, I'd rather fix the very few problematic
softirqs, than take such a drastic step - this softirq appears to be one
of the last obstacles to being able to meet a 1ms soft RT constraint
with the mainline kernel.

Thanks for looking at this; I'd be glad to test any patches...

Lee


2006-01-29 07:51:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1


* Lee Revell <[email protected]> wrote:

> On Sat, 2006-01-28 at 20:52 +0100, Eric Dumazet wrote:
> > > Your new trace shows that we are held up in in rt_run_flush().
> > > I guess we need to investigate why we spend so much time in rt_run_flush(),
> > > because of a big route table or the lock acquisitions.
> >
> > Some machines have millions of entries in their route cache.
> >
> > I suspect we cannot queue all them (or only hash heads as your
> > previous patch) by RCU. Latencies and/or OOM can occur.
> >
> > What can be done is :
> >
> > in rt_run_flush(), allocate a new empty hash table, and exchange the
> > hash tables.
> >
> > Then wait a quiescent/grace RCU period (may be the exact term is not
> > this one, sorry, I'm not RCU expert)
> >
> > Then free all the entries from the old hash table (direclty of course,
> > no need for RCU grace period), and free the hash table.
> >
> > As the hash table can be huge, we might need allocate it at boot time,
> > just in case a flush is needed (it usually is :) ). If we choose
> > dynamic allocation and this allocation fails, then fallback to what is
> > done today.
> >
>
> No problem, I'm not a networking expert...
>
> Ingo's response to these traces was that softirq preemption, which
> simply offloads all softirq processing to softirqd and has been tested
> in the -rt patchset for over a year, is the easiest solution. Any
> thoughts on that? Personally, I'd rather fix the very few problematic
> softirqs, than take such a drastic step - this softirq appears to be
> one of the last obstacles to being able to meet a 1ms soft RT
> constraint with the mainline kernel.

well, softirq preemption is not really a drastic step - its biggest
problem is that it cannot be included in v2.6.16 ;-) But i agree that if
a solution can be found to break up a latency path, that is preferred.

Ingo

2006-01-29 08:21:46

by Lee Revell

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Sun, 2006-01-29 at 08:51 +0100, Ingo Molnar wrote:
>
> well, softirq preemption is not really a drastic step - its biggest
> problem is that it cannot be included in v2.6.16 ;-) But i agree that
> if a solution can be found to break up a latency path, that is
> preferred.
>

Agreed. It's only drastic in the context of my target (meeting a 1ms
soft RT constraint) as this happens to be the one of the only code paths
getting in the way. Also I'd like to be able to go down to 1ms without
requiring a custom kernel config...

Lee

2006-01-30 04:37:01

by Paul E. McKenney

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Sat, Jan 28, 2006 at 08:52:02PM +0100, Eric Dumazet wrote:
> Dipankar Sarma a ?crit :
> >On Sat, Jan 28, 2006 at 01:51:23PM -0500, Lee Revell wrote:
> >>On Sat, 2006-01-28 at 13:00 -0500, Lee Revell wrote:
> >>>OK, now we are making progress.
> >>I spoke too soon, it's not fixed:
> >>
> >>preemption latency trace v1.1.5 on 2.6.16-rc1
> >>--------------------------------------------------------------------
> >> latency: 4183 us, #3676/3676, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0)
> >> -----------------
> >>evolutio-2877 0d.s. 97us : local_bh_enable (rt_run_flush)
> >>evolutio-2877 0d.s. 98us : local_bh_enable (rt_run_flush)
> >>evolutio-2877 0d.s. 99us : local_bh_enable (rt_run_flush)
> >>evolutio-2877 0d.s. 100us : local_bh_enable (rt_run_flush)
> >>evolutio-2877 0d.s. 101us : local_bh_enable (rt_run_flush)
> >>
> >>[ etc ]
> >>
> >>evolutio-2877 0d.s. 4079us : local_bh_enable (rt_run_flush)
> >>evolutio-2877 0d.s. 4080us : local_bh_enable (rt_run_flush)
> >
> >I am not sure if I am interpreting the latency trace right,
> >but it seems that there is a difference between the problem
> >you were seeing earlier and now.
> >
> >In one of your earlier traces, I saw -
> >
> > <idle>-0 0d.s. 182us : dst_destroy (dst_rcu_free)
> > <idle>-0 0d.s. 183us : ipv4_dst_destroy (dst_destroy)
> >
> >[ etc - zillions of dst_rcu_free()s deleted ]
> >
> > <idle>-0 0d.s. 13403us : dst_rcu_free (__rcu_process_callbacks)
> > <idle>-0 0d.s. 13403us : dst_destroy (dst_rcu_free)
> >
> >This points to latency increase caused by lots and lots of
> >RCU callbacks doing dst_rcu_free(). Do you still see those ?
> >
> >Your new trace shows that we are held up in in rt_run_flush().
> >I guess we need to investigate why we spend so much time in rt_run_flush(),
> >because of a big route table or the lock acquisitions.
>
> Some machines have millions of entries in their route cache.
>
> I suspect we cannot queue all them (or only hash heads as your previous
> patch) by RCU. Latencies and/or OOM can occur.
>
> What can be done is :
>
> in rt_run_flush(), allocate a new empty hash table, and exchange the hash
> tables.
>
> Then wait a quiescent/grace RCU period (may be the exact term is not this
> one, sorry, I'm not RCU expert)
>
> Then free all the entries from the old hash table (direclty of course, no
> need for RCU grace period), and free the hash table.
>
> As the hash table can be huge, we might need allocate it at boot time, just
> in case a flush is needed (it usually is :) ). If we choose dynamic
> allocation and this allocation fails, then fallback to what is done today.

Interesting approach!

If I remember correctly, the point of all of this is to perturb the hash
function periodically in order to avoid DoS attacks. It will likely
be necessary to avoid a big performance hit during the transition.
One way of doing this, given your two-table scheme, would be to:

o Allocate both tables at boot time, as you suggest above.

o Keep the following additional state:

o Pointer to the table that is the current table.

o First valid index (fvl) into the current table -- all
indexes below the fvl correspond to hash buckets that
have been transferred into the non-current table.
In the normal case where the tables are not being
switched, fvl==-1.

(To make the RCU searches work without requiring
tons of explicit memory barriers, there needs to
be a separate fvl for each of the tables.)

o Parameters defining the hash functions for the current
table and for the non-current table.

o When it is time to switch tables, start removing the entries
in hash bucket #fvl of the current table. Optionally put them
into the non-current table (or just let them be added as they
are needed. Only remove a limited number of entries (or,
alternatively, stop removing them after a limited amount of
time).

When the current hash bucket has been completely emptied,
increment fvl, and, if we have not already hit the limit,
continue on the new hash bucket.

When fvl runs off the end of the table, you are done with
the switch. Update the pointer to reference the other
table. Important -- do -not- start another switch until
a grace period has elapsed!!! Otherwise, you will end
up fatally confusing slow readers.

o When searching, if the hash function gives a value less
than fvl, search the non-current table.

If the hash function gives a value equal to fvl, search
the current table, and, if not found, search the non-current
table.

If the hash function gives a value greater than fvl, search
only the current table. (It may also be necessary to search
the non-current table to allow for races with fvl update.)

Does this seem reasonable?

Thanx, Paul

2006-01-30 04:55:39

by Eric Dumazet

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

Paul E. McKenney a ?crit :
> On Sat, Jan 28, 2006 at 08:52:02PM +0100, Eric Dumazet wrote:
>> Dipankar Sarma a ?crit :
>>> On Sat, Jan 28, 2006 at 01:51:23PM -0500, Lee Revell wrote:
>>>> On Sat, 2006-01-28 at 13:00 -0500, Lee Revell wrote:
>>>>> OK, now we are making progress.
>>>> I spoke too soon, it's not fixed:
>>>>
>>>> preemption latency trace v1.1.5 on 2.6.16-rc1
>>>> --------------------------------------------------------------------
>>>> latency: 4183 us, #3676/3676, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0)
>>>> -----------------
>>>> evolutio-2877 0d.s. 97us : local_bh_enable (rt_run_flush)
>>>> evolutio-2877 0d.s. 98us : local_bh_enable (rt_run_flush)
>>>> evolutio-2877 0d.s. 99us : local_bh_enable (rt_run_flush)
>>>> evolutio-2877 0d.s. 100us : local_bh_enable (rt_run_flush)
>>>> evolutio-2877 0d.s. 101us : local_bh_enable (rt_run_flush)
>>>>
>>>> [ etc ]
>>>>
>>>> evolutio-2877 0d.s. 4079us : local_bh_enable (rt_run_flush)
>>>> evolutio-2877 0d.s. 4080us : local_bh_enable (rt_run_flush)
>>> I am not sure if I am interpreting the latency trace right,
>>> but it seems that there is a difference between the problem
>>> you were seeing earlier and now.
>>>
>>> In one of your earlier traces, I saw -
>>>
>>> <idle>-0 0d.s. 182us : dst_destroy (dst_rcu_free)
>>> <idle>-0 0d.s. 183us : ipv4_dst_destroy (dst_destroy)
>>>
>>> [ etc - zillions of dst_rcu_free()s deleted ]
>>>
>>> <idle>-0 0d.s. 13403us : dst_rcu_free (__rcu_process_callbacks)
>>> <idle>-0 0d.s. 13403us : dst_destroy (dst_rcu_free)
>>>
>>> This points to latency increase caused by lots and lots of
>>> RCU callbacks doing dst_rcu_free(). Do you still see those ?
>>>
>>> Your new trace shows that we are held up in in rt_run_flush().
>>> I guess we need to investigate why we spend so much time in rt_run_flush(),
>>> because of a big route table or the lock acquisitions.
>> Some machines have millions of entries in their route cache.
>>
>> I suspect we cannot queue all them (or only hash heads as your previous
>> patch) by RCU. Latencies and/or OOM can occur.
>>
>> What can be done is :
>>
>> in rt_run_flush(), allocate a new empty hash table, and exchange the hash
>> tables.
>>
>> Then wait a quiescent/grace RCU period (may be the exact term is not this
>> one, sorry, I'm not RCU expert)
>>
>> Then free all the entries from the old hash table (direclty of course, no
>> need for RCU grace period), and free the hash table.
>>
>> As the hash table can be huge, we might need allocate it at boot time, just
>> in case a flush is needed (it usually is :) ). If we choose dynamic
>> allocation and this allocation fails, then fallback to what is done today.
>
> Interesting approach!
>
> If I remember correctly, the point of all of this is to perturb the hash
> function periodically in order to avoid DoS attacks. It will likely
> be necessary to avoid a big performance hit during the transition.
> One way of doing this, given your two-table scheme, would be to:
>
> o Allocate both tables at boot time, as you suggest above.
>
> o Keep the following additional state:
>
> o Pointer to the table that is the current table.
>
> o First valid index (fvl) into the current table -- all
> indexes below the fvl correspond to hash buckets that
> have been transferred into the non-current table.
> In the normal case where the tables are not being
> switched, fvl==-1.
>
> (To make the RCU searches work without requiring
> tons of explicit memory barriers, there needs to
> be a separate fvl for each of the tables.)
>
> o Parameters defining the hash functions for the current
> table and for the non-current table.
>
> o When it is time to switch tables, start removing the entries
> in hash bucket #fvl of the current table. Optionally put them
> into the non-current table (or just let them be added as they
> are needed. Only remove a limited number of entries (or,
> alternatively, stop removing them after a limited amount of
> time).
>
> When the current hash bucket has been completely emptied,
> increment fvl, and, if we have not already hit the limit,
> continue on the new hash bucket.
>
> When fvl runs off the end of the table, you are done with
> the switch. Update the pointer to reference the other
> table. Important -- do -not- start another switch until
> a grace period has elapsed!!! Otherwise, you will end
> up fatally confusing slow readers.
>
> o When searching, if the hash function gives a value less
> than fvl, search the non-current table.
>
> If the hash function gives a value equal to fvl, search
> the current table, and, if not found, search the non-current
> table.
>
> If the hash function gives a value greater than fvl, search
> only the current table. (It may also be necessary to search
> the non-current table to allow for races with fvl update.)
>
> Does this seem reasonable?
>
> Thanx, Paul

Well, if as a bonus we are able to expand the size of the hash table, it could
be very very good : As of today, the boot time sizing of this hash table is
somewhat problematic.

If the size is expanded by a 2 factor (or a power of too), can your proposal
works ?

Eric

2006-01-30 05:12:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Mon, Jan 30, 2006 at 05:55:37AM +0100, Eric Dumazet wrote:
> Paul E. McKenney a ?crit :
> >On Sat, Jan 28, 2006 at 08:52:02PM +0100, Eric Dumazet wrote:

[ . . . ]

> >>Some machines have millions of entries in their route cache.
> >>
> >>I suspect we cannot queue all them (or only hash heads as your previous
> >>patch) by RCU. Latencies and/or OOM can occur.
> >>
> >>What can be done is :
> >>
> >>in rt_run_flush(), allocate a new empty hash table, and exchange the hash
> >>tables.
> >>
> >>Then wait a quiescent/grace RCU period (may be the exact term is not this
> >>one, sorry, I'm not RCU expert)
> >>
> >>Then free all the entries from the old hash table (direclty of course, no
> >>need for RCU grace period), and free the hash table.
> >>
> >>As the hash table can be huge, we might need allocate it at boot time,
> >>just in case a flush is needed (it usually is :) ). If we choose dynamic
> >>allocation and this allocation fails, then fallback to what is done today.
> >
> >Interesting approach!
> >
> >If I remember correctly, the point of all of this is to perturb the hash
> >function periodically in order to avoid DoS attacks. It will likely
> >be necessary to avoid a big performance hit during the transition.
> >One way of doing this, given your two-table scheme, would be to:
> >
> >o Allocate both tables at boot time, as you suggest above.
> >
> >o Keep the following additional state:
> >
> > o Pointer to the table that is the current table.
> >
> > o First valid index (fvl) into the current table -- all
> > indexes below the fvl correspond to hash buckets that
> > have been transferred into the non-current table.
> > In the normal case where the tables are not being
> > switched, fvl==-1.
> >
> > (To make the RCU searches work without requiring
> > tons of explicit memory barriers, there needs to
> > be a separate fvl for each of the tables.)
> >
> > o Parameters defining the hash functions for the current
> > table and for the non-current table.
> >
> >o When it is time to switch tables, start removing the entries
> > in hash bucket #fvl of the current table. Optionally put them
> > into the non-current table (or just let them be added as they
> > are needed. Only remove a limited number of entries (or,
> > alternatively, stop removing them after a limited amount of
> > time).
> >
> > When the current hash bucket has been completely emptied,
> > increment fvl, and, if we have not already hit the limit,
> > continue on the new hash bucket.
> >
> > When fvl runs off the end of the table, you are done with
> > the switch. Update the pointer to reference the other
> > table. Important -- do -not- start another switch until
> > a grace period has elapsed!!! Otherwise, you will end
> > up fatally confusing slow readers.
> >
> >o When searching, if the hash function gives a value less
> > than fvl, search the non-current table.
> >
> > If the hash function gives a value equal to fvl, search
> > the current table, and, if not found, search the non-current
> > table.
> >
> > If the hash function gives a value greater than fvl, search
> > only the current table. (It may also be necessary to search
> > the non-current table to allow for races with fvl update.)
> >
> >Does this seem reasonable?
> >
> > Thanx, Paul
>
> Well, if as a bonus we are able to expand the size of the hash table, it
> could be very very good : As of today, the boot time sizing of this hash
> table is somewhat problematic.
>
> If the size is expanded by a 2 factor (or a power of too), can your
> proposal works ?

Yep!!!

Add the following:

o Add a size variable for each of the tables. It works best
if the per-table state is stored with the table itself, for
example:

struct hashtbl {
int size;
int fvl;
struct hash_param params;
struct list_head buckets[0];
};

o When switching tables, allocate a new one of the desired size
and free up the non-current one. (But remember to wait at least
one grace period after the last switch before starting this!!!)

o Compute hash parameters suitable for the new table size.

o Continue as before.

Note that you are not restricted to power-of-two expansion -- the
hash parameters should handle any desired difference, and in fact
handle contraction as well as expansion.

Thanx, Paul

2006-01-30 05:53:32

by David Miller

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

From: "Paul E. McKenney" <[email protected]>
Date: Sun, 29 Jan 2006 21:11:56 -0800

> > If the size is expanded by a 2 factor (or a power of too), can your
> > proposal works ?
>
> Yep!!!
>
> Add the following:

This all sounds very exciting and promising. I'll try to find some
spare cycles tomorrow to cook something up.

2006-01-30 10:01:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Sun, Jan 29, 2006 at 09:52:44PM -0800, David S. Miller wrote:
> From: "Paul E. McKenney" <[email protected]>
> Date: Sun, 29 Jan 2006 21:11:56 -0800
>
> > > If the size is expanded by a 2 factor (or a power of too), can your
> > > proposal works ?
> >
> > Yep!!!
> >
> > Add the following:
>
> This all sounds very exciting and promising. I'll try to find some
> spare cycles tomorrow to cook something up.

Cool! My earlier description did not get rid of all the explicit
memory barriers. The following updated pseudocode does so.

I will of course be happy to review what you come up with!!!
(And I will be interested to see if this still makes sense after
sleeping on it...)

Improvements welcome as always!!!

Thanx, Paul

------------------------------------------------------------------------

Data Structures

o Yes, the names suck, but that is why I usually am happy to
have other people come up with names...

o Having separate fvbupd and fvbrdr removes the need for
(hopefully all) explicit memory barriers.

struct hashtbl {
int nbuckets;
int fvbupd; /* updater's first valid bucket. */
int fvbrdr; /* readers' first valid bucket. */
struct hash_param params;
struct list_head buckets[0];
};

/* Both fvbupd and fvbrdr need to be initialized to -1. */

struct hashtbl *current;
struct hashtbl *not_current;

Switch Pseudocode: assumes that the switcher can block, if not,
then need to make a state machine driven by call_rcu().

o Wait until at least one grace period has elapsed since the
previous switch. Easiest to just put "synchronize_rcu()"
up front, but can take advantage of naturally occurring
grace periods that elapsed since the last switch if desired.

o Initialize the not-current array, including the hash params.
If the new array is to be different in size, allocate the
new one and free the not-current array.

o Use rcu_assign_pointer() to point not_current to (you guessed
it!) the not-current array.

o Set current->fvbupd = current->fvbrdr = 0;

o Wait for a grace period (synchronize_rcu() if blocking OK,
otherwise call_rcu()...). The delay guarantees that readers
are aware that things might be disappearing before we actually
start making them disappear.

o Loop until all buckets of current are emptied:

o current->fvbrdr = current->fvbupd

No memory barriers needed because all buckets
preceding current->fvbupd were emptied at least
one grace period ago, so any readers that were
active when any of those buckets were non-empty
have now completed.

o Remove elements from current->bucket[current->fvbupd]
until the bucket is empty or until we use up our
alloted quota of time and/or elements.

o If we empty the bucket, we of course increment
current->fvbupd, and if there is time/elements
left, could continue on the next bucket.

o Wait for a grace period.

-Really- big arrays on really big SMP machines might want
to have multiple CPUs processing buckets in parallel. In theory,
this is not hard, but the fvbrdr update gets messier.

And yes, there needs to be some sort of update-side locking.
Per-bucket locking seems like it should work -- one would
hold the bucket lock for the current array throughout,
and acquire and release the bucket lock for the non-current
array on a per-element basis. One perhaps better way is to
simply remove the elements from current, and let the readers
insert them into not_current only upon cache miss. Getting
this right requires more insight into the networking code
than I currently have.

o Use rcu_assign_pointer() to set current to not_current.

o Note: leave not_current pointing to the now-current array
to allow slow readers to complete successfully. Alternatively,
one could wait for a grace period, then set non_current to
NULL (but does not seem worth it).

Table lookup (am assuming that this does rcu_read_unlock() before
returning, but that only makes sense if you have some sort of lock
on the element being returned -- if no such lock, the rcu_read_unlock()
primitives must be removed -- the caller must rcu_read_unlock() instead):

o rcu_read_lock();

o c = rcu_dereference(current);

o Compute hash based on c->params

o If hash >= c->fvbrdr, do lookup in c->bucket[hash]
If found, rcu_read_unlock() and return it.

o If c->fvbrdr is not -1, look the item up in the non-current
table:

o c = rcu_dereference(not_current);

o Compute hash based on c->params

o Do lookup in c->bucket[hash]

Note that c->fvbrdr -could- transition to zero here, but
if it does, we are guaranteed that nothing will actually
be removed from the current table until we complete.

And note that we need the non-current lookup even if the
switch is not populating the non-current array, since the
readers would be doing so upon cache miss, right?

o rcu_read_unlock()

o return what is found or complain appropriately if nothing found.

2006-02-12 00:45:15

by Lee Revell

[permalink] [raw]
Subject: Re: RCU latency regression in 2.6.16-rc1

On Sun, 2006-01-29 at 21:11 -0800, Paul E. McKenney wrote:
> > Well, if as a bonus we are able to expand the size of the hash
> table, it
> > could be very very good : As of today, the boot time sizing of this
> hash
> > table is somewhat problematic.
> >
> > If the size is expanded by a 2 factor (or a power of too), can your
> > proposal works ?
>
> Yep!!!
>
> Add the following:
>
> o Add a size variable for each of the tables. It works best
> if the per-table state is stored with the table itself, for
> example:
>
> struct hashtbl {
> int size;
> int fvl;
> struct hash_param params;
> struct list_head buckets[0];
> };
>
> o When switching tables, allocate a new one of the desired size
> and free up the non-current one. (But remember to wait at
> least
> one grace period after the last switch before starting
> this!!!)
>
> o Compute hash parameters suitable for the new table size.
>
> o Continue as before.
>
> Note that you are not restricted to power-of-two expansion -- the
> hash parameters should handle any desired difference, and in fact
> handle contraction as well as expansion.

I'd be glad to test any patches whenever someone gets around to working
on this.

Lee