hi,
Post by Hubert FeyrerRe: http://software.newsforge.com/article.pl?sid=04/12/27/1243207&from=rss
I've been talking with the author of this article and looking into
the issues here some.
Post by Hubert FeyrerI've talked to the author of the article, and he aparently really didn't
set PTHREAD_CONCURRENCY (caused by it being not documented in NetBSD
2.0); He wants to create some addendum, so we can look forward to an
update.
setting this PTHREAD_CONCURRENCY=2 on our 2-CPU boxes actually reduces
the throughput of the super-smack select-key test slightly.
we go from being I/O bound (CPUs about 40% busy) to being CPU-bound
(obviously CPUs 100% busy), but without benefit to the application.
I ran the test under a profiling kernel, and here's the output:
PTHREAD_CONCURRENCY=1
167 p3:~ # time super-smack /usr/share/smacks/select-key.smack 10 10000
Query Barrel Report for client smacker1
connect: max=59ms min=1ms avg= 15ms from 10 clients
Query_type num_queries max_time min_time q_per_s
select_index 200000 0 0 9636.62
6.305u 7.354s 0:21.13 64.6% 0+0k 0+0io 0pf+0w
top during the run:
CPU0 states: 13.9% user, 0.0% nice, 18.8% system, 0.0% interrupt, 67.3% idle
CPU1 states: 29.7% user, 0.0% nice, 23.8% system, 0.0% interrupt, 46.5% idle
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
after startup:
2397 mysql 18 0 324M 18M sigwai/0 0:00 1.15% 0.49% mysqld
after run:
2397 mysql 18 0 324M 21M sigwai/0 0:10 18.04% 16.89% mysqld
% cumulative self self total
time seconds seconds calls us/call us/call name
72.47 40.04 40.04 idle_loop
10.95 46.09 6.05 7330 825.38 825.38 Xspllower
2.50 47.47 1.38 3738039 0.37 1.14 soreceive
1.45 48.27 0.80 6342632 0.13 0.13 _kernel_lock
1.09 48.87 0.60 4153108 0.14 0.17 sodopendfree
1.05 49.45 0.58 5172723 0.11 1.63 syscall_plain
0.85 49.92 0.47 5805505 0.08 0.08 i386_copyin
0.71 50.31 0.39 4413531 0.09 0.09 _kernel_lock_acquire_count
0.70 50.70 0.39 syscall1
PTHREAD_CONCURRENCY=2
165 p3:~ # time super-smack /usr/share/smacks/select-key.smack 10 10000
Query Barrel Report for client smacker1
connect: max=12ms min=1ms avg= 7ms from 10 clients
Query_type num_queries max_time min_time q_per_s
select_index 200000 1 0 9347.77
8.054u 14.502s 0:21.96 102.6% 0+0k 0+2io 0pf+0w
top during the run:
CPU0 states: 45.3% user, 0.0% nice, 45.3% system, 0.0% interrupt, 9.5% idle
CPU1 states: 33.8% user, 0.0% nice, 63.2% system, 0.0% interrupt, 3.0% idle
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
after startup:
1976 mysql 18 0 324M 18M sigwai/0 0:03 0.00% 0.00% mysqld
after run:
1976 mysql 18 0 324M 21M sigwai/0 1:06 18.32% 18.26% mysqld
% cumulative self self total
time seconds seconds calls ms/call ms/call name
55.13 23.80 23.80 2955 8.05 8.05 Xspllower
7.25 26.93 3.13 6579165 0.00 0.00 _kernel_lock
4.56 28.90 1.97 4355165 0.00 0.00 _kernel_lock_acquire_count
4.01 30.63 1.73 3685574 0.00 0.00 soreceive
2.90 31.88 1.25 4160068 0.00 0.00 sodopendfree
2.48 32.95 1.07 5623640 0.00 0.00 i486_copyout
2.36 33.97 1.02 4147578 0.00 0.01 syscall_plain
1.92 34.80 0.83 5662118 0.00 0.00 i386_copyin
1.67 35.52 0.72 idle_loop
unfortunately, most of the time is being spent with clock interrupts blocked,
and with the i386 implementation of splx(), that means that we can't tell
where it's all going. I tried to make an inline-assembly version of splx()
a year or two ago, but it got ugly and I lost motivation. anyone want to
pick that up?
Post by Hubert FeyrerA different question is, why NetBSD performed so bad in the 10M Rows test
(http://www.newsforge.com/blob.pl?id=82311a9e7896a841032c395f270d6a0f).
According to the article it's not a softdep or atime issue.
this one appears to be mostly due to our crappy page-replacement algorithm.
these sysctl settings more than doubled the throughput for tony:
vm.filemin=1
vm.filemax=20
vm.anonmin=80
vm.anonmax=99
this has the effect of keeping the mysql cache in memory.
I tried a couple additional changes to see if we could improve this further:
(1) the loaning-for-read() stuff that we've discussed lately on tech-kern
(2) another optimization that I haven't mentioned before, where we zero
pages that are being overwritten with write() instead of reading them
from disk. (we have to do something to initialize the page before
unlocking it so that other threads can't look at the page's old contents).
(this only took about 1/2 hour to implement, so it's basically a backup
plan if loaning-for-write() turns out to be too hard or too risky for
the 2.1 and 3.0 releases.)
I measured all four combinations of this for both the "prepare" and "run"
phases of sysbench. the results were:
options prepare phase run phase
(in minutes) transactions/sec
domaploanobj=0 chuq_dozero=0 52:40 5.56
domaploanobj=0 chuq_dozero=1 44:37 5.59
domaploanobj=1 chuq_dozero=0 52:57 6.84
domaploanobj=1 chuq_dozero=1 48:18 6.83
zeroing instead of reading for overwrites helps during prepare
and has no effect during run. loaning-for-read(), on the other hand,
helps during run, but hurts during prepare.
I also ran this under a profiling kernel, and here were the results:
prepare
% cumulative self self total
time seconds seconds calls ms/call ms/call name
44.13 263.91 263.91 475763 0.55 0.56 genfs_putpages
13.68 345.74 81.83 2438342 0.03 0.03 i386_copyin
7.11 388.24 42.50 379417126 0.00 0.00 pmap_clear_attrs
5.19 419.29 31.05 1704374 0.02 0.02 memset
2.74 435.68 16.39 380800608 0.00 0.00 pmap_tlb_shootnow
2.40 450.04 14.36 121860 0.12 0.12 Xspllower
2.28 463.69 13.65 1705310 0.01 0.01 i486_copyout
0.97 469.51 5.82 8968996 0.00 0.00 lockmgr
run
% cumulative self self total
time seconds seconds calls ms/call ms/call name
34.52 370.95 370.95 709745 0.52 0.52 genfs_putpages
9.74 475.64 104.69 15372348 0.01 0.01 i386_copyin
8.30 564.79 89.15 mpidle
6.67 636.50 71.71 16511526 0.00 0.00 i486_copyout
5.51 695.74 59.24 529689671 0.00 0.00 pmap_clear_attrs
4.86 747.99 52.25 227478 0.23 0.23 Xspllower
3.21 782.47 34.48 1941078 0.02 0.02 memset
2.17 805.80 23.33 532201987 0.00 0.00 pmap_tlb_shootnow
0.95 816.02 10.22 6513348 0.00 0.01 soreceive
0.85 825.19 9.17 13135256 0.00 0.00 lockmgr
0.84 834.24 9.05 18082701 0.00 0.00 uvm_pagelookup
0.78 842.61 8.37 3175090 0.00 0.02 genfs_getpages
0.69 850.04 7.43 3482626 0.00 0.12 uvm_fault
0.59 856.38 6.34 1595193 0.00 0.01 pmap_remove_ptes
0.50 861.74 5.36 13264280 0.00 0.03 syscall_plain
0.49 866.98 5.24 14308126 0.00 0.00 pvtree_SPLAY_MINMAX
0.48 872.18 5.20 23920121 0.00 0.00 pvtree_SPLAY
the database file vnode has a bunch of pages in memory, and
genfs_getpages() has to look at all of them when mysqld does an fsync().
we're clearly hurt by our inability to do unbuffered I/O.
I also ran the super-smack select-key test with the same combinations
of the two optimizations, but all the runs had the same throughput.
the kernel profiling data for this benchmark was:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
26.31 9.69 9.69 2384 4.06 4.06 Xspllower
13.79 14.77 5.08 3800113 0.00 0.00 soreceive
5.43 16.77 2.00 6006860 0.00 0.00 i386_copyin
5.21 18.69 1.92 5404173 0.00 0.01 syscall_plain
4.97 20.52 1.83 4200256 0.00 0.00 uipc_usrreq
4.72 22.26 1.74 4400953 0.00 0.00 uiomove
4.66 23.98 1.72 syscall1
3.31 25.20 1.22 4628997 0.00 0.00 i486_copyout
3.10 26.34 1.14 400053 0.00 0.01 sosend
2.86 27.39 1.05 Xsyscall
2.74 28.40 1.01 4200177 0.00 0.00 sodopendfree
again the i386 splx() gets in the way. but since this time we're
just as interested in single-CPU performance, I ran it again on a macppc,
where the splx() implementation doesn't get in the way so much:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
19.61 2.34 2.34 15445156 0.00 0.00 splraise
19.28 4.64 2.30 15444019 0.00 0.00 splx
9.72 5.80 1.16 s_sctrap
9.56 6.94 1.14 439 2.60 2.60 cpu_idlespin
5.70 7.62 0.68 3800116 0.00 0.00 soreceive
3.94 8.09 0.47 4200262 0.00 0.00 uipc_usrreq
3.86 8.55 0.46 5405542 0.00 0.00 syscall_plain
3.52 8.97 0.42 5219607 0.00 0.00 memcpy
3.02 9.33 0.36 4609360 0.00 0.00 copyout
2.68 9.65 0.32 4401865 0.00 0.00 uiomove
2.26 9.92 0.27 3802081 0.00 0.00 dofileread
1.76 10.13 0.21 400437 0.00 0.00 enable_fpu
0.00 0.00 6779/15445156 uvm_lock_fpageq [92]
0.03 0.00 200502/15445156 preempt [30]
0.03 0.00 200969/15445156 wakeup <cycle 1> [42]
0.03 0.00 201039/15445156 ltsleep <cycle 1> [35]
0.15 0.00 1000108/15445156 sosend [14]
0.18 0.00 1200108/15445156 m_gethdr [22]
0.64 0.00 4200181/15445156 sodopendfree [10]
1.27 0.00 8400357/15445156 soreceive [6]
[8] 19.6 2.34 0.00 15445156 splraise [8]
0.00 0.00 2589/16282037 decr_intr [76]
0.03 0.02 200502/16282037 preempt [30]
0.15 0.08 1000108/16282037 sosend [14]
0.18 0.09 1200108/16282037 m_gethdr [22]
0.63 0.33 4200181/16282037 sodopendfree [10]
1.26 0.65 8400357/16282037 soreceive [6]
[9] 19.3 2.30 0.00 15444019 splx <cycle 1> [9]
546 do_pending_int <cycle 1> [185]
so it's not a definitive answer, but it's nearly certain that
the socket code is a major bottleneck for super-smack select-key.
-Chuck