10M row test in that MySQL benchmark

Discussion:

Hubert Feyrer

2005-02-09 21:16:07 UTC

Re: http://software.newsforge.com/article.pl?sid=04/12/27/1243207&from=rss

I've talked to the author of the article, and he aparently really didn't
set PTHREAD_CONCURRENCY (caused by it being not documented in NetBSD
2.0); He wants to create some addendum, so we can look forward to an
update.

A different question is, why NetBSD performed so bad in the 10M Rows test
(http://www.newsforge.com/blob.pl?id=82311a9e7896a841032c395f270d6a0f).
According to the article it's not a softdep or atime issue.

Any guesses?

- Hubert

--
NetBSD - Free AND Open! (And of course secure, portable, yadda yadda)

Chuck Silvers

2005-02-25 03:44:38 UTC

Permalink

hi,

Post by Hubert Feyrer
Re: http://software.newsforge.com/article.pl?sid=04/12/27/1243207&from=rss

I've been talking with the author of this article and looking into
the issues here some.

Post by Hubert Feyrer
I've talked to the author of the article, and he aparently really didn't
set PTHREAD_CONCURRENCY (caused by it being not documented in NetBSD
2.0); He wants to create some addendum, so we can look forward to an
update.

setting this PTHREAD_CONCURRENCY=2 on our 2-CPU boxes actually reduces
the throughput of the super-smack select-key test slightly.
we go from being I/O bound (CPUs about 40% busy) to being CPU-bound
(obviously CPUs 100% busy), but without benefit to the application.

I ran the test under a profiling kernel, and here's the output:

PTHREAD_CONCURRENCY=1

167 p3:~ # time super-smack /usr/share/smacks/select-key.smack 10 10000
Query Barrel Report for client smacker1
connect: max=59ms min=1ms avg= 15ms from 10 clients
Query_type num_queries max_time min_time q_per_s
select_index 200000 0 0 9636.62
6.305u 7.354s 0:21.13 64.6% 0+0k 0+0io 0pf+0w

top during the run:
CPU0 states: 13.9% user, 0.0% nice, 18.8% system, 0.0% interrupt, 67.3% idle
CPU1 states: 29.7% user, 0.0% nice, 23.8% system, 0.0% interrupt, 46.5% idle

PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
after startup:
2397 mysql 18 0 324M 18M sigwai/0 0:00 1.15% 0.49% mysqld
after run:
2397 mysql 18 0 324M 21M sigwai/0 0:10 18.04% 16.89% mysqld

% cumulative self self total
time seconds seconds calls us/call us/call name
72.47 40.04 40.04 idle_loop
10.95 46.09 6.05 7330 825.38 825.38 Xspllower
2.50 47.47 1.38 3738039 0.37 1.14 soreceive
1.45 48.27 0.80 6342632 0.13 0.13 _kernel_lock
1.09 48.87 0.60 4153108 0.14 0.17 sodopendfree
1.05 49.45 0.58 5172723 0.11 1.63 syscall_plain
0.85 49.92 0.47 5805505 0.08 0.08 i386_copyin
0.71 50.31 0.39 4413531 0.09 0.09 _kernel_lock_acquire_count
0.70 50.70 0.39 syscall1

PTHREAD_CONCURRENCY=2

165 p3:~ # time super-smack /usr/share/smacks/select-key.smack 10 10000
Query Barrel Report for client smacker1
connect: max=12ms min=1ms avg= 7ms from 10 clients
Query_type num_queries max_time min_time q_per_s
select_index 200000 1 0 9347.77
8.054u 14.502s 0:21.96 102.6% 0+0k 0+2io 0pf+0w

top during the run:
CPU0 states: 45.3% user, 0.0% nice, 45.3% system, 0.0% interrupt, 9.5% idle
CPU1 states: 33.8% user, 0.0% nice, 63.2% system, 0.0% interrupt, 3.0% idle

PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
after startup:
1976 mysql 18 0 324M 18M sigwai/0 0:03 0.00% 0.00% mysqld
after run:
1976 mysql 18 0 324M 21M sigwai/0 1:06 18.32% 18.26% mysqld

% cumulative self self total
time seconds seconds calls ms/call ms/call name
55.13 23.80 23.80 2955 8.05 8.05 Xspllower
7.25 26.93 3.13 6579165 0.00 0.00 _kernel_lock
4.56 28.90 1.97 4355165 0.00 0.00 _kernel_lock_acquire_count
4.01 30.63 1.73 3685574 0.00 0.00 soreceive
2.90 31.88 1.25 4160068 0.00 0.00 sodopendfree
2.48 32.95 1.07 5623640 0.00 0.00 i486_copyout
2.36 33.97 1.02 4147578 0.00 0.01 syscall_plain
1.92 34.80 0.83 5662118 0.00 0.00 i386_copyin
1.67 35.52 0.72 idle_loop

unfortunately, most of the time is being spent with clock interrupts blocked,
and with the i386 implementation of splx(), that means that we can't tell
where it's all going. I tried to make an inline-assembly version of splx()
a year or two ago, but it got ugly and I lost motivation. anyone want to
pick that up?

Post by Hubert Feyrer
A different question is, why NetBSD performed so bad in the 10M Rows test
(http://www.newsforge.com/blob.pl?id=82311a9e7896a841032c395f270d6a0f).
According to the article it's not a softdep or atime issue.

this one appears to be mostly due to our crappy page-replacement algorithm.
these sysctl settings more than doubled the throughput for tony:

vm.filemin=1
vm.filemax=20
vm.anonmin=80
vm.anonmax=99

this has the effect of keeping the mysql cache in memory.

I tried a couple additional changes to see if we could improve this further:

(1) the loaning-for-read() stuff that we've discussed lately on tech-kern

(2) another optimization that I haven't mentioned before, where we zero
pages that are being overwritten with write() instead of reading them
from disk. (we have to do something to initialize the page before
unlocking it so that other threads can't look at the page's old contents).
(this only took about 1/2 hour to implement, so it's basically a backup
plan if loaning-for-write() turns out to be too hard or too risky for
the 2.1 and 3.0 releases.)

I measured all four combinations of this for both the "prepare" and "run"
phases of sysbench. the results were:

options prepare phase run phase
(in minutes) transactions/sec
domaploanobj=0 chuq_dozero=0 52:40 5.56
domaploanobj=0 chuq_dozero=1 44:37 5.59
domaploanobj=1 chuq_dozero=0 52:57 6.84
domaploanobj=1 chuq_dozero=1 48:18 6.83

zeroing instead of reading for overwrites helps during prepare
and has no effect during run. loaning-for-read(), on the other hand,
helps during run, but hurts during prepare.

I also ran this under a profiling kernel, and here were the results:

prepare
% cumulative self self total
time seconds seconds calls ms/call ms/call name
44.13 263.91 263.91 475763 0.55 0.56 genfs_putpages
13.68 345.74 81.83 2438342 0.03 0.03 i386_copyin
7.11 388.24 42.50 379417126 0.00 0.00 pmap_clear_attrs
5.19 419.29 31.05 1704374 0.02 0.02 memset
2.74 435.68 16.39 380800608 0.00 0.00 pmap_tlb_shootnow
2.40 450.04 14.36 121860 0.12 0.12 Xspllower
2.28 463.69 13.65 1705310 0.01 0.01 i486_copyout
0.97 469.51 5.82 8968996 0.00 0.00 lockmgr

run
% cumulative self self total
time seconds seconds calls ms/call ms/call name
34.52 370.95 370.95 709745 0.52 0.52 genfs_putpages
9.74 475.64 104.69 15372348 0.01 0.01 i386_copyin
8.30 564.79 89.15 mpidle
6.67 636.50 71.71 16511526 0.00 0.00 i486_copyout
5.51 695.74 59.24 529689671 0.00 0.00 pmap_clear_attrs
4.86 747.99 52.25 227478 0.23 0.23 Xspllower
3.21 782.47 34.48 1941078 0.02 0.02 memset
2.17 805.80 23.33 532201987 0.00 0.00 pmap_tlb_shootnow
0.95 816.02 10.22 6513348 0.00 0.01 soreceive
0.85 825.19 9.17 13135256 0.00 0.00 lockmgr
0.84 834.24 9.05 18082701 0.00 0.00 uvm_pagelookup
0.78 842.61 8.37 3175090 0.00 0.02 genfs_getpages
0.69 850.04 7.43 3482626 0.00 0.12 uvm_fault
0.59 856.38 6.34 1595193 0.00 0.01 pmap_remove_ptes
0.50 861.74 5.36 13264280 0.00 0.03 syscall_plain
0.49 866.98 5.24 14308126 0.00 0.00 pvtree_SPLAY_MINMAX
0.48 872.18 5.20 23920121 0.00 0.00 pvtree_SPLAY

the database file vnode has a bunch of pages in memory, and
genfs_getpages() has to look at all of them when mysqld does an fsync().
we're clearly hurt by our inability to do unbuffered I/O.

I also ran the super-smack select-key test with the same combinations
of the two optimizations, but all the runs had the same throughput.

the kernel profiling data for this benchmark was:

% cumulative self self total
time seconds seconds calls ms/call ms/call name
26.31 9.69 9.69 2384 4.06 4.06 Xspllower
13.79 14.77 5.08 3800113 0.00 0.00 soreceive
5.43 16.77 2.00 6006860 0.00 0.00 i386_copyin
5.21 18.69 1.92 5404173 0.00 0.01 syscall_plain
4.97 20.52 1.83 4200256 0.00 0.00 uipc_usrreq
4.72 22.26 1.74 4400953 0.00 0.00 uiomove
4.66 23.98 1.72 syscall1
3.31 25.20 1.22 4628997 0.00 0.00 i486_copyout
3.10 26.34 1.14 400053 0.00 0.01 sosend
2.86 27.39 1.05 Xsyscall
2.74 28.40 1.01 4200177 0.00 0.00 sodopendfree

again the i386 splx() gets in the way. but since this time we're
just as interested in single-CPU performance, I ran it again on a macppc,
where the splx() implementation doesn't get in the way so much:

% cumulative self self total
time seconds seconds calls ms/call ms/call name
19.61 2.34 2.34 15445156 0.00 0.00 splraise
19.28 4.64 2.30 15444019 0.00 0.00 splx
9.72 5.80 1.16 s_sctrap
9.56 6.94 1.14 439 2.60 2.60 cpu_idlespin
5.70 7.62 0.68 3800116 0.00 0.00 soreceive
3.94 8.09 0.47 4200262 0.00 0.00 uipc_usrreq
3.86 8.55 0.46 5405542 0.00 0.00 syscall_plain
3.52 8.97 0.42 5219607 0.00 0.00 memcpy
3.02 9.33 0.36 4609360 0.00 0.00 copyout
2.68 9.65 0.32 4401865 0.00 0.00 uiomove
2.26 9.92 0.27 3802081 0.00 0.00 dofileread
1.76 10.13 0.21 400437 0.00 0.00 enable_fpu

0.00 0.00 6779/15445156 uvm_lock_fpageq [92]
0.03 0.00 200502/15445156 preempt [30]
0.03 0.00 200969/15445156 wakeup <cycle 1> [42]
0.03 0.00 201039/15445156 ltsleep <cycle 1> [35]
0.15 0.00 1000108/15445156 sosend [14]
0.18 0.00 1200108/15445156 m_gethdr [22]
0.64 0.00 4200181/15445156 sodopendfree [10]
1.27 0.00 8400357/15445156 soreceive [6]
[8] 19.6 2.34 0.00 15445156 splraise [8]

0.00 0.00 2589/16282037 decr_intr [76]
0.03 0.02 200502/16282037 preempt [30]
0.15 0.08 1000108/16282037 sosend [14]
0.18 0.09 1200108/16282037 m_gethdr [22]
0.63 0.33 4200181/16282037 sodopendfree [10]
1.26 0.65 8400357/16282037 soreceive [6]
[9] 19.3 2.30 0.00 15444019 splx <cycle 1> [9]
546 do_pending_int <cycle 1> [185]

so it's not a definitive answer, but it's nearly certain that
the socket code is a major bottleneck for super-smack select-key.

-Chuck

YAMAMOTO Takashi

2005-02-25 04:58:14 UTC

Permalink

hi,

Post by Chuck Silvers
unfortunately, most of the time is being spent with clock interrupts blocked,

this reminds me PR/26108.
can you try the patch in the PR?

YAMAMOTO Takashi

Chuck Silvers

2005-02-25 15:44:39 UTC

Permalink

Post by YAMAMOTO Takashi
hi,

Post by Chuck Silvers
unfortunately, most of the time is being spent with clock interrupts blocked,

this reminds me PR/26108.
can you try the patch in the PR?

even without that patch, I don't see much difference in the clock drift
when a 2 CPU machine is idle vs. when I have both CPUs in the kernel
(and thus one would be waiting for the big lock). to generate the latter
load, I ran 5 copies of "dd if=/dev/zero of=/dev/null bs=100m count=10000"
simultaneously.

so I'm not sure how to reproduce the problem.

-Chuck

YAMAMOTO Takashi

2005-02-26 02:46:27 UTC

Permalink

Post by Chuck Silvers

Post by YAMAMOTO Takashi

Post by Chuck Silvers
unfortunately, most of the time is being spent with clock interrupts blocked,

this reminds me PR/26108.
can you try the patch in the PR?

i forgot how to reproduce.
(apparently i should have noted it in the PR :-)

i think it was something like the following, which involves
frequent context switches, thus biglock contentions in mi_switch.
dd if=/dev/zero bs=1|dd bs=1|dd bs=1| ... |dd bs=1 of=/dev/null

YAMAMOTO Takashi

Chuck Silvers

2005-03-28 15:18:54 UTC

Permalink

Post by YAMAMOTO Takashi

Post by Chuck Silvers

Post by YAMAMOTO Takashi

Post by Chuck Silvers
unfortunately, most of the time is being spent with clock interrupts blocked,

this reminds me PR/26108.
can you try the patch in the PR?

i forgot how to reproduce.
(apparently i should have noted it in the PR :-)
i think it was something like the following, which involves
frequent context switches, thus biglock contentions in mi_switch.
dd if=/dev/zero bs=1|dd bs=1|dd bs=1| ... |dd bs=1 of=/dev/null

I tried this too just now, and it didn't seem to make any difference.
my test machine loses about 10 microseconds per second both when that
dd pipeline is running and when the machine is idle.

-Chuck

Mike Cheponis

2005-03-28 21:11:35 UTC

Permalink

Post by Chuck Silvers
I tried this too just now, and it didn't seem to make any difference.
my test machine loses about 10 microseconds per second both when that
dd pipeline is running and when the machine is idle.

This represents a clock source stability of -10 PPM; that is indeed
extraordinary for a non-compensated crystal-controlled clock source.

Typical crystals show a +/- 40 PPM short-term performance (not taking aging
or temperature into account, because that would push it even more.)

Good sources, such as the Digitally Compensated Oscillators, can achieve
+/- 0.3 PPM, but I have never seen those used as computer clocks.

In short, 10 PPM seems mighty good!

-Mike

YAMAMOTO Takashi

2005-03-31 01:16:06 UTC

Permalink

how about killing a large process which holds many swap slots?
it seems that, without the patch, my machine can lose a few hundred ms
when killing a 1G byte process.

anyway, it's clear that it's bad to spin with IPL_HIGH.

YAMAMOTO Takashi