Diamond Notes

Just another WordPress weblog

Archive for the 'Benchmarking' Category

Mark Callaghan — Scaling InnoDB/MySQL @ UC

Goals

Make Innodb scale on big servers

  • fix bottlenecks on big SMP server
  • utilize server with many disk
  • support thousands of database connections
  • handle corruption in mormroy and on disk
  • make query plans perdictable
  • support thousands of tables and accounts

Mentioned a problem with SANs?

Desirable Features

  • throughput scales with number of CPU cores
  • efficient support for 128 GB buffer cache
  • performance for servers with man and or remote disk
  • recover from corruption

CPU Problems

  • Mutex implementations favor portability over performance
  • Mutex hotspots
    • buffer cache
    • memory allocation
    • transaction log
    • adaptive hash latch

Symtops of CPU problems

  • adaptive hash latch contention
    • SHOW INNODB STATUS displays the session that holds the adaptive hash latch
    • a background thread logs SHOW INNODB STATUS into the error log when there is along lock wait
  • excessive mutex contention
  • server has many queries, is slow and is not IO bound
  • on linux, vmstat will report a lot of idle time

Workarounds for CPU problems

  • Upgrade
    • MySQL 5.0.30 has import improvements
    • MySQL 5.0.54 fixes a bug that causes some threads to miss wakeup signals
  • experiment with innodb_thread_concurrency to limit the number of threads that run concurrently

Making the InnoDB RW-mutex fast

  • Use atomic ops to change internal state (replacing the Innodb spin lock mutex)
  • use separate events to wake readers and writers

Work in progress for CPU problems

  • change the innodb mutex to use atomic ops
  • change the Innodb read-write mutex to use atomic ops

Performance Tests

Eight-CPU core server. Data is from the sbtest table generated by sysbench. Data is cached (key buffer, InnoDB buffer cache).

It was amazing the difference that these patches make. The scalability for these servers (number of sessions) was almost linear.

Memory

To support a 128GB buffer cache:

  • data structures must scale
    • walking a list with 8M page entries might be slow
  • Resources need to be split
    • more than one mutex might be needed for the buffer cache and LRU chain
  • Detection of corruption is more important
    • memory will be corrupted by software and hardware bugs

Detect and correct corruption

Features in InnoDB

  • page checksums to detect corruption that occurs after a page has been written to disk
  • doublewrite buffer to correct partial page writes that occur during a server crash

Innodb crashes when it reads a page with an invalid checksum

if this is a page for a secondary index, than the index can be rebuilt

Disk performance

  • Innodb uses one background thread to process prefetch requests.
  • Innodb uses one background thread to flush dirty buffer cache pages to disk. This is fine as long as writes go to the OS buffer cache. Otherwise, writes may be slow
  • The background IO threads assume a server with one disk and don’t run fast enough when there is work to be done

Fixes

Google patch adds support for multiple background IO threads.

The Google patch will soon have an option to tune the maximum rate of background IO

Connections

  • need to support thousands of connections, but not will to use one thread per connection
  • MySQL 6 separates threads from connections

Query Plans

  • innodb uses sampling to gather stats for the optimizer
  • stats are not stored on disk
  • stats are collected the first time a table is used after startup and after many rows has been modified

stats can be off between slaves with identical dataset. This happens (sometimes) except if the is a unique key

Overall a good talk. Mark got off several times on things that, frankly, if you weren’t deeply involved in the Innodb code you wouldn’t understand (code snippets). However, it sounds like they are solving some important problems. I am really looking forward to these being put into the “official” MySQL code.

1 comment

Server Testing Stumper

I blogged several times about some testing I have been doing recently. Well, I have run across something that just doesn’t make sense. I checked kernels and I/O schedulers and worked out what seems to do the best (2.6.22 kernel with deadline for those who are curious — I will post results soon). I had done all this with two servers that were identical and had 8 gb of RAM each. I then moved the 8 gb of RAM from one server to the other so one server had 16 gb of RAM. I then re-ran the last test with the only change being I raised the innodb buffer variable from four gb to eight gb.

I would have expected if there was any change it would be faster. I ran querybench five times and averaged the results as I had done with previous test. It was almost exactly 1500 qps slower than the previous run.

Anyone have any thoughts? I have run the test several times. I am rerunning the tests while closely examining the output of some of normal profile tools — top and iostat. It appears to me that the runs are io-bound if anything. The iowait percentage runs between the 25% - 40+ %. Top shows memory usage isn’t excessive (30% or 4.8 GB) and CPU percentage never tops 100% (so it doesn’t even max out one core of eight total cores).

My hunch is that the server is trying to fill up the buffer and IO is just bogging down the system. If I had a larger input file (I have 135 megs total of queries I am feeding into querybench) it might level out after “warming up” more.

My current formulae:

  • I split the input queries into two files to warm the server up with the first set of queries
  • I time both runs and then record the results of the second part (which is always higher than the warm up run)

Results show that . However I could be really wrong :)

Feedback would be very appreciated!!

2 comments

Benchmark Updates

As I blogged about recently I was going to have a chance to do some testing.  Last week we started benchmarking our two new servers.  I am running a variety of tests on them and I will begin publishing the results shortly.  This has been the first time I have seriously used the query processing programs for testing.  So far, it has performed fairly well.  Initially, I had intended to use the EVMS (Enterprise Volume Management System) to perform rollbacks of the database files after each test.  However, due to complexity and the overhead that this adds (and we don’t use EVMS in production) I decided to go with a standard setup.  I just wrote a script to stop MySQL, delete the database files, and copy the baseline files from a storage directory.  Then just restarting MySQL and I have a fresh setup.  Yes, it takes longer, but I script the entire run anyways so a little extra time is no big deal.

I guess due to learning a few things during my SAN testing last year the progress has been much faster this time are.  I should have the first set of test results up in a few days..by the end of the week at the latest.

No comments

New Benchmarks Coming Soon — Your Chance to Participate!

I am going to be getting a new master/slave pair of production servers. I will have between a one and two weeks before they have to go into production.  This will be a great chance to do a little benchmarking. I am going to begin writing up what I plan on testing in the next few days. I would love to here your suggestions about what to test!!

Here is the specs as I know them:

8 gigs of RAM

4 or 6 disks — 10k or 15k drives on a RAID controller with 256MB write-back cache.

4 or 8 cores (cpu) — most likely eight though

debian etch

MySQL 5.0.45

Things I want to test:

write-back turned on vs write-back turned off on RAID controller

xfs/reiserfs/ext3 on the RAID partition

I/O schedular (debian defaults to CFQ — want to test deadline also)

innodb-flush-log-at-trx-commit = 0 vs 1

I will be using our QPP (query processing programs) tools to run tests with production data and production client requests.

I very likely will not have time to run all the tests I want. However, the more tests the merrier. Please comment with your test ideas.

7 comments

New Querysniffer Release

Hey everyone. A kind user pointed out a timing error in the querysniffer code. Ian fixed it and I fiddled around and managed to upload the new release to the server. If you are using querysniffer you should update to the new version as you are getting incorrect timing results currently.

Downloads are available at http://www.paragon-cs.com/queryprogs

No comments

XFS Snapshots with LVM2

As I wrote in my last post, there is an issue that bit me using XFS data partitions and trying to take a normal LVM snapshot of them while running MySQL. As part of the upgrade from (MySQL) 4.1 to 5.0 I dumped the entire dataset and imported it into a new server running 5.0. This server had an XFS data partitioning as previous testing had shown that it is faster than reiserfs for our database servers. This same server had our application “pointed” to it (after stopping the application and shutting down the old replication so there was no database activity). In essence it became the new master. I then had to reconfigure the old master with the MySQL 5.0 binaries, delete the data in the data partition and re-sync it with a copy of the data from the new master.

This is a fairly standard procedure and is done using the lvcreate command to create a snapshot of the data partition and then mounting the new snapshot and rsyncing it to the old master’s data partition. I have done this dozens of times in the past without ever having a problem — but this was the first time with an XFS filesystem in use on the data partition on the running server.

This time I copied over the data and tried to start up the new server. It didn’t work. Innodb complained bitterly and then the server just died. I don’t have the exact error but it didn’t start. This was actually the first of the set of servers I had to upgrade, so I went on and did the same procedure on all the other servers. Each one worked exactly as expected. However, I had resierfs on each of the other data partitions.

It was about six in the morning by this time. I began doing the upgrades at 10:00 pm so I suppose I wasn’t exactly sharp at this point. I performed the entire procedure again with exactly the same results. At this point I had to stop as our traffic was beginning to build up and the snapshot/rsync process creates too much load on the servers to do it during normal hours.

I googled around. People suggested that with XFS partitions you use the xfs_freeze command to halt all disk activity and clear out the XFS transaction logs. It seems that LVM doesn’t do this and it was what was causing the inconsistent view of the data. Basically you did this:

xfs_freeze -f

lvcreate -s ……..

xfs_freeze -u

So you freeze the partition, create the snapshot and then unfreeze it so activity continue. However, googling some more, I found that this doesn’t work with LVM2 (the current stable version of LVM).  Something has changed with LVM in the newer version and now when you issue the “xfs_freeze -f” comand and than the lvcreate command — lvcreate actually hangs. It won’t create the snapshot until you run the “xfs_freeze -u” command which “unfreezes” the filesystem.

I don’t know what changed with LVM but I really wish that this wasn’t the case. To get the copy of the data partition that would actually work I had to briefly shut down the MySQL server and then run the lvcreate command.  Then, after the snapshot was created, I could restart the server.  Of course the server was down for about 30 seconds.  But, because there was no XFS filesystem activity going on when the snapshot was made, it worked perfectly.  However this means I have to revert filesystems back to reiserfs until this issue with LVM is resolved.  I suspect it will be quite some time :(

9 comments

Query Processing Programs — Version 1.0!!!

Hey everyone. I just finished uploaded the latest versions of the three QPP programs. They are all working well and I consider them “version 1.0″. I will be updating the qpp pages as soon as possible with a tutorial on how all these programs work together.

The basic idea is that you use the querysniffer program to gather production data from your servers as needed without stopping/starting the MySQL server. Queryparse is actually used to control querysniffer and push the gathered data into a database from which you can use simple php/html pages to view the statistics.

Querybench is used to benchmark/stress-test a test server with production data (gathered, once again, with querysniffer). You can even set it up so that the database can be “rolled back” to its initial state after a run.

All in all, I am very pleased. I have fulfilled all the goals I set out to achieve when I began this little journey a few months ago. It will be very useful in my work and I hope others find it useful as well.

I would like to publicly thank Ian Kilgore who helped far above and beyond anything required. He is one of the system administrators where I work and is the genius programmer behind this stuff. I just had the idea and kind of directed things along. The only “programming” I have done is working on the documentation a bit and developing the web pages for the display of the statistics.

Feedback can be sent to “bmurphy” at “paragon-cs.com”.

Now what do I with all this free time I have left over *cough*?

No comments

mysql-parallel-dump test

I just wrapped up a test of the new mysql-parallel-dump and mysql-parallel-restore that Baron Schwartz has released as part of his toolkit. The toolkit is available from http://mysqltoolkit.sourceforge.net

I tested on an 8-core (dual processor/quad-core) server with 8 gigabytes of RAM. I baselined with the normal mysqldump command (no options) and then benchmarked variations of both the mysqldump command and the mysqlimport command against the mysql-parallel-dump and mysql-parallel-restore command.

The database tested was under two gigs in size so it wasn’t an ideal test but I was just trying to get an idea of how I can expect it to operate.

The results:

Test One: Parallel Dump & Restore

rm -rf /home/kmurphy/pdump/default/*time mysql-parallel-dump –databases pdump –tab –basedir=/home/kmurphy/pdump

mysql -e ‘drop database pdump’

mysql -e ‘create database pdump’
  run time
run one 0m22.166s
run two 0m22.348s
run three 0m21.625s
run four 0m22.723s
run five 0m16.744s
run six 0m24.603s
run seven 0m16.647s
run eight 0m21.449s
run nine 0m18.590s
run ten 0m21.975s
average
20.887s
time mysql-parallel-restore –tab /home/kmurphy/pdumpmysql -e ‘reset master’
  run time
run one 4m3.048s
run two 4m24.352s
run three 4m25.403s
run four 4m8.525s
run five 4m18.319s
run six 4m7.668s
run seven 4m32.558s
run eight 4m29.318s
run nine 4m10.424s
run ten 4m53.700s
average
4m21.332s

Test Two: Parallel Dump & Restore w/Chunk Size of 100,000

rm -rf /home/kmurphy/pdump/default/*time mysql-parallel-dump –databases pdump –tab –chunksize=100000basedir=/home/kmurphy/pdump

mysql -e ‘drop database pdump’

mysql -e ‘create database pdump’
  run time
run one 0m16.207s
run two 0m14.899s
run three 0m14.988s
run four 0m15.331s
run five 0m15.112s
run six 0m16.124s
run seven 0m15.902s
run eight 0m14.763s
run nine 0m14.903s
run ten 0m15.261s
average
15.394s
time mysql-parallel-restore –tab /home/kmurphy/pdumpmysql -e ‘reset master’
  run time
run one 4m27.080s
run two 4m43.456s
run three 4m32.554s
run four 4m27.986s
run five 5m3.333s
run six 4m48.599s
run seven 4m14.181s
run eight 5m7.383s
run nine 4m44.915s
run ten 5m3.911s
average
4 min 39.34s

Test Three: Default Dump & Restore

time mysqldump –databases pdump > /data/backup/testing/normal/dumptest.sql
  run time
run one 0m38.788s
run two 0m43.511s
run three 0m39.406s
run four 0m41.640s
run five 0m40.168s
run six 0m39.696s
run seven 0m43.360s
run eight 0m43.546s
run nine 0m43.370s
run ten 0m42.890s
average
41.638s
time mysql < /data/backup/testing/normal/opt/dumptest.sqlmysql -e ‘reset master’
  run time
run one 9m33.572s
run two 9m32.136s
run three 9m28.766s
run four 9m32.804s
run five 9m31.367s
run six 9m32.018s
run seven 9m33.421s
run eight 9m30.748s
run nine 9m34.836s
run ten 9m29.171s
average
9m31.884s

Test Four: Optimized standard dump with grouped insert statements and restore

time mysqldump –opt -e –databases pdump > /home/kmurphy/dump/dumptest.sql
  run time
run one 0m43.553s
run two 0m44.514s
run three 0m39.177s
run four 0m41.644s
run five 0m41.356s
run six 0m40.577s
run seven 0m39.982s
run eight 0m39.553s
run nine 0m41.970s
run ten 0m41.008s
average
41.333s
time mysql < /home/kmurphy/dump/dumptest.sqlmysql -e ‘reset master’
  run time
run one 9m17.282s
run two 9m28.965s
run three 9m29.265s
run four 9m35.659s
run five 9m31.757s
run six 9m32.542s
run seven 9m27.294s
run eight 9m28.672s
run nine 9m32.159s
run ten 9m27.542s
average  
9m29.114s

Test Five: Optimized mysqldump and restore w/tab delimited files

time mysqldump -Q -T /home/kmurphy/dump/pdump  –opt –single-transaction pdumpcat /home/kmurphy/dump/pdump/*.sql > /home/kmurphy/dump/pdump/xyz.sql

time mysql pdump < /home/kmurphy/dump/pdump/xyz.sql
  run time
run one 38.188s
run two 33.321s
run three 33.663s
run four 33.096s
run five 33.268s
run six 33.203s
run seven 33.352s
run eight 33.073s
run nine 37.682s
run ten 33.196s
average
34.204s
time mysqlimport pdump /home/kmurphy/dump/pdump/*.txtrm /home/kmurphy/dump/pdump/xyz.sql

mysql -e 'reset master'
run time
run one 5.348s + 6m10.926s
run two 4.905s + 6m11+366s
run three 6.770s + 6m11.556s
run four 5.859s + 6m5.699s
run five 6.066s + 6m10.076s
run six 6.478s + 6m6.743s
run seven 5.422s + 6m7.818s
run eight 6.055s + 6m14.252s
run nine 5.016s + 6m11.458s
run ten 6.730s + 6m16.509s
average
6m16.505s

Test Six: Optimized mysqldump and restore w/tab delimited files and extended inserts

time mysqldump -Q -T /home/kmurphy/dump/pdump  --opt --single-transaction -e pdump
run time
run one 33.164s
run two 33.463s
run three 33.239s
run four 33.182s
run five 33.140s
run six 33.596s
run seven 33.115s
run eight 33.063s
average
33.245s
cat /home/kmurphy/dump/pdump/*.sql > /home/kmurphy/dump/pdump/xyz.sqltime
time mysql pdump < /home/kmurphy/dump/pdump/xyz.sql
time mysqlimport pdump /home/kmurphy/dump/pdump/*.txtrm /home/kmurphy/dump/pdump/xyz.sql
mysql -e 'reset master'
run time
run one 5.846s + 6m12.939s
run two 5.195s + 6m10.240s
run three 6.221s + 6m9.272s
run four 5.431s + 6m10.620s
run five 6.297s + 6m15.142s
run six 5.740s + 6m12.571s
run seven 6.551s + 6m15.873s
run eight 6.293s + 6m11.707s
average
6m18.242s

The results are quite impressive. The mysql-parallel-dump with chunksize option that I ran didn’t seem to make any difference. Don’t be fooled as this was on a smaller data set. If you have a multi-gigabyte table with millions of rows in innodb format this could make a huge difference. Don’t underestimate it.

The clear winner is the mysql-parallel-dump paired with mysql-parallel-restore. The parallel import/dump over a normal mysqldump w/various options is around twice as fast. Using mysqlimport with mysqldump brings the time difference down to around two minutes but it is still fairly significant. With a larger dataset I think the difference would become more pronounced.

In short, you should strongly consider using the parallel dump and restore tool for your backup needs if you are currently using mysqldump.

2 comments

Interesting New Software

So for quite some time I have been looking around for a solution to a problem. We are developing a tool in-house at iContact that will allow actual production queries to be run against a database. This actually almost complete. Thursday we actually did the initial testing. I am responsible for setting up a system that will allow a database to be rolled back to a start point. That way we can populate a database — run queries against it and then roll back to re-run the test after configuration changes.

Snapshots with rollback would allow this. There is no common Linux software that allows this. I must admit to a little envy about Solaris as it does allow snapshots with ZFS. I honestly just wish all the developers could get together and figure out some way to port this to Linux. But .. that is another story. I tried working with unionfs. It is just to clunky for what I want to do. It was designed really to work with “live cd” based projects (like Knoppix) and isn’t really suited to my needs.

My CTO pointed me to some software yesterday that I read about some time ago (about two years ago if I remember). I had lost track of it and forgotten about it. The software doesn’t appear to be in active development. The last release was 2-26-06. However, it does appear to do everything I want and then some.

The software is called EVMS (Enterprise Volume Management System) and looks like it was developed in-house at IBM back in 2001. Evidently IBM use to support the project in a fairly big way but it has been somewhat orphaned of late. A few of the developers still work on it but not very much.

The software works on top of underlying filesystem technologies such as LVM and provide a single unified way of working with them. In addition it adds some capabilities of its own (including the afore-mentioned snapshot capabilities). EVMS has a command line interface, a ncurses interface and a gtk+ GUI. Covers it all. Unfortunately the installation looks like it is a little complex. I will be getting with the sys admins Monday to see if I can get some time next week for getting it set up on the test system.

The EVMS website is: http://evms.sourceforge.net

Anyone out there have experience with this software? Would love to hear some feedback.

7 comments

Mysqlpdump Test

I should be sleeping. But I am playing :). I have fairly serious dump of a couple of servers that will probably occur late this week. I have been tweaking various things and testing various configurations to get the best dump and restore speeds. Then, in the last couple of days, I ran across this by Monty Taylor:

http://mysql-ha.com/2007/09/12/mysqlpdump-for-parallel-dumps/

which links to this site:

http://www.fr3nd.net/projects/mysqlpdump/

It is a python wrapper script around mysqldump that executes the dumps using multiple threads — from what I see it each thread dumps an individual table.
I executed some tests on a slave that we run backups and a small DB that is right at 900M in size. It was fairly eye opening:

1:37 — 917M mysqlpdump.py -u root -ppassword -v -d /var/backup/directory -i database
2:23 — 915M mysqldump database -u root -ppassword > /var/backup/directory/database.sql

So, on this small database, it made a significant % speed increase. This was with the default 5 threads. Next I increased the thread count to 10, 15 and then 20. The results are:
1:50 — mysqlpdump.py -u root -ppassword -v -t 10 -d /var/backup/directory -i database
1:40 — mysqlpdump.py -u root -ppassword -v -t 15 -d /var/backup/directory -i database
1:59 — mysqlpdump.py -u root -ppassword -v -t 20 -d /var/backup/directory -i database

While these aren’t controlled test I think it is clear that the mysqlpdump utility script is faster than the mysqldump command. Speed was best with a thread count of five. I don’t have an explanation for this other than the possibility that my server (a dual-xeon) is really just simulating a higher thread count. I don’t know. Not as much a hardware person as I might like. Even so, just with default settings it is a nice little speed-up.

As for the restore - Brian Aker says in this post: http://krow.livejournal.com/519655.html that there is an option now to mysqlimport to specify a number of threads..the reverse of this script. Or, as an alternative, I could use the mysqlprestore script developed by Monty Taylor after he looked at mysqlpdump. Not sure which I will use, but it seems that both will do the job.
Cool stuff. Hoping something gets directly integrated into mysqldump soon. Or, even better, an online backup that didn’t disrupt things :)

4 comments

Next Page »