Archive for the 'Benchmarking' Category
Mark Callaghan — Scaling InnoDB/MySQL @ UC
Goals
Make Innodb scale on big servers
- fix bottlenecks on big SMP server
- utilize server with many disk
- support thousands of database connections
- handle corruption in mormroy and on disk
- make query plans perdictable
- support thousands of tables and accounts
Mentioned a problem with SANs?
Desirable Features
- throughput scales with number of CPU cores
- efficient support for 128 GB buffer cache
- performance for servers with man and or remote disk
- recover from corruption
CPU Problems
- Mutex implementations favor portability over performance
- Mutex hotspots
- buffer cache
- memory allocation
- transaction log
- adaptive hash latch
Symtops of CPU problems
- adaptive hash latch contention
- SHOW INNODB STATUS displays the session that holds the adaptive hash latch
- a background thread logs SHOW INNODB STATUS into the error log when there is along lock wait
- excessive mutex contention
- server has many queries, is slow and is not IO bound
- on linux, vmstat will report a lot of idle time
Workarounds for CPU problems
- Upgrade
- MySQL 5.0.30 has import improvements
- MySQL 5.0.54 fixes a bug that causes some threads to miss wakeup signals
- experiment with innodb_thread_concurrency to limit the number of threads that run concurrently
Making the InnoDB RW-mutex fast
- Use atomic ops to change internal state (replacing the Innodb spin lock mutex)
- use separate events to wake readers and writers
Work in progress for CPU problems
- change the innodb mutex to use atomic ops
- change the Innodb read-write mutex to use atomic ops
Performance Tests
Eight-CPU core server. Data is from the sbtest table generated by sysbench. Data is cached (key buffer, InnoDB buffer cache).
It was amazing the difference that these patches make. The scalability for these servers (number of sessions) was almost linear.
Memory
To support a 128GB buffer cache:
- data structures must scale
- walking a list with 8M page entries might be slow
- Resources need to be split
- more than one mutex might be needed for the buffer cache and LRU chain
- Detection of corruption is more important
- memory will be corrupted by software and hardware bugs
Detect and correct corruption
Features in InnoDB
- page checksums to detect corruption that occurs after a page has been written to disk
- doublewrite buffer to correct partial page writes that occur during a server crash
Innodb crashes when it reads a page with an invalid checksum
if this is a page for a secondary index, than the index can be rebuilt
Disk performance
- Innodb uses one background thread to process prefetch requests.
- Innodb uses one background thread to flush dirty buffer cache pages to disk. This is fine as long as writes go to the OS buffer cache. Otherwise, writes may be slow
- The background IO threads assume a server with one disk and don’t run fast enough when there is work to be done
Fixes
Google patch adds support for multiple background IO threads.
The Google patch will soon have an option to tune the maximum rate of background IO
Connections
- need to support thousands of connections, but not will to use one thread per connection
- MySQL 6 separates threads from connections
Query Plans
- innodb uses sampling to gather stats for the optimizer
- stats are not stored on disk
- stats are collected the first time a table is used after startup and after many rows has been modified
stats can be off between slaves with identical dataset. This happens (sometimes) except if the is a unique key
Overall a good talk. Mark got off several times on things that, frankly, if you weren’t deeply involved in the Innodb code you wouldn’t understand (code snippets). However, it sounds like they are solving some important problems. I am really looking forward to these being put into the “official” MySQL code.
1 commentServer Testing Stumper
I blogged several times about some testing I have been doing recently. Well, I have run across something that just doesn’t make sense. I checked kernels and I/O schedulers and worked out what seems to do the best (2.6.22 kernel with deadline for those who are curious — I will post results soon). I had done all this with two servers that were identical and had 8 gb of RAM each. I then moved the 8 gb of RAM from one server to the other so one server had 16 gb of RAM. I then re-ran the last test with the only change being I raised the innodb buffer variable from four gb to eight gb.
I would have expected if there was any change it would be faster. I ran querybench five times and averaged the results as I had done with previous test. It was almost exactly 1500 qps slower than the previous run.
Anyone have any thoughts? I have run the test several times. I am rerunning the tests while closely examining the output of some of normal profile tools — top and iostat. It appears to me that the runs are io-bound if anything. The iowait percentage runs between the 25% - 40+ %. Top shows memory usage isn’t excessive (30% or 4.8 GB) and CPU percentage never tops 100% (so it doesn’t even max out one core of eight total cores).
My hunch is that the server is trying to fill up the buffer and IO is just bogging down the system. If I had a larger input file (I have 135 megs total of queries I am feeding into querybench) it might level out after “warming up” more.
My current formulae:
- I split the input queries into two files to warm the server up with the first set of queries
- I time both runs and then record the results of the second part (which is always higher than the warm up run)
Results show that . However I could be really wrong
Feedback would be very appreciated!!
2 commentsBenchmark Updates
As I blogged about recently I was going to have a chance to do some testing. Last week we started benchmarking our two new servers. I am running a variety of tests on them and I will begin publishing the results shortly. This has been the first time I have seriously used the query processing programs for testing. So far, it has performed fairly well. Initially, I had intended to use the EVMS (Enterprise Volume Management System) to perform rollbacks of the database files after each test. However, due to complexity and the overhead that this adds (and we don’t use EVMS in production) I decided to go with a standard setup. I just wrote a script to stop MySQL, delete the database files, and copy the baseline files from a storage directory. Then just restarting MySQL and I have a fresh setup. Yes, it takes longer, but I script the entire run anyways so a little extra time is no big deal.
I guess due to learning a few things during my SAN testing last year the progress has been much faster this time are. I should have the first set of test results up in a few days..by the end of the week at the latest.
No commentsNew Benchmarks Coming Soon — Your Chance to Participate!
I am going to be getting a new master/slave pair of production servers. I will have between a one and two weeks before they have to go into production. This will be a great chance to do a little benchmarking. I am going to begin writing up what I plan on testing in the next few days. I would love to here your suggestions about what to test!!
Here is the specs as I know them:
8 gigs of RAM
4 or 6 disks — 10k or 15k drives on a RAID controller with 256MB write-back cache.
4 or 8 cores (cpu) — most likely eight though
debian etch
MySQL 5.0.45
Things I want to test:
write-back turned on vs write-back turned off on RAID controller
xfs/reiserfs/ext3 on the RAID partition
I/O schedular (debian defaults to CFQ — want to test deadline also)
innodb-flush-log-at-trx-commit = 0 vs 1
I will be using our QPP (query processing programs) tools to run tests with production data and production client requests.
I very likely will not have time to run all the tests I want. However, the more tests the merrier. Please comment with your test ideas.
7 commentsNew Querysniffer Release
Hey everyone. A kind user pointed out a timing error in the querysniffer code. Ian fixed it and I fiddled around and managed to upload the new release to the server. If you are using querysniffer you should update to the new version as you are getting incorrect timing results currently.
Downloads are available at http://www.paragon-cs.com/queryprogs
No commentsXFS Snapshots with LVM2
As I wrote in my last post, there is an issue that bit me using XFS data partitions and trying to take a normal LVM snapshot of them while running MySQL. As part of the upgrade from (MySQL) 4.1 to 5.0 I dumped the entire dataset and imported it into a new server running 5.0. This server had an XFS data partitioning as previous testing had shown that it is faster than reiserfs for our database servers. This same server had our application “pointed” to it (after stopping the application and shutting down the old replication so there was no database activity). In essence it became the new master. I then had to reconfigure the old master with the MySQL 5.0 binaries, delete the data in the data partition and re-sync it with a copy of the data from the new master.
This is a fairly standard procedure and is done using the lvcreate command to create a snapshot of the data partition and then mounting the new snapshot and rsyncing it to the old master’s data partition. I have done this dozens of times in the past without ever having a problem — but this was the first time with an XFS filesystem in use on the data partition on the running server.
This time I copied over the data and tried to start up the new server. It didn’t work. Innodb complained bitterly and then the server just died. I don’t have the exact error but it didn’t start. This was actually the first of the set of servers I had to upgrade, so I went on and did the same procedure on all the other servers. Each one worked exactly as expected. However, I had resierfs on each of the other data partitions.
It was about six in the morning by this time. I began doing the upgrades at 10:00 pm so I suppose I wasn’t exactly sharp at this point. I performed the entire procedure again with exactly the same results. At this point I had to stop as our traffic was beginning to build up and the snapshot/rsync process creates too much load on the servers to do it during normal hours.
I googled around. People suggested that with XFS partitions you use the xfs_freeze command to halt all disk activity and clear out the XFS transaction logs. It seems that LVM doesn’t do this and it was what was causing the inconsistent view of the data. Basically you did this:
xfs_freeze -f
lvcreate -s ……..
xfs_freeze -u
So you freeze the partition, create the snapshot and then unfreeze it so activity continue. However, googling some more, I found that this doesn’t work with LVM2 (the current stable version of LVM). Something has changed with LVM in the newer version and now when you issue the “xfs_freeze -f” comand and than the lvcreate command — lvcreate actually hangs. It won’t create the snapshot until you run the “xfs_freeze -u” command which “unfreezes” the filesystem.
I don’t know what changed with LVM but I really wish that this wasn’t the case. To get the copy of the data partition that would actually work I had to briefly shut down the MySQL server and then run the lvcreate command. Then, after the snapshot was created, I could restart the server. Of course the server was down for about 30 seconds. But, because there was no XFS filesystem activity going on when the snapshot was made, it worked perfectly. However this means I have to revert filesystems back to reiserfs until this issue with LVM is resolved. I suspect it will be quite some time ![]()
Query Processing Programs — Version 1.0!!!
Hey everyone. I just finished uploaded the latest versions of the three QPP programs. They are all working well and I consider them “version 1.0″. I will be updating the qpp pages as soon as possible with a tutorial on how all these programs work together.
The basic idea is that you use the querysniffer program to gather production data from your servers as needed without stopping/starting the MySQL server. Queryparse is actually used to control querysniffer and push the gathered data into a database from which you can use simple php/html pages to view the statistics.
Querybench is used to benchmark/stress-test a test server with production data (gathered, once again, with querysniffer). You can even set it up so that the database can be “rolled back” to its initial state after a run.
All in all, I am very pleased. I have fulfilled all the goals I set out to achieve when I began this little journey a few months ago. It will be very useful in my work and I hope others find it useful as well.
I would like to publicly thank Ian Kilgore who helped far above and beyond anything required. He is one of the system administrators where I work and is the genius programmer behind this stuff. I just had the idea and kind of directed things along. The only “programming” I have done is working on the documentation a bit and developing the web pages for the display of the statistics.
Feedback can be sent to “bmurphy” at “paragon-cs.com”.
Now what do I with all this free time I have left over *cough*?
No commentsmysql-parallel-dump test
I just wrapped up a test of the new mysql-parallel-dump and mysql-parallel-restore that Baron Schwartz has released as part of his toolkit. The toolkit is available from http://mysqltoolkit.sourceforge.net
I tested on an 8-core (dual processor/quad-core) server with 8 gigabytes of RAM. I baselined with the normal mysqldump command (no options) and then benchmarked variations of both the mysqldump command and the mysqlimport command against the mysql-parallel-dump and mysql-parallel-restore command.
The database tested was under two gigs in size so it wasn’t an ideal test but I was just trying to get an idea of how I can expect it to operate.
The results:
Test One: Parallel Dump & Restore
rm -rf /home/kmurphy/pdump/default/*time mysql-parallel-dump –databases pdump –tab –basedir=/home/kmurphy/pdump mysql -e ‘drop database pdump’ mysql -e ‘create database pdump’
| run time | |
| run one | 0m22.166s |
| run two | 0m22.348s |
| run three | 0m21.625s |
| run four | 0m22.723s |
| run five | 0m16.744s |
| run six | 0m24.603s |
| run seven | 0m16.647s |
| run eight | 0m21.449s |
| run nine | 0m18.590s |
| run ten | 0m21.975s |
| average | |
| 20.887s |
time mysql-parallel-restore –tab /home/kmurphy/pdumpmysql -e ‘reset master’
| run time | |
| run one | 4m3.048s |
| run two | 4m24.352s |
| run three | 4m25.403s |
| run four | 4m8.525s |
| run five | 4m18.319s |
| run six | 4m7.668s |
| run seven | 4m32.558s |
| run eight | 4m29.318s |
| run nine | 4m10.424s |
| run ten | 4m53.700s |
| average | |
| 4m21.332s |
Test Two: Parallel Dump & Restore w/Chunk Size of 100,000
rm -rf /home/kmurphy/pdump/default/*time mysql-parallel-dump –databases pdump –tab –chunksize=100000 –basedir=/home/kmurphy/pdump mysql -e ‘drop database pdump’ mysql -e ‘create database pdump’
| run time | |
| run one | 0m16.207s |
| run two | 0m14.899s |
| run three | 0m14.988s |
| run four | 0m15.331s |
| run five | 0m15.112s |
| run six | 0m16.124s |
| run seven | 0m15.902s |
| run eight | 0m14.763s |
| run nine | 0m14.903s |
| run ten | 0m15.261s |
| average | |
| 15.394s |
time mysql-parallel-restore –tab /home/kmurphy/pdumpmysql -e ‘reset master’
| run time | |
| run one | 4m27.080s |
| run two | 4m43.456s |
| run three | 4m32.554s |
| run four | 4m27.986s |
| run five | 5m3.333s |
| run six | 4m48.599s |
| run seven | 4m14.181s |
| run eight | 5m7.383s |
| run nine | 4m44.915s |
| run ten | 5m3.911s |
| average | |
| 4 min 39.34s |
Test Three: Default Dump & Restore
time mysqldump –databases pdump > /data/backup/testing/normal/dumptest.sql
| run time | |
| run one | 0m38.788s |
| run two | 0m43.511s |
| run three | 0m39.406s |
| run four | 0m41.640s |
| run five | 0m40.168s |
| run six | 0m39.696s |
| run seven | 0m43.360s |
| run eight | 0m43.546s |
| run nine | 0m43.370s |
| run ten | 0m42.890s |
| average | |
| 41.638s |
time mysql < /data/backup/testing/normal/opt/dumptest.sqlmysql -e ‘reset master’
| run time | |
| run one | 9m33.572s |
| run two | 9m32.136s |
| run three | 9m28.766s |
| run four | 9m32.804s |
| run five | 9m31.367s |
| run six | 9m32.018s |
| run seven | 9m33.421s |
| run eight | 9m30.748s |
| run nine | 9m34.836s |
| run ten | 9m29.171s |
| average | |
| 9m31.884s |
Test Four: Optimized standard dump with grouped insert statements and restore
time mysqldump –opt -e –databases pdump > /home/kmurphy/dump/dumptest.sql
| run time | |
| run one | 0m43.553s |
| run two | 0m44.514s |
| run three | 0m39.177s |
| run four | 0m41.644s |
| run five | 0m41.356s |
| run six | 0m40.577s |
| run seven | 0m39.982s |
| run eight | 0m39.553s |
| run nine | 0m41.970s |
| run ten | 0m41.008s |
| average | |
| 41.333s |
time mysql < /home/kmurphy/dump/dumptest.sqlmysql -e ‘reset master’
| run time | |
| run one | 9m17.282s |
| run two | 9m28.965s |
| run three | 9m29.265s |
| run four | 9m35.659s |
| run five | 9m31.757s |
| run six | 9m32.542s |
| run seven | 9m27.294s |
| run eight | 9m28.672s |
| run nine | 9m32.159s |
| run ten | 9m27.542s |
| average | |
| 9m29.114s |
Test Five: Optimized mysqldump and restore w/tab delimited files
time mysqldump -Q -T /home/kmurphy/dump/pdump –opt –single-transaction pdumpcat /home/kmurphy/dump/pdump/*.sql > /home/kmurphy/dump/pdump/xyz.sql time mysql pdump < /home/kmurphy/dump/pdump/xyz.sql
| run time | |
| run one | 38.188s |
| run two | 33.321s |
| run three | 33.663s |
| run four | 33.096s |
| run five | 33.268s |
| run six | 33.203s |
| run seven | 33.352s |
| run eight | 33.073s |
| run nine | 37.682s |
| run ten | 33.196s |
| average | |
| 34.204s |
time mysqlimport pdump /home/kmurphy/dump/pdump/*.txtrm /home/kmurphy/dump/pdump/xyz.sql mysql -e 'reset master'
| run time | |
| run one | 5.348s + 6m10.926s |
| run two | 4.905s + 6m11+366s |
| run three | 6.770s + 6m11.556s |
| run four | 5.859s + 6m5.699s |
| run five | 6.066s + 6m10.076s |
| run six | 6.478s + 6m6.743s |
| run seven | 5.422s + 6m7.818s |
| run eight | 6.055s + 6m14.252s |
| run nine | 5.016s + 6m11.458s |
| run ten | 6.730s + 6m16.509s |
| average | |
| 6m16.505s |
Test Six: Optimized mysqldump and restore w/tab delimited files and extended inserts
time mysqldump -Q -T /home/kmurphy/dump/pdump --opt --single-transaction -e pdump
| run time | |
| run one | 33.164s |
| run two | 33.463s |
| run three | 33.239s |
| run four | 33.182s |
| run five | 33.140s |
| run six | 33.596s |
| run seven | 33.115s |
| run eight | 33.063s |
| average | |
| 33.245s |
cat /home/kmurphy/dump/pdump/*.sql > /home/kmurphy/dump/pdump/xyz.sqltime
time mysql pdump < /home/kmurphy/dump/pdump/xyz.sql
time mysqlimport pdump /home/kmurphy/dump/pdump/*.txtrm /home/kmurphy/dump/pdump/xyz.sql
mysql -e 'reset master'
| run time | |
| run one | 5.846s + 6m12.939s |
| run two | 5.195s + 6m10.240s |
| run three | 6.221s + 6m9.272s |
| run four | 5.431s + 6m10.620s |
| run five | 6.297s + 6m15.142s |
| run six | 5.740s + 6m12.571s |
| run seven | 6.551s + 6m15.873s |
| run eight | 6.293s + 6m11.707s |
| average | |
| 6m18.242s |
The results are quite impressive. The mysql-parallel-dump with chunksize option that I ran didn’t seem to make any difference. Don’t be fooled as this was on a smaller data set. If you have a multi-gigabyte table with millions of rows in innodb format this could make a huge difference. Don’t underestimate it.
The clear winner is the mysql-parallel-dump paired with mysql-parallel-restore. The parallel import/dump over a normal mysqldump w/various options is around twice as fast. Using mysqlimport with mysqldump brings the time difference down to around two minutes but it is still fairly significant. With a larger dataset I think the difference would become more pronounced.
In short, you should strongly consider using the parallel dump and restore tool for your backup needs if you are currently using mysqldump.
2 commentsInteresting New Software
So for quite some time I have been looking around for a solution to a problem. We are developing a tool in-house at iContact that will allow actual production queries to be run against a database. This actually almost complete. Thursday we actually did the initial testing. I am responsible for setting up a system that will allow a database to be rolled back to a start point. That way we can populate a database — run queries against it and then roll back to re-run the test after configuration changes.
Snapshots with rollback would allow this. There is no common Linux software that allows this. I must admit to a little envy about Solaris as it does allow snapshots with ZFS. I honestly just wish all the developers could get together and figure out some way to port this to Linux. But .. that is another story. I tried working with unionfs. It is just to clunky for what I want to do. It was designed really to work with “live cd” based projects (like Knoppix) and isn’t really suited to my needs.
My CTO pointed me to some software yesterday that I read about some time ago (about two years ago if I remember). I had lost track of it and forgotten about it. The software doesn’t appear to be in active development. The last release was 2-26-06. However, it does appear to do everything I want and then some.
The software is called EVMS (Enterprise Volume Management System) and looks like it was developed in-house at IBM back in 2001. Evidently IBM use to support the project in a fairly big way but it has been somewhat orphaned of late. A few of the developers still work on it but not very much.
The software works on top of underlying filesystem technologies such as LVM and provide a single unified way of working with them. In addition it adds some capabilities of its own (including the afore-mentioned snapshot capabilities). EVMS has a command line interface, a ncurses interface and a gtk+ GUI. Covers it all. Unfortunately the installation looks like it is a little complex. I will be getting with the sys admins Monday to see if I can get some time next week for getting it set up on the test system.
The EVMS website is: http://evms.sourceforge.net
Anyone out there have experience with this software? Would love to hear some feedback.
7 commentsMysqlpdump Test
I should be sleeping. But I am playing :). I have fairly serious dump of a couple of servers that will probably occur late this week. I have been tweaking various things and testing various configurations to get the best dump and restore speeds. Then, in the last couple of days, I ran across this by Monty Taylor:
http://mysql-ha.com/2007/09/12/mysqlpdump-for-parallel-dumps/
which links to this site:
http://www.fr3nd.net/projects/mysqlpdump/
It is a python wrapper script around mysqldump that executes the dumps using multiple threads — from what I see it each thread dumps an individual table.
I executed some tests on a slave that we run backups and a small DB that is right at 900M in size. It was fairly eye opening:
1:37 — 917M mysqlpdump.py -u root -ppassword -v -d /var/backup/directory -i database
2:23 — 915M mysqldump database -u root -ppassword > /var/backup/directory/database.sql
So, on this small database, it made a significant % speed increase. This was with the default 5 threads. Next I increased the thread count to 10, 15 and then 20. The results are:
1:50 — mysqlpdump.py -u root -ppassword -v -t 10 -d /var/backup/directory -i database
1:40 — mysqlpdump.py -u root -ppassword -v -t 15 -d /var/backup/directory -i database
1:59 — mysqlpdump.py -u root -ppassword -v -t 20 -d /var/backup/directory -i database
While these aren’t controlled test I think it is clear that the mysqlpdump utility script is faster than the mysqldump command. Speed was best with a thread count of five. I don’t have an explanation for this other than the possibility that my server (a dual-xeon) is really just simulating a higher thread count. I don’t know. Not as much a hardware person as I might like. Even so, just with default settings it is a nice little speed-up.
As for the restore - Brian Aker says in this post: http://krow.livejournal.com/519655.html that there is an option now to mysqlimport to specify a number of threads..the reverse of this script. Or, as an alternative, I could use the mysqlprestore script developed by Monty Taylor after he looked at mysqlpdump. Not sure which I will use, but it seems that both will do the job.
Cool stuff. Hoping something gets directly integrated into mysqldump soon. Or, even better, an online backup that didn’t disrupt things ![]()