Diamond Notes

Just another WordPress weblog

A Long Time Coming

So I recently came across a situation at work where I need to copy almost 150 million rows of data from a MyISAM table to an Innodb table. Here is the problem that I started thinking of approximately 130 million rows (and ten or eleven hours) into the transfer: what happens if the query gets aborted?

Having had large transfer die once before I know from experience it isn’t pretty. The way I ended up stopping the rollback the first time was to kill the mysqld daemon and rip all the data of the database out and restore the database after restarting the server. I could be so heavy handed during this recovery because it was a test server. Not something that you could do in production.

If I had to cancel the query I ran late last week it would have taken days for the transaction to rollback. So how could I keep a rollback like this from happening– or at least minimize it?

Here is what I ran:

INSERT INTO new_table SELECT * FROM old_table

Of course this selects everything from one table and copies it to a new table. What came to mind when I started thinking about this was:

INSERT INTO new_table SELECT * FROM old_table LIMIT X OFFSET Y

I posted a question to the mysql mailing list and got similar comments. I thought I might be on the right track.

So this seems to work. BUT…I am not convinced that it would. This is similar to the replication issue that Baron Schwartz has talked about this here:

http://www.xaprb.com/blog/2007/11/08/how-mysql-replication-got-out-of-sync/

Baron was having trouble because an ORDER BY….LIMIT statement was returning non-deterministic results. I think something similar could go on here.

First, let me say that I am in no way a “SQL expert”. I don’t pretend to be one. However, follow me for a minute. And feel free to comment :). When you execute a SELECT such as the one above against a database, the server is in no way constrained to give you the results in a deterministic order. When you run the above select statement multiple times with different offsets (to “chunk” the output so it’s easier to roll back) you aren’t guaranteed to get the proper results. At least that is what I understand.

I may be mistaken :). Be gentle if I am. Remember, I am not a SQL guru.

Another recommendation was to do something like the following:

INSERT INTO new_table SELECT * FROM old_table WHERE user_id mod [num_of_passes] = 0, 1, 2 . . . (num_of_passes - 1)

I believe that user_id would have to be a unique key. This query would return completely consistent results. I am not sure what you would do if you didn’t have a unique key to run the WHERE clause against.

Just some food for thought :)

****added…as Paul pointed out this is valid only as long as no inserts/updates/deletes are performed on the SELECTed table.  I was operating under that assumption (the table I was reading from was not being written too) but forgot to mention it.  Thanks Paul!!

7 Comments so far

  1. Paul November 20th, 2007 9:19 am

    don’t you still have to worry about the poor soul who gets updated from mod(id) of 2 to become mod(id) of 0, after the 0’s have been transferred, but before the 2’s have. how do you control for changes to the id over the lifetime of the transfer??

    if there are no updates to the source table, you are good to go methinks.
    but if there are updates/inserts/deletes - still problems…

  2. ryan November 20th, 2007 10:09 am

    As long as your old_table contains at least one unique key (row_id, primary key, or even a compound PK) you can order by that and it will be deterministic and avoid replication problems noted by xapbr. If your tables don’t have any unique keys at all, you’ve got bigger issues than replication worries. :-)

  3. Andy November 20th, 2007 10:33 am

    The mod idea sounds like a good idea, and I am not even sure it would have to be limited to unique keys - if it hit two records with the same key, it seems like it would just mod the second one and insert it also (if the mod return value is correct).

  4. […] started this as a response to Keith Murphy’s post at http://www.paragon-cs.com/wordpress/?p=54, but it got long, so it deserves its own post. The basic context is figuring out how not to cause […]

  5. Sheeri November 21st, 2007 6:29 pm

    ORDER BY has the problem that you add sorting to the list of things MySQL will need to do, which in many cases adds a lot of work, and who wants a temporary table for all that?

    LIMIT has the problem that it processes all the results until the end of the limit. That’s fine when you have LIMIT 0,1000 but you end up duplicating a lot of work and taking a lot more time than a mere ALTER TABLE when you have LIMIT 100000,1000 followed by LIMIT 101000,1000 followed by…..etc.

    Also, in response to Paul’s comment and your edit, a process cannot obtain a write lock on a MyISAM table if there’s already a read lock on the table. Any process trying to write to the table would be stuck waiting and waiting and waiting until the transaction finished (successful or otherwise) — yet another reason to use the mysqldump method, see the post I wrote that I reference below.

    There might be writes in between the time the read lock is released from the failed attempt and the time a new read lock is acquired from the 2nd attempt to put data into the new table….

    I wrote a much longer answer to how I’d do it, and decided to post it. it’s at http://sheeri.net/archives/246 which is also a pingback now….

  6. Xaprb November 21st, 2007 7:08 pm

    Hi Keith,

    I’d do it in “chunks” with a table lock, if needed.

    The problem with SELECT… LIMIT X OFFSET Y is that as you get further and further into the table, you’re still scanning through and discarding Y rows, which is really an N^2 algorithm. So, 130 million^2 rows later…

    It’s not quite that bad because you’re not doing 1 row at a time, but you see what I mean.

    The “chunking” algorithms I built into the parallel dump and checksum tools could handle this easily for many cases. Basically, say you want to do 1 million rows at a time. Find the minimum and maximum PK values (very cheap query) and divide the difference by 130 to get approximately the spread between the min/max PK of each chunk. Now begin with the min value, add the spread, and you have the first WHERE clause. Start there and add another spread, and you get the next.

    I currently have code written that’ll work for any numerical PK, including date/time types.

    It would be pretty easy to slap the modules together and make a new tool for this, or you could just do it manually :-)

  7. Technocation, Inc. December 18th, 2007 6:31 pm

    Picking Up Where You Left Off………

Leave a reply