Updating Large Datasets in MySQL

I’ve been building a PHP script to import an XML file every 24 hours into a MySQL database using a CRON job.
The table contains almost 150,000 rows of data, each with a unique ID (primary key).

The goal is to check if each ID exists. If so, update the record, otherwise insert it as a new row. But it’s also important that the script doesn’t run for too long and tie up valuable server resources.

The worst possible method would be to do a SELECT for every row in the XML file to check if the ID already exists in the database, and then perform an INSERT or UPDATE accordingly. That would mean running 2 queries for each row, a total of 300,000 queries.
I immediately dismissed that option.

The first method I actually tried was MySQL’s “REPLACE INTO” option, ie:
REPLACE INTO `table` (`id`,`name`) VALUES (25,'John Smith');
This tries to insert the row into the table, but if the unique ID already exists then it deletes the row first and performs an INSERT.

The initial import took over an hour to process 150,000 rows, but I expected the update to run faster. It didn’t.
The script took a total of 4720 seconds (~78 minutes) to complete. Far too long.

The second method I tried was the “INSERT…ON DUPLICATE KEY” option:
INSERT INTO `table` (`id`,`name`) VALUES (25,'John Smith') ON DUPLICATE KEY UPDATE `id`=25,`name`='John Smith'
At first glance this looks very longwinded, however my PHP script dynamically builds the SQL using an array of fields and values so it wasn’t too hard to put together.

I ran the new script and was pleasantly surprised to see the execution time drop to 203 seconds (~4 minutes). This is much more reasonable. The boost in speed most likely comes from having to perform less operations (INSERT->DELETE->INSERT vs INSERT->UPDATE).

Once again here are those stats:
REPLACE INTO: 4720 Seconds
INSERT…ON DUPLICATE KEY: 203 Seconds

You can leave a response, or trackback from your own site.

Leave a Reply

Powered by WordPress