PHP - Optimising MySQL queries
From LXF Wiki
|Table of contents|
Practical PHP Programming
(Original version written by Paul Hudson for Linux Format magazine issue 45.)
We finish a two-part mini-series into accelerating your scripts with a look at how to optimise your MySQL queries...
Last tutorial we looked at the basics of optimisation, and also went into quite some depth regarding how to make your PHP scripts run faster. However, where's the point in making your scripts run faster if the database that drives them crawls? This issue, without further ado, we're going to be looking at various ways to make your SQL run faster. This is a big area, so you're strongly advised to read up on the suggestions given here to see whether they will work in your given situation.
There are three basic routes to take when it comes to optimising your SQL: change your queries, change your table schema, or change your server configuration. We'll be looking at all three, starting with queries.
One of the many advantages to SQL is that it is a true fourth-generation language - you tell it what you want, and it does all the hard work for you. As such, the actual /implementation/ of your query requests is left entirely down to your database server, which means it might not necessarily be doing things in the most optimised way. In order to reach the largest audience, we'll be covering MySQL techniques here, but many of the same rules will apply elsewhere.
MySQL rule #1: Select as little data as possible. Consider the following two queries:
SELECT * FROM MyTable WHERE Username = 'Hudzilla'; SELECT ID FROM MyTable WHERE Username = 'Hudzilla' LIMIT 1;
MyTable, we shall assume, contains thousands of rows, with one for each user. The programmer is looking to extract the ID of a user based upon a username, and above are two ways it can be done: select every field from each row and match Username against "Hudzilla", or to select just the ID field and match Username against "Hudzilla", and only return one row. I made a simple benchmark to test the exact speed difference between the queries - with 200,000 users (a large number to make the point clear), the first query took 1.31 seconds to execute, whereas the second query took 0.66 seconds to execute.
The reason for the big speed up for the second query is two-fold: firstly, we're not bothering to return anything other than ID, whereas there could be dozens of fields being returned in the first query. Selecting as little data as possible means using "SELECT *" rarely if ever. Secondly, and most importantly, the LIMIT 1 at the end of the second query will force MySQL to only return one row and, as we don't have any complicated ordering going on, MySQL will simply stop searching as soon as it finds the first row with Username "Hudzilla". That's quite clearly superior to the first solution, where MySQL will keep on searching the database to find other Hudzilla rows - this is rarely the desired situation.
MySQL rule #2: Let MySQL tell you how it would execute your query. If you append EXPLAIN before any SELECT query, MySQL will output a detailed report on how it would have handled the query, how many rows it would have returned, etc - this is a great way to spot problems. The faster of the two queries from above, when used with EXPLAIN, produces the following output:
+--------+------+---------------+------+---------+------+--------+------------+ | table | type | possible_keys | key | key_len | ref | rows | Extra | +--------+------+---------------+------+---------+------+--------+------------+ | lxf44a | ALL | NULL | NULL | NULL | NULL | 199999 | where used | +--------+------+---------------+------+---------+------+--------+------------+
The important parts are "possible_keys", "rows", and "Extra". Possible keys shows "NULL", which means MySQL cannot run the query using indexes - we'll be solving that later. Rows shows how many rows MySQL would have had to search through, and "Extra" shows what extra functionality it would have used to perform the search - here we can see it would have filtered the search using the where clause.
MySQL rule #3: Let MySQL tell you about queries that take a long time. Perhaps you didn't realise this, but simply by enabling one option in your MySQL configuration file, MySQL will keep a log of all the queries it considers slow (usually taking longer than ten seconds to execute), as well as how many rows the slow query returned.
To activate the slow query log, simply start up MySQL with the option --log-slow-queries = /var/log/wherever/you/want/it. You can optionally also use --log-long-format, which will cause MySQL to consider all queries that don't use indexes as slow queries - this can be a great help if you're trying to add indexes, but you're not sure where they're needed.
MySQL rule #4: Perform joins carefully. Joining data from two tables in one query is a very powerful technique - particularly when combined with normalisation. However, if not done right -- even by a small margin -- you can incur a serious speed hit. Joining data properly in your queries is a complicated and long-winded thing to explain, and we're almost out of the room, so the best I can do is point you to an excellent reference on the topic, the MySQL manual:
MySQL rule #5: "OPTIMIZE TABLE /table/" is a crucial command to achieve maximum performance. Once you've implemented everything else here, run optimize table on your tables every so often - it does a lot of housework to make sure the table is in an optimal state for querying, and will make your queries faster.
Changing your schemata
The second way to speed up your MySQL performance is to alter the layout ("schema") of your tables to be optimised for performance. This can be split into three sub-sections: data types, indexing, and normalisation, ordered by difficulty. Changing your database schema can be considered the same as changing your algorithm - very often you think your spending hours tweaking things to get speed-ups, only to find you could double the speed by starting from scratch.
Choosing the right data type for your fields can provide a substantial speed boost, but may or may not be good when it comes to space. For the "select as little data as possible" test above, here's the SQL schema I used:
CREATE TABLE lxf44 (ID INT NOT NULL PRIMARY KEY AUTO_INCREMENT, Username VARCHAR(255), Age INT, JoinDate INT, Homepage VARCHAR(255), Location VARCHAR(255), FaveColour VARCHAR(255), Password VARCHAR(255), PassRemind VARCHAR(255));
It's not a complicated schema, as you can see. However, there are a number of ways it can be optimised. Firstly, "INT" is a vastly over-used data type - it means "store any number between -2147483648 and 2147483647". Now, while it might be possible that vitamin pills and a healthy diet will help you live long, but it's not really likely that we'll ever use all of that range for our age field. Nevertheless, /MySQL/ doesn't know that, so we need to be slightly less vague about our Age range. The "TINYINT" data type stores values from -128 to 127, which is probably enough despite it holding minus values. The "TINYINT UNSIGNED" data type stores values from 0 to 255, which is definitely enough for the forseeable future, and takes up much less room than a full INT.
Size vs Speed
Hard drives are cheap - you can buy hundreds of gigabytes of space for 200 pounds and still have change, so saving space isn't generally an issue. Processors are an entirely different matter, though - adding CPU power is still expensive, so everything we can do to cut down on CPU useage is a good thing, even /if/ that means taking up more hard drive space, right?
Consider this new schema, a modified version of the previous one:
CREATE TABLE lxf44a (ID INT NOT NULL PRIMARY KEY AUTO_INCREMENT, Username CHAR(255), Age TINYINT UNSIGNED, JoinDate INT, Homepage CHAR(255), Location CHAR(255), FaveColour CHAR(255), Password CHAR(255), PassRemind CHAR(255))
As you can see, we're now using TINYINT UNSIGNED for our Age field, but the key difference is that we're using CHAR(255) rather than VARCHAR(255). Both hold up to 255 characters of text, but the difference is that VARCHAR only uses as much as space as there are characters for on a row-by-row basis. Variable row sizing might sound good at first - after all, why leave unused space to waste? The problem lies in the fact that, because VARCHARs can vary in size, MySQL needs to calculate the length of each VARCHAR field in a row so it knows how far to jump to get to the next record. Compare that to using CHAR as the data type, where MySQL can just add 255 (using the above example) to get to the end of one field and the start of another.
Optimum column types
Another important consideration to keep in mind is that keeping character data short is a great space saver - in the above table I used CHAR(255) for everything, but you should consider each column carefully. A helpful tool in this regard is the ANALYSE() procedure - not the same as "ANALYZE TABLE /table/", which is like a miniature /OPTIMIZE TABLE/. Use the ANALYSE() procedure like this:
SELECT * FROM lxf44a PROCEDURE ANALYSE();
That will execute the query "SELECT * FROM lxf44a", then pass the results into the ANALYSE() procedure. ANALYSE()'s results are fairly complicated, but the important column is "Optimal_fieldtype" - this tells you the best field to use based upon the results returned by the query.
While all the above will help you get a big speed boost, the real key to maximum performance in MySQL is mastering indexes. An index in database terms is the same as an index in a book. If you were to take any book, say Red Hat Linux 9 Unleashed for example, how long would it take you to find something without the index, and how long with?
Indexing your data
You create an index using the command "ALTER TABLE... CREATE INDEX name(column)" - the name is just a name for your reference, but the column should be the name of the row you want to construct an index of. We're working with username here, so we should use ALTER TABLE lxf44a ADD INDEX idxusername(username). It will take a little time to execute, and also subsequent inserts/updates will be a mite slower because of the need to update the index, but it's worth it - the faster of the two queries from earlier on now takes 0.00 seconds. Yes, you read that right - it executes so quickly MySQL doesn't even register the time it took, which is a great deal faster than the 0.66 seconds the next fastest query took, and even faster than the 1.31 seconds the slowest query took.
You can add indexes to as many columns as you want - you can even index every column if you like. However, there are trade-offs, as per usual - indexes take up space, so the more indexes you have, the larger your table will be. Secondly, each time you change a row, you also need to change the index - if you index every column you'll need to update several indexes to reflect the changes. If you have a table which is written to exponentially less than it is read from, you can go ahead and try adding as many indexes as you think necessary.
We're not finished yet, though - remember, optimisation is the act of improving code, and even though we seem to be running at maximum performance, there's a great deal more we can do.
One key area open to optimisation is data duplication. Very often data is duplicated across rows, for example as row might contain Person A works at Company A, whose address is BigLongAddressHere. Person B also works at Company A, whose address is still BigLongAddressHere. Person C also works at.... etc. In this situation there are many people who work at the same place, but the company address is duplicated in the row of each staff member. What happens if the company changes address? Worse, think of all the drive space that's being wasted needlessly!
This can be corrected through normalisation: a person work at a company, but the address of the company relates to the company, not to the person. Therefore, it's better to create a new table for companies, put the companies in there (with no duplicates) and the addresses in there too. Now you just add a field to the employees table to link each employee to a company.
We covered normalisation in more depth back in the Practical PHP article in LXF31, but there are also several helpful resources online (just try Google) if you want to learn more.
Configuring your server
No matter how fast your queries are, if you're running cheap hardware or, worse, poorly configured hardware, your performance will always be stunted. The only thing worse that not having the very latest kit available for your servers is actually /having/ some big iron and not having it configured properly!
There are several easy tricks to make sure your system is working at maximum capacity. Note that we do not have the space here to print any form of "Optimise Linux" tutorial - a topic we haven't really covered in the magazine since about issue thirty-six.
There are four steps to database server nirvana: upgrade your hardware, upgrade MySQL, tweak MySQL, and finally keep a watch over your server.
Upgrade your hardware
This might sound like an obvious way to make your code faster, and indeed it is: upgrading your system is likely to make the biggest immediate difference. Attitudes on what you should upgrade are mixed, with much of the conventional wisdom not really holding true any more. For example, MySQL recommend upgrading your RAM, then your hard drives, then your CPU. That may well be true in some circumstances, however the majority of us are likely to find the biggest speed difference by upgrading in the opposite order: CPU, hard drives, RAM.
The reason for this is because the "RAM first" recommendation is based upon the concept that you're likely to be filling up your RAM with your MySQL data. If you are indeed using up all your RAM with your databases, then upgrading your RAM will be a huge help. However, on the smaller of my two servers I have 512MB of RAM, and my databases (large as they are) only take up 200 - upgrading RAM won't change things a whit for me. Naturally if you're running gigantic databases, your priorities will be different!
Upgrading your hard disk helps when committing data to disk and also reading data into RAM. If you can upgrade to a SCSI drive, or, better if you can stripe your data across several disks, you'll see a noticeable speed up for your server. At the very least, consider using hdparn to tweak your settings. On IDE drives we test, I generally use *hdparm -m16 -d1 -u1 -c1 /dev/hda*. NOTE: Using hdparm can damage or destroy your hardware - read the manual before use.
If your RAM isn't full and your hard disk aren't a problem - because they're fast enough or because they're not hit often - then it is CPU power that will count the most, and in my experience I've found sheer CPU horsepower is the key bottleneck. MySQL uses a lot of 64-bit integer code for its calculations, so if you can afford an Opteron or even an Athlon 64, then you'll find it makes a substantial difference.
I'll make this nice and clear so there's no confusion: if you're not using MySQL 4.0.14 or later, you should be. The reason for this, other than for stability and feature enhancements, is that the 4.x series includes a query cache, which stores results for SELECT queries in RAM so that the next time the same query comes in the results can be served up in no times at all. We covered the query cache in some depth in the MySQL 4 review in LXF42, and the basic result is: if you have queries that return the same results all the time or even some of the time, the query cache will raise your performance substantially.
Increase your buffers
MySQL has a variety of buffers that hold information about tables and indexes, and by increasing the size of these buffers you can often gain substantial boosts. Naturally the biggest boosts occur when changing from very small amounts, but if you find that one buffer has used all its memory, the chances are you'll get at least some performance boost by giving it some more.
To increase the size of your buffers, you need to edit your query cache. Key areas to increase are your key buffer, sort buffer, and query cache. Give your query cache as much as it needs - it's a big space chewer. If you're wondering how you can tell how well your buffers are being used, you need to check the next tip...
Keeping tabs on your server
There are two key commands you should use regularly, as they give you detailed information about the running of your machine. Firstly, just type "STATUS;" at the MySQL prompt to get basic information - how many questions have come in, how many were considered slow queries, how many queries per second, etc. Secondly, type "SHOW STATUS;" to get more detailed information - how much of your query cache is being used, how often MySQL has had to trim the cache due to a RAM shortage, how often your key buffer has been read and how much space it has left, etc. Check SHOW STATUS once a week or so to make sure your buffers aren't overflowing.
RAM is cheap, so there's no harm giving your buffers all they can take and then some. Using the query cache has no impact on performance, which means if you have the RAM you can just allocate 1GB to the query cache to watch your performance shoot up.
There are a variety of ways you can optimise your MySQL usage for maximum performance, and you can generally spend a lot of time ekeing out every last bit of speed. However, the chances are you'll get the best benefit, at least when it comes down to speed improvement vs your time simply by giving using indexes on columns you query by, and giving your buffers and query cache lots of RAM. If you really want to get every last drop of speed without upgrading your hardware, then there's more than enough above to keep you going for quite some time.
If you really want to take your performance to the max, consider purchasing the second edition of Paul Dubois' book, /MySQL/ (New Riders, 2003), as he goes into a great deal of depth on optimisation, including much more detail on the various caches and buffers you can configure.
Optimisation is a black art, yes, but it's lots of fun, and really helps test your coding skills. Very often it's a skill you learn just by doing, so practise, practise, practise!
To NULL or to NOT NULL
As NULL means literally "no value" and not 0 or an empty string, MySQL needs to store one extra bit of data in each field that can have a NULL value to store whether that variable is NULL or not. If you know a field isn't going to be NULL (because you're always going to provide a value, or use a default value), then declare the field as NOT NULL - save the space, and get a little extra performance, too.