PHP - Validating data
From LXF Wiki
|Table of contents|
(Original version written by Paul Hudson for LXF issue 66.)
The web is full of bad guys - how can you know who to trust? Come with me if you want to live!
Sending private information over the web is as secure as posting it on a postcard - it gets there sure enough, but world + dog can read it en route. That said, more data is stolen by people hacking into databases as opposed to people sniffing data packets, so it's crucial that the data is kept in a secure state.
If we consider a website that requires visitors to login in order to post messages to a forum, there are a number of factors we need to look at: getting the data to the server securely, ensuring it is legitimate, verifying the credentials, and issuing them with access rights. The first of these is handled transparently by SSL and so we needn't concern ourselves with it. However, the other three can be summarised into the succinct steps:
- 1: Validate: ensure that the input is what you expected, is of the correct length, and is correctly formatted.
- 2: Authenticate: verify what the user provided against what you have in your database.
- 3: Allocate: when you have establish that they should have permissions, issue them with an access token that gives them access rights to their account.
If you follow the advice over the next four pages, you cannot fail. Or, if you do fail, you can't blame us!
While blogs came to prominence over the last year or so, they brought with them the scourge of comment spam. Two factors made comment spam possible: i) Google indexes by the number of links to another site (known as PageRank), and ii) most blogs use off-the-shelf software. As a result, scripts were written that look for the "Post comment" fields in a blog and fill it full of links to the spammers' sites - Google then thinks these sites are popular, and indexes them highly.
This technique - an automated form of Googlebombing - has been a plague for a long time, but there's now a "nofollow" option that can be placed on links to have Google and other search engines ignore them - perfect for solving the problem. However, it doesn't actually stop comment spam, it merely makes it less desirable: it's still very possible that the person posting a comment to your site isn't a person at all!
Other things we need to be looking for are bad input - either people trying to type extraordinarily long usernames, or people specifically trying to enter bad input for malicious purposes - but we also need to ensure it comes from a real person, and that they haven't tried to spoof the referer (note: we're aware that common use is "referrer", but the official W3C standard is "referer" - hurrah for incompetence!)
So, first up: how can we prove the person who submitted our form is a real person? What we're looking for is what Alan Turing called an imitation game, now called a Turing test: we interrogate the entity submitting the form, and - because computers are currently not able to distinguish themselves sufficiently well from humans - we should know for sure whether it's a computer or a human at the end of the wire.
A popular method right now is the Turing test image, where you show the user some text in a picture and ask them to type it in. This works moderately well, but has its downsides: as an image, it cannot be scaled up for people who have sight problems (and their screen reader cannot read out the image contents - it's designed to be unreadable by computer, remember!), and also it's invisible for people who browse the web with images turned off.
Instead, we're going to do something text-based: we're going to ask someone to write in the answer of a simple sum, eg nine plus three. We don't want to discourage people from submitting comments by asking them things that require any real work, so using the numbers 0 to 10 and sticking only to addition is a smart move!
In PHP, we're looking at something like this:
<?php $numbers = "zero"; $numbers = "one"; $numbers = "two"; $numbers = "three"; $numbers = "four"; $numbers = "five"; $numbers = "six"; $numbers = "seven"; $numbers = "eight"; $numbers = "nine"; $numbers = "ten"; $add_one = array_rand($numbers); $add_two = array_rand($numbers); ?> <form method="POST" action="first.php"> Some text: <input name="Comment" type="TEXT" /><br /><br /> Please answer this simple question: what is <?php echo $numbers[$add_one]; ?> plus <?php echo $numbers[$add_two]; ?>?<br /> The answer is <input type="TEXT" name="CommentSumAnswer" /> (please write in numbers, eg 19)<br /><br /> <input type="SUBMIT" value="Add comment" /> <input type="HIDDEN" name="CommentSum" value="<?php echo md5(sha1($add_one + $add_two)); ?>" /> </form>
In that code, we have an array of eleven numbers - 0 to 10 - and we pick two randomly with array_rand() and add them together. The form gets submitted with the user's comment, their answer to the question, plus a coded version of the correct answer. SHA1 and MD5 are both hashing algorithms that generate a fixed-length string that can be used to validate another string. That is, setting a HTML field, "CommentSum", to the value of md5(sha1($add_one + $add_two) will add the two numbers together (say they were 7 and 6, giving 13), then hash them with the SHA1 algorithm, then hash the SHA1 algorithm with MD5. We then transmit that to the server, and hash the user's answer in the same way - our two answers should match.
Yes, this is security through obscurity, but it's a surprisingly common trick - often a magic number is rolled into the mix, eg $add_one + $add_two + 0xBEEFEEBABE or 0xC0FFEE.
However, that's the least of our problems: have you spotted the fatal flaw in the current plan? Well, consider this scenario: a spammer visits our site personally to find out why his mass-spamming engine hasn't managed to fill our site with spam. There he sees the question, "what is five plus six" and also sees the hash value that tells the server what the answer should be, "c8180c19e5a1278cddf5248331ef7fa5". What's to stop him submitting the form by hand, sending "11" and "c8180c19e5a1278cddf5248331ef7fa5" each time? The answer is "nothing" - the server can't tell that the spammer is forcing the question and providing the set answer, so that renders our code easy to break.
The solution is quite simple: along with the question and answer, we provide a field marked "time" that sets the time the form was shown to the user. Our answer hash should then look be $add_one + $add_two + $time. When the form gets submitted, we check that the answer is correct, but also that the post was submitted within 30 minutes of the user being shown the form - we essentially make the hash answer valid for only half an hour.
Once we know that our input is from a real person, we can eliminate potentially bad data - this is much easier! Our goal here is:
- Trim all data to the size we're expecting
- Substitute reserved characters with their HTML equivalent
- Escape quotes
We can do all that with just one line of code, like this:
$mytext = add_slashes(substr(trim($mytext), 0, 50));
The "50" part is arbitrary: you will want to change that to the length you're looking for.
The final task is checking the referer, which again is quite simple because PHP provides it as a string in $_SERVER['HTTP_REFERER']. As mentioned, this is easy to spoof, and also some firewalls are configured to strip referer information out of requests being sent.
Once we have our login information, we can run it against our database to see whether this user is authorised. Of course, our database passwords aren't stored in plain text, right? Instead, we hash them with SHA1 so that even if our database gets broken into, the passwords will still be safe. This can be the easiest part of the entire process:
- When users set their password, SHA1 it and store it away
- When the user logs in next time, SHA1 their input and compare it against the stored copy
So, yes, that's easy - but is it the most secure option? No. Is it even the most user-friendly option? Again, no. However, it is the easiest option is you're lazy: it put password stealing out of reach of the nearly everyone in the world, and only has two downsides:
- Users are required to enter their full password.
- Users cannot get a password reminder if they forget it
The first downside sounds like it is entirely negated by SSL because it sends the password encrypted over the wire. However, if you've ever wanted to check your email while at a conference or while sitting in an Internet café while abroad, you will know that SSL only secures half the equation - it stops people from reading the password when it's going over the web, but it doesn't stop people from reading the password if they have installed a local key logger. Paranoia rocks!
The second downside springs from the fact that SHA1 is a hash algorithm that performs one-way encryption. That is, you cannot get the plain text input from the SHA1 output. As a result, if a user forgets their password you cannot simply "decrypt" the SHA1 key and send it to them.
Is there a solution here? Absolutely, but it takes our "easiest part of the entire process" and turns it into something that requires some mathematical analysis. [Somewhere in the distance you can hear the dull thud of a thousand LXF magazines being placed back onto their shelves at the book store.] Symmetric encryption is nothing to be afraid of, however there are a few terms you need to know:
- Block cipher -your source text gets split up into chunks when encrypted, and this decides how each chunk is handled.
- Ciphertext - this is the encrypted version of your source text.
- Initialization Vector (IV) - this is a value (preferably kept secret) that is used to make your input look more innocuous.
- Key - this is the secret value that, combined with your IV, encrypts your data.
On top of that, we also have a choice of encryption algorithms and key sizes - we will be using 256-bit Rijndael (commonly known as the Advanced Encryption Standard, AES) but there are others to choose from.
The complete encryption and decryption process looks like this:
1) Select an algorithm and block cipher 2) Create an IV 3) Create a key 4) Initialise the algorithm with the IV and key 5) Encrypt 6) Unload the algorithm, IV, and key 7) Reload the algorithm, IV, and key 8) Decrypt 9) Unload the algorithm, IV, and key
In code it's a little harder to read, but if you're still reading then I imagine nothing will scare you off!
<?php $plaintext = "This is very important data"; $plainkey = "There's nowt as queer as folk"; $td = mcrypt_module_open(MCRYPT_RIJNDAEL_256, '', MCRYPT_MODE_CFB, ''); $iv = mcrypt_create_iv(mcrypt_enc_get_iv_size($td), MCRYPT_RAND); $ks = mcrypt_enc_get_key_size($td); $key = substr(sha1($plainkey), 0, $ks); mcrypt_generic_init($td, $key, $iv); $ciphertext = mcrypt_generic($td, $plaintext); mcrypt_generic_deinit($td); mcrypt_generic_init($td, $key, $iv); $decrypted = mdecrypt_generic($td, $ciphertext); mcrypt_generic_deinit($td); mcrypt_module_close($td); echo <<<EOT Input was "$plaintext" Key text was "$plainkey" IV was $iv Key was $key Ciphertext was $ciphertext Decrypted was "$decrypted" EOT; ?>
Do run that script before trying to understand it, if only for the fact that it's reassuring to see it print out working data before committing its mechanism to memory!
First up, we create $plaintext and $plainkey to hold the data we want to encrypt and the secret encryption string repectively. What you choose as your key is important, but don't worry about getting it to be any length - as you can see in the script, it gets passed into sha1() so that it uses more characters, then trimmed to the length of the key that the algorithm accepts.
The call to mcrypt_module_open() takes an algorithm as the first parameter and the block type as the third parameter - leave parameters two and four blank. As you can see in the code, the first parameter is where you select the algorithm you want - 256-bit Rijndael is used, but you can use others such as MCRYPT_SERPENT_256 or MCRYPT_TWOFISH_256. That said, you should keep in mind that no one will get fired for using AES, as it is the recommended encryption standard. While the other two are very strong pieces of work - Twofish was Bruce Schneier's attempt at the AES competition, and Serpent is the strongest algorithm of the three (and the slowest!) - you should only need them if AES doesn't fit your needs for some reason.
Next up, we create our IV using mcrypt_create_iv(). The first parameter takes the size of the IV to create (provided through the return value of mcrypt_enc_get_iv_size()), and the random number generator it should use - MCRYPT_RAND uses /dev/random. The IV gets applied to your data before it gets encryption to make it look more like white noise - a process known as "whitening", but it's essentially pre-encryption. After generating the key, we place it and the IV into the algorithm using mcrypt_generic_init(), then use mcrypt_generic() to actually perform the encryption.
Inbetween encryption and decryption, you can see we use mcrypt_generic_deinit() to free up the resources - without this, decryption simply will not work (you're welcome to try it!) At the end of the script, mcrypt_module_close() is called to free all the resources up before the script terminates. At the end, all the variables are printed out so you can see exactly what has gone on - I would print the output here, but the encrypted text uses some pretty wacky characters that simply won't make it through the print process!
Now we have working encryption we can store complete passwords in our database without losing the ability to decrypt them for checking. However, that is only a stepping-stone towards our goal: in order to full secure our users they should only type in some of their password. For example, it would be a smart move to ask users to enter letters 2, 5, and 3 from their password one time, then 1, 4, and 5 the next time, etc. You then decrypt the password and check individual characters. This doesn't make your system foolproof (the universe is /always/ giving us better fools), but it does make it a great deal stronger: in order to be able to piece together the full password, a hacker must monitor several login attempts and store both the request (in order to know which letters the user is providing) and also the keys being typed.
We're at the last phase now, which should be a blessed relief for you, however don't let your brain switch off yet - this is the most important part of the whole operation. You see, it's at this point that we grant site access to our users - the point at which we hand them the keys to our Ferrari, as it were. This is dangerous because if a malicious user gets an authenticated user to relinquish their access privileges (either through social engineering or some technical wizardry) then it can cause havoc on your system and also result in a very irate customer.
So, what we're looking at here is how to allocate security privileges to a user so that it becomes incredibly hard to subvert those rights. The most common tactic around right now is called session fixation, and is actually quite cunning. If you have investigate how PHP's sessions work, you'll know it sets a value called PHPSESSID with a random value like this one: 5p6oail4fcjie309su6dkoc6o4. That value gets stored in a cookie on the user's local machine and gets sent to the server each request so that PHP knows which session to load - anyone able to guess the session ID of a valid user will be able to get their access rights.
Being both random and long makes the session ID incredibly hard to predict, however it turns out that you don't actually need to guess the ID at all - you can just pass a pre-generated value and PHP will use that. For example, <A HREF="http://www.somesite.com/foobar.php"> would link to foobar.php and allow PHP to generate a random session ID, but <A HREF="http://www.somesite.com/foobar.php?PHPSESSID=evilhaxxor"> would link to the site and use the session ID "evilhaxxor". J Random Villain need simply wait for someone to click their link and Mitnick is their uncle.
As per usual, there is a solution waiting in the wings. To foil our attacker, we can do one of two things:
- 1) Tie the session to a particular characteristic of the user that created it
- 2) Regenerate the session ID on privilege elevation
Of course, for maximum security we can do both, and that's what we're going to be looking at right now. Tying the session to a user can be done in a number of ways, but surprisingly the easiest way is also probably the best: we take the user agent of the visitor, and store it in the session. This then gets compared each time the page loads, to ensure this is the same person on the session. Step-by-step, we get this:
- Villain sets link with known session ID
- User clicks link, gets known session ID
- Site notes that user browses with Internet Explorer, stores in session
- User logs into their account
- Villain goes to site with same session
- Site notes that villain browses with Firefox, and cuts him off
- User carries on browsing innocently
Of course, more than 80% of the world uses Internet Explorer right now, however this doesn't make our solution less valid. For example, we tried IE out on a Windows XP box, and its user agent was "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322" - that's pretty unique! If the user doesn't have exactly the same versions of Windows, IE, and .NET, then the user agent will be different.
Regenerating the session ID is equally easy to do, thanks to the function regenerate_session_id(). By "privilege elevation" I mean "whenever your user acquires access rights". When they click the evilhaxxor link through to your site, they will be guests by default - they might not be able to post messages, buy things, etc. To do that, they must login, which usually entails the entry of a username and password. When we authenticate them, they have their privileges elevated - they can do all the things they would expect to do with their account. It's at that precise moment that we should regenerate the session ID, because if they are the victims of session fixation then the villain will be left with the old, guest session, and our new user will get a fresh, clean, and safe session.
Using regenerate_session_id() is simple: just call it with no parameters, and ignore the return value. It automatically copies across the data from the old session to the new, and sends out a cookie to the user with the new session ID, The one problem that may catch you out is that last part: it sends out a new cookie, which means you /must/ call it before you send any HTML content, otherwise it won't work. Calling this function is very fast - if you were particularly paranoid you could change the session ID each page!