PHP - Tidy extension

From LXF Wiki

Table of contents

Practical PHP

(Original version written by Paul Hudson for LXF issue 64.)


Tidying up after yourself is easier than you thought...


The Tidy PHP extension is one of the least understood out there, despite its name giving away what it does. Part of the problem is that it was being written alongside development of PHP 5, and so saw substantial revision before it was finalised - if you used an early version of Tidy, you can be guaranteed you need to relearn it now.

Another part of the problem is that few people see any need for Tidy at all: it smartens up the HTML output of your PHP scripts - it pretty-prints it. Consider this piece of PHP code:

<?php 
$monkeys = array('Minky','Manky','Stinky','Woopsy','Fuzzy','Scuzzy','Bubbles'); // define monkey array
$num_monkeys = count($monkeys);	// count names in array		 
for ($i = 0; $i < $num_monkeys;  $i++) echo $monkeys[$i]. "<br />"; // loop names
?>

This functioning code fragment loops through an array holding the names of monkeys, printing out the names separated by HTML line breaks. While this might display just fine on-screen, anyone who clicks "View Source" in their browser will see a mess:

Minky<br />Manky<br />Stinky<br />Woopsy<br />Fuzzy<br />Scuzzy<br />Bubbles

By comparison, the use of tidy on the above code gives the following result:

 Minky<br />
 Manky<br />
 Stinky<br />
 Woopsy<br />
 Fuzzy<br />
 Scuzzy<br />
 Bubbles<br />

So, the names are separated by HTML line breaks that look fine sent through a web browser, but they don't actually have any textual lines break to make source code reading easier. "Aha!" you say, "I didn't want those dirty pirates stealing my code anyway!" Perhaps. However, this is an Open Source world: I've learnt a huge amount about HTML (probably more than is safe for any one person to know) by examining the code of others, and I think it's important to help others learn, too. Sure, they /could/ decipher our HTML if they were sufficiently talented, but a little help from Tidy will make their lives much easier, while makes us look like benevolent code geniuses. Everyone's a winner!

Spring cleaning

There are dozens of options for Tidy that allow you to be really specific with your needs. However, these are largely irrelevant for our purposes: we're just interested in the tidying aspect of Tidy, as opposed to the formatting and restyling aspect. As such, our basic Tidy script is very simple:

Create a Tidy object Tell it to clean up and repair our HTML Print out the results

By default, Tidy works to HTML 3.2: it will top and tail output with the appropriate <html> and <body> tags, normalise the case of tags, and remove any HTML elements that aren't in the standard. In order to tidy, we need something messy, so create a file called input.html with this content:

<TITLE>Linux Format</TITLE>

<statement type="true">Linux Format is a great magazine.</STATEMENT>  Particularly that Paul Hudson - he's <B>the best programmer around!

There are a few things wrong with that HTML sample:

* It's missing <html>, <head>, <body>, </body>, </head>, and </html>
* <TITLE> is in uppercase; properly formatted HTML tags are lowercase
* <statement> is an illegal HTML tag.  It also is terminated by </STATEMENT>,
  which is a different case.
* <B> should be lowercase, and also isn't terminated.
* Clearly if anything is a true statement, it's me being the best programmer around!

Anyway, we can run that HTML file through Tidy with just the tiniest smidge of PHP:

<?php
	$tidy = new tidy("input.html");
	$tidy->cleanRepair();
	echo $tidy;
?>

First up, we instantiate an instance of the tidy class, passing into the constructor the filename we want to work with. Then, the cleanRepair() function is called, which will fix our code. Finally, the object is echoed - if this sounds alien to you, you should dig up your back issues and read the PHP tutorial from LXF49 where the __toString() magic function was first discussed. Here, that function is being used to automatically print out the tidied content - it's much smarter than having to write echo $tidy->getContent() or something similar.

That code will output the following:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>Linux Format</title>
</head>
<body>
Linux Format is a great magazine. Particularly that Paul Hudson -
he's <b>the greatest programmer around!</b>
</body>
</html>

So, it now has a doctype header, all the missing tags, normalised character case, plus no more illegal HTML elements.

We can accomplish the same thing using strings rather than files by using the parseString() function, like this:

<?php
	$input = file_get_contents("input.html");
	$tidy = new tidy();
	$tidy->parseString($input);
	$tidy->cleanRepair();
	echo $tidy;
?>

This time the input text gets sucked into the $input string then passed through Tidy. Of course, that string could also have been created through heredoc, or generated through some complex instructions - Tidy handles any string input you can throw at it.

Playing with the options

The basic output has three annoyances: it's only HTML 3.2, it doesn't indent the HTML, and it wraps lines it thinks are too long. All of these - and many more options - can be tweaked in Tidy by setting options before parsing the input. Specifically, we can get started by fixing our three annoyances. This is done by passing in an array of options where the keys are the options you want to set, and the values are, well, the values. Simple, really. This array then needs to be passed into the Tidy constructor as the second parameter.

Try this code out:

<?php
	$options = array();
	$options["indent"] = true;
	$options["output-xhtml"] = true;
	$options["wrap"] = 0;

	$tidy = new tidy("input.html", $options);
	$tidy->cleanRepair();
	echo $tidy;
?>

Setting indent to true indents the code, setting output-xhtml to true forces XHTML-compliant output, and setting wrap to 0 forces Tidy to not wrap lines no matter how many characters they are. If you do want it to wrap, but want to specify your own value in number of characters, just replace the 0.

This time, the output is:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>
      Linux Format
    </title>
  </head>
  <body>
    Linux Format is a great magazine. Particularly that Paul Hudson - he's <b>the best programmer around!</b>
  </body>
</html>

As you can see, we now have smartly indented XHTML, with all the code errors fixed, and - most importantly - no line wrapping. Perfect! If the tinkerer deep within you is hungry for more to play with, try these extra options out:

  • word-2000. Got someone who generates their HTML using Microsoft Word? It produces nightmarishly bad code, but setting this to true will fix it.
  • fix-backslash. Some people, particularly Windows users, like writing URLs with backslashes rather than forward slashes. This insanity can be corrected by setting fix-backslash to true.
  • show-body-only. If you want Tidy to fix all the HTML without topping and tailing it with HTML header and footer bumf, set this option to true. This is most commonly used if you want to store the content in a database, then serve it up with a header and footer later in the process.

As the PHP Tidy extension based directly on the Tidy library used elsewhere, you can use any of the settings listed in the main Tidy manual at http://tidy.sourceforge.net/docs/quickref.html. The manual also lists all the default options, which should give you a good idea of what output you can expect.


Checking for errors

Apart from just fixing your HTML, you might also like to know where you went wrong so you can learn and avoid the problem next time. Again, this is really easy to do with Tidy because it has the errorBuffer variable that stores a list of all the problems in your code. Here's how that looks in PHP:

<?php
	$tidy = new tidy("input.html");
	$tidy->cleanRepair();
	if ($tidy->errorBuffer) {
		echo "Your HTML had errors!\n";
		echo $tidy->errorBuffer;
	} else {
		echo $tidy;
	}
?>

Run that script, and you'll get output like this:

Your HTML had errors!
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 3 column 1 - Error: <statement> is not recognized!
line 3 column 1 - Warning: discarding unexpected <statement>
line 1 column 1 - Warning: plain text isn't allowed in <head> elements
line 3 column 60 - Warning: discarding unexpected </statement>
line 3 column 111 - Warning: missing </b>

As you can see, Tidy gives us the the line and column of all the errors, plus a description of the problem. That code works because errorBuffer is blank if no issues were found, making the conditional statement fail.


Tidying everything

The last Tidy-related topic I want to look at is how to pipe all your output through Tidy automatically, skipping all the tidy class nonsense. This is done through the output buffering system, and, although you're bored of me saying this, it's very easy to do. We covered output buffering back in LXF46 (I can hear the back issues phone ringing already...!), and you may well recall that a normal buffered page looks something like this:

<?php
	ob_start();
	echo "Some content";
	ob_end_flush();
?>

We can add Tidy to that just by changing the ob_start() line to so that it calls the Tidy output buffering handler. Dropping that into our existing script gives the following:

<?php
	ob_start('ob_tidyhandler');

	$tidy = new tidy("input.html");
	$tidy->cleanRepair();

	if ($tidy->errorBuffer) {
		echo "Your HTML had errors!\n";
		echo $tidy->errorBuffer;
	} else {
		echo $tidy;
	}

	ob_end_flush();
?>

However, that won't work as you might think. The problem is that we're running Tidy inside Tidy, and one of the errors in our is a missing <!DOCTYPE> directive. If we leave that in, Tidy sees it an gets confused the second time around, so what we really need to do is pass $tidy->errorBuffer through the htmlentities() function before printing it out. That will convert < and > to < and > respectively, thus making it safe for output.

That pretty much wraps up our coverage of Tidy - it is remarkably simple to learn, and even easier to use. Once you switch to using the output buffer system rather than using objects, it's essentially transparent - it's no work for you, and you're benefiting budding hackers worldwide!


Non sequitur

Back in LXF50 we briefly looked at SimpleXML, and how it made XML simple from PHP 5 onwards. You probably read it, tried it once, and forgot it. Well, as I promised it would be a mish-mash month in terms of the topics covered, I want to show you something in SimpleXML that should leave you quite impressed. It's nothing new, mind you - it's been around about five years now, but thanks to the snail-like speed of XML development most people still know very much about it. To what am I referring? XPath, of course!

Once you get addicted to the speed and power of SQL, it's a little painful to go back to boring old XML. Sure, it works anywhere you put it, but it's also clumsy to parse and hard to search through. XPath changes that by allowing you to filter your XML using quite intelligent queries.

As with Tidy, we need a subject to demonstrate how this works, so save this text as an XML file in your scripts directory:

<squirrels>
<squirrel>
<name>Pinky</name>
<age>3</age>
<colour>Reddish</colour>
</squirrel>
<squirrel>
<name>Chip</name>
<age>2</age>
<colour>Greyish</colour>
</squirrel>
<squirrel>
<name>Perky</name>
<age>5</age>
<colour>Reddish greyish</colour>
</squirrel>
<squirrel>
<name>Dale</name>
<age>2</age>
<colour>Reddish</colour>
</squirrel>
<squirrel>
<name>Nick</name>
<age>35</age>
<colour>Grey</colour>
</squirrel>
</squirrels>

So our XML file has a group of squirrels, each of which have a name, an age, and a colour. Note the stacking of the elements: the parent element is <squirrels>, then there are lots of <squirrel> elements, each of which contain <name>, <age>, and <colour>. This is important, because XPath's precision relies upon you telling it exactly what kinds of element you're looking for.

For example, if we want to pick up all <name> elements that belong to squirrels, we could specify an XPath search like this: /chipmunks/chipmunk/name. That is, "get all chipmunks elements, then all chipmunk elements inside, then all the name elements inside". This is done with the xpath() function, which takes that XPath search string as its only parameter, and returns an array of matches. In code, that is:

<?php
 $xml = simplexml_load_file("squirrels.xml");
 $names = $xml->xpath('/squirrels/squirrel/name');
 foreach($names as $name) {
 echo "Found a squirrel called $name!\n";
 }
?>

The output from that script is this:

Found a squirrel called Pinky!
Found a squirrel called Chip!
Found a squirrel called Perky!
Found a squirrel called Dale!
Found a squirrel called Nick!

The foreach loop iterates over the return value from xpath(), which with that query is just an array of strings. However, what do you do if you want to get the names /and/ ages of the squirrels? If we amend the query to /squirells/squirrel you'll see - xpath() returns an array of squirrel objects that come complete with all their variables. So, we could amend our code to this:

<?php
 $xml = simplexml_load_file("squirrels.xml");
 $squirrels = $xml->xpath('/squirrels/squirrel');

 foreach($squirrels as $squirrel) {
 echo "\nFound a squirrel\n";
 echo " Name: {$squirrel->name}\n";
 echo " Age: {$squirrel->age}\n";
 }
?>

We can get even more vague by asking for any <name> element regardless of its ancestry, by using //name. The two slashes are important - it's not just a typo! Using this method, any <name> in the XML will get returned regardless of whether it's a squirrel name or a chipmunk name, or anything else that matches then name. So, in code, we have:

$names = $xml->xpath('//name');


From generic to specific

"Vague" works to a degree: as it stands, we can grab info on all the squirrels then filter through it with PHP. However, that's equivalent to running a "SELECT *" SQL query then using PHP to sift through it - it's much better to do it inside the query. We can do this with XPath too: rather than asking for every squirrel, we can specify exact values or a particular range we're interested in.

This is done using square brackets wherever you want a query, and you can use standard operators like <, >=, or =. For example:

$names = $xml->xpath('/squirrels/squirrel[age<=3]/name');

Hopefully what that XPath query does should be immediately obvious: it returns the names of all squirrels that are equal to or under the age of 3. We can extend this to show all squirrels that have age <= 3 or age > 5 using an OR pipe, like this:

$names = $xml->path('/squirrels/squirrel[age<=3]/name|/squirrels/squirrel[age>5]/name');

The little | does kinda disappear on that line, but it is there and is crucial to the operation of the query. Alternatively, we can query the results of a query by simply having more than one set of square brackets, like this

$names = $xml->path('//squirrel[age<=3][name="Pinky"]/name'

So that time we use a wildcard to match all squirrel elements, then filter it on age (leaving Pinky, Chip, and Dale), then finally filter it only for squirrels called Pinky. I don't know about you, but I'm sick of squirrels - fortunately that's XPath explained! As we've seen, XPath really does transform XML from a static data source into a filterable, queryable storage system that, while still a long way away from being an SQL-like database, is still a huge improvement.