25 Mar 2013

I needed to strip out some DOM nodes from a HTML file. I would use SED but some of the tags are multiline, and SED/regexes really don’t understand HTML/XML and get really confused if you’re using nested tags of the same type. In the end I decided to use PHP’s built in DOMDocument functions. It is fairly strict and refuses to load if the HTML isn’t perfectly formed, so first I ran it through PHP’s tidy - this isn’t installed by default but you can add it in with a:

sudo apt-get install php5-tidy

So first fix the malformed HTML:

<?php
$html = file_get_contents("myfile.html");
$config = array(
	'indent'         => true,
	'output-xhtml'   => true,
	'wrap'           => 0);
$tidy = tidy_parse_string($html, $config, 'UTF8');
$tidy->cleanRepair();

//And then load it into DOMDocument:

$doc = new DOMDocument();
$doc->loadHTML($tidy)
?>

Then it’s just a matter of ripping out the tags you don’t want. Note how we’re iterating through the $nodes variable - it MUST be done this way if you’re planning on removing the nodes (as I am) because as they’re removed they also disappear from the collection. A foreach will do some odd stuff - probably terminate after the first node, and a for-loop will have you missing every other node. Instead, just remove the first child until there are no children:

<?php
$nodes = $doc->getElementsByTagName("script");
while ($nodes->length > 0) {
    $node = $nodes->item(0);
    remove_node($node);
}

function remove_node(&$node) {
    $pnode = $node->parentNode;
    remove_children($node);
    $pnode->removeChild($node);
}

function remove_children(&$node) {
    while ($node->firstChild) {
        while ($node->firstChild->firstChild) {
            remove_children($node->firstChild);
        }

        $node->removeChild($node->firstChild);
    }
}
?>

matt helps

programming help

25 Mar 2013