I needed to strip out some DOM nodes from a HTML file. I would use SED but some of the tags are multiline, and SED/regexes really don’t understand HTML/XML and get really confused if you’re using nested tags of the same type. In the end I decided to use PHP’s built in DOMDocument functions. It is fairly strict and refuses to load if the HTML isn’t perfectly formed, so first I ran it through PHP’s tidy - this isn’t installed by default but you can add it in with a:
sudo apt-get install php5-tidy
So first fix the malformed HTML:
<?php
$html = file_get_contents("myfile.html");
$config = array(
'indent' => true,
'output-xhtml' => true,
'wrap' => 0);
$tidy = tidy_parse_string($html, $config, 'UTF8');
$tidy->cleanRepair();
//And then load it into DOMDocument:
$doc = new DOMDocument();
$doc->loadHTML($tidy)
?>
Then it’s just a matter of ripping out the tags you don’t want. Note how we’re iterating through the $nodes variable - it MUST be done this way if you’re planning on removing the nodes (as I am) because as they’re removed they also disappear from the collection. A foreach will do some odd stuff - probably terminate after the first node, and a for-loop will have you missing every other node. Instead, just remove the first child until there are no children:
<?php
$nodes = $doc->getElementsByTagName("script");
while ($nodes->length > 0) {
$node = $nodes->item(0);
remove_node($node);
}
function remove_node(&$node) {
$pnode = $node->parentNode;
remove_children($node);
$pnode->removeChild($node);
}
function remove_children(&$node) {
while ($node->firstChild) {
while ($node->firstChild->firstChild) {
remove_children($node->firstChild);
}
$node->removeChild($node->firstChild);
}
}
?>