I got an interesting question on IRC recently. How can you highlight some words/word parts in an HTML document?
The Challenge
- Wrap given words in text content with a span.
- Add a class to the span depending on the word.
- Do not touch elements, attributes, comments or processing instructions.
- Do it case insensitive.
- Do it the safe way.
Select The Text Content
Well this is the easy part. Get some FluentDOM object, find the part of the document to edit, select all text nodes in it.
$fd = FluentDOM($html, 'html')
->find('/html/body')
->find('descendant-or-self::text()');
I used two Xpath expressions because it are two steps. This way I can separate them later. In a single expression I could use the short syntax for the axis, shortening it to "/html/body//text()".
Loop
FluentDOM provides an "each()" method, expecting a callback for argument. The callback is executed for each node (in this case each text node). The first argument of the callback is the node itself.
$fd->each(
function ($node) use ($check, $highlights) {
...
}
);
Prepare The Words
$highlights = array(
'word' => 'classNameOne',
'word_two' => 'classNameTwo'
);
I need to check each node against the words and split it at the words. Is is a text value now, so the tool of choice are PCRE. To build a pattern from the words I sort them by length first, then loop, escape and concatinate them. The sorting is important if one word is part of another.
uksort(
$highlights,
function ($stringOne, $stringTwo) {
$lengthOne = strlen($stringOne);
$lengthTwo = strlen($stringTwo);
if ($lengthOne > $lengthTwo) {
return -1;
} elseif ($lengthOne < $lengthTwo) {
return 1;
} else {
return strcmp($stringOne, $stringTwo);
}
}
);
$check = '';
foreach ($highlights as $string => $class) {
$check .= '|'.preg_quote(strtolower($string));
}
$check = '(('.substr($check, 1).'))iS';
Check And Divide
This pattern can now be used to check, as well to divide the text. A direct replace would be a bad idea, because I need to insert a new element node (the span). Creating nodes using the DOM functions takes care of any special chars.
if (preg_match($check, $node->nodeValue)) {
$parts = preg_split(
$check, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE
);
...
}
The option PREG_SPLIT_DELIM_CAPTURE puts the submatch into the $parts array, too. So it is possible to loop over all parts in their original order.
To Wrap Or Not To Wrap
The $parts array contains the words as well as the text around in separate strings. For each word, a span with the class is needed, all other become separate text nodes.
foreach ($parts as $part) {
$string = strtolower($part);
if (isset($highlights[$string])) {
$span = $node
->ownerDocument
->createElement('span');
$items[] = FluentDOM($span)
->addClass($highlights[$string])
->text($part)
->item(0);
} else {
$items[] = $node
->ownerDocument
->createTextNode($part);
}
}
You now see the reason why I used lowercase versions of the words for keys in the $highlights array. It is easy to check if the $part is a word and get the class for the span.
Replace The Text
The last step is easy again, replace the node with the list of created ones.
FluentDOM($node)->replaceWith($items);
More
This is the basic solution and will only work with PHP 5.3, but I created another version defining a class. You can find the full source of the class example in the FluentDOM SVN at svn://svn.fluentdom.org in examples/tasks/highlightWords.php or on Gist.
If you were to do this to, for example, highlight words that were searched on through a search engine, I would recommend doing this all in javascript, and don't do the post-processing in PHP.
ReplyDeleteIt will make it easier to cache the PHP response, and IMHO a bit more reliable.
This would be a basic function of the result page, so it should not depend on JavaScript.
ReplyDeleteto expect javascript to be disabled is the biggest failure nowadays.
ReplyDeleteI expect it to be selective enabled.
ReplyDeleteHere is how you can do it with Lucene and Solr. It can support highlighting phrases, wildcards, boolean queries, etc.
ReplyDeletehttp://sigabrt.blogspot.com/2010/04/highlighting-query-in-entire-html.html
@Anonymous on my terms I do use the NoScript Firefox plugin to ensure a minimum of anonymity while I am surfing throught the net (which btw. is not too seldom) or I am using the Lynx commandline browser in an ssh session to find a quick answer to a question. With this I do have Javascript disabled by default or in case I use Lynx not available.
ReplyDeleteBut on the other hand in my opinon it is a shame that I do have to activate Javascript to browse a plain webpage just returning plain text representing a search result for example. Did every one forget about 'seamless degration' and responsible and lightweight webdesign?