Did you try to scrap content from a html document using regular expressions? This is a bad idea (Read here why!).
With FluentDOM it is easy:
Get all links
Just create and FluentDOM from the HTML string, find all links using XPath and map the nodes to an array.
<?php
require('FluentDOM/FluentDOM.php');
$html = file_get_contents('http://www.papaya-cms.com/');
$links = FluentDOM($html, 'html')->find('//a[@href]')->map(
function ($node) {
return $node->getAttribute('href');
}
);
var_dump($links);
?>
Extend local urls
Need to edit the links? Pretty much the same:
<?php
require('FluentDOM/FluentDOM.php');
$url = 'http://www.papaya-cms.com/';
$html = file_get_contents($url);
$fd = FluentDOM($html, 'html')->find('//a[@href]')->each(
function ($node) use ($url) {
$item = FluentDOM($node);
if (!preg_match('(^[a-zA-Z]+://)', $item->attr('href'))) {
$item->attr('href', $url.$item->attr('href'));
}
}
);
$fd->contentType = 'xml';
header('Content-type: text/xml');
echo $fd;
?>
Great article on scraping links, I use beautiful soup in python, for tough projects though it may just be easier to have someone else do the web scraping
ReplyDelete