|
Home::Robert Plank
Screen scraping your way into RSS
Author : Dennis Pallett
Introduction RSS is one the hottest technologies
at the moment, and even big web publishers (such as the New York
Times) are getting into RSS as well. However, there are still a
lot of websites that do not have RSS feeds.
If you still want to be able to check those websites in your
favourite aggregator, you need to create your own RSS feed for
those websites. This can be done automatically with PHP, using a
method called screen scrapping. Screen scrapping is usually
frowned upon, as it's mostly used to steal content from other
websites.
I personally believe that in this case, to automatically
generate a RSS feed, screen scrapping is not a bad thing. Now,
on to the code!
Getting the
content For this article, we'll use PHPit as an example,
despite the fact that PHPit already has RSS feeds.
We'll want to generate a RSS feed from the content listed on the
frontpage. The first step in
screen scraping is getting the complete page. In PHP this can be
done very easily, by using implode(file("", "[the url here]"));
IF your web host allows it. If you can't use file() you'll have
to use a different method of getting the page, e.g. using the CURL library.
Now that we have the content available, we can parse it for the
content using some regular expressions. The key to screen
scraping is looking for patterns that match the content, e.g.
are all the content items wrapped in <div>'s or something
else? If you can successfully discover a pattern, then you can
use preg_match_all() to get all the content items.
For PHPit, the pattern that match the content is <div
class="contentitem">[Content Here]<div>. You
can verify this yourself by going to the main page of PHPit, and
viewing the source.
Now that we have a match we can get all the content items. The
next step is to retrieve the individual information, i.e. url,
title, author, text. This can be done by using some more regular
expression and str_replace() on the each content items.
By now we have the following code; <?php
// Get page $url = "http://www.phpit.net/"; $data =
implode("", file($url));
// Get content items preg_match_all ("/<div
class="contentitem">([^`]*?)</div>/",
$data, $matches); Like I said, the next step is to retrieve
the individual information, but first let's make a beginning on
our feed, by setting the appropriate header (text/xml) and
printing the channel information, etc. // Begin feed header
("Content-Type: text/xml; charset=ISO-8859-1"); echo
"<?xml version="1.0"
encoding="ISO-8859-1" ?> "; ?> <rss
version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
; xmlns:admin="http://webns.net/mvcb/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
> <channel> <title>PHPit Latest
Content</title> <description>The latest content from
PHPit (http://www.phpit.net), screen
scraped!</description>
<link>http://www.phpit.net</link>
<language>en-us</language>
<? Now it's time to loop through the items, and print
their RSS XML. We first loop through each item, and get all the
information we get, by using more regular expressions and
preg_match(). After that the RSS for the item is printed.
<?php // Loop through each content item foreach
($matches[0] as $match) { // First, get title preg_match
("/">([^`]*?)</a></h3>/", $match,
$temp); $title = $temp['1']; $title = strip_tags($title); $title
= trim($title);
// Second, get url preg_match ("/<a
href="([^`]*?)">/", $match, $temp); $url =
$temp['1']; $url = trim($url);
// Third, get text preg_match ("/<p>([^`]*?)<span
class="byline">/", $match, $temp); $text =
$temp['1']; $text = trim($text);
// Fourth, and finally, get author preg_match ("/<span
class="byline">By ([^`]*?)</span>/",
$match, $temp); $author = $temp['1']; $author = trim($author);
// Echo RSS XML echo "<item> "; echo "
<title>" . strip_tags($title) . "</title>
"; echo " <link>http://www.phpit.net" .
strip_tags($url) . "</link> "; echo "
<description>" . strip_tags($text) .
"</description> "; echo "
<content:encoded><![CDATA[ "; echo $text . "
"; echo " ]]></content:encoded> "; echo
" <dc:creator>" . strip_tags($author) .
"</dc:creator> "; echo " </item>
"; } ?> And finally, the RSS file is closed off.
</channel> </rss> That's all. If you put
all the code together, like in the demo script, then you'll have
a perfect RSS feed.
Conclusion In this tutorial I have shown you how
to create a RSS feed from a website that does not have a RSS
feed themselves yet. Though the regular expression is different
for each website, the principle is exactly the same.
One thing I should mention is that you shouldn't immediately
screen scrape a website's content. E-mail them first about a RSS
feed. Who knows, they might set one up themselves, and that
would be even better.
Download sample script
Spam emails More free articles Related articles
|
More related feeds |
screen scraping your way into rss - by: dennis pallett we’ll want to generate a rss feed from the content listed on the frontpage (http://www.phpit.net). the first step in screen scraping is getting the complete page. in php this can be done very easily, by using implode(file(”", ...Chief Family Officer: Guest Post: Is Your Credit Card Causing You ... The spender tends to see the scrimper as an ungenerous tightwad and the scrimper tends to see the spender as a frivolous dolt who will drive both parties into debt. A first solution here is to separate expenses and accounts – that way, ... screen scraping your way into rss introduction rss is one the hottest technologiesat the moment, and even big web publishers (such as the new yorktimes) are getting into rss as well. however, there are still alot of websites that do not have rss feeds. ... How to make your children do the same as we do? Do not give in to their demands for a gaming console with the latest and greatest, or a big screen TV in their rooms. Passive entertainment breeds escapism and laziness. If they want to be entertained, ensure that the entertainment is ... Heated Ice Scraper With Light by Gadzooki If you live in the city and would usually park by the sidewalks you NEED the Heated Ice Scraper With Light. At US$14.95 it is a no-brainer, really! This is handy to use that you can plug it into your vehicle’s cigarette liter to work. ... Down we go again: Fourth-worst drop ever for Dow Yes, and if America goes down in flames, and is lost as an ideal to look up to (which has kind of already happened thanks to the: don't let the door hit you in the ass on your way out the door, W. Administration) China's middle class ... Screen Scraping Uses With this wealth of data available, businesses are constantly looking for a way to turn information into a competitive edge. This article will discuss several uses of screen scraping technology that help businesses develop a greater ... Email Spider(Crawler) 4.55 Web Scraper Plus+: Web Spider Edition 5.0.0. Web Scraper Plus+: Web Spider Edition 5.0.0. Web extract / screen scraping and data mining with web spider / web crawler. Download now · Web Scraper Plus+: Web Spider Edition 5.0.0 ... Power.com: For Social Networking Power Users Log into one or more social networks on the Power.com site. Friends, messages, updates, photos and other information are either scraped from the site or obtained via the API (it varies by site), and aggregated on the Power dashboard. ... Ask Dr. Hal » Blog Archive » ASK DR. HAL’s FIFTH COLUMN! serving up accompanying images (usually) scraped up from the benthic. bottom of the Internet. You might not think so, but the guy actually. practices some restraint on behalf of our show– as bad as what he usually. puts up on our screen ...
|
|
|