ArticlesReader.com Menu
Newest Articles
Most Viewed Articles
ArticlesReader.com RSS
Submit Article
Login
Signup
Search the articles

Articles Main Categories
Advice
Animals
Automobiles
Business
Career
Communications
Computer Programming
Computers
Entertainment
Environment
Family
Fashion
Finance
Food
Health & Medical
Home & Garden
Humor
Internet Business
Internet Marketing
Legal
Leisure & Recreation
Marketing
Other
Politics
Reference & Education
Religion
Self Improvement
Sports
Technology & Science
Travel
Writing
Subscribe
Receive alert message from us when new articles submitted to our site for free.

Enter your name

Enter your email

Syndicate

















Related Products
Home::CGI

Screen scraping your way into RSS

Author : Dennis Pallett

Introduction RSS is one the hottest technologies
at the moment, and even big web publishers (such as the New York
Times) are getting into RSS as well. However, there are still a
lot of websites that do not have RSS feeds.

If you still want to be able to check those websites in your
favourite aggregator, you need to create your own RSS feed for
those websites. This can be done automatically with PHP, using a
method called screen scrapping. Screen scrapping is usually
frowned upon, as it's mostly used to steal content from other
websites.

I personally believe that in this case, to automatically
generate a RSS feed, screen scrapping is not a bad thing. Now,
on to the code!

Getting the
content
For this article, we'll use PHPit as an example,
despite the fact that PHPit already has RSS feeds.

We'll want to generate a RSS feed from the content listed on the
frontpage. The first step in
screen scraping is getting the complete page. In PHP this can be
done very easily, by using implode(file("", "[the url here]"));
IF your web host allows it. If you can't use file() you'll have
to use a different method of getting the page, e.g. using the CURL library.

Now that we have the content available, we can parse it for the
content using some regular expressions. The key to screen
scraping is looking for patterns that match the content, e.g.
are all the content items wrapped in <div>'s or something
else? If you can successfully discover a pattern, then you can
use preg_match_all() to get all the content items.

For PHPit, the pattern that match the content is <div
class="contentitem">[Content Here]<div>. You
can verify this yourself by going to the main page of PHPit, and
viewing the source.

Now that we have a match we can get all the content items. The
next step is to retrieve the individual information, i.e. url,
title, author, text. This can be done by using some more regular
expression and str_replace() on the each content items.

By now we have the following code;
<?php

// Get page $url = "http://www.phpit.net/";; $data =
implode("", file($url));

// Get content items preg_match_all ("/<div
class="contentitem">([^`]*?)</div>/",
$data, $matches);
Like I said, the next step is to retrieve
the individual information, but first let's make a beginning on
our feed, by setting the appropriate header (text/xml) and
printing the channel information, etc.
// Begin feed header
("Content-Type: text/xml; charset=ISO-8859-1"); echo
"<?xml version="1.0"
encoding="ISO-8859-1" ?> "; ?> <rss
version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/";
xmlns:content="http://purl.org/rss/1.0/modules/content/"
; xmlns:admin="http://webns.net/mvcb/";
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
> <channel> <title>PHPit Latest
Content</title> <description>The latest content from
PHPit (http://www.phpit.net), screen
scraped!</description>
<link>http://www.phpit.net<;/link>
<language>en-us</language>

<?
Now it's time to loop through the items, and print
their RSS XML. We first loop through each item, and get all the
information we get, by using more regular expressions and
preg_match(). After that the RSS for the item is printed.
<?php // Loop through each content item foreach
($matches[0] as $match) { // First, get title preg_match
("/">([^`]*?)</a></h3>/", $match,
$temp); $title = $temp['1']; $title = strip_tags($title); $title
= trim($title);

// Second, get url preg_match ("/<a
href="([^`]*?)">/", $match, $temp); $url =
$temp['1']; $url = trim($url);

// Third, get text preg_match ("/<p>([^`]*?)<span
class="byline">/", $match, $temp); $text =
$temp['1']; $text = trim($text);

// Fourth, and finally, get author preg_match ("/<span
class="byline">By ([^`]*?)</span>/",
$match, $temp); $author = $temp['1']; $author = trim($author);

// Echo RSS XML echo "<item> "; echo "
<title>" . strip_tags($title) . "</title>
"; echo " <link>http://www.phpit.net"; .
strip_tags($url) . "</link> "; echo "
<description>" . strip_tags($text) .
"</description> "; echo "
<content:encoded><![CDATA[ "; echo $text . "
"; echo " ]]></content:encoded> "; echo
" <dc:creator>" . strip_tags($author) .
"</dc:creator> "; echo " </item>
"; } ?>
And finally, the RSS file is closed off.
</channel> </rss>
That's all. If you put
all the code together, like in the demo script, then you'll have
a perfect RSS feed.

Conclusion In this tutorial I have shown you how
to create a RSS feed from a website that does not have a RSS
feed themselves yet. Though the regular expression is different
for each website, the principle is exactly the same.

One thing I should mention is that you shouldn't immediately
screen scrape a website's content. E-mail them first about a RSS
feed. Who knows, they might set one up themselves, and that
would be even better.

Download sample script

Spam emails More free articles

Related articles


  1. 5 CGI Scripts You Must Use to Turn Your Site Into a Powerhouse
  2. Clever Profit Growth Software
  3. Why Aren't You Using CGI
  4. Use CGI to Automate Your Web Site
  5. CGI: What the Heck Is That?
  6. CGI Security Issues
  7. How to Stop Digital Thieves with CGI
  8. Quick Intro to PHP Development
  9. Better Writing: What Works and What Doesn't
  10. Password Protection and File Inclusion With PHP
  11. Autoresponders With PHP
  12. Track your visitors, using PHP
  13. PHP On-The-Fly!
  14. PHP and Cookies; a good mix!
  15. Screen scraping your way into RSS
  16. Mastering Regular Expressions in PHP
  17. ASP, CGI and PHP Scripts and Record-Locking: What Every Webmaster Needs To Know
  18. Open Source Scripts
  19. this is a test
  20. An Extensive Examination of the PHP:DataGrid Component: Part 1
  21. PHP:Form Series, Part 1: Validators & Client-side Validation
  22. Design an Online Chat Room with PHP and MySQL
More related feeds
Screen Scraping Your Way Into RSS - by: Dennis Pallett
We’ll want to generate a RSS feed from the content listed on the frontpage (http://www.phpit.net). The first step in screen scraping is getting the complete page. In PHP this can be done very easily, by using implode(file(”", ...

Screen Scraping Your Way Into RSS
... publishers (such as the New York Times) are getting into RSS as well. However, there are still a lot of websites that do not have RSS feeds. If you still want to be able to check those websites in your favourite aggregator, you [. ...

Web Worker Daily
Through some screen-scraping and number crunching, this site is tracking which states have 3G iPhones available. Click on a state and you can see complete store information, saving you a lot of cruising around the official sites. ...

Re: Questions
So, it isn't required to call out to a ratings site, for example, but do not hard-code access to the data. In that way, we could swap physical data sources in the future. Bonus Points? Pull in data from an RSS feed, screen scrape, ...

Screen Scraping your Way into RSS
RSS is one the hottest technologies at the moment, and even big web publishers (such as the New York Times) are getting into RSS as well. However, there are still a lot of websites that do not have RSS feeds. ...

Tony Snow's Commencement Address At Catholic University
So, when it comes to the world, engage it in every possible way. Don’t be chicken. Get dirt under your fingernails. Scrape your knees. Laugh … a lot … at yourself. Trust me, if you don’t, others will do if for you. ...

The Future of TV According to AT&T
Here's where AT&T benefits from being AT&T here, with your phone jacked into your set-top box. Maybe more "cool" than critical. A message asking for a video share call from a local Atlanta 404 number appears on the screen. ...

Dr William Boothe a permanent solution
Actually the correct mindset we are referring to here is to always think in terms of benefits for your customers. The highly successful businessperson thinks of ways to show interest in their customers even before they come into their ...

screen scraping project (programming & optimization) - oDesk
This is a pilot project (we may hire you again if we're happy with your service) to find out the feasibility of a particular scraping technique. The main goal is to find a FAST way to scrape the sites. The script will eventually run on ...

Project Car Hell, Graverobber Edition: 1970 Cougar or 1972 Torino ...
He first loses his pants, which flap up and into the face of the second minion still riding the grill of your pursuer. Flipping on his back to save his manhood from essence scraped off, he looses both buttcheeks to the road, ...

 


 

2007 articlesreader.com - All Rights Reserved