<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Trypticon: Tag html</title>
    <link>http://trypticon.org/articles/tag/html?tag=html</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>If it ain't broke, break it.</description>
    <item>
      <title>RSS from HTML</title>
      <description>&lt;p&gt;From time to time I have a need to watch a bunch of links on a web page, where that page doesn&amp;#8217;t have an RSS feed.&lt;/p&gt;

&lt;p&gt;I use RSS to automate [Azureus][1], so this becomes a major problem for sites with a list of .torrent files but no feed&lt;/p&gt;&lt;h2&gt;Requirements&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Perl (duh.)&lt;/li&gt;
&lt;li&gt;LWP::UserAgent (gentoo ebuild: dev-perl/libwww-perl)&lt;/li&gt;
&lt;li&gt;XML::RSS (gentoo ebuild: dev-perl/XML-RSS)&lt;/li&gt;
&lt;li&gt;URI::Escape (gentoo ebuild: dev-perl/URI)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;The Code&lt;/h2&gt;

&lt;pre&gt;&lt;code&gt;#!/usr/bin/perl

use LWP::UserAgent;  # dev-perl/libwww-perl
use XML::RSS;        # dev-perl/XML-RSS
use URI::Escape;

# This is where the data will be fetched from.
my $url = uri_unescape($ENV{'QUERY_STRING'});

# Create the feed...
my $rss = XML::RSS-&amp;gt;new( version =&amp;gt; '1.0' );

# Do a query on the URL...
my $ua = LWP::UserAgent-&amp;gt;new;
$ua-&amp;gt;timeout(10);
$ua-&amp;gt;env_proxy;
my $response = $ua-&amp;gt;get($url);
die $response-&amp;gt;status_line unless $response-&amp;gt;is_success;
my $content = $response-&amp;gt;content;

# Get the basic metadata for the page on the whole.
my $title;
if ($content =~ m/&amp;lt;title&amp;gt;\s*(.*?)\s*&amp;lt;\/title&amp;gt;/) {
    $title = $1;
}
#XXX: Later, see if it's worth doing the same for the description.

# This is the channel information.  Modify this as you please.
$rss-&amp;gt;channel(
    'title'        =&amp;gt; $title,
    'link'         =&amp;gt; $url,
#    'description'  =&amp;gt; "...",
);


# Now, find every link in the page.  This is where it gets fun...
while ($content =~ m/&amp;lt;a\s[^&amp;gt;]*?href\s*=\s*"(.*?)"[^&amp;gt;]*?&amp;gt;(.*?)&amp;lt;\/a&amp;gt;/gi) {
    my $link = $1;
    my $title = $2;

    # The link needs to be non-encoded, and absolute.
    $link = URI-&amp;gt;new_abs($link, $url)-&amp;gt;as_string;

    next unless $link =~ m/.torrent$/;

    $rss-&amp;gt;add_item(
        'title'        =&amp;gt; $title,
        'link'         =&amp;gt; $link,
    );
}

# Done, generate the RSS as a string.
print "Content-Type: application/rss+xml\n\n";
print $rss-&amp;gt;as_string;
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Usage&lt;/h2&gt;

&lt;p&gt;Put the script somewhere (e.g. /cgi-bin/rssify) and access it like this:&lt;/p&gt;

&lt;p&gt;http://example.com/cgi-bin/rssify?http://someothersite.com/myfiles/&lt;/p&gt;

&lt;p&gt;The script will then fetch http://someothersite.com/myfiles/ and turn it into an RSS 1.0 feed.&lt;/p&gt;

&lt;p&gt;The output is fairly simple. It takes the feed title from the page title and the page URL from the URL you gave it. It takes the content of the links to be the title of each item, and the full URL for each link becomes the link for each item. I didn&amp;#8217;t bother with dates, as dates don&amp;#8217;t make sense.
Caveats&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The script is currently far from secure, so do not put it on a live web site.&lt;/strong&gt; An attacker can pretty much enter any URL</description>
      <pubDate>Mon, 23 May 2005 10:00:00 +1000</pubDate>
      <guid isPermaLink="false">urn:uuid:de24a195072ea3afcbd5062322f91e9e</guid>
      <author>Trejkaz</author>
      <link>http://trypticon.org/articles/2005/05/23/rss-from-html</link>
      <category>syndication</category>
      <category>html</category>
      <category>rss</category>
      <category>software</category>
    </item>
  </channel>
</rss>
