A long time ago, on a domain not my own, I wrote a PHP parser that handled pretty much all versions of RSS/RDF xml feeds. I released it for free, and a few people used it. It wasn’t pretty code, but it was small, and it was easy to use, and it didn’t care what sort of feeds you threw at it., it just chewed them up and spat out links. In fact, it didn’t even care if those feeds validated perfectly.

Today I’m releasing the first ever upgrade of that script.

HuddledParser is still written in PHP (tested in PHP 4.x), but now it is object oriented, and [new] it handles ATOM feeds as well. (tada!) Now, just to be clear: this is still not a fancy-shmancy feed formatter. I made it object oriented because it saves me some code and a bunch of nasty global variables.

HuddledParser doesn’t try to present all the data that is in a feed; that’s not it’s purpose. Rather, it creates a headline summary from a feed by grabbing the title, link and summary for the feed, and for each entry, (regardless of feed format), and putting them into a list for you. You have the option of specifying a maximum number of entries to parse, and even a maximum length for the “summaries” so you can cut feeds like LockerGnome’s full content feeds down to a size that you can feel comfortable putting in a sidebar on your web page. And you never have to worry about other site’s feeds breaking your layout, because all the HTML is stripped from their content (even if they specify a CDATA section).

Feel free to grab the source and play with it. Please let me know if you have any problems, or encounter any feeds that it cannot parse.

Before I go into details on how to use this script, let me say this: there’s still one thing I’d like this to do that it doesn’t do now. Currently, when you call ParseURL(...) you get text back which corresponds to the selected information. What I’d like to do is make it so that you could call ParseURL(...) multiple times, and it would nest all that data inside a list of lists. Then, when you wanted to output the feeds, you could call .ToHTML(...), or you could enumerate through the nested lists yourself (say, if you wanted to create DHTML menus or combo-boxes or something). But that will wait until the next version.

Basically, it works like this:


include "HuddledParser2.php";
$feedParser= new HuddledParser( 10, false);
print  $feedParser->ParseFeed( "http://www.HuddledMasses.org/feed/rss2/", "HMO");
 

The two required arguments for the HuddledParser constructor are the maximum number of entries to fetch per-feed, and whether or not to show the “summary” of the entries (the summary will still be inserted as the “title” attribute of each link, but it won’t clutter your page up). The two required arguments for the ParseFeed function are the URL of the feed, and a unique name for that feed (for the cache file). There are several optional arguments for each:

HuddledParser( $maxEntries, $showSummary, $cacheFolder, $cacheTime, $titleLimit, $summaryLimit )

  • maxEntries – A number. The maximum number of entries to retrieve
  • showSummary – A boolean (true or false). Whether or not to print out the summaries in divs under each entry title.
  • cacheFolder – A path. Defaults to an “xmlcache/” subfolder of the folder it’s called from. This must be writeable, so we can store our cache files there.
  • cacheTime – A number. The number of seconds before we consider our cache “old” and refetch the feed. Defaults to 3600 (one hour).
  • titleLimit – A number. The maximum length (in characters) of titles. Note: we do not cut words off, so this is approximate. Defaults to -1, no limit.
  • summaryLimit – A number. The maximum length (in characters) of summaries. Note: we do not cut words off, so this is approximate. Defaults to -1, no limit.

ParseFeed( $url, $cacheName, $title, $summary, $maxEntries, $cacheTime )

  • url – A URL. The URL of the xml feed to parse
  • cacheName – A file name. The name of a file to store a cache in, this must be unique to each feed.
  • title – Text. Allows you to override the title of the feed
  • summary – Text. Allows you to override the summary of the feed contents
  • maxEntries – A number. Override (for this feed) the maximum number of entries to fetch
  • cacheTime – A number. Override (for this feed) the number of seconds to consider our cache “fresh”

Edit: May 11, 2006

This script is now also available as a WordPress plugin

12 Responses to “HuddledParser 2.0”

  • Jan van Gessel says:

    Great little parser.
    It has saved me a lot of work and headaches as this one handles the atom feeds as well (gmail/blogger etc.) in contrast to most others.
    I’m using it from my private page to view my feeds.
    I hope you don’t mind that I build a little on it.
    I have adjusted it a little to use cURL instead of fopen.
    This in turn allowed me to add some more code to view authenticated pages (gmail in my case)simply by supplying it with a url like: https://gmailusername:password@gmail.google.com/gmail/feed/atom/.
    The only other thing I changed was moving the $showSummary bolean to the parseFeed function as for some reason I didn’t get summaries with the original setup.
    If you are interested I’ll be happy to send you the changed code.
    Thanks and good luck
    Jan

  • Well, I admit I hadn’t tried it before, mainly because I’m just using this on web-pages, and I don’t have any private feeds that I want to make public [;)] but I just tried it on my gmail atom feed, and it works fine with https://user:password@host using fopen.

  • Joel, I would suggest you use fsocket instead of fopen – just because fsocket supports a connection timeout to the remote url. This way if a feed is unavailable you can handle it.

    Other than that it sounds like a nice little package. I will have to download it and see if it will suit my needs better than the current parser I use, lastRSS (which also uses fopen but is easilly modified to use fsocket).

  • Joel, one more thing – to guarantee uniqueness on your saved cache file name you can try this:

    $cacheName = $this->cacheFolder . ‘/xmlcache_’ . md5($url);

    using the md5 hash on the url will pretty much guarantee that you get a unique file name for each cached feed.

  • eugene says:

    any thoughts on allowing for enclosure fetching?

  • Jon B says:

    I have been looking around for a PHP script that can parse feeds but preferably liberally – from what you say this seems to do that (Magpie doesn’t). However you only parse for summaries and my aim is to build a complete server side personal RSS aggregator (the current ones about don’t ‘float my boat’, I have ‘needs’).

    Any ideas? is this suitable with modifications?

    Thanks

  • Jaykul says:

    Of course you could use this and modify it … it’s really just a simple example of how to use PHP’s xml parsing ;) , The only reason it’s limited to what it is, is that I’m just not interested in competing with existing server-side aggregators, all I wanted was something I could use on my site.

  • d3v says:

    is there a live demo of this? would really like to see this at work.

  • ben says:

    is there a n0ob proof man for the installation for this script? would be great if that script could be used by other people than nerds. tzhank you very much for your reply.

    Ben

  • Hello.

    Thank you for HuddledParser. I’ve been trying forever to figure out how to pull the link element from Blogger and HuddledParser is doing it, except Blogger’s feed includes two link elements and it’s adding both to the href attribute.

    Example: href=“https://www.blogger.com/atom/8090863/109380437787746685http://onlytheweb.blogspot.com/2004/08/quick-reply-in-opera-mail.html” title=”“>Quick Reply in Opera Mail

    How can I remove the first link?

    Thanks.