Well, I’ve decided to refer to my SOM Recommender by the initialism SOMR (which for the sake of the argument, I pronounce “sommer” like “summer” but with an o), and I’ve been working on it for about a month. It’s been a busy month outside of working on this project, with end-of-year stuff at work, and of course, the Christmas holiday with family, but I’m basically tracking correctly on my schedule despite that.

A Better Project Site

Rather than just keep everything here in my blog, I created a Trac website so that those of you who are interested can easily track my progress. I’m hosting that site on my home server (on residential cable) so if you have problems accessing it, it’s probably just momentarily off-line, it’s at somr.jaykul.org. Trac has a timeline which will allow you to easily see the subversion checkins, as well as tracking my progress against the milestones that I set out in my project proposal.

The somr site also allows you to easily browse the source in the subversion tree, and it has a wiki, which I’ll be updating more frequently with progress and ideas that come to me as I’m working. I’ll still be posting updates here once a month or so to update those who are less interested in the details.

Current Progress

The WebDownloader and scraper are done, and I’ve created the database schema and saved a bunch of SQL scripts so I can recreate it later as part of the installer?

I also created a SomrDataSet class to handle the interface to the database storage, and a small set of tests for each of these items to validate that they work.

Problems with tests

At this point, most of these tests serve more as an example of how to use the classes than as comprehensive test, so I have in mind to try and get some coverage tests going as well to make sure that I have 100% code coverage in the tests going forward (although that will be awfully difficult with the SomrDataSet.Designer.cs which has quite a lot of generated code in it that I may not really need).

The Windows RSS Platform

I’ve also discovered the new Windows RSS Platform, which is part of IE7, and as a result, is built into Vista. I’ve created a simple test case to see how it works, and it’s pretty simple, and fairly slick. It seems like the absolute best way to parse the recent feed because it will continuously download the feed in the background even SOMR isn’t running.

Using the RSS platform would mean that SOMR itself would never have to download the feed for recent items at start up, and would be guaranteed a larger number of items to evaluate, since the RSS Platform service can retrieve the feed as often as every five minutes even when SOMR isn’t running.

However: it’s probably not the best way to handle downloading user or URL pages, as downloading through the RSS Platform seems to require adding the feed to the collection and then invoking the download method. Considering the number of feeds we’d be processing (a feed for each URL we find in the recent feed) it seems like a bad idea to add them to the RSS platform collection, since we don’t want to be downloading thousands of feeds on a regular basis.

Although it might be simplest to parse all the feeds through the Windows RSS “normalizer,” I’m not entirely convinced. There are basically two feeds I have to deal with on del.icio.us: the recent feed and the URL feeds ... even if I’m getting the recent feed through the platform, it might be worth handling the URL feeds myself.

Issues:

How much data do I need to validly map a URL?

Can I really tell if a URL is interesting when it’s only been bookmarked by 2 people (unless I knew those people were “like me”)?

Is it possible it might be interesting later?

That is, if I test a URL initially when it’s only been bookmarked by 2 people, and it fails to be interesting based on keyword tags, should I retest it when it’s been bookmarked by 5 or 10 people? How about when it’s been bookmarked by 25 or 50?

Comments are closed.