Huddled Masses
You can do more than breathe for free...
Browse: Home / Masters Thesis Proposal: An SOM Classifier

Masters Thesis Proposal: An SOM Classifier

By Joel 'Jaykul' Bennett on 14-Jun-2006

The basic idea here is to build a classification system based on SOM algorithms which can be used to pick “interesting” articles from sites like delicious, diigo, magnolia, and lilisto (I have a partial list of possible sites here).

There are currently several parts to this idea, and it’s really possible that this could be fodder for several project-length experiments.

The first question

Can I build a classifier which rates documents on how closely they match your interests, based on placing them in a self-organizing map which uses keywords to position the document. I have already built an algorithm which applies GHSOM to the relationship between keywords which were applied to documents, so the task here is mainly to see how useful this information is for mapping additional documents and for classifying them by interest.

The next steps:

  1. Apply the algorithm to individual documents and see where they are placed
  2. Determine the area of the map that represents the user’s interests (either by inference from having them rate documents, or by directly “circling” on the map their area(s) of interest)
  1. Rate documents by their (multi-dimensional) proximity to these areas.

The second question

Is this method more effective when using keywords generated by actual people than when it uses machine-generated keywords. There are many existing document sumarizing and keyword extraction algorithms, and even commercial products (eg: brevity intellexer). One or more of these could be run on the document to extract keywords instead of using the human-generated keywords available on delicious et. al. This would make the algorithm more capable of analyzing “any” documents, and would reduce dependency on the websites mentioned earlier (although this seems important, it may be of limited use, since the intent is to classify interesting documents from an incoming “stream” of documents, and currently my “stream” comes from these same sites where the keywords come from).

The next steps:

  1. Create a collection of documents with their human-generated keywords
  2. Run machine summarizing algorithms on these documents
  1. Compare the resulting mappings for relevancy ( what is the metric here? )

Additional questions

The most important open question (to me) is whether this idea is original enough to work as a thesis at RIT (as opposed to becoming a project). If it’s not, I’m leaning toward working on a different project which is somewhat more interesting to me.

However, there are several other open questions:

  • Is GHSOM better than a non-hierarchical growing SOM, or even a simple SOM algorithm for this task? (Instinctively, it seems that the key requirement is that the map size must be inferred, and thus that a growing algorithm is required, but the hierarchy may be uneccessary).
  • How does this system using free keywords (anything can be a keyword, including the user-name of the person who creates the keywords) compare to a system which has set categories. It seems that the classification would have much less adapting to do in a situation where categories are limited, since in the current system new keywords are constantly being added to the database and the algorithm must infer a user’s interest in these new keywords.

Similar Posts:

    None Found

Posted in Recommender | Tagged Development, Personal, Recommender

« Previous Next »

Lijit Search

Tags

.Net .Net 2008 Scripting Games Automation Bugs Design Development Funny Gadgets GeoShell GUI Huddled Masses Internet licensing Microsoft Modules My Software News Personal PInvoke Pipeline Politics PoshCode PoshConsole PowerBoots PowerShell PowerShell Functions PowerTips Rants Recommender Repository Scripting ShowUI Software Solutions Textile Tips User Group UserInterface WalkThrough WebHosting Windows 7 WordPress WPF Xml

About Huddled Masses

This is web site is dedicated to the musings of Joel Bennett (aka Jaykul) about technology, software, software development, the web, and the world.

Any resemblance of the views expressed and the views of my employer, my terminal, or the view out my window are purely coincidental. The resemblance between them and my own views is non-deterministic. The question of the existence of views in the absence of anyone to hold them is left as an exercise for the reader.

P.S.: I occasionally link to things I think are great. When I do, I occasionally find a "referral code" so I can make a little cash. I promise that I don't link to anything just because of that cash (I wouldn't cross the street for the amount of cash those links bring in, never mind write a whole blog post) ... but I do not promise that things I link to will stay great as time passes, nor that you will agree with me about their greatness!

Archives

  • January 2012
  • October 2011
  • August 2011
  • July 2011
  • June 2011
  • March 2011
  • February 2011
  • January 2011
  • November 2010
  • August 2010

Copyright © 2012 Joel Bennett.

Powered by WordPress and Hybrid.