Friday, January 27, 2012

Real Time Search Of Live Content On Limited Memory Lossy Dynamic Data WebSites

Degradata Search = Degrading Data Search, or
Real Time Degrading Data From WebSites Exhibiting Rapid Relevance Decay Such As Image Boards, Chatrooms, Blogs, and Status Updates.

Use cases:
Users seek to discover ongoing live conversations in which their college, city, area code, or names are being discussed.

Firms seek to discover live conversations about their brand to engage their users.

Reporters seek to take the public pulse on a particular subject as research for a news story.

The primary rapid decay datasource I'd like to mine is 4chan, but other examples include tinychat, craigslist, facebook, twitter, quora, reddit, and hacker news.  It is important to note that 4chan, tinychat, and craigslist exhibit different forms of data decay than the remaining examples, that is; their data dissapears eventually.  Not so with twitter, facebook, and reddit; which experience only relevance decay without the added complication of data dissapearance.


4 chan acts like a limited memory queue of bumpable image and username optional message board threads that 404 as they fall off the queue.  A search result for a thread on 4chan is only useful within the window the thread has active contributors.  One of the boards on 4chan particularly ripe to have it's data mined is its /soc/ board that often sees threads related to people linking up in their area by sharing contact info, area codes, and self descriptions.  This and other oppurtunities for interaction have elevated the refreshing of the soc board to a modern cultural pastime for under 30's.  It's users would benefit from real time alerts when specific terms or usernames appear, the ability to see all current occurances of a search term across the site, and a daily digest of discussions involving those terms to keep them current.  The effect would extend the social aspects of the site into a slightly more formalized follow network, of sorts, with the alerts as follows, a find related alerts function for discovery, and the ability to selectively make profile and a subset of alerts public.

The topic of a particular thread is not always simple to identify.  Threads can be derailed on purpose, ask spiderman, though more commonly they simply devolve to how contributors choose to interact within them.  A naive bayes classifier applied at the thread level to return a hypothetical topic ought to satisfice this requirement, so hereon I proceed with the validity of that hypothesis as an assumption.

The pain point it solves: tbc

No comments:

Post a Comment