Log in

No account? Create an account

Lazyweb assignment #1 for search engine geeks

« previous entry | next entry »
Nov. 10th, 2004 | 02:04 am
mood: geekygeeky
music: Orbital - P.E.T.R.O.L.

an arbitrary list of URIs

a google search string which will return ~90% of those URIs within the first N results

Extra credit:
Optimize for shortest possible (or practical) search string
configurable options and parameters

Link | Leave a comment |

Comments {3}

truth without proof

(no subject)

from: chronicfreetime
date: Nov. 10th, 2004 02:13 am (UTC)

I've actually worked on this problem for a single URL, as an approach to clustering. I think N will have to be very large if the pages are not closely related.

Reply | Thread

Triple Entendre

(no subject)

from: triple_entendre
date: Nov. 11th, 2004 02:28 am (UTC)

Yeah, I assume that if there are very unique pages in the set, your desired percentage (given as 90% above) might decrease, so the less-similar pages would effectively be ignored.

My particular motivation amounts to sort of a reverse-google: these are the pages I'd get in a highly optimized search for what I'm looking for; show me how I got there. Result: the google equivalent of an iTunes smart playlist for [whatever it is that grouping of pages means].

But it's also an interesting question because sometimes the optimal search string for a problem tells you something interesting about the problem domain.

Reply | Parent | Thread

truth without proof

(no subject)

from: chronicfreetime
date: Nov. 11th, 2004 10:52 am (UTC)

It's one of the first things I want to implement, assuming they hire me. The "find similar" button does a lousy job at present, I'd like it to use the content more than the URL.

Feeding the top 5-10 words by TFIDF back into a query should work pretty well, and it's very easy to compute from the index they already have.

Reply | Parent | Thread