A semester of search

My grad school class this semester is the Project Course, where the whole semester is spent on a group project.  No tests, no other grading besides the project, which is actually what I expected more of when starting grad work at Marist (that’s a different post though).  The project domain is search.

We have to build an application with an integrated search engine tuned for a specific problem.  Our group problem is an user driven online restaurant review site.  Our canonical example is searching for “boston seafood”, which should return all the posts that a human would, given the same tasks.  That means “the best lobster in bean town” counts as a hit that you’d want.  Guess what, SQL like clauses and regex’s aren’t going to cut it here.

But that’s ok, we don’t have to do everything from scratch.  We’re expected to base our solution on Lucene, which is a search SDK.  You build custom indexer, analyzer, and searcher classes from the Lucene base classes, and feed it documents.  Lucene does the heavily lifting of building the inverted index, and scoring the results based on the rules, weights, and policies you’ve given it.  A project like this is pretty open ended, as you can always make it better given more time, and more interesting analysis tricks.

The whole team is making nice progress, so for the last two weeks I’ve been able to focus squarely on Lucene integration code itself.  Pass one got some basic queries working in Lucene.  Pass two was earlier this week, when scoring started to be useful.  Pass three will be tonight, where I’ll start to integrate synonym support so that lobster is understood as a type of seafood, burgers are understood to be american food.  Though I’ll have to think about how to make sure crab cakes don’t show up in the desert category, though maybe we just need a hybrid seafood desert category.

A few interesting lessons have come out of the work so far.  First, search is way harder than most people think.  While Lucene gives you lots of nobs and levers to tweak how documents are ranked, the results of those tweaking aren’t always what you think.  It’s sort of like moving furniture by throwing bowling balls at it, you may get things close, but you do a lot of collatoral damage in the process.  Recently I was attempting to boost scores based on terms showing up in the subject of posts, which completely overwhelmed our post rating scoring, making low quality posts show up at the top of the list.

You also notice when people are using search badly, or more specifically using bad search.  Using SQL Like clauses is not search, it’s grep.  Unfortunately most php sites do that because they don’t have anything better (Lucene has been ported to a lot of language environments, php is not one of them).  The gentoo wikis fall into this category.

Finally, you realize that google’s scoring, while good in general, may not actually be what you want for your problem domain.  The fact that the word seafood shows up 3 times in a post doesn’t make it a better post, but default scoring gives it a boost based on the number of times relevant terms show up.  Badger, Badger, Badger, while being non kosher, shouldn’t be scored highly in our results, even if we had a category fully dedicated to badgers and mushrooms.