Tag Archives: grad school

A semester of search

My grad school class this semester is the Project Course, where the whole semester is spent on a group project.  No tests, no other grading besides the project, which is actually what I expected more of when starting grad work at Marist (that’s a different post though).  The project domain is search.

We have to build an application with an integrated search engine tuned for a specific problem.  Our group problem is an user driven online restaurant review site.  Our canonical example is searching for “boston seafood”, which should return all the posts that a human would, given the same tasks.  That means “the best lobster in bean town” counts as a hit that you’d want.  Guess what, SQL like clauses and regex’s aren’t going to cut it here.

But that’s ok, we don’t have to do everything from scratch.  We’re expected to base our solution on Lucene, which is a search SDK.  You build custom indexer, analyzer, and searcher classes from the Lucene base classes, and feed it documents.  Lucene does the heavily lifting of building the inverted index, and scoring the results based on the rules, weights, and policies you’ve given it.  A project like this is pretty open ended, as you can always make it better given more time, and more interesting analysis tricks.

The whole team is making nice progress, so for the last two weeks I’ve been able to focus squarely on Lucene integration code itself.  Pass one got some basic queries working in Lucene.  Pass two was earlier this week, when scoring started to be useful.  Pass three will be tonight, where I’ll start to integrate synonym support so that lobster is understood as a type of seafood, burgers are understood to be american food.  Though I’ll have to think about how to make sure crab cakes don’t show up in the desert category, though maybe we just need a hybrid seafood desert category.

A few interesting lessons have come out of the work so far.  First, search is way harder than most people think.  While Lucene gives you lots of nobs and levers to tweak how documents are ranked, the results of those tweaking aren’t always what you think.  It’s sort of like moving furniture by throwing bowling balls at it, you may get things close, but you do a lot of collatoral damage in the process.  Recently I was attempting to boost scores based on terms showing up in the subject of posts, which completely overwhelmed our post rating scoring, making low quality posts show up at the top of the list.

You also notice when people are using search badly, or more specifically using bad search.  Using SQL Like clauses is not search, it’s grep.  Unfortunately most php sites do that because they don’t have anything better (Lucene has been ported to a lot of language environments, php is not one of them).  The gentoo wikis fall into this category.

Finally, you realize that google’s scoring, while good in general, may not actually be what you want for your problem domain.  The fact that the word seafood shows up 3 times in a post doesn’t make it a better post, but default scoring gives it a boost based on the number of times relevant terms show up.  Badger, Badger, Badger, while being non kosher, shouldn’t be scored highly in our results, even if we had a category fully dedicated to badgers and mushrooms.

Attempting to Study… badly

I’m attempting to study for my Automata final, which is going slowly, and not something I’m able to keep enthusiasm up about.

  • I only need a 51 on the final to get the requisite B in the class, and given that every exam prior had a 15 point curve, I probably only really need a 36. A 57 or better (pre curve) translates into an A-.
  • The PDA and Turing Machine portions of the exam are well in hand. I feel comfortable converting back and forth between them and a language definition. I personally find generating Turing Machines a lot of fun, and really interesting ways to make your brain work. They just take time, and I’ve had no issue on time in previous exams.
  • Chomsky hierarchy of language is pretty much just a table and a vend diagram to remember. Pretty confident in that.
  • Pumping lemma for PDAs. I’m sure this is going to be the part of the exam I blow, but it will probably only be one question. The pumping lemma proof on the last exam accounted for the majority of points I lost for that test.
  • Halting problem proof. I ran through this prior to class the other day without notes, and I think I can reproduce it at will. About to go attempt to do that now again.
  • Life of Alan Turing bonus question (a staple of Prof. Hayes’ final exams), I’ve got enough fun facts memorized to do ok on that.
  • Computability. We did a bit on the partial recursive functions for the last class. I’m not sure which portions are actually askable. Possibly an enumeration of the 3 basis functions (Z, S, P), or the 3 operations (composition, primitive recursion, unbounded minimalization). We were already told we weren’t going to have to actually do the mechanics of these, given their level of horridness, and being introduced on the last day of class.

Guess it’s time to start playing with proofs from memory, and reread Alan Turing’s life synopsis a few more times.

Grad School

For the past 3.5 years, grad school has just been something that I do, and will be doing forever. I’m enrolled in the Computer Science Masters degree program at Marist College, taking 1 class a semester, part time, at nights. As registration was coming up for this spring, I emailed the adviser for the program about where I stood, and what classes I still need to take.

Turns out I’ve only got 2 required, and 3 electives left (so Spring 2009 is my graduation date if nothing goes wrong). Which means I’m more than 1/2 through the program. So for the first time since it started, the end is closer than the beginning. While I’ve enjoyed some part of every class that I’ve taken so far, and have learned something in every class, things like Curling had to take a back seat this year given how booked the rest of my life is right now. Perhaps, post grad school, I could actually join the curling league in Norfolk. 🙂

End of term potpourri

Last night was the exam for my grad class. It went quite well. The 2 hour test took me about 45 minutes to finish, then I went over the test 3 times, mostly to kill time and not miff everyone else in the room for leaving so early. The test was pretty straight forward, and while I’m sure I missed a few points here and there (given that I always do, I’m just not that careful about anything), I think the grade will be pretty good. That means I get back about 6 hours / week of time for the next 3 months, which is always appreciated (and needed, due to the wedding). It also means that I am now ~ 1/2 way through grad school. As long as I can get the last UG prereq dropped (which shouldn’t be too much of a challenge), I’ve got 6 more classes (only 1 is a required class), and the network lab that I’ve continued to not figure out when I have time for.

I’m currently on a plane to Wisconsin to visit my friend Nick, one of my closest friends from college. It was supposed to be his graduation this week, but his thesis took a little more time than anticipated. Such is life. It actually makes the visit much nicer, as there won’t be quite so many different pulls on Nick’s time while I’m there, and hopefully it means He, Heather (his fiancee), and I will get much chill time together.

My team of interns start arriving on Monday. I’m really looking forward to having 4 bright students, and a hard problem, and seeing what happens with a little direction. It’s also something new, having never led an intern team before, and new is always good.

Wedding planning is coming along. Susan and I finally have some food options, and that is a very good thing. 🙂 We had a good talk with Steve Warner, who will be officiating the wedding last weekend. Steve is a family friend from as far back as I can remember, a really funny guy, and a justice of the peace in Granville, VT, where I was born, grew up, met Susan for the first time, and where we’ll get married. That last thought just occurred to me last night, and think very clearly has to be part of the material for the wedding itself. I also seem to have fixed whatever Google didn’t like about the wedding website, so now “dague wedding” or “tveekrem wedding” is showing up correctly as the first hit. Given the uniqueness of both of our names, you are almost going to have to try to not find the site. 😉

The whole ppp over sprint phone thing, has me rather excited. It means I’ll have to play with getting IMAP up and running on my home box, as while the through put seems pretty good (140 kbs range on a long wget), the latency is very high (min 350 ms to google.com). My normal method of sshing into my home box to do mutt in screen isn’t going to cut it over that link.

Lastly I just finished listening to How to Make Love the Bruce Cambell Way, which was awesome, and started listening to The Salmon of Doubt, the set of Douglas Adams post mortem essays that were pulled together a few years back. Really, really, good.

Gnome Sort

While search the NIST Algorithms Dictionary for a simple sort algorithm that would be easy to write in CUSP asm, I came across gnome sort, which I’d never seen before.

Gnome Sort is based on the technique used by the standard Dutch Garden Gnome (Du.: tuinkabouter). Here is how a garden gnome sorts a line of flower pots. Basically, he looks at the flower pot next to him and the previous one; if they are in the right order he steps one pot forward, otherwise he swaps them and steps one pot backwards. Boundary conditions: if there is no previous pot, he steps forwards; if there is no pot next to him, he is done.

Awesome! The implementation is 23 instructions in CUSP asm for an arbitrary segment of continous memory.