egilh project #11: Intranet Search
What: First some background info: I like to know how things work. I started picking apart broken kitchen appliances when I was a kid, trying to figure out what was wrong. I had a hard time buying spare parts though as repair centers wanted to do the job themselves to earn money. These days I check out How Stuff Works once in a while to learn something new.
It's the same interest in knowing what goes on behind the scenes that got me started programming as well. My neighbor convinced me to spend the few bucks I had earned picking strawberries during my summer vacation on a computer. I started using Lotus and WordPefect, constantly wondering how the stuff worked. I got my hands on Turbo Pascal and tought myself programing from the excellent online help an examples.
Fast Forward 10 years: AltaVista launched their search service,and I said to myself: how do they do that? So, during compiles and installations of Windows 95 I built a fully working crawler, indexer and web search server. I had limited resources but wanted good performance so I decided to use NTFS as a DB. FAT had a horrible linked list to find files but the "new" NTFS used b-trees to do very fast file name based lookups. Each page I crawled got a unique ID and was added to a fixed size record file with the title of the page, abstract, url, date etc. For each unique word in the crawled page I added the ID to a word file. Each word had it's own file in a directory system to avoid to many files in one directory. The ID of pages using the word 'Hello', for example, were saved in "\h\e\llo.dat". Pages that used Hi were stored in "\h\i.dat". One word queries were lighting fast as I just had to open a flat file with 32 bit IDs to get the list of pages containing that word. AND/OR queries were simple as well as I just had to join/intersect arrays. Then again, it didn’t have any of the fancy smancy Google relevance logic. It was a lot of fun to make and watch the progress of the crawler as it found it's way to the various internal web servers in Microsoft.
When: mid 90
- The Microsoft web site is -huge-. I don't remember how, but my crawler ended up on the two internal mirrors of the public web site and it got "stuck" indexing the public web servers only. I can only imagine how large the site is now, 10 years later.
- Sockets on Windows
- Work smarter, not harder and you can accomplish incredible things
- Parsing badly formed html pages
- Html can be a pain to parse in C++. Long live Perl for parsing strange stuff and xpath for dealing with xml/xhtml