My Search Engine
I've read alot about and worked with many of the major search engines and
directories. This includes Google, Inktomi, Altavista, Yahoo and AlltheWeb
just to name a few. This interest has brought me into wanting to go ahead
and write my own web crawler and search engine. I'll be writing this engine
in the very versitile PERL programming language and posting the results on
this page so I can gloat over my not so brilliance. I am using this page as
a framework for this project.
Parts of my Search Engine
The Web Crawler
- The web crawler will go out and collect information such as URL, Page
Content, IP, and whatever else becomes necessary. I will be using DMOZ as a
starting point because they seem to be the most friendly towards spider agents.
- I'm considering sending this spider out cloaked. Using either LWP or Socket
in PERL I can make it appear as an ordinary browser request which will help me
avoid peole who dislike spidering. The spider will be aggressive in that it
won't look at or obey robots.txt.
The Index
- My Index will start at 10,000 sites and depending on the scalability and
storage requirements it will move up to an undefined number. I intend to try
to index as many different domains as possible as some individual domains
contain thousands of pages themselves.
The SERPs Generator
- Results will be generated based on some algorithm to be applied to the
Index. My thoughts are that only on the page content will be used to
evaluate the pages at first and later on a possible random walk link
popularity system with some sort of grouping for categories.