Optimizing for Search Indexes Paul Kovacs Computer Information Systems, Robert Morris University Moon Township, PA 15108, USA ABSTRACT The expansion of Internet technology is challenging information and instructional technology educators to develop courses related to the teaching of Web site design and development. Such courses should follow all of the recommended web site design practices such as site organization, site navigation, page design, text design, graphic design, and accessibility considerations. Whether the purpose of a Web site is to run and promote a business, provide a service, disseminate information, or just establish a web presence, an extremely important additional topic in a Web design course is to assure that the target audience will find the site. Search indexes are essential starting points for users seeking to find Web locations or documents and getting a site listed in a search index can be an invaluable tool for reaching a target market. This paper discusses the working components of a crawler-based search indexes as well as provides optimization techniques that can be included in Web design and development courses. Keywords: web design, search indexes, optimizing HTML tags, world wide web indexing 1. INTRODUCTION The public’s enthusiasm and business acceptance for the Internet is unparalleled in the history of information technology. The record growth of the Internet has exceeded all previous records for both the brief time of acceptance and the number of businesses embracing this new technology. The new millennium was a crossroads for the Internet and over 50 percent of all Americans homes and more than 90 percent of all college students had internet access (Lehnert 2001). This expansion of Internet technology is challenging information and instructional technology educators to develop courses related to the teaching of Web site design and development. Such courses should follow all of the recommended web site design practices such as site organization, site navigation, page design, text design, graphic design, and accessibility considerations. Whether the purpose of a Web site is to run and promote a business, provide a service, disseminate information, or just establish a web presence, an extremely important additional topic in a Web design course is to assure that the target audience will find the site. There are a number of ways to accomplish this including traditional media advertising , announcements with Usenet newsgroups, e-mail lists, Listserv Postings, reciprocal links agreements, or personal recommendations. However, according to a survey by the Georgia Institute of Technology (Pitkow and Kehoe 1997), more than 86% of all people locate a Web site through a search indexes as opposed to Printed media (62.47%), friends (56.92%), TV (30.20%), email signatures (31.23%), Usenet Newsgroups (32.75%) or other methods. A more recent Nielsen/NetRating survey reported at searchwatch.com (Sullivan, 2002) indicates that nine out of ten web surfers utilize a search index when they need to find a Web site, and a Consumer Daily Question Study (Felke, 2003) found that search indexes were the most utilized method of obtaining information and finding sites. In the context of the World Wide Web and Indexing, the term "search index” can describe both crawler-based search indexes and human-edited directories. Crawler-based search indexes, such as Excite, index words or terms in Web documents automatically by transversing or crawling the web. Conversely, directories such as Yahoo, classify locations or Web documents into an arbitrary subject classification scheme or taxonomy that are produced by human beings who visit and consider websites for inclusion. This paper discusses the components of a crawler-based search indexes as well as provides optimization techniques for Web design and development courses. The paper does not discuss Human-edited Search directories. 2. COMPONENTS OF A CRAWLER-BASED SEARCH INDEXES Crawler based Search indexes have the following three major components: * Spider * Index * Search form The first component is the spider also called the crawler, robot, or worm. These names are somewhat deceptive because they give the impression of undesirable software such as a virus. This not true and they are not destructive. The spider simply visits a web page, reads it, and then follows links to other pages within the site. http://www.jafsoft.com/searchengines/webots.html provides a list of spiders for the major search indexes (Fotheringham 2003). The second component is the Index or the database, catalogue or dataset that contains a copy of every web page that the spider locates. If a web page is changed, then new information updates the index. Many times, it can take a while to add new or altered pages to the index. Thus, a web page may be "spidered" but not "indexed." Until the page is actually inserted into the database, it is not available to those searching for it. The third component is the Search form and allows for the entry of key words or phrases. When submitting a search form, a special program analyzes the search query, and then searches the database or index for pages that contain this keyword or keyword phrase. Next it analyzes every single one of the relevant pages in order to determine how important that keyword or phrase is on that page. A search index can contain hundreds of millions of web pages in its database. In order to deliver relevant results, special ranking algorithms calculate the order of search query results or the keyword or keyword phrase the searcher enters into the search form. Search Indexes such as Excite (http://www.excite.com), AltaVista (http://www.altavista.com), and Hotbot (http://www.hotbot.com) index Web pages by starting from a historical list of URLs, especially of documents with many links elsewhere, such as "What's New" pages, server lists, and the most popular sites on the Web. The spider may then decide to parse the document and insert it into its database. Some Search Indexes spiders only index the home page of the Web site while others index all of the links in the site. The Web page’s title, body, and other elements all play a role in the indexing. Some indexes also parse the META tag, or other special hidden tags. 3. ELECTING KEYWORDS AND PHRASES As mentioned in the preceding section, when searching for a Web location or document, searchers enter various keywords in the search field of a search form. The search index then looks for Web pages that match those keywords. It follows that the starting point of search index optimization is selecting crucial keywords (one word) or keyword phrases (more than one word) that describe the Web site or page. Assemble a list of relevant keywords. This list can contain 10 to 25 words or phrases arranged in order of importance. Do not include keywords into the list that have nothing to do with the Web site and do not repeat keywords. In assembling the list, think of what a user would enter when searching for the Web site. If there are not enough words to describe the site, a thesaurus may be helpful. The Keywords selected should not be too general or too restricted. Too general Keywords can result in a large number of sites found. Conversely, keywords too restricted may reduce the number of sites found, but not many searchers will use that keyword. Several Web resources for selecting a keyword list are: * Overture’s Search Term Suggestions Tool http://inventory.overture.com/d/searchinventory/suggestion/ * Word Tracker http://www.wordtracker.com/ * Lexical FreeNet Connected thesaurus http://www.lexfn.com/ * iBoost Keyword Generator http://www.iboost.com/tools/keyword_generator.htm * Ranks.nl Keyword Density, Placement, Prominence Analyzer http://www.ranks.nl/tools/spider.html Keyword density refers to the ratio of a keyword or phrase to the total words in a Web page. The keywords should be between 1% and 7%. This means that there should be 1 to 7 keywords for every 100 words on the Web page. A Web developer can manually count the keywords or can acquire a software product to do the count. Two software products that will analyze keyword density are: * WebPositionGold by FirstPlace Software http://www.webposition.com/ * GRKda Keyword Density Analyzer by GRSoftware http://www.grsoftware.net/search_engines/software/grkda.html 4. OPTIMIZING HTML TAGS A search index does not view a Web page as a human does but rather reads only the text-based code contained in the HTML tags. Thus, optimization for a search index is really optimization of selected HTML tags. Of course, in order to accomplish this, it is necessary to have some understanding of HTML. Many Web design courses use WYSIWYG (what you see is what you get) web authoring software such as Microsoft FrontPage or Macromedia Dreamweaver. Although this software allows for Rapid Application Development (RAD), it can also hide the HTML from the Web designer. Furthermore, some Web authoring programs produce HTML code that is full of incorrect or unnecessary tags. As a result, when optimizing the page for a search index, it may be necessary to rewrite parts of the code itself. In section 6. CLASSROOM IMPLEMENTATION of this paper, an example is provided as to how optimization techniques are incorporated in a course at Robert Morris University. This course teaches HTML and uses HTML for the development of a Web site not web authoring software. When students complete the class, they may want to learn and use a web authoring tool. This approach assures that the student fully understands the HTML language and if using Web authoring software, should be able to rewrite HTML code if necessary. The title tag The
Image with alt tag and description
5. ADDITIONAL WEB DESIGN CONSIDERATIONS
This section will focus on what to avoid when considering indexing spiders.
Little or no body text or content
When there is little or no body text on a Web page, some crawlers will take this to mean that there is no content on the website worth indexing.
Large file size
This refers to the total size of the HTML code. If a Web site has a great amount of text, the file size could be very large. An indexing spider may give up if the site takes a long time to read. This is another reason why it is a good idea to break up large amounts of text to separate pages.
URL redirection
When directing a Web site’s URL to another Web host, a spider may not be able to follow the re-direct to the destination page and may not index the page. If a re-direction is necessary, it is essential to create a spider optimized doorway page on the index page of the site that contains the re-direct.
JavaScript/Cascading Style Sheet
JavaScript and Styles are a jumble of coding that crawlers do not understand and can have an effect on indexing. Although both are widely used in site design, it is something to consider if there is difficulty indexing the site. It is recommended to place the script and the styles in a separate text file external to the Web page and then link it into the page.
Flash
In Web design, Flash can be a fantastic element. However, Flash can also be destructive to spiders. If using Flash, a consideration may be to create an entrance page without flash that is crawler optimized.
Fancy navigation
A fancy navigation structure can make it difficult for a spider to transverse other pages of the website.
Free Web hosting
A free hosting service for the website will often place the site on the same server as many other websites with all of them using the same IP address. With a Web site having the same IP address as other sites already in an indexing database, the index may not accept the site, or the site may be slow or inaccessible because another website on the server is controlling all the resources. Furthermore, if a Web site has the same IP address of a site banned from an indexing service, the indexing service may also ban the site. It is worth it to acquire a unique IP address and a reliable hosting service.
6. CLASSROOM IMPLEMENTATION
This search index optimization topics previously presented are implemented in an undergraduate and graduate course at Robert Morris University titled HTML/Internet. The first half (based on a 14 week semester) of the course is traditional lecture/discussion/and hands-on and the remaining class time is based on a project-based format.
The lecture/discussion/hands-on portion teaches the student HTML syntax. It is textbook driven and covers the anatomy of a Web page as well as the tags necessary for hyperlinks, colors, graphics, tables, and forms. Also discussed are Cascading style sheets and elementary JavaScript concepts. Students received a lecture on the tag and then are required to complete an exercise at the end of the chapter.
The second portion of the class is devoted to having the each student combines the major principles of the systems development life cycle (SDLC) to develop a personal web site and post it to a Web server. The site requires a home page and at least four hyperlinks to content pages.
Before the students begin working on their personal site, they are given an overview of search indexes including the popular search indexes and the components of the search indexes as discussed previously in this paper. Additionally, a class is devoted to visiting various search indexes and determining the steps involved in submitting a site to that particular index. With this accomplished, each student is required to create a keyword and phrases list for their site. The Web resources for selecting a keyword list mentioned previously in this papers (3. ELECTING KEYWORDS AND PHRASES) can be used. The instructor reviews this keyword list and when approved, the student may create the page layout design for their home page. Items such as site color scheme, graphics, text, and navigation considerations are determined so that the page is appealing and compelling to a target audience. Additionally, in order to assure that the page is optimized to work with a crawler-based search index, selected keywords are incorporated from the student’s list with attention given to the title tag, header tag, and image tags. Included are also the keyword, description, and ROBOT metatags. When the page is completed, the code of each student’s home page is checked by the instructor to assure that all tags are optimized as well as submitted by the student to either WebPositionGold or GRKda Keyword Density Analyzer to check the key word density.
The points discussed in section 5. ADDITIONAL WEB DESIGN CONSIDERATIONS of this papers are also addressed. For example, if the student wishes to use JavaScript or cascading styles sheets in their site, they must place the script and the styles in a separate text file external to the Web page and then link it into the page. Additionally, if the student wants to incorporate Flash, an entrance page without flash that is crawler optimized must be included.
The remaining pages of the site are then constructed and checked for optimization, as was the home page. When the site is completed, it is posted to the server at Robert Morris University or to a hosting service of the student’s choice.
7. CONCLUSION
This paper discusses design concepts related to OPTIMIZING a Web site for crawler-based search indexes in order to attract worldwide interest or gain “hits” to the Website. After attracting new visitors, the next step is to keep those “hits” continuing. One way to assure this is to keep the Web site updated on a regular basis. Another way is to keep up to date with Web services. Finally, be sure to include an e-mail address on all Web pages and encourage visitors to send any comments or suggestions.
8. REFERENCES
All About Search Engine Robots andSpiders. (2002) [Online].http://www.searchtools.com/robots
Felke, Terry A, 2003, Web Developer Foundations. Scott Jones Publishers, El Granada, CA.
Fotheringham, John A. 2003, Search Engines Robots. [Online] http://www.jafsoft.com/searchengines/webbots.html/
Lehnet, W, 2001, Web 101: Making the Net Work for You. Reading, MA: Addison Wesley. University Press.
Marckini Fredrick W. 2001, Search Engine Positioning. Republic of Texas Prining.
Nobels, Robin and Susan O’Neil, 2001, Streetwise Maximize Web Site Traffic: Build Web Site Traffic Fast and Free by Optimizing Search Engine Placement. Adams Media Corporation
Pitkow, J. and C. Kehoe, 1997, “Results from the Seventh WWW User Survey”, [Online]. http://www.gvu.gatech.edu/user_surveys/survey-1997-04/bulleted/use_bullets.html
Sekhar, Chandra, 2001, Internet Marketing and Search Engine Positioning – A “ Do it Yourself” Guide. Southern Pub Group.