Search Helium

Home > Computers & Technology > Computers & Technology (Other)

Finding the hidden Internet

by David Stuart

Created on: April 14, 2007   Last Updated: April 02, 2011

The term hidden internet' is used to refer to the information that has not been indexed by a search engine but is available on the web, it is also referred to as the deep web', the invisible web' or the hidden web' (the terms web' and the internet' are often used interchangeably despite having distinct differences.) As all search engines index different portions of the web, the hidden web' is relative to the search engine(s) that are being used, although the term may also be used more broadly to refer to those parts of the web not indexed by any of the major search engines.

There are many different reasons why portions of the web may not be included within a search engine's database of web pages: due to the structure of the web, due to the design of web sites, due to the capabilities of web crawlers, and due to the choices made by the search engines.

• The structure of the web Web crawlers (which collect information from web pages for search engines) find web pages by following the links from other web pages, if a web page (or group of pages) is not linked to by a page the crawler is aware of, the web crawler can not find it.

• The design of web sites Whilst web crawlers are capable of finding the links embedded within an html format, they are not necessarily able to collect the links in sites written in more complicated formats, e.g., Flash and Java. They are also unable to look beyond the password protected sections of a web site and those sections of a web site that require the user to fill in a query box.

• The capabilities of the web crawler Different search engines have crawlers of different capabilities, whilst some can only index html files, others are capable of indexing files in other formats, e.g., whereas PDFs were once considered part of the invisible web they are now indexed by search engines such as Google.

• Choices made by search engines Web crawlers do not crawl and index every page they can; the search engines are commercial organisations and do not perceive the benefit of crawling all web pages to be in their financial interests as the majority of their users are happy with the number of pages they do crawl. The major search engines usually choose to respect the robots.txt protocol, which allows web site owners to inform crawlers if there are any pages they don't want indexed.

Whilst there have been certain web sites designed specifically for interaction with the invisible web, e.g., www.pipl.com which has indexed information from databases containing information about individuals, the number of databases that are available means that such sites are by no means exhaustive and most users have generally found the breadth of the generic search engines more appropriate.

The future for discovery of information from the invisible web looks to be more rosy as the big four search engines have now all agreed to the sitemaps protocol (www.sitemaps.org) which allows web masters to inform search engines of the pages on their site (rather than just those they didn't want crawled) and should enable more pages to be found by the search engines. Whilst search engines will still not index the whole of the web, the use of a meta-search engine (a search engine that searches more than one search engine on your behalf) will provide access to more data than ever before.

A selection of meta-search engines:

• www.mamma.com
• www.dogpile.com
• www.surfwax.com
• www.aftervote.com

Learn more about this author, David Stuart.
Click here to send this author comments or questions.

Helium Debate

Cast your vote!

Does technology impoverish the mind?

Click for your side.

175066

Featured Partner

Needful Provision Inc.

Needful Provision's mission is to research, develop, demonstrate, and teach innovative self-help technologies to assist the poor, worldwide, achieve self-sufficiency and well-being.more


CONNECT WITH US

Read
our blog
Helum for writers

Write and get published
Share with other writers
Polish your freelancing skills

Join our active writing community
Helium Content Source for Publishers

Quality articles from proven freelancers
Exclusive rights, fast turnaround
Brand engagement, business blogging -- our writers do it all

Get custom content today!

INFORMATION


Helium, Inc.
200 Brickstone Square Andover, MA 01810 USA
#