The term hidden internet' is used to refer to the information that has not been indexed by a search engine but is available on the web, it is also referred to as the deep web', the invisible web' or the hidden web' (the terms web' and the internet' are often used interchangeably despite having distinct differences). As all search engines index different portions of the web, the hidden web' is relative to the search engine(s) that are being used, although the term may also be used more broadly to refer to those parts of the web not indexed by any of the major search engines.
There are many different reasons why portions of the web may not be included within a search engine's database of web pages: due to the structure of the web, due to the design of web sites, due to the capabilities of web crawlers, and due to the choices made by the search engines.
*The structure of the web Web crawlers (which collect information from web pages for search engines) find web pages by following the links from other web pages, if a web page (or group of pages) is not linked to by a page the crawler is aware of, the web crawler can not find it.
*The design of web sites Whilst web crawlers are capable of finding the links embedded within an html format, they are not necessarily able to collect the links in sites written in more complicated formats, e.g., Flash and Java. They are also unable to look beyond the password protected sections of a web site and those sections of a web site that require the user to fill in a query box.
*The capabilities of the web crawler Different search engines have crawlers of different capabilities, whilst some can only index html files, others are capable of indexing files in other formats, e.g., whereas PDFs were once considered part of the invisible web they are now indexed by search engines such as Google.
* Choices made by search engines Web crawlers do not crawl and index every page they can; the search engines are commercial organisations and do not perceive the benefit of crawling all web pages to be in their financial interests as the majority of their users are happy with the number of pages they do crawl. The major search engines usually choose to respect the robots.txt protocol, which allows web site owners to inform crawlers if there are any pages they don't want indexed.
Whilst there have been certain web sites designed specifically for interaction with the invisible web, e.g., www.pipl.com which has indexed information from databases containing information about individuals, the number of databases that are available means that such sites are by no means exhaustive and most users have generally found the breadth of the generic search engines more appropriate.
The future for discovery of information from the invisible web looks to be more rosy as the big four search engines have now all agreed to the sitemaps protocol (www.sitemaps.org) which allows web masters to inform search engines of the pages on their site (rather than just those they didn't want crawled) and should enable more pages to be found by the search engines. Whilst search engines will still not index the whole of the web, the use of a meta-search engine (a search engine that searches more than one search engine on your behalf) will provide access to more data than ever before.
A selection of meta-search engines:
-www.mamma.com
-www.dog pile.com
-www.surfwax.com
-www.a ftervote.com
Learn more about this author, David Stuart.
Click here to send this author comments or questions.
Below are the top articles rated and ranked by Helium members on:
by steven chan
RSS is technology - a simple software program - that allows you to access web and blog content automatically. The acronym's
Tired of using google or your normal search engine and only coming up with the same old pages every single time? Try using
There are 2 kinds of websites... Common and the Invisible.
The "invisible web", or deep web, is a part of the World Wide
by David Stuart
The term hidden internet' is used to refer to the information that has not been indexed by a search engine but is available
Add your voice
Know something about Finding the hidden Internet?
We want to hear your view.
Write now!
Cast your vote!
Click for your side.
Featured Partner
Pulitzer Center on Crisis Reporting
The Pulitzer Center on Crisis Reporting is an innovator in international nonprofit journalism. It goes beyond the hea...more
hide