In an era where an approximated 1.36 * 10^11 pieces of A4 paper, or 2 percent of the amazon rainforest is needed to make a single hardcopy of the internet, it is extremely difficult to know exactly how you can find the information you are looking for. Fortunately for us, there are plenty of tools to help us find the way. Probably the most important such tool is the search engine. But how does such a tool guide us to the information we seek in the fraction of a second?
Text by: Pepijn Wissing
Search engines have three major functions: crawling, building an index, and providing us with a ranked list of websites that the program has determined to be relevant for the search query entered by a user. To understand what crawling is, imagine the World Wide Web to be a network of stops in a big city subway system. Each stop is a unique document: usually a web page, but sometimes a PDF, JPG or some other file. To determine how all of these stops are connected, the search engine “crawls” through the entire city and finds all of the stops along the way. To do that, it uses the best path available: links.
The link structure of the web serves to bind all of the pages together. Links allow the search engines’ automated robots, called “crawlers” or “spiders,” to reach the many billions of connected documents on the web. Once the engines find these pages, they decipher the code and make a complete list of everything on it: the page title, images, keywords, and any other pages it links to – at a bare minimum. Creating this list is called indexing. Modern crawlers may cache a copy of the whole page, as well as look for some additional information such as the page layout, where the advertising units are, where the links are on the page (featured prominently in the article text, or hidden in the footer?).
The crawler then adds all the new links it found to a list of places to crawl next, in addition to re-crawling sites to see if anything has changed: it is a never-ending process. Any site that is linked to from another site already indexed, or any site that manually asked to be indexed, will eventually be crawled – some sites more frequently than others and some to greater depth. If the site is huge and content is hidden many clicks away from the home page, the crawler bots may eventually give up. There are ways to ask search engines not to index a site, though this is rarely used to block an entire website.
Do you have too much time on your hands and are you interested in building your very own web crawler using PHP? A full tutorial can be found here!
Even the spiders cannot go everywhere though. There are still parts of the internet that are essentially invisible to search engines: the so-called “deep web”. Specialized engines to discover the deep web exist; some of them are even freely available. “DeepPeep”, for example, aims to enter the invisible web through forms that query databases and web services for information. However, there are also those on the deep web that have good reason to not want to be found: if you know where to look, you can find pages where almost anything is traded, up to and including highly illegal contraband.
You would be forgiving for thinking this is an easy step. Imagine trying to make a list of all the books you own, their author and the number of pages. Going through each book to find the information you are looking for would be equivalent to crawling your bookcase; the list you are writing is the index. But now imagine it is not just your bookcase or even your parent’s library, but every library in the world. That would be a small-scale version of what Google does.
To accomplish the monumental task of holding billions of pages that can be accessed in a fraction of a second, the search engine companies have constructed data centers all over the world. These monstrous storage facilities hold thousands of machines processing large quantities of information rapidly. When a person performs a search at any of the major engines, they demand results instantaneously. Consumers will always gravitate to the search engine that delivers them their results most promptly, so the engines work hard to provide answers as fast as possible.
A look inside one of Google’s data centers.
While a big part of their jobs involves indexing the Internet, search engines are in essence answer machines. When a person performs an online search, the search engine searches its database of billions of documents and does two things: first, it returns only those results that are relevant or useful to the searcher’s query; second, it ranks those results according to the popularity of the websites serving the information.
So how do search engines determine relevance and popularity? To a search engine, relevance means more than finding a page that contains the right words. In the early days of the web, search engines did not do much more than this simple step, which made search results of limited value. Over the years, smart engineers have devised better ways to match results to searchers’ queries. Today, hundreds of factors influence relevance, and we will discuss the most important of these.
Popularity and relevance are not determined manually. Instead, the engines employ algorithms to sort the wheat from the chaff (relevance), and then to rank the wheat in order of quality (popularity). These algorithms are search engine specific and are not known to the public, but it will not come as a surprise that researchers have been putting thousands of hours into building series of web pages that differ on as little as a single font setting, to determine which algorithm favors which features of a web page.
An example of particularly bad SEO, called “keyword stuffing”.
Search Engine Optimization
Considering that almost 60% of all searches end up clicking the first result, one can imagine that the success of a company is dependent on the ease with which potential customers are able to find the company on the web – which is where the search engines play a big role. In fact, many companies have actively perform Search Engine Optimization (SEO): the dark art of manipulating the company’s web pages to make search engines prefer their company’s web page to that of a rivalling company.
Search engines such as Google and Bing even provide SEO tips! For example, Google’s SEO tips include the following guideline: “Make a site with a clear hierarchy and text links. Every page should be reachable from at least one static text link.
Create a useful, information-rich site, and write pages that clearly and accurately describe your content. Make sure that your <title> elements and ALT attributes are descriptive and accurate.”
What is next?
Of course, there is still a lot to be improved upon. Probably the most recently up-and-coming feature to improve search engines is called semantics: the meaning and type of the content a page contains. Let us provide a simple example. Right now, when you search for gluten free cookies, you might find a regular cookie with a bit of text that says “this cookie is not gluten free”. In a world with semantics, you could search for cookie recipes and then remove regular flour from your list of ingredients. Then, you could remove any recipes with nuts and, if you would desire to, you could even narrow it down to only recipes with a review score of better than 4/5 with a preparation time of less than 30 minutes. The best part is, this is not even fiction anymore: Google.com’s search tools already offer an ingredients filter, and more!