Definition of Spidering and Web Crawlers

Spiders & Web Crawlers: What You Need to Know to Protect Website Data

Young Man Using Laptop in Cafe
••• Stefanie Grewel / Getty Images

Spiders are programs (or automated scripts) that 'crawl' through the Web looking for data. Spiders travel through website URLs and can pull data from web pages like email addresses. Spiders also are used to feed information found on websites to search engines.

Spiders, which are also referred to as 'web crawlers' search the Web and not all are friendly in their intent.

Spammers Spider Websites to Collect Information

Google, Yahoo! and other search engines are not the only ones interested in crawling websites -- so are scammers and spammers.

Spiders and other automated tools are used by spammers to find email addresses (on the internet this practice is often referred to as 'harvesting') on websites and then use them to create spam lists.

Spiders are also a tool used by search engines to find out more information about your website but left unchecked, a website without instructions (or, 'permissions') on how to crawl your site can present major information security risks. Spiders travel by following links, and they are very adept at finding links to databases, program files, and other information to which you may not want them to have access.

Webmasters can view logs to see what spiders and other robots have visited their sites. This information helps webmasters know who is indexing their site, and how often.

This information is useful because it allows webmasters to fine tune their SEO and update robot.txt files to prohibit certain robots from crawling their site in the future.

Tips on Protecting Your Website From Unwanted Robot Crawlers

There is a fairly simple way to keep unwanted crawlers out of your website. Even if you are not concerned about malicious spiders crawling your site (obfuscating email address will not protect you from most crawlers), you should still need to provide search engines with important instructions.

All websites should have a file located in the root directory called a robots.txt file. This file allows you to instruct web crawlers where you want them to look to index pages (unless otherwise stated in a specific page's meta data to be no-indexed) if they are a search engine.

Just as you can tell wanted crawlers where you want them to browse, you can also tell them where they may not go and even block specific crawlers from your entire website.

It is important to bear in mind that a well put together robots.txt file will have tremendous value for search engines and could even be a key element in improving your website's performance, but some robot crawlers will still ignore your instructions.  For this reason, it is important to keep all your software, plugins, and apps up to date at all times.

Related Articles and Information

Due to the prevalence of information harvesting used to nefarious (spam) purposes, legislation was passed in 2003 to make certain practices illegal.  These consumer protection laws fall under the CAN-SPAM Act of 2003.

It is important that you take the time to read up on the CAN-SPAM Act if your business engages in any mass mailing or information harvesting.

You can find out more about anti-spam laws and how to deal with spammers, and what you as a business owner may not do, by reading the following articles: