Definition of Spidering and Web Crawlers

Spiders & Web Crawlers: What You Need to Know to Protect Website Data

Young Man Using Laptop in Cafe
••• Stefanie Grewel / Getty Images

Spiders, which can be referred to as web crawlers, are programs (or automated scripts) that "crawl" through the Web looking for data. Spiders travel through website URLs and can pull data like email addresses from web pages. They are also used to feed information found on websites to search engines. Be aware, though, that not all of them are friendly in their intent.

How Spammers Use Spiders

Google, Yahoo! and other search engines are not the only ones interested in crawling websites—so are scammers and spammers.

Spiders and other automated tools are used by spammers to find email addresses on websites (a practice often referred to as "harvesting") and use them to create spam lists.

Spiders are also used by search engines to find out more information about your website, but without instructions or "permissions" on how to crawl your site, it can present major information security risks. They travel by following links and are very adept at finding links to databases, program files, and other information you may not want them to access.

Webmasters can view logs to see what spiders and other robots have visited their sites. This information helps webmasters know who is indexing their site, how often, as well as allows them to fine-tune their SEO and update robot.txt files to prohibit certain robots from crawling their site in the future.

Tips on Protecting Your Website From Unwanted Robot Crawlers

Even if you are not concerned about malicious spiders crawling your site, you still need to provide search engines with important instructions. All websites should have a file located in the root directory called a robots.txt file that allows you to instruct web crawlers where you want them to look to index pages if they are a search engine.

Just as you can tell wanted crawlers where you want them to browse, you can also tell them where they may not go, and even block specific crawlers from your entire website.

It is important to bear in mind that a well put together robots.txt file will have tremendous value for search engines and could be a key element in improving your website's performance, but some robot crawlers will still ignore your instructions. That's why it's important to always keep all your software, plugins, and apps up to date.

Related Articles and Information

Due to the prevalence of information harvesting used for evil purposes, legislation was passed in 2003 to make certain practices illegal. These consumer protection laws fall under the CAN-SPAM Act of 2003.

Take the time to read up on the CAN-SPAM Act if your business engages in any mass mailing or information harvesting. You can find out more about anti-spam laws, how to deal with spammers, and what you as a business owner may not do, by reading the following articles: