Contact
Site: US UK AU |

How to use search engine robots

How to use search engine robots

Overview
This article shows how to configure and use search engine crawlers with either a robots.txt file or an .htaccess file.

Before proceeding

This procedure requires FTP access to your Nexcess server. For details about how to use FileZilla, a popular FTP client, refer to How to use FileZilla.  

Table of contents

Search engine robots in robots.txt 

Locating 

  1. Using your preferred FTP client, type your site's directory and /html directory.
  2. Within this directory, locate the robots.txt file. If you are unable to locate this file, create a text file with the name robots.txt.

Adding functions  

The following sections describe the formatting for allowing or disallowing crawlers to access specific folders on your web site.

Attention: Search engine crawlers do not scan the robots.txt file each time they crawl your site, so changes to your robots.txt file might not be read by the search engine for up to a week.

Blocking search-engine crawlers 

If you are performing development work on your site, and would prefer Google or Bing to not crawl your site, blocking your site from search engines is an option.

  1. The first line of the robots.txt file will be User-agent: followed by the name of the search engine you want to block.
  2. On the next line, type Disallow:  followed by the folders and files you want to block the bot from crawling.
  3. The example below is what blocking Google would look like:

User-agent: googlebot
Disallow: /photos

Allowing search engines to crawl specific folders of your site

If you would like to block all of your folders from search engine crawlers, configure an allow rule.

  1. The first line of the robots.txt file will be User-agent:, followed this name of the search engine crawler.
  2. On the next line, type Allow:,  followed by the name of the folder you would like to allow the bot to crawl.

Adding crawl delays for search-engine robots

If your site is experiencing a large amount of traffic, and it appears to be caused by multiple search engine crawlers simultaneously visiting your site,  configure a search engine crawler delay.

Attention: Adding a crawl delay to your robots.txt file is considered a non-standard entry, and some search engines do not abide by this rule. You will need to check with the specific search engine you want to delay for specific details.

  1. The first line of your robots.txt addition will be the User-agent: and the name of the search engine.
  2. The second line will be Crawl-delay: followed by a number between 1 and 30. This is for the second delay a crawling search engine can crawl your site. If your site is being crawled by multiple bots simultaneously, adding a crawl delay of 10 seconds or more.

The following table is a list of search engines and their corresponding bot names:

Search Engines Search Bot Name
Google googlebot
Bing bingbot
Baidu baiduspider
MSN Bot msnbot
Yandex.ru yandex
All Search Engines *


For example, to block Google bot from viewing your /photos folder, the following is the method for configuring a line in your robots.txt file:

User-agent: googlebot
Disallow: /photos

Search engine robots in .htaccess

Depending on the way your website is configured, your robots.txt file might not properly work with search engine crawlers. You can make changes to your .htaccess file instead.

Attention: Search engine crawlers do not scan the robots.txt file each time they crawl your site, so changes to your robots.txt file might not be read by the search engine for as long as a week.

Locating .htaccess

  1. Using your preferred FTP navigator, enter your directory and /html file.
  2. Within this directory, locate the .htaccess file. If the files does not exist, create a text file with the name .htaccess.

Adding functions  

  1. Once you have located or created the .htaccess file, open the file in your preferred text editor.
  2. If you are creating a new .htaccess file, the first line should be: RewriteEngine On
  3. If the .htaccess file already exists, and you are editing it, you will want to make sure that RewriteEngine On line is located in the file at the top.
  4. The following line reads the user or bot's user agent name and matches it to what was provided in the .htaccess file. Your users will not be matched with this variable, therefore will not be blocked. Replace [crawler] with the name of the search engine.

    RewriteCond %{HTTP_USER_AGENT} ^[crawler]$ [NC

  5. The final line tells the system what to do with a user that has been matched correctly, in this example, they will be provided a 403 Forbidden Message.
  6. RewriteRule .* - [R=403,L]

For example, to block Yandex from crawling any pages of your site, the .htaccess file will look something like this:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Yandex$ [NC,OR]
RewriteRule .* - [R=403,L]

Adding crawl delays for search engine robots

If adding a crawl delay to the robots.txt file was unsuccessful, add the following to your .htaccess file:

SetEnvIf User-Agent [botname] GoAway=1
Order allow,deny
Allow from all
Deny from env=GoAway

  • The first line checks the user ID, where [botname] is the name of the bot:

SetEnvIf User-Agent [botname] GoAway=1

  • The second and third lines allows all traffic not matching the first line:

Order allow,deny
Allow from all

  • The fourth line denies all traffic that matches the GoAway variable.

Deny from env=GoAway

 

For 24-hour assistance any day of the year, contact our Support Team by email or through the Client Portal.

Article Rating (1 Votes)
Rate this article
  • Icon PDFExport to PDF
  • Icon MS-WordExport to MS Word
 
Attachments Attachments
There are no attachments for this article.
Related Articles RSS Feed
What is nextmp.net?
Added on Tue, Jan 6, 2015
How to find the IP address of a website or server
Added on Tue, Aug 6, 2013
How to install PrestaShop 1.6
Added on Wed, Mar 19, 2014
What is the PHP-FPM limit (max_children)?
Added on Mon, Nov 23, 2015
How to install Vanilla Forums Community Edition 1.8
Added on Wed, Mar 19, 2014
What is a web application firewall?
Added on Mon, Feb 23, 2015
Installing Revive Adserver on your site
Added on Mon, Dec 30, 2013
How to Install Drupal 7.2
Added on Thu, Mar 20, 2014