5 years ago (2015-03-14)  Algorithm language |   First to comment  14 
post score 0 times, average 0.0

Although for the majority of webmasters, we may want search engines to include our web pages as much as possible, but sometimes we do not want search engines to include some of our web pages, such as landing pages, password protection pages, and private pages. .Search engine web crawlers, we often call it the search "spider", because these "spiders" are crawling along the links on the network can be described as invulnerable, once the author even speechless discovery, Google image search spider even my personal The user avatars are all included. This kind of hospitality makes me cry and laugh. So the site's root directory often has a file called "robot.txt". Robot is the meaning of "robot" in English. You can understand it as a web robot, that is, search for spiders. Through the text in this file to tell the search engine which Directory, which page or what format pictures do not want to be included.

Let us give you a few examples:

The first line: "# Forbid admin page"

The first character "#" indicates comments, which can be freely written without any effect on spider crawling. The main function is to remind yourself of the purpose of the next piece of code.

The second line of code: "user-agent:*

User-agent is a user agent. You may see this word in the website log. The "UA ID" of the mobile browser is also the word. In robot.txt you can understand it as "accessing the user (searching for spiders). Identity." The common search spider logos are:

1.Google Google Spider Googlebot Googlebot-Mobile Googlebot-Image Mediapartners-Google Adsbot-Google

2. Baidu Spider Baiduspider

3. Yahoo Spider Yahoo! + Slurp Yahoo! + Slurp + China: Yahoo China Spider

4.Yodao Spider YodaoBot YoudaoBot YodaoBot-Image

5. Soso spider Sosospider Sosoimagespider

6. Microsoft (Bing and MSN) spiders bingbot msnbot msnbot-media

7. Sogou Spider Sogou Web Sprider Sogou Orion spider Sogou-Test-Spider

"user-agent:*" where "*" is a wildcard, meaning "all", which means to tell all search engines ": You must pay attention!Here is what I want to tell you!The following code is valid for all spiders; if it is "user-agent: Baiduspider" at the head of a paragraph, it means that the following code is for Baidu. The other spiders are not controlled by the following code: .

The third line of code: "Disallow:/admin/"

Disallow, meaning "not allowed", means "your website address/admin". Web pages in this directory do not allow spiders to access the crawl. For example, I also don't want search engines to include the bbs directory on my site. Then I can Write "Disallow:/bbs/", if it is multiple directories, one directory is written one line, the next directory is written in one line, and so on.If your entire site does not allow crawling, such as when your site is debugging closed beta, you can write "disallow:*"; In addition, "disallow:/wp*" means to include "wp-content, wp-include" and other sites The directory with the wp prefix in the root directory prohibits crawling spiders.

The fourth line of code: "Disallow:/*.jpg$"

This line of code is to tell the search engine that I don't need you to include all the picture files ending with .jpg.For example, I also don’t want search engines to include pictures of .png files on my site. I can write “Disallow:/*.png$”. If there are multiple file formats, one file format line will be written, and the next file format will be changed. Write one line, and so on.

to sum up

1. After you finish writing, you can save it as a file named "robot.txt" (the file name must be) and upload it to the root directory of your site using FTP! 2. The website's webpage does not contain as many records as possible. Everyone knows that search engines compare the similarity of web pages on the web (when the two pages with too high similarity will disperse the weights), not only will the vertical comparison be made between different web sites. , And will compare the different pages of the same website, so, for example, for the personal blog's author archive and home page, the page content is almost the same, we can completely shield the spiders from accessing the archived pages of the author.Of course, you can control whether your date archives, directory archives, and search spiders crawl, as appropriate.


This article has been printed on copyright and is protected by copyright laws. It must not be reproduced without permission.If you need to reprint, please contact the author or visit the copyright to obtain the authorization. If you feel that this article is useful to you, you can click the "Sponsoring Author" below to call the author!

Reprinted Note Source: Baiyuan's Blog>>https://wangbaiyuan.cn/en/how-to-write-robts-txt-2.html

Post comment


No Comment


Forget password?