What is robots.txt and how to use it correctly?
Let's start with a brief definition: robots.txt - a text file hosted on the server that is responsible for interacting with indexing robots. The main function of robots.txt is to grant or deny access to files in the site folder.
Learn about some common robots.txt configurations, which I'll go into more detail below:
# Access to the entire site.
# No access to site.
# One folder excluded.
User agent: *
# One subpage excluded.
Why do we need to know what robots.txt is?
- Not knowing what robots.txt is and using it incorrectly can negatively affect your site's ranking.
- The robots.txt file controls how indexing robots view your site.
- Robots.txt is mentioned in several guides provided by Google itself.
- This file and indexing robots are fundamental elements that affect the operation of all search engines.
Internet indexing robots
The first thing such a robot will do when it visits your site is look into the robots.txt file. For what purpose? The robot wants to know if it has the authority to access a given page or file. If the robots.txt file allows entry, then it will continue to work. If not, it will leave the specified site. In this regard, if you have any instructions for indexing robots, then robots.txt is the right file for this.
Note: There are two important things every webmaster should do when it comes to a robots.txt file:
- determine if a robots.txt file exists at all
- if it exists, make sure it doesn't harm the site's position in search engines
How to check if a site has a robots.txt file?
Robots.txt can be checked from any web browser. This file must be placed in the root folder of each site so that we can determine if the site has robots.txt. Just add 'robots.txt' to the end of your domain name, as shown in the example below:
If the file exists or is empty, the browser will display its content. If it doesn't exist, you'll get a 404 error.
Do we need a robots.txt file?
If you already know what robots.txt is, you probably don't need it on your site at all.
Reasons why a robots.txt file should be on your site:
- You have data that you do not want to share with search engines.
- You use paid links or ads that require special instructions for crawlers.
- You want only authoritative bots like Googlebot to visit your site.
- You create a site and modify it "live", so you don't want robots to index the unfinished version.
- Robots.txt will help you follow the guidelines that Google publishes.
Reasons why you don't need a robots.txt file on your site:
- By not having a robots.txt file, you eliminate potential errors that could negatively affect your site's position in search engines.
- You don't have any files you want to hide from the search engine.
In this regard, if you do not have a robots.txt file, search engines have full access to your site. This is, of course, a normal and common occurrence, so there is nothing to worry about.
How to create a robots.txt file?
Creating a robots.txt file is child's play.
Such a file is a simple text file, which means you can use the most common notepad on your system or any other text editor. So you can look at it this way: I'm not creating a robots.txt file, I'm just writing a simple note.
Robots.txt instructions and their importance
Now that you know what robots.txt is, you need to learn how to use it properly.
- The user-agent syntax defines the direction in which indexing robots move - if necessary, of course. This can be done in two ways. If you want to report all robots, add "*" (asterisk).
- User-agent: * - this way of writing says this: "Every robot should follow this direction." If you want to report something to a specific robot, for example Googlebot, then the notation is as follows
- User-agent: Googlebot - this line says: "These instructions apply only to Googlebot."
The 'Disallow' instruction is used to prevent robots from entering specified folders or files. This means that if you don't want, for example, Google to index images on your site, you put them all in one folder and exclude it.
How do you do it? Let's say you've already moved all your photos to a folder called "pics". Now you need to tell the robots not to visit this folder for indexing.
Here's what the robots.txt file should contain in this situation:
The above two lines of text in the robots.txt file will keep robots away from the photos folder.
Note: If you forgot the "/" sign after the Disallow statement like this... your site will look at the first line, then read the second (ie "Disallow:"). What will happen? After that, the robot will feel like a fish in water, because you forbade him ... nothing. So it will start indexing all pages and files.
Only a few indexing robots understand this specific instruction, one of which is, for example, Googlebot.
The 'Allow' instruction allows the robot to determine if it can view a file in a folder that is blocked by the 'Disallow' command. To illustrate this, let's look at the previous example.
We have saved all photos in one folder called "fotki" and thanks to the "Disallow: /photos" feature we have blocked full access to its contents . However, after a while we came to the conclusion that we wanted to make only one photo available for the search engine, which is located exactly in the "photos" folder. The "Allow" position allows us to tell Googlebot that despite the fact that we have blocked access to the folder, it can still search in it and index a photo with a name, for example, "bike.jpg". In this regard, we need to create an instruction for it that will look like this:
Such instructions tell Googlebot that it can find the "bicycle.jpg" file in the excluded "photos" folder.
How to determine which sites to block?
If we already know how to properly use robots.txt, we probably want to use it for something. In this regard, what types of pages should be excluded from indexing?
- Pages that display search results.
- Pages that are generated automatically.
- Low-ranking sites also get unavoidable content, so it's best to just exclude them.
- Pages where any information is generated from partner databases or any information that is not generated on your site.
What is robots.txt and how to use it? - summary
Don't forget to upload the robots.txt file to the root directory (if necessary, of course). You also need to make sure it's configured correctly. You can check if the robots.txt file is correct in the Google Search Console tester. Instructions on how to do this can be found at this link.