What is robots.txt and how to use it correctly?
Let's start with a brief definition: robots.txt - a text file hosted on the server that is responsible for interacting with indexing robots. The main function of robots.txt is to grant or deny access to files in the site folder.
Find out some common robots.txt configurations, which I'll go into more detail below:
# Access the entire site.
User-Agent: *
Disable:
# No access to site.
User-Agent: *
Disable: /
# One folder excluded.
User-agent: *
Deny: /folder/
# One subpage excluded.
User-Agent: *
Disable: /file.html
Why do we need to know what robots.txt is?
- Not knowing what robots.txt is and using it incorrectly can negatively impact your site's rankings.
- The robots.txt file controls how crawlers crawl your site.
- Robots.txt is mentioned in several guides provided by Google itself.
- This file and indexing p Bots are fundamental elements that affect how all search engines work.
Bots for crawling Web
The first thing such a robot will do when it visits your site is look into the robots.txt file. For what purpose? The robot wants to know if it has the authority to access a given page or file. If the robots.txt file allows entry, then it will continue to work. If not, it will leave the specified site. In this regard, if you have any instructions for indexing robots, then robots.txt is the right file for this.
Note: There are two important things every webmaster should do when it comes to robots.txt file:
- detect if robots.txt at all
- if it exists, make sure it doesn't hurt your site's position in search engines
How do I check if a site has a robots.txt file?
Robots.txt can be inspected from any web browser. This file must be placed in the root folder of each site so that we can determine if the site has robots.txt. Just add 'robots.txt' to the end of your domain name as shown in the example below:
www. domain.pl/robots.txt
If the file exists or is empty, the browser will display its contents. If it doesn't exist, you'll get a 404 error.
Do we need a robots.txt file?
If you already know what robots.txt is, you probably don't need it on your site at all.
Reasons why you should have a robots.txt file on your site:
- You have data you don't want to share with search engines.
- You are using paid links or ads that require special instructions for crawlers.
- You want only authority bots like Googlebot to visit your site.
- You create with site and modify it "live", so you don't want robots to index the unfinished version.
- Robots.txt will help you follow the guidelines posted by Google.
Reason why a robots.txt file doesn't have to be on your site:
- Not having a robots.txt file , you eliminate potential errors that could negatively affect the position of your site in search engines.
- You don't have files that you want to hide from the search engine.
In this regard, if If you don't have a robots.txt file, search engines have full access to your site. This is of course normal and common, so there is nothing to worry about.
How to create a robots.txt?
Creating a robots.txt file is child's play.
Such a file is a simple text file, which means you can use the most common notepad on your system or any other text editor. So you can look at it this way: I'm not creating a robots.txt file, I'm just writing a simple note.
Robots.txt instructions and their importance
Now that you know what robots.txt is, you you need to learn how to use it correctly.
User-agent
User-agent:
#lub
User-agent: *
#lub
User-agent: Googlebot
Description:
- The user-agent syntax defines the direction in which indexing robots move - if necessary, of course. This can be done in two ways. If you want to report all robots, add "*" (asterisk).
- User agent: * - this way of writing says about it: "Every robot should follow in this direction." If you want to report something to a specific robot, for example, Googlebot, then the notation looks like this
- User -agent: Googlebot - this line says "These instructions only apply to Googlebot".
Disallow
The 'Disallow' instruction is used to prevent robots from entering specified folders or files. This means that if you don't want, for example, Google to index images on your site, you put them all in one folder and exclude it.
How do you do it? Let's say you've already moved all your photos to a folder called "pics". Now you need to tell the robots not to visit this folder for indexing.
This is what the robots.txt in this situation:
User-agent: *
Disable: /photos
The above two lines of text in the robots.txt file will keep the robots away from the photos folder.
Note: If you forgot the "/" sign after the Disallow statement, like this....
User-Agent: *
Disable:
...then the indexing robot will visit your site, look at the first line, then read the second one (i.e. "Disallow:"). What will happen? After that, the robot will feel like a fish in water, because you forbade him ... nothing. So it will start indexing all pages and files.
Allow
Only a few indexing robots understand this specific instruction, one of which is, for example, Googlebot.
Allow:
The 'Allow' statement allows the robot to determine if it can view a file in a folder that is locked with the 'Disallow' command. To illustrate this, let's look at the previous example.
User-Agent: *
Disable: /photos
We have saved all the photos in one folder called "fotki" and thanks to the "Disallow: /photos" function we have blocked full access to its content. However, after a while we came to the conclusion that we wanted to make only one photo available for the search engine, which is located exactly in the "photos" folder. The "Allow" position allows us to tell Googlebot that despite the fact that we have blocked access to the folder, it can still search in it and index a photo with a name, for example, "bike.jpg". In this regard, we need to create an instruction for it, which will look like this:
User-agent: *
Disable: /photos
Allow: /photos/bike.jpg
Such instructions tell Googlebot that it can find the "bicycle.jpg" in the excluded "photos" folder.
How do I determine which sites to block?
If we already know how to properly use robots.txt, we probably want to use it for something. Therefore, what types of pages should be excluded from indexing?
- Pages that display results search.
- Pages that are automatically generated.
- Low-ranking sites also get the inevitable content, so it's best to just exclude them.
- Pages where any information is generated from partner databases or any information that is generated is not on your site.
What is robots.txt and how do I use it? - summary
Don't forget to upload the robots.txt file to the root directory (if necessary, of course). You also need to make sure it's configured correctly. You can check if the robots.txt file is correct in the Google Search Console tester. Instructions on how to do this can be found here.