Last updated on September 21, 2020 by Dan Nanni
The modern Internet is infested with various malicious robots and crawlers such as malware bots, spambots or content scrapers which are scanning your website in surreptitious ways, for example to detect potential website vulnerabilities, harvest email addresses, or just to steal content from your website. Many of these robots can be identified by their signature "user-agent" string.
As a first line of defense, you could try to block malicious bots from accessing your website by blacklisting their user-agents in robots.txt
file. However, unfortunately this works only for well-behaving robots which are designed to obey robots.txt
. Many malicious bots can simply ignore robots.txt
and scan your website at will.
An alternative way to block particular robots is to configure your web server, such that it refuses to serve content to requests with certain user-agent strings. This post explains how to block certain user-agent on nginx web server. I assume that you already have an Nginx web server up and running.
To configure user-agent block list, open the nginx
configuration file of your website, where the server
section is defined. This file can be found in different places depending on your nginx
setup or Linux distribution (e.g., /etc/nginx/nginx.conf
, /etc/nginx/sites-enabled/<your-site>
, /usr/local/nginx/conf/nginx.conf
, /etc/nginx/conf.d/<your-site>
).
server { listen 80 default_server; server_name xmodulo.com; root /usr/share/nginx/html; .... }
Once you open the config file with the server
section, add the following if
statement(s) somewhere inside the section.
server { listen 80 default_server; server_name xmodulo.com; root /usr/share/nginx/html; # case sensitive matching if ($http_user_agent ~ (Antivirx|Arian)) { return 403; } # case insensitive matching if ($http_user_agent ~* (netcrawl|npbot|malicious)) { return 403; } .... }
As you can guess, these if
statements match any bad user-agent string with regular expressions, and return 403
HTTP status code when a match is found. $http_user_agent
is a variable that contains the user-agent string of an HTTP request. The ~
operator does case-sensitive matching against user-agent string, while the ~*
operator does case-insensitive matching. The |
operator is logical-OR, so you can put as many user-agent keywords in the if
statements, and block them all.
After modifying the configuration file, you must reload nginx
to activate the blocking:
$ sudo /path/to/nginx -s reload
You can test user-agent blocking by using wget
or curl
with --user-agent
option.
$ wget --user-agent "malicious bot" http://<nginx-ip-address>
So far, I have shown how to block HTTP requests with a few user-agents in nginx
. What if you have many different types of crawling bots to block?
Since the user-agent blacklist can grow very big, it is not a good idea to put them all inside your nginx
's server
section. Instead, you can create a separate file which lists all blocked user agents. For example, let's create /etc/nginx/useragent.rules
, and define a map with all blocked user agents in the following format.
$ sudo vi /etc/nginx/useragent.rules
map $http_user_agent $badagent { default 0; ~*malicious 1; ~*backdoor 1; ~*netcrawler 1; ~Antivirx 1; ~Arian 1; ~webbandit 1; }
Similar to the earlier setup, the ~*
operator will match a keyword in case-insensitive manner, while the ~
operator will match a keyword using a case-sensitive regular expression. The line that says default 0
means that any other user-agent not listed in the file will be allowed.
Next, open an nginx
configuration file of your website, which contains http
section, and add the following line somewhere inside the http
section.
http { ..... include /etc/nginx/useragent.rules }
Note that this include
statement must appear before the server
section (this is why we add it inside http
section).
Now open an nginx
configuration where your server
section is defined, and add the following if
statement:
server { .... if ($badagent) { return 403; } .... }
Finally, reload nginx
.
$ sudo /path/to/nginx -s reload
Now any user-agent which contains a keyword listed in /etc/nginx/useragent.rules
will be automatically banned by nginx
.
This website is made possible by minimal ads and your gracious donation via PayPal or credit card
Please note that this article is published by Xmodulo.com under a Creative Commons Attribution-ShareAlike 3.0 Unported License. If you would like to use the whole or any part of this article, you need to cite this web page at Xmodulo.com as the original source.
Xmodulo © 2021 ‒ About ‒ Write for Us ‒ Feed ‒ Powered by DigitalOcean