How do bots work?
Why are bots considered unwanted traffic?
We’re living in the world of bots. However, unlike humans, bots are not welcome visitors to our websites.
The negative impact of bots can be categorized as:
- Wasting server resources, thus slowing down response times for human visitors
- Scraping the website content, which was made to be unique
- Faking human-generated content, as in customer reviews
- Performing various malicious actions
In eCommerce, this goes even further:
- Price monitoring
- Faking orders
- Stock level monitoring
How bots work?
Technically speaking, the bot needs to visit your website and open the desired page. This means that the bot needs a browser. This browser needs to identify itself by UserAgent property (with regular browsers, this is usually ‘Mozilla Firefox’, ‘Safari’ etc), and it needs to use an IP address.
However, unlike humans, bots need to open thousands and thousands of pages on your site, in such a way that the traces it leaves are non-detectable, or better to say – in such a way that its traces cannot be distinguished from traces of human visitor.
Let’s see what kind of resources bots need in order to perform such scraping tasks
Technically speaking, the bot needs to visit your website and open the desired page. This means that the bot needs a browser. This browser needs to identify itself by UserAgent property (with regular browsers, this is usually ‘Mozilla Firefox’, ‘Safari’ etc), and it needs to use an IP address.
However, unlike humans, bots need to open thousands and thousands of pages on your site, in such a way that the traces it leaves are non-detectable, or better to say – in such a way that its traces cannot be distinguished from traces of human visitor.
Let’s see what kind of resources bots need in order to perform such scraping tasks
IP address pool
In most simple cases – a bot is executed from a single server, using a single IP address. As you can guess, this type of bot will be relatively easy to detect, and block by blocking that IP address.
More complex bots will have access to a number of IP addresses, in most cases by using a proxy network. The higher the number of IPs in this network, the more difficult it will be to detect and block such a bot. However, IP addresses are not free – meaning that this requires investment from the bot owner.
More recently, there are solutions that enable bots to access IP addresses of residential Internet users, which are very difficult to detect, even more, difficult to block – but come at a very high cost.
User Agent
No matter what browser the bot uses for scraping, it will have the ability to arbitrarily set the UserAgent. This means that the bot can easily change its signature, making it more difficult to detect
Browser type
No matter what IP address and UserAgent it uses, a bot will need to use a browser.
There are many browsers available, but 2 main categories are used:
- HTTP Client – a very simple browser, using very few system resources, however, being able to scrape only ordinary HTML content. This browser makes it very difficult for bots to mimic human behavior.
- Web Driver – can be considered a full browser, displaying a full Web page, including graphics, CSS styling, executing JavaScript. This sort of browser is very good at mimicking human behavior, but requires a high amount of system resources (RAM, CPU power), and is very vulnerable to changes in the structure of scraped pages.
Simulating human behaviour
In recent years, anti-bot protection has taken a different course. Instead of capturing bots by their IP address (which gets more and more difficult because of the ever-increasing pool of IPs). or their User Agents (which are easy enough to fake) – they use complex Java Script solutions to analyze visitor’s behavior on the page. For example:
- Humans are expected to spend time on the page, scrolling up/down, or making mouse moves
- Humans are not expected to jump from one page to another within seconds
- Visitors might be presented with a Captcha that needs to be solved in order to continue reading the page