Hello everyone, I'm programmer Fishskin. I had a mock interview two days ago with a two-year veteran of social recruiting, and since he did well, I improvised and shared with him a business scenario problem we recently encountered. The problem is as follows:
Didn't we recently do a
You can watch the video to learn the whole process of communicating the problem:/video/BV1b142187Tb
I'm going to share a summary of ways to prevent crawlers directly with you, there are 10 whole ways in total! The last one is very unique.
How can I prevent my website from being crawled?
1. Terms of Use Agreement
is a file placed in the root directory of a website to tell search engine crawlers which parts of the site they don't want crawled.
For example, you can add the following rule to a file to disable crawling of a specific directory or file:
User-agent: *
Disallow: /private/
Disallow: /important/
While most compliant crawlers will follow these rules, malicious crawlers may ignore it, so it alone cannot completely deter all crawlers. But it is the first step in protection, acting as a statement and deterrent.
It is possible to explicitly prohibit crawlers from capturing data in the website's terms of service or usage agreement, and to treat violations of these terms as illegal, which can be used as one of the evidences of violating these terms if the content of the website has been captured by a malicious crawler and has caused damages.
2. Restrictions on access to data
Rather than exposing all data outright, users can be required to log in or provide an API key to access specific data. You can also set up authentication mechanisms for key content, such as using OAuth 2.0 or JWT (JSON Web Tokens), to ensure that only authorized users are able to access sensitive data, effectively preventing unauthorized crawlers from obtaining data.
3. Statistics on frequency of visits and blocking
Caching tools such as Redis Distributed Cache or Caffeine Local Cache can be utilized to record the number of requests per IP or client and set thresholds to limit the frequency of access to a single IP address. When anomalous traffic is detected, the system can automatically block the IP address or adopt other policies.
It should be noted that although Map can also count the frequency of requests, but because the requests are constantly adding up, the occupied memory will also continue to grow, so it is not recommended to use Map, a data structure that can not automatically release resources. If you must use memory for request frequency statistics, you can use Caffeine, a caching technology with a data elimination mechanism.
4. Multi-level processing strategy
In order to prevent "misuse", a more flexible multi-level processing strategy can be set to deal with crawlers than directly blocking the client of an illegal crawler. For example, when abnormal traffic is detected, a warning is issued first; if the crawler behavior continues, more severe measures are taken, such as temporarily blocking the IP address; if the crawler continues after unblocking, then penalties such as permanent blocking are applied.
Specific processing strategy can be customized according to the actual situation, it is not recommended to get too complex, do not therefore increase the burden on the system.
5、Automatic alarm + human intervention
Automatic alert capabilities can be realized, for example, when abnormal traffic or crawler behavior is detected, the system can automatically send a corporate WeChat message notification. The administrator of the website can then intervene in a timely manner to further analyze and process the crawler's request.
This point has been shared before, not only for crawlers, the enterprise's online system is best to access the full range of alerts, such as interface errors, CPU / memory usage is too high, and so on.
6, crawler behavior analysis
There is generally a difference between the behavior of illegal crawlers and normal users; crawlers tend to follow specific access patterns. For example, normal users look at each topic for a while and for different periods of time, whereas crawlers generally access topics in a fixed order and with a fixed frequency, which is clearly recognizable.
For example, in the following case, there is a possibility that it is a crawler:
7. Request header detection
Each request sent to the server has a request header, and crawler requests can be intercepted by checking identifiers such as User-Agent and Referer in the request header.
Of course, this trick can only prevent against rookies, because the request header can be easily forged, as long as the browser comes with a web console to get the response to the normal request header information, you can bypass the detection.
8. Autonomous public data
I remember taking an information security class in college and learning that one way to prevent cyber attacks is to make the cost to the attacker greater than the actual benefit. For example, if a password is valid for 10 minutes and it takes 15 minutes to crack it, no one will try to crack it.
Used to the crawler scenario, our approach is to not make any restrictions, and directly allow everyone without logging in can also view the title data of our site! And also provides a variety of filtering functions of the title, the collection function. Most of the students just for their own learning, so that there is no need to spend time to crawl data~
9. Traceability technology
Although the questions are public, some of the quality solutions that we've asked the big players to write specifically are visible to members only. If any user grabs this data using a crawler, be careful! Generally speaking, as long as you're logged in to a website, there's bound to be a record of your visit, and if you're leaking content that's only visible when you're logged in to the site, especially paid content, there must be a way for the site administrators to trace back to who you are.
The more commonly used traceability techniques are watermarking, blind watermarking, and so on. For our interview with Duck, which itself was logged in via WeChat, and if you're a member, there's definitely a payment record. These techniques not only help mark the source of data, but also enhance data protection by tracing its source if it is misused.
10. Popularization of science and law
In addition to these methods above, you can further restrict crawlers by accessing anti-crawling services, accessing CAPTCHA, adding dynamic timestamps, etc. But keep in mind that there is no way to perfectly defend against crawlers! Because you can not limit the real user, the attacker can completely simulate the real user's access to get your website data, such as finding 10 users, each getting a few hundred questions.
So here's my final approach -- popularize the law. You can post a clear legal statement on your website to inform users that unauthorized crawling is illegal, which can act as a deterrent to crawling behavior. And also through the release of videos and articles, so that the majority of programmers and friends to raise awareness of the law. Crawler is a certain risk, their own learning is not a problem, but don't give people's websites to cause pressure, or there will be suspected of destroying the computer system!
more
💻 Programming Learning Exchange: