Double Eleven shopping carnival is coming, as a programmer, the use of crawler technology to query the historical price trend of goods, seems to be a reasonable demand, after all, it is only for their own reference, does not involve commercial use. However, partners must be cautious and careful when crawling operations, especially in the process of data collection and use, be sure to comply with relevant laws and regulations and the use of the platform specification.
Whenever I explain crawlers to people, I always remind a word of caution, "Caution, caution, and more caution!" Not only do you want to avoid breaking the law, but you also want to avoid having an impact on the normal operation of your website, and remain rational and compliant.
Product Acquisition
Okay, our first step is to go to the Jingdong query page and find and open the page of the product we are concerned about. For example, suppose what I am most concerned about is the price and related data of the graphics card, then next I will go to query the specific information of the graphics card. In this way, we are able to get the relevant commodity data of the graphics card. As shown in the figure:
The job you have to do is to find our Merchandise Information Request link. This connection may not be easy to find, so you'll need to be patient and go through the relevant pages one by one. I've found the link for you, so now we can start writing our crawler script directly from it, with the goal of extracting the product links from it. You can get the Python code directly by right clicking on the request and selecting "Copy request as Python code".
As for online tools, there are a lot of similar tools in the market that can help you to convert your requests, but I won't list them all, you can choose the right tool by yourself according to your needs.
The code part you can write yourself, I will only provide some sample code of the key parts here to help you better understand how to implement it. Here are the key code snippets I've put together for you:
response = ('/', params=params, cookies=cookies, headers=headers)
# utilizationBeautifulSoupanalyzeHTML
soup = BeautifulSoup(, '')
# Find all eligibledivtab (of a window) (computing)
div_tags = soup.find_all('div', class_='p-name p-name-type-2')
# Loop over eachdivtab (of a window) (computing),retrieve information
for div_tag in div_tags:
# find span tab (of a window) (computing)中是否有 "operate one's own business" markings
self_operated_tag = div_tag.find('span', class_='p-tag')
if self_operated_tag and 'operate one's own business' in self_operated_tag.text:
# Extract video card name and link
a_tag = div_tag.find('a', href=True)
product_name = a_tag.find('em').()
# Handling relative paths,Spliced into a completeURL
link = 'https:' + a_tag['href'] if a_tag['href'].startswith('//') else a_tag['href']
({
'name': product_name,
'link': link
})
# Print results
print("name (of a thing):", product_name)
print("link (on a website):", link)
else:
print("没有找到operate one's own businessmarkings或没有相关信息。")
sort_data(store,keyword)
def sort_data(data,name):
with open(name+'.csv','a',newline='',encoding='utf8')as f:
writer=(f)
for i in data:
((i['name'],i['link']))
Here we only focus on self-supporting goods, because the quality of self-supporting goods is relatively guaranteed. In order to avoid the risk of frequent crawling leading to blocking, I store the crawled data in a CSV file for subsequent use. After all, it is not recommended to launch frequent requests to the same website, which can easily be blocked.
Below is an example of the data I crawled for a particular page. If you need to fetch data from more than one page, just adjust the relevant parameters to make sure the paging function works properly. The example data crawled is shown below:
That's right, I didn't crawl the real-time price of the item because our main goal this time was to get historical price data. However, while crawling the historical price, the latest price of the item is also crawled by the way, so as to fulfill the requirement without wasting extra crawling time. Thus, the current code has covered both.
Next, we can move on to another site and look at its data structure and crawling methods for comparison and optimization.
Historical Price Crawl
After successfully fetching the data from the current website, we will move to crawling the data from another website. First, in order to ensure that we can successfully crawl the required historical price information, we need to perform some preliminary tests on the web side. By doing this manually and analyzing the web requests, I have identified the request interface that is able to fetch the historical price data.
After some testing and debugging, I managed to find the correct connection for the request. Next, I will show this connection for your reference. It is shown below:
We plan to gradually crawl the historical price information for each product link to ensure that the data is comprehensive and accurate. However, during the crawling process, I noticed that the requested content contained an encrypted portion, which prevented us from directly accessing the complete price data. This encrypted content needs to be decrypted or further processed to ensure that we can successfully extract the historical prices.
Therefore, we need to analyze and deal with this encryption mechanism before proceeding with the crawl. Here is the encryption section for reference:
In this request process, it is not a direct link to the product, but an encrypted "code" parameter. In fact, the link to the product in the above request has undergone some conversion processing, so we do not need to worry too much about this conversion step, it is only an additional processing link, it does not have a substantial impact on the data acquisition itself.
We just need to get the "code" parameter in the specified way and use it correctly in subsequent requests. After a series of analysis and processing, the final code implementation is shown below:
def get_history(itemid):
#Omit a bunch of code here
params = {
'ud': 'EAONJNRXWXSMTBKNNYL_1730899204',
'reqid': '46db0db9f67129f31d1fca1f96ed4239',
'checkCode': 'ada35e4f5d7c1c55403289ec49df69e3P9f1',
'con': itemid,
}
data = {
'checkCode': 'ada35e4f5d7c1c55403289ec49df69e3P9f1',
'con': itemid,
}
response = ('http:///dm/', params=params, cookies=cookies, headers=headers, data=data, verify=False)
#Omit a bunch of code here
code = ()
params = {
'code': code['code'],
't': '',
'ud': 'EAONJNRXWXSMTBKNNYL_1730899204',
'reqid': '46db0db9f67129f31d1fca1f96ed4239',
}
response = ('http:///dm/', params=params, cookies=cookies, headers=headers, verify=False)
# Date and Price in Regular Expression Matching
pattern = r"\((\d{4}),(\d{1,2}),(\d{1,2})\),([\d\.]+)"
matches = (pattern, )
# Analysis of dates and prices
prices = []
for match in matches:
year, month, day, price = match
date = datetime(int(year), int(month) + 1, int(day)) # The month is from0initial,need to add1
((date, float(price)))
# Find the lowest price、Maximum and latest prices
min_price = min(prices, key=lambda x: x[1])
max_price = max(prices, key=lambda x: x[1])
latest_price = prices[-1]
# Print results
print(f"lowest price: {min_price[1]},dates: {min_price[0].strftime('%Y-%m-%d')}")
print(f"highest price: {max_price[1]},dates: {max_price[0].strftime('%Y-%m-%d')}")
print(f"Latest Prices: {latest_price[1]},dates: {latest_price[0].strftime('%Y-%m-%d')}")
get_history("/")
Finally, by analyzing the acquired historical price data, we can make reasonable buying judgments based on price fluctuation trends! Look at the end result:
The rest of the work is the process of optimizing the code. At this stage, our main goal is to show a basic idea of the implementation and to verify that the relevant functionality works. In fact, we do not intend to crawl the details of all the products, as this is not only not in line with our actual needs, but also unnecessary in practice.
summarize
Overall, crawler technology provides us with rich data resources, but in the process of using it, it is prudent to act cautiously and operate rationally in order to really let crawler technology bring convenience to our life instead of bringing trouble. We hope that in the upcoming Double Eleven shopping spree, we can seize the opportunity to buy our favorite products, but also abide by the bottom line of morality and law, and be a responsible technology user.
I'm Rain, a Java server-side coder, studying the mysteries of AI technology. I love technical communication and sharing, and I am passionate about open source community. I am also a Tencent Cloud Creative Star, Ali Cloud Expert Blogger, Huawei Cloud Enjoyment Expert, and Nuggets Excellent Author.
💡 I won't be shy about sharing my personal explorations and experiences on the path of technology, in the hope that I can bring some inspiration and help to your learning and growth.
🌟 Welcome to the effortless drizzle! 🌟