Location>code7788 >text

Crawler case 2 - crawl the video of one of the three ways: requests chapter (1)

Popularity:445 ℃/2024-09-09 21:44:05

@

catalogs
  • preamble
  • crawler step
    • Determine the URL and send the request
    • Getting response data
    • Parsing the response data
    • Save data
  • Full source code
  • encourage sb to do sth
  • blog (loanword)

preamble

This article writes a case of crawling video, using requests library to crawl the good-looking video video, and save it locally. We will also update the selenium article and DrissionPage article. Of course, crawling images must be more than these three methods, there are python based scrapy framework, based on the express framework and Java based webmagic framework and so on.

crawler step

Determine the URL and send the request

After we open the website we need to crawl, press f12 to check, because the page uses lazy loading, so we need to slide down to load new videos, this time there will be a new packet, this packet is the probability that the source of these new videos loaded out, we can also search for the video packets in the ① in the figure below for the content that may be found in the video packets, such as video suffixes, such as mp4, m4s, ts, etc., and then filter the correct packet from there, this may require some experience. m4s, ts, etc., and then filter the correct packets from them, this may require some experience.
在这里插入图片描述
When we slide down to refresh, then ② will load a new packet, click on the packet, the right window will appear, in ③ header will see the url address we want to request, as well as cookies and some encrypted parameters.
The code is as follows

import requests # Data request module
url='/haokan/ui-web/video/feed?time=1723964149093&hk_nonce=915ae0476c308b550e98f6196331fd2a&hk_timestamp=1723964149&hk_sign=93837eec50add65f7ca64a95fb4eb8de&hk_token=aRYZdAVwdwNwCnwBcHNyAAkNAQA' # request address
headers={
    # UAunder the guise of
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0'
}
html=(url,headers=headers)

Getting response data

In the response we can see the response json data, which has the cover photo address, title, video address and so on, we just need to get the name of the picture (title) and the picture address (previewUrlHttp) can be.
在这里插入图片描述

respnose=()

Parsing the response data

The json data is a dictionary, so we only need to fetch the keys in it.

data=html['data']['apiData'] # Fetch photo address
for li in data.
    video_name=li['title'] # photo name
    video_url=li['previewUrlHttp'] # photo address

Save data

After getting the url of the image you just need to make another request to the url to get the binary data and then save it locally.

video=(video_url,headers=headers).content # Send a request to the photo address to get the binary data
    with open('. /videos/'+video_name+'.mp4','wb') as f: # Save the video
        (video)

Full source code

import requests # Data parsing module
import os # file management module

if not (". /videos"): # create folder
    (". /videos")
url='/haokan/ui-web/video/feed?time=1723964149093&hk_nonce=915ae0476c308b550e98f6196331fd2a&hk_timestamp=1723964149&hk _sign=93837eec50add65f7ca64a95fb4eb8de&hk_token=aRYZdAVwdwNwCnwBcHNyAAkNAQA' # Request address
headers={
    # UA masquerading
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0'
}
html=(url,headers=headers).json()
data=html['data']['apiData'] # fetch photo address
for li in data.
    video_name=li['title'] # name of the photo
    video_url=li['previewUrlHttp'] # photo address
    video=(video_url,headers=headers).content # Send a request to the photo address to get the binary data
    with open('. /videos/'+video_name+'.mp4','wb') as f: # save video
        (video)
        print(video_name+'.mp4')

Multi-page crawling should be more to observe the data packet, what is the pattern, and then in this case, it involves the timestamp js encryption.

encourage sb to do sth

Less is more. Slow is faster.

blog (loanword)

  • I am a fan of infiltration, and from time to time, I will be on WeChat (laity's path to penetration testing) to update some real-world penetration of real-world cases, interested students can pay attention to it, we make progress together.
    • Previously in the public number released a kali crack WiFi article, interested students can go to see, in the b station (up master:laity1717) also released the correspondingInstructional Videos