I'm sure you all know about crawlers, that is, crawling web pages for the information you need
Compared to the web page a lot of crawler tutorials, this article mainly crawler is divided into four parts, so that you are clear, the function of the code as well as the use of these four parts are
1. Get the source code
2. According to the characteristics of the tags in the web page, get the source code you need the part
3. think about how to automate the crawling of a series of web pages based on the logic of the page
4. Save data in xlsx and other formats
Let's talk about each step next
1. Get source code
There are a lot of libraries to get it now, but the one that is commonly used now is requests, which I use as well
The import method isimport requests
This part is very simple with just one line of coderesponse = (url, params = params , headers = headers)
The url is the url of the page, the web address
params is the query parameters, optional
headers is the page request header, is also optional, but now you can be sure to crawl through the simple anti-climbing, mainly user_agent and cookies
2. Acquisition based on web page characteristics
Here I used BeautifulSoup.
The import method isfrom bs4 import BeautifulSoup
The specific use issoup = BeautifulSoup(directory, '')
where '', which is Python's built-in parser, is used to parse normal HTML documents.
The function is: first of all, the HTML content in the html variable will be parsed into a BeautifulSoup object soup, so that the methods provided by BeautifulSoup can be used later to conveniently traverse and manipulate the various parts of the HTML document.
As for filtering tags, I mainly I used BeautifulSoup in the find and find_all two functions, is used to find the specified tag name and attribute conditions of the elements, the two functions are somewhat different.
find is used to find the first element of a document that meets the specified conditions.
find_all is used to find all the elements in the document that match the conditions and returns a list
give an examplefirst_span = ('span', class_='fl')
Here the find method is used to find the first tag with a class attribute of 'fl'. span_list = soup.find_all('span', class_='fl')
The find_all method is used here to find all tag and whose class attribute is 'fl', store them in the span_list list.
3. Automated crawling
Such as the selection of some pages, such as ?p = s this some logic, will be written into the script to automate the?p=s
4. Save data
First take a look at the code
def create_execl(name):
wb = Workbook()
ws =
= name
excel_headers = ["Disease Information", "Consultation Type", "Case url", "Doctor url", "Doctor Profile", "Doctor's Specialties", "Doctor's Quality of Service", "Doctor's Recommendations", "Doctor's Communication with Patients"]
(excel_headers)
(name+".xlsx")
def write_back_execl(data, name).
wb = load_workbook(name+".xlsx")
ws =
(data)
(name+".xlsx")
Then I'll break it down line by line.
The first function create_excel(name) is to create an Excel file named name and write the header information.
= Workbook(): create a new Workbook object, i.e. a new Excel file.
= : Get the currently active sheet object, which is a Worksheet object.
= name: Sets the name of the current sheet to the passed-in name parameter.
4. excel_headers: defines the fields in the header of the Excel table, including "Disease Information", "Type of Consultation" and so on.
(excel_headers): adds the table header information to the first row.
(name + ".xlsx"): save the Excel file, the file name is , here the name is a function of the parameters.
The function write_back_excel(data, name), is used to write data to an already existing Excel file.
= load_workbook(name+".xlsx"): Use load_workbook function to load an existing Excel file with the name .
= : Get the currently active worksheet object.
(data): append data to the last row of the current worksheet.
(name+".xlsx"): Saves the modified Excel file.
I hope the above crawler easy idea is helpful to you guys!