Location>code7788 >text

Data Acquisition Assignment 4

Popularity:425 ℃/2024-11-18 18:23:45

Data Acquisition Assignment IV

gitee link:/wangzm7511/shu-ju/tree/master/operation4

1. Using Selenium to crawl stock data in practice

Demand:

  • Proficiency in Selenium Finding HTML Elements, Crawling Ajax Web Data, Waiting for HTML Elements and more.
  • Using Selenium framework + MySQL database storage technology route to crawl the stock data of "CSI A-shares", "SSE A-shares", "SZ A-shares" 3 sectors. We also use the
  • Data source website: Oriental Wealthweb address

Data storage format:

  • Output information: The MYSQL database stores and outputs the format of the following table, the table header should be named in English, for example:
    • Serial number: id
    • Stock code: bStockNo
    • Stock Name: bStockName
    • Latest Price: bLatestPrice

An example of the data format is shown below:

serial number stock code (computing) Stock Name Latest Quotes rise or fall in price rise or fall in price turnover turnover amplification supreme lowest modern day yesterday's harvest
1 688093 N Shihua 28.47 62.22% 10.92 26.13 million 700 million 22.34 32.0 28.08 30.2 17.55
2 ... ... ... ... ... ... ... ... ... ... ... ...

Process:

1. Overview of reptile development

In this practical exercise, we use Selenium to simulate a browser operation to crawl the stock data of the Oriental Wealth website. The goal is to extract the stock information of "CSI A-shares", "SSE A-shares" and "SZ A-shares", and store the data into a MySQL database. Using Selenium, the data capture can be automated by simulating user actions on the web page, such as clicking, scrolling, etc. In addition, the MySQL database is used to store the data in the MySQL database. In addition, the MySQL database is used to store the data crawled from the webpage, which is convenient for subsequent data analysis and presentation.

2. Specific steps for data crawling

  • Determine the content and structure of the crawl

    • Before we start crawling, we determine the data needed and the location of the elements by analyzing the structure of the web page. This is done through the browser's developer tools (usually pressF12 key) to see the structure of the HTML code, and found the relevant data in the<tbody> tags, and the data for each stock is located in the<tr> element, each field in a different<td> Center.
  • Element Localization and Data Extraction with Selenium

    • Use the Selenium framework to control Chrome and configure it for headless mode (i.e. no browser window is displayed). The methods that use Selenium (e.g.find_element()find_elements()) locates the row data for each stock and gets the data for each column.
  • Crawling data from multiple boards

    • We set up a loop to crawl "CSI A-shares", "SSE A-shares" and "SZ A-shares". By clicking on different navigation tabs, we switch to the corresponding section and crawl the data on each page.
  • flip-flop handling

    • Due to the large amount of data in each section, it is usually necessary to turn pages to crawl all the data. Therefore, the crawler script adds a page flip function, by locating the "next page" button and clicking it, to complete the data crawling of multiple pages.

3. Data storage

The crawled data needs to be stored in a MySQL database. First, a file namedstocks The database was created and three tables were created based on the different boards. The fields in each table corresponded one-to-one with the crawled data fields to make it easier to store the crawled data in.

MySQL Table Structure Examples

```sql CREATE TABLE nav_hs_a_board ( id INT PRIMARY KEY, stock_code VARCHAR(16), stock_name VARCHAR(32), latest_price VARCHAR(32), change_rate VARCHAR(32), price_change VARCHAR(32), volume VARCHAR(32), turnover VARCHAR(32), amplitude VARCHAR(32), highest_price VARCHAR(32), lowest_price VARCHAR(32), opening_price VARCHAR(32), previous_close VARCHAR(32) ); ```

4. Challenges and solutions

  • Page Load Wait: Some data takes some time to show up when the page loads, so we use Selenium's explicit wait feature (WebDriverWait) to ensure that the page elements are loaded before crawling the data.

5. Crawler code implementation

The code contains the following main sections:

  • Initializes the Selenium browser driver.
  • Connect to the MySQL database and create a data table.
  • Crawl the stock data of Oriental Wealth via Selenium.
  • Inserts data into a MySQL database.

Below is a snippet of a simplified version of the code showing how to use Selenium for data crawling and storing to MySQL:

import pymysql
from selenium import webdriver
from import Options
from import By
from import WebDriverWait
from import expected_conditions as EC

# initialization Chrome Browser Configuration
chrome_options = Options()
chrome_options.add_argument('--headless')
browser = (options=chrome_options)

# grout MySQL comprehensive database
connection = (host="localhost", user="root", passwd="cryptographic", db="stocks", charset="utf8mb4")
cursor = ()

# Open the target page
("/center/#hs_a_board")

# Crawling stock data
wait = WebDriverWait(browser, 10)
rows = browser.find_elements(, "//table[@id='table_wrapper-table']/tbody/tr")
for row in rows:
    stock_code = row.find_element(, "./td[2]/a").text
    stock_name = row.find_element(, "./td[3]/a").text
    latest_price = row.find_element(, "./td[5]/span").text
    # Insert data into the MySQL
    ("INSERT INTO nav_hs_a_board (stock_code, stock_name, latest_price) VALUES (%s, %s, %s)",
                   (stock_code, stock_name, latest_price))
()
()

Screenshots



summarize

We used Selenium to successfully crawl stock data from Oriental Wealth and store the data in MySQL database. The whole process involves many aspects such as webpage structure analysis, element positioning, page flipping, data storage, etc. It is a comprehensive practice of crawler technology. In this way, we can realize the automated collection of stock market data and lay the foundation for further data analysis. Hope this sharing is helpful to you!

2. Use Selenium to crawl the course data of Chinese mooc website.

Demand:

  • Proficiency in Selenium to find HTML elements, simulate user logins, crawl Ajax web data, wait for HTML elements and more.

  • Using Selenium framework + MySQL database storage technology route to crawl the Chinese mooc website course resources information (course number, course name, school name, main teacher, team members, number of participants, course progress, course introduction).

  • Candidate websites: China mooc network:web address

  • output message: Store the data into MySQL with the following table structure:

    • Course Number:id
    • Course Title:cCourse
    • School Name:cCollege
    • Lead Teacher:cTeacher
    • Team Members:cTeam
    • Number of participants:cCount
    • Course Progress:cProcess
    • Course Description:cBrief

Process:

1. Overview of project objectives and needs

In this practical exercise, we use Selenium to simulate a browser operation to crawl the course information on the Chinese mooc website, including the course name, school name, main teacher and other information. The crawled data is stored in a MySQL database for subsequent data analysis and presentation. In order to realize these functions, we wrote an automation script, which contains the following main parts:

  • Simulate the user login process
  • Crawling the course information page
  • Page Flip to Crawl for More Courses
  • Storing the crawled data into a MySQL database

2. Specific steps for data crawling

1 Simulate user login

  • Login Page Analysis
    We start by using the browser's developer tools (usually by pressing theF12 (Key) Analyze the structure of the login page to find the location of the login button, account input box, and password input box. With Selenium, we can simulate clicking the login button, entering the account and password, and completing the user login.
  • Selenium Automated Login
    Use Selenium to locate the login button and click it, then switch to theiframeTo do so, complete the entry of your account password and finally click on the Login button to submit your login information.

2 Course Information Crawling

  • Open search page
    After a successful login, we navigated to the course search page and analyzed the structure of the page to find the elements that contained course information.
  • Crawling data from multiple pages
    We have implemented a page-crawling feature that automatically fetches the course information for each page by clicking the "Next" button.

3 Scrolling page loading

  • Since some of the course information on the page needs to be scrolled to load, we use Selenium to simulate scrolling before crawling each page so that all the course information is loaded for a complete crawl.

4 Data storage

  • MySQL Database Creation and Connection
    The crawled data needs to be stored in a MySQL database. We first created a database namedstocks and created a database namedmooc table that holds information about the course.
  • Example of data table structure
    CREATE TABLE mooc (
        id INT AUTO_INCREMENT PRIMARY KEY,
        cCourse VARCHAR(255),
        cCollege VARCHAR(255),
        cTeacher VARCHAR(255),
        cTeam VARCHAR(255),
        cCount VARCHAR(50),
        cProcess VARCHAR(100),
        cBrief TEXT
    );
    
  • Whenever a course message is successfully crawled, it is inserted into the database.

code implementation

Below is the main part of the code that shows how to use Selenium for data crawling and storing it in MySQL:

import pymysql
from selenium import webdriver
from import By
from import Options
from lxml import etree
import time

class MoocScraper:
    def __init__(self):
        # Initializing the Browser Configuration
        try:
            chrome_options = Options()
            chrome_options.add_argument('--headless')
             = (options=chrome_options)
            print("Browser initialized successfully")
        except Exception as e:
            print(f"Browser initialization failure: {e}")
        self.initialize_db()

    def initialize_db(self):
        # Initialize the database and create tables
        try:
            mydb = (
                host="localhost",
                user="root",
                password="cryptographic",
                charset='utf8mb4'
            )
            with () as cursor:
                ("CREATE DATABASE IF NOT EXISTS stocks CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci")
                ("USE stocks")
                (
                    """
                    CREATE TABLE IF NOT EXISTS mooc (
                        id INT AUTO_INCREMENT PRIMARY KEY,
                        cCourse VARCHAR(255),
                        cCollege VARCHAR(255),
                        cTeacher VARCHAR(255),
                        cTeam VARCHAR(255),
                        cCount VARCHAR(50),
                        cProcess VARCHAR(100),
                        cBrief TEXT
                    )
                    """
                )
            ()
            print("Database initialization successful")
        except Exception as e:
            print(f"An error occurred while initializing the database: {e}")
        finally:
            if 'mydb' in locals():
                ()

    def login(self, url, phone, password):
        # Simulate the login process
        ...

    def scrape_courses(self, search_url):
        # Crawl for course information
        ...

    def parse_and_store(self, html):
        # Parsing the page and storing it in the database
        ...

if __name__ == "__main__":
    scraper = MoocScraper()
    login_url = "https:///"
    search_url = "https:///?search=%20#/"
    (login_url, 'cell phone number', 'cryptographic')
    scraper.scrape_courses(search_url)
    ()

Challenges and solutions

  • Page Load Wait: Some of the elements in the page are slow to load, and to ensure that the crawler is able to grab all the data in a stable manner, we use Selenium's explicit waiting and() to ensure that the element is fully loaded.

Screenshots


summarize

We used Selenium to crawl the course data from a Chinese mooc website and store it in a MySQL database. The whole process involves many aspects such as webpage structure analysis, element positioning, page flipping, data storage, etc. It is a comprehensive practice of Selenium crawling technology. In this way, we can realize the automated collection of online course information and lay the foundation for subsequent educational data analysis.

If you have any questions about the code implementation or want to learn more about using Selenium, feel free to talk to me in the comments section!

3.Huawei Cloud_Big Data Real-time Analysis and Processing Experiment Manual-Flume Log Collection Experiment

1 Flume Log Capture

1.1 Task 1: Python script to generate test data

1.2 Task 2: Configure Kafka

1.3 Task 3: Install the Flume Client

1.4 Task 4: Configure Flume to Capture Data