Location>code7788 >text

C# crawls information on dynamic web pages: B-site homepage

Popularity:42 ℃/2024-09-27 17:07:22

catalogs
  • synopsis
  • Getting the HTML document
  • Parsing HTML documents
  • beta (software)
  • reference article

synopsis

Dynamic content websites use JavaScript scripts to retrieve and render data dynamically. Crawling information requires simulating browser behavior, otherwise the source code obtained is basically empty. The crawling steps are as follows:

  • utilizationSelenium Get the rendered HTML document
  • utilizationHtmlAgilityPack Parsing HTML documents

Create a new project and install the required libraries:

  • HtmlAgilityPack

Getting the HTML document

There are 2 main things to keep in mind:

  • Set browser startup parameters: headless mode, disable GPU acceleration, set window size at startup
  • Wait for the page to finish loading dynamically: wait for 5 seconds, set an appropriate time to do so
private static string GetHtml(string url)
{
    ChromeOptions options = new ChromeOptions();
    // Do not show the browser
    ("--headless");
    // GPU acceleration may cause Chrome to have a black screen and high CPU usage.
    ("--nogpu");; // Setting chrome's sizes when it starts.
    // Setting the size of Chrome on startup.
    ("--window-size=10,10").

    using (var driver = new ChromeDriver(options))
    {
        using (var driver = new ChromeDriver(options)) { try
        {
            (). (); ().GoToUrl(url); ().
            ().GoToUrl(url);
            // Wait for the page to finish loading dynamically
            (5000); // Wait for the page to finish loading dynamically.
            // Return the page source code
            return ;
        }
        catch (NoSuchElementException)
        {
            ("This element was not found"); }
            return ;
        }
    }
}

Parsing HTML documents

Here is an example of crawling the video information on the homepage of the B site UP owner, such as the title, link and cover of the video.
Start by defining a class to hold the information:

class VideoInfo
{
    public string Title { get; set; }
    public string Href { get; set; }
    public string ImgUrl { get; set; }
}

Define a parsing function that returns a list of video messages:

private static List<VideoInfo> GetVideoInfos(string url)
{
    List<VideoInfo> videoInfos = new List<VideoInfo>();

    // Load the document
    var html = GetHtml(url);
    var htmlDoc = new HtmlDocument();; // Load the document.
    (html).

    // Parsing the document, first locating the video list tag
    var xpath = "/html/body/div[2]/div[4]/div/div/div[1]/div[2]/div/div";
    var htmlNodes = (xpath);

    // Loop over its children to parse the video information
    foreach (var node in htmlNodes)
    {
        var titleNode = ("a[2]"); var imgNode = ("a[1] / div[1]")
        var imgNode = ("a[1]/div[1]/picture/source[1]");

        var title = ;
        var href = ["href"]. ('/'); var title = ; var href = ["href"].
        var imgUrl = ["srcset"]. ('@')[0].Trim('/'); var imgUrl = ["srcset"].

        (new VideoInfo
        {
            Title = title,
            Href = href,
            ImgUrl = imgUrl
        });
    }
    return videoInfos.
}

The video list tab of theXPath The path is via the browser debugging tool, right-click on the specified tabCopy the full XPath Get:
image

Analyzing the code in thenode nodes, the html text formatting may be messy and can be changed by using the onlineHTML Code Formatting tool format before analyzing it.

beta (software)

Take the B-station UP masterStarPupil_Official As an example, crawl the video information:

static void Main(string[] args)
{
    var url = @"/401315430";
    var videoInfos = GetVideoInfos(url);
    foreach (var videoInfo in videoInfos)
    {
        ();
        ();
        ();
        ();
    }
    ();
}

The results are as follows:

Wait a minute, good sister.
/video/BV1uyxLeJEM9
/bfs/archive/

One bite? Your super sweet chili.
/video/BV1AQsDeiEn1
/bfs/archive/

Here's just a demonstration of how to crawl a dynamic page thatIf you want to get the video information of the B station UP master, it is recommended to use the API to request the data directly

reference article

  • The Ultimate Guide to Web Crawling in C#
  • C# write a small crawler, the realization of crawling js loaded web pages
  • Html Agility Pack Documentation
  • [ Long Update ] C# Selenium Common Operations Code