- synopsis
- Getting the HTML document
- Parsing HTML documents
- beta (software)
- reference article
synopsis
Dynamic content websites use JavaScript scripts to retrieve and render data dynamically. Crawling information requires simulating browser behavior, otherwise the source code obtained is basically empty. The crawling steps are as follows:
- utilizationSelenium Get the rendered HTML document
- utilizationHtmlAgilityPack Parsing HTML documents
Create a new project and install the required libraries:
- HtmlAgilityPack
Getting the HTML document
There are 2 main things to keep in mind:
- Set browser startup parameters: headless mode, disable GPU acceleration, set window size at startup
- Wait for the page to finish loading dynamically: wait for 5 seconds, set an appropriate time to do so
private static string GetHtml(string url)
{
ChromeOptions options = new ChromeOptions();
// Do not show the browser
("--headless");
// GPU acceleration may cause Chrome to have a black screen and high CPU usage.
("--nogpu");; // Setting chrome's sizes when it starts.
// Setting the size of Chrome on startup.
("--window-size=10,10").
using (var driver = new ChromeDriver(options))
{
using (var driver = new ChromeDriver(options)) { try
{
(). (); ().GoToUrl(url); ().
().GoToUrl(url);
// Wait for the page to finish loading dynamically
(5000); // Wait for the page to finish loading dynamically.
// Return the page source code
return ;
}
catch (NoSuchElementException)
{
("This element was not found"); }
return ;
}
}
}
Parsing HTML documents
Here is an example of crawling the video information on the homepage of the B site UP owner, such as the title, link and cover of the video.
Start by defining a class to hold the information:
class VideoInfo
{
public string Title { get; set; }
public string Href { get; set; }
public string ImgUrl { get; set; }
}
Define a parsing function that returns a list of video messages:
private static List<VideoInfo> GetVideoInfos(string url)
{
List<VideoInfo> videoInfos = new List<VideoInfo>();
// Load the document
var html = GetHtml(url);
var htmlDoc = new HtmlDocument();; // Load the document.
(html).
// Parsing the document, first locating the video list tag
var xpath = "/html/body/div[2]/div[4]/div/div/div[1]/div[2]/div/div";
var htmlNodes = (xpath);
// Loop over its children to parse the video information
foreach (var node in htmlNodes)
{
var titleNode = ("a[2]"); var imgNode = ("a[1] / div[1]")
var imgNode = ("a[1]/div[1]/picture/source[1]");
var title = ;
var href = ["href"]. ('/'); var title = ; var href = ["href"].
var imgUrl = ["srcset"]. ('@')[0].Trim('/'); var imgUrl = ["srcset"].
(new VideoInfo
{
Title = title,
Href = href,
ImgUrl = imgUrl
});
}
return videoInfos.
}
The video list tab of theXPath The path is via the browser debugging tool, right-click on the specified tabCopy the full XPath Get:
Analyzing the code in thenode nodes, the html text formatting may be messy and can be changed by using the onlineHTML Code Formatting tool format before analyzing it.
beta (software)
Take the B-station UP masterStarPupil_Official As an example, crawl the video information:
static void Main(string[] args)
{
var url = @"/401315430";
var videoInfos = GetVideoInfos(url);
foreach (var videoInfo in videoInfos)
{
();
();
();
();
}
();
}
The results are as follows:
Wait a minute, good sister.
/video/BV1uyxLeJEM9
/bfs/archive/
One bite? Your super sweet chili.
/video/BV1AQsDeiEn1
/bfs/archive/
Here's just a demonstration of how to crawl a dynamic page thatIf you want to get the video information of the B station UP master, it is recommended to use the API to request the data directly。
reference article
- The Ultimate Guide to Web Crawling in C#
- C# write a small crawler, the realization of crawling js loaded web pages
- Html Agility Pack Documentation
- [ Long Update ] C# Selenium Common Operations Code