歡迎光臨
每天分享高質量文章

網絡資料採集(AngleSharp)-使用AngleSharp做html解析

有這麼一本Python的書: <>

我準備用.NET Core及第三方庫實現裡面所有的例子.

這是第一部分, 主要使用的是AngleSharp: https://anglesharp.github.io/

(文章的章節書與該書是對應的)

發送Http請求

 在python裡面這樣發送http請求, 它使用的是python的標準庫urllib:

在.NET Core裡面, 你可以使用HttpClient, 相應的C#代碼如下:

            var client = new HttpClient();
      HttpResponseMessage response = await client.GetAsync("http://pythonscraping.com/pages/page1.html");
            response.EnsureSuccessStatusCode();
            var responseBody = await response.Content.ReadAsStringAsync();
            Console.WriteLine(responseBody);
            return responseBody;

或者可以簡寫為:

                var client = new HttpClient();
                var responseBody = await client.GetStringAsync("http://pythonscraping.com/pages/page1.html");
                Console.WriteLine(responseBody);

其結果如下:

使用AngleSharp解析html原始碼

python裡面可以使用BeautifulSoup或者MechanicalSoup等庫對html原始碼進行解析.

而.NET Core可以使用AngleSharp, Html Agility Pack, DotnetSpider(國產, 也支持元素抽取).等庫來操作Html文件.

這裡我先使用的是AngleSharp, AngleSharp的解析庫可以使用標準的W3C規範來解析HTML, MathML, XML, SVG和CSS. 它支持.NET Standard 1.0.

安裝AngleSharp

通過Nuget即可: https://www.nuget.org/packages/AngleSharp/

Install-Package AngleSharp

或者dotnet-cli:

dotnet add package AngleSharp

AngleSharp的一個簡單例子

下麵這個例子(1.2.2)是把頁面中h1元素的內容顯示出來.

書中Python的代碼:

下麵是.NET Core的C#代碼:

 public static async Task ReadWithAngleSharpAsync()
        {
            var htmlSourceCode = await SendRequestWithHttpClientAsync();
            var parser = new HtmlParser();
            var document = await parser.ParseAsync(htmlSourceCode);

            Console.WriteLine($"Serializing the (original) document: {document.QuerySelector("h1").OuterHtml}");
            Console.WriteLine($"Serializing the (original) document: {document.QuerySelector("html > body > h1").OuterHtml}");
        }

在這裡AngleSharp首先需要創建一個可以迴圈使用的HtmlParser(Html解析器), 然後使用解析器解析html原始碼即可: parser.Parse() 或者異步版本 parser.ParseAsync().

解析傳回物件的型別是IHtmlDocument, 裡面是解析好的DOM. 其中DOM是和AngleSharp里的類這樣對應的:

這個圖其實是老一點的版本, 新版本的DOM模型是稍微有點不同的, 不過你只要理解這個意思就行…

AngleSharp有很多特點, 但是最重要的特點就是它支持querySelector()querySelectorAll()方法, 就像DOM的方法一樣.

上面這個例子里, 其html的結構大致如下:

所以針對傳回的IHtmlDocument物件document, 我們使用document.QuerySelector(“h1”).OuterHtml, 就可以傳回h1的OuterHtml. 而使用document.QuerySelector(“html > body > h1”).OuterHtml 也是同樣的效果, 因為標準的CSS選擇器是都支持的.

QuerySelector()傳回的是一個/0個元素, 相當於Linq的FirstOrDefault().

其運行結果如下:

異常情況處理

發送Http請求之後, 可能會發生錯誤, 例如網頁不存在(或者請求時出錯), 服務器不存在等等.

針對這些情況, .NET Core程式會傳回HTTP錯誤, 可能是404也可能是500等. 但是所有的型別HttpClient都會丟擲HttpRequestException, 我們可以這樣處理這種異常:

public static async Task ResponseWithErrorsAsync()
        {
            try
            {
                var client = new HttpClient();
                var responseBody = await client.GetStringAsync("http://notexistwebsite");
                Console.WriteLine(responseBody);
            }
            catch (HttpRequestException e)
            {
                Console.ForegroundColor = ConsoleColor.Red;
                Console.WriteLine("\nException Caught!");
                Console.WriteLine("Message :{0} ", e.Message);
            }
        }

但是即使網頁獲取成功了, 網頁上的內容也並非完全是我們所期待的, 仍可能會丟擲異常. 比如說你想要找的標簽不存在, 那麼就會傳回null, 然後再呼叫改標簽的屬性, 就會發生NullReferenceException.

所以這種情況可以捕獲NullReferenceException, 也可以使用代碼判斷:

        public static async Task ReadNonExistTagAsync()
        {
            var htmlSourceCode = await SendRequestWithHttpClientAsync();
            var parser = new HtmlParser();
            var document = await parser.ParseAsync(htmlSourceCode);

            var nonExistTag = document.QuerySelector("h8");
            Console.WriteLine(nonExistTag);
            Console.WriteLine($"nonExistTag is null: {nonExistTag is null}");

            try
            {
                Console.WriteLine(nonExistTag.QuerySelector("p").OuterHtml);
            }
            catch (NullReferenceException)
            {
                Console.ForegroundColor = ConsoleColor.Red;
                Console.WriteLine("Tag was not found");
            }
        }

完整的例子:

        public static async Task RunAllAsync()
        {
            Console.ForegroundColor = ConsoleColor.Red;
            async Task GetTileAsync(string uri)
            {
                var httpClient = new HttpClient();
                try
                {
                    var responseHtml = await httpClient.GetStringAsync(uri);
                    var parser = new HtmlParser();
                    var document = await parser.ParseAsync(responseHtml);
                    var tagContent = document.QuerySelector("body > h8").TextContent;
                    return tagContent;
                }
                catch (HttpRequestException e)
                {
           Console.WriteLine($"{nameof(HttpRequestException)}:");
                    Console.WriteLine("Message :{0} ", e.Message);
                    return null;
                }
                catch (NullReferenceException)
                {
                    Console.WriteLine($"{nameof(NullReferenceException)}:");
                    Console.WriteLine("Tag was not found");
                    return null;
                }
            }

            var title = await GetTileAsync("http://www.pythonscraping.com/pages/page1.html");
            if (string.IsNullOrWhiteSpace(title))
            {
                Console.WriteLine("Title was not found");
            }
            else
            {
                Console.ForegroundColor = ConsoleColor.Green;
                Console.WriteLine(title);
            }
        }

首先我把請求Http傳回HTML代碼的那部分封裝成了一個方法以便復用:

        public static async Task GetHtmlSourceCodeAsync(string uri)
        {
            var httpClient = new HttpClient();
            try
            {
                var htmlSource = await httpClient.GetStringAsync(uri);
                return htmlSource;
            }
            catch (HttpRequestException e)
            {
                Console.ForegroundColor = ConsoleColor.Red;
                Console.WriteLine($"{nameof(HttpRequestException)}: {e.Message}");
                return null;
            }
        }

CSS是網絡爬蟲的福音, 下麵這兩個元素在頁面中可能會出現很多次:

我們可以使用AngleSharp裡面的QuerySelectorAll()方法把所有符合條件的元素都找出來, 傳回到一個結果集合里.

        public static async Task FindGreenClassAsync()
        {
            const string url = "http://www.pythonscraping.com/pages/warandpeace.html";
            var html = await GetHtmlSourceCodeAsync(url);
            if (!string.IsNullOrWhiteSpace(html))
            {
                var parser = new HtmlParser();
                var document = await parser.ParseAsync(html);
                var nameList = document.QuerySelectorAll("span > .green");

                Console.WriteLine("Green names are:");
                Console.ForegroundColor = ConsoleColor.Green;
                foreach (var item in nameList)
                {
                    Console.WriteLine(item.TextContent);
                }
            }
            else
            {
                Console.WriteLine("No html source code returned.");
            }
        }

非常簡單, 和DOM的標準操作是一樣的.

如果只需要元素的文字部分, 那麼就是用其TextContent屬性即可.

再看個例子

1. 找出頁面中所有的h1, h2, h3, h4, h5, h6元素

2. 找出class為green或red的span元素.

        public static async Task FindByAttributeAsync()
        {
            const string url = "http://www.pythonscraping.com/pages/warandpeace.html";
            var html = await GetHtmlSourceCodeAsync(url);
            if (!string.IsNullOrWhiteSpace(html))
            {
                var parser = new HtmlParser();
                var document = await parser.ParseAsync(html);

                var essay-headers = document.QuerySelectorAll("*")
                    .Where(x => new[] { "h1", "h2", "h3", "h4", "h5", "h6" }.Contains(x.TagName.ToLower()));
                Console.WriteLine("Headers are:");
                PrintItemsText(essay-headers);

                var greenAndRed = document.All
                    .Where(x => x.TagName == "span" && (x.ClassList.Contains("green") || x.ClassList.Contains("red")));
                Console.WriteLine("Green and Red spans are:");
                PrintItemsText(greenAndRed);

                var thePrinces = document.QuerySelectorAll("*").Where(x => x.TextContent == "the prince");
                Console.WriteLine(thePrinces.Count());
            }
            else
            {
                Console.WriteLine("No html source code returned.");
            }

            void PrintItemsText(IEnumerable elements)
            {
                foreach (var item in elements)
                {
                    Console.WriteLine(item.TextContent);
                }
            }
        }

這裡我們可以看到QuerySelectorAll()的傳回結果可以使用Linq的Where方法進行過濾, 這樣就很強大了.

TagName屬性就是元素的標簽名.

此外, 還有一個document.All, All屬性是該Document所有元素的集合, 它同樣也支持Linq.

(該方法中使用了一個本地方法).

由於同時支持CSS選擇器和Linq, 所以抽取元素的工作簡單多了.

導航樹

一個頁面, 它的結構可以是這樣的:

這裡面有幾個概念:

子標簽和後代標簽.

子標簽是父標簽的下一級, 而後代標簽則是指父標簽下麵所有級別的標簽.

tr是table的子標簽, tr, th, td, img都是table的後代標簽.

使用AngleSharp, 找出子標簽可以使用.Children屬性. 而找出後代標簽, 可以使用CSS選擇器.

兄弟標簽

找到前一個兄弟標簽使用.PreviousElementSibling屬性, 後一個兄弟標簽是.NextElementSibling屬性.

父標簽

.ParentElement屬性就是父標簽.

        public static async Task FindDescendantAsync()
        {
            const string url = "http://www.pythonscraping.com/pages/page3.html";
            var html = await GetHtmlSourceCodeAsync(url);
            if (!string.IsNullOrWhiteSpace(html))
            {
                var parser = new HtmlParser();
                var document = await parser.ParseAsync(html);

                var tableChildren = document.QuerySelector("table#giftList > tbody").Children;
                Console.WriteLine("Table's children are:");
                foreach (var child in tableChildren)
                {
                    System.Console.WriteLine(child.LocalName);
                }

                var descendants = document.QuerySelectorAll("table#giftList > tbody *");
                Console.WriteLine("Table's descendants are:");
                foreach (var item in descendants)
                {
                    Console.WriteLine(item.LocalName);
                }

                var siblings = document.QuerySelectorAll("table#giftList > tbody > tr").Select(x => x.NextElementSibling);
                Console.WriteLine("Table's descendants are:");
                foreach (var item in siblings)
                {
                    Console.WriteLine(item?.LocalName);
                }

                var parentSibling = document.All.SingleOrDefault(x => x.HasAttribute("src") && x.GetAttribute("src") == "../img/gifts/img1.jpg")
                    ?.ParentElement.PreviousElementSibling;
                if (parentSibling != null)
                {
                    Console.WriteLine($"Parent's previous sibling is: {parentSibling.TextContent}");
                }
            }
            else
            {
                Console.WriteLine("No html source code returned.");
            }
        }

結果:

使用正則運算式

“如果你有一個問題打算使用正則運算式來解決, 那麼現在你有兩個問題了”.

這裡有一個測試正則運算式的網站: https://www.regexpal.com/

目前, AngleSharp支持通過CSS選擇器來查找元素, 也可以使用Linq來過濾元素, 當然也可以通過多種方式使用正則運算式進行更複雜的查找動作.

關於正則運算式我就不介紹了. 直接看例子.

我想找到頁面中所有的滿足下列要求的圖片, 其src的值以../img/gifts/img開頭並且隨後跟著數字, 然後格式為.jpg的圖標.

        public static async Task FindByRegexAsync()
        {
            const string url = "http://www.pythonscraping.com/pages/page3.html";
            var html = await GetHtmlSourceCodeAsync(url);
            if (!string.IsNullOrWhiteSpace(html))
            {
                var parser = new HtmlParser();
                var document = await parser.ParseAsync(html);

                var images = document.QuerySelectorAll("img")
                    .Where(x => x.HasAttribute("src") && Regex.Match(x.Attributes["src"].Value, @"\.\.\/img\/gifts/img.*\.jpg").Success);
                foreach (var item in images)
                {
                    Console.WriteLine(item.Attributes["src"].Value);
                }

                var elementsWith2Attributes = document.All.Where(x => x.Attributes.Length == 2);
                foreach (var item in elementsWith2Attributes)
                {
                    Console.WriteLine(item.LocalName);
                    foreach (var attr in item.Attributes)
                    {
                        Console.WriteLine($"\t{attr.Name} - {attr.Value}");
                    }
                }
            }
            else
            {
                Console.WriteLine("No html source code returned.");
            }
        }

這個其實沒有任何難度.

但從本例可以看到, 判斷元素有沒有一個屬性可以使用HasAttribute(“xxx”)方法, 可以通過.Attributes索引來獲取屬性, 其屬性值就是.Attributes[“xxx”].Value.

如果不會正則運算式, 我相信多寫的Linq的過濾代碼也差不多能達到要求.

遍歷單個域名

就是幾個應用的例子, 直接貼代碼吧.

打印出一個頁面內所有的超鏈接地址:

        public static async Task TraversingASingleDomainAsync()
        {
            var httpClient = new HttpClient();
            var htmlSource = await httpClient.GetStringAsync("http://en.wikipedia.org/wiki/Kevin_Bacon");

            var parser = new HtmlParser();
            var document = await parser.ParseAsync(htmlSource);
            var links = document.QuerySelectorAll("a");
            foreach (var link in links)
            {
                Console.WriteLine(link.Attributes["href"]?.Value);
            }
        }

找出滿足下列條件的超鏈接:

  • 在id為bodyContent的div里
  • url不包括分號
  • url以/wiki開頭
        public static async Task FindSpecificLinksAsync()
        {
            var httpClient = new HttpClient();
            var htmlSource = await httpClient.GetStringAsync("http://en.wikipedia.org/wiki/Kevin_Bacon");

            var parser = new HtmlParser();
            var document = await parser.ParseAsync(htmlSource);
            var links = document.QuerySelector("div#bodyContent").QuerySelectorAll("a")
                .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)((?!:).)*$").Success);
            foreach (var link in links)
            {
                Console.WriteLine(link.Attributes["href"]?.Value);
            }
        }

隨機找到頁面裡面一個連接, 然後遞迴呼叫自己的方法, 直到主動停止:

        private static async Task> GetLinksAsync(string uri)
        {
            var httpClient = new HttpClient();
            var htmlSource = await httpClient.GetStringAsync($"http://en.wikipedia.org{uri}");
            var parser = new HtmlParser();
            var document = await parser.ParseAsync(htmlSource);

            var links = document.QuerySelector("div#bodyContent").QuerySelectorAll("a")
                .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)((?!:).)*$").Success);
            return links;
        }

        public static async Task GetRandomNestedLinksAsync()
        {
            var random = new Random();
            var links = (await GetLinksAsync("/wiki/Kevin_Bacon")).ToList();
            while (links.Any())
            {
                var newArticle = links[random.Next(0, links.Count)].Attributes["href"].Value;
                Console.WriteLine(newArticle);
                links = (await GetLinksAsync(newArticle)).ToList();
            }
        }

採集整個網站

首先要瞭解幾個概念:

淺網 surface web: 是互聯網上搜索引擎可以直接抓取到的那部分網絡.

與淺網對立的就是深網 deep web: 互聯網中90%都是深網.

暗網Darknet / dark web / dark internet: 它完全是另外一種怪獸. 它們也建立在已有的網絡基礎上, 但是使用Tor客戶端, 帶有運行在HTTP之上的新協議, 提供了一個信息交換的安全隧道. 這類網也可以採集, 但是超出了本書的範圍…..

深網相對暗網還是比較容易採集的.

採集整個網站的兩個好處:

  • 生成網站地圖
  • 收集資料

由於網站的規模和深度, 所以採集到的超鏈接很多可能是重覆的, 這時我們就需要鏈接去重, 可以使用Set型別的集合:

        private static readonly HashSet LinkSet = new HashSet();
        private static readonly HttpClient HttpClient = new HttpClient();
        private static readonly HtmlParser Parser = new HtmlParser();

        public static async Task GetUniqueLinksAsync(string uri = "")
        {
            var htmlSource = await HttpClient.GetStringAsync($"http://en.wikipedia.org{uri}");
            var document = await Parser.ParseAsync(htmlSource);

            var links = document.QuerySelectorAll("a")
                .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)").Success);

            foreach (var link in links)
            {
                if (!LinkSet.Contains(link.Attributes["href"].Value))
                {
                    var newPage = link.Attributes["href"].Value;
                    Console.WriteLine(newPage);
                    LinkSet.Add(newPage);
                    await GetUniqueLinksAsync(newPage);
                }
            }
        }

(遞迴呼叫的深度需要註意一下, 不然有時候能崩潰).

收集整個網站資料

這個例子相對網站, 包括收集相關文字和異常處理等:

        private static readonly HashSet LinkSet = new HashSet();
        private static readonly HttpClient HttpClient = new HttpClient();
        private static readonly HtmlParser Parser = new HtmlParser();

       public static async Task GetLinksWithInfoAsync(string uri = "")
        {
            var htmlSource = await HttpClient.GetStringAsync($"http://en.wikipedia.org{uri}");
            var document = await Parser.ParseAsync(htmlSource);

            try
            {
                var title = document.QuerySelector("h1").TextContent;
                Console.ForegroundColor = ConsoleColor.Green;
                Console.WriteLine(title);

                var contentElement = document.QuerySelector("#mw-content-text").QuerySelectorAll("p").FirstOrDefault();
                if (contentElement != null)
                {
                    Console.WriteLine(contentElement.TextContent);
                }

                var alink = document.QuerySelector("#ca-edit").QuerySelectorAll("span a").SingleOrDefault(x => x.HasAttribute("href"))?.Attributes["href"].Value;
                Console.WriteLine(alink);
            }
            catch (NullReferenceException)
            {
                Console.ForegroundColor = ConsoleColor.Red;
                Console.WriteLine("Cannot find the tag!");
            }

            var links = document.QuerySelectorAll("a")
                .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)").Success).ToList();
            foreach (var link in links)
            {
                if (!LinkSet.Contains(link.Attributes["href"].Value))
                {
                    var newPage = link.Attributes["href"].Value;
                    Console.WriteLine(newPage);
                    LinkSet.Add(newPage);
                    await GetLinksWithInfoAsync(newPage);
                }
            }
        }

不知前方水深的例子

第一個例子, 尋找隨機外鏈:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Net.Http;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using AngleSharp.Parser.Html;

namespace WebScrapingWithDotNetCore.Chapter03
{
    public class CrawlingAcrossInternet
    {
        private static readonly Random Random = new Random();
        private static readonly HttpClient HttpClient = new HttpClient();
        private static readonly HashSet InternalLinks = new HashSet();
        private static readonly HashSet ExternalLinks = new HashSet();
        private static readonly HtmlParser Parser = new HtmlParser();

        public static async Task FollowExternalOnlyAsync(string startingSite)
        {
            var externalLink = await GetRandomExternalLinkAsync(startingSite);
            if (externalLink != null)
            {
                Console.WriteLine($"External Links is: {externalLink}");
                await FollowExternalOnlyAsync(externalLink);
            }
            else
            {
                Console.WriteLine("Random External link is null, Crawling terminated.");
            }
        }

        private static async Task GetRandomExternalLinkAsync(string startingPage)
        {
            try
            {
                var htmlSource = await HttpClient.GetStringAsync(startingPage);
                var externalLinks = (await GetExternalLinksAsync(htmlSource, SplitAddress(startingPage)[0])).ToList();
                if (externalLinks.Any())
                {
                    return externalLinks[Random.Next(0, externalLinks.Count)];
                }

                var internalLinks = (await GetInternalLinksAsync(htmlSource, startingPage)).ToList();
                if (internalLinks.Any())
                {
                    return await GetRandomExternalLinkAsync(internalLinks[Random.Next(0, internalLinks.Count)]);
                }

                return null;
            }
            catch (HttpRequestException e)
            {
                Console.WriteLine($"Error requesting: {e.Message}");
                return null;
            }
        }

        private static string[] SplitAddress(string address)
        {
            var addressParts = address.Replace("http://", "").Replace("https://", "").Split("/");
            return addressParts;
        }

        private static async Task> GetInternalLinksAsync(string htmlSource, string includeUrl)
        {
            var document = await Parser.ParseAsync(htmlSource);
            var links = document.QuerySelectorAll("a")
                .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, [email protected]"^(/|.*{includeUrl})").Success)
                .Select(x => x.Attributes["href"].Value);
            foreach (var link in links)
            {
                if (!string.IsNullOrEmpty(link) && !InternalLinks.Contains(link))
                {
                    InternalLinks.Add(link);
                }
            }
            return InternalLinks;
        }

        private static async Task> GetExternalLinksAsync(string htmlSource, string excludeUrl)
        {
            var document = await Parser.ParseAsync(htmlSource);

            var links = document.QuerySelectorAll("a")
                .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, [email protected]"^(http|www)((?!{excludeUrl}).)*$").Success)
                .Select(x => x.Attributes["href"].Value);
            foreach (var link in links)
            {
                if (!string.IsNullOrEmpty(link) && !ExternalLinks.Contains(link))
                {
                    ExternalLinks.Add(link);
                }
            }
            return ExternalLinks;
        }

        private static readonly HashSet AllExternalLinks = new HashSet();
        private static readonly HashSet AllInternalLinks = new HashSet();

        public static async Task GetAllExternalLinksAsync(string siteUrl)
        {
            try
            {
                var htmlSource = await HttpClient.GetStringAsync(siteUrl);
                var internalLinks = await GetInternalLinksAsync(htmlSource, SplitAddress(siteUrl)[0]);
                var externalLinks = await GetExternalLinksAsync(htmlSource, SplitAddress(siteUrl)[0]);
                foreach (var link in externalLinks)
                {
                    if (!AllExternalLinks.Contains(link))
                    {
                        AllExternalLinks.Add(link);
                        Console.WriteLine(link);
                    }
                }

                foreach (var link in internalLinks)
                {
                    if (!AllInternalLinks.Contains(link))
                    {
                        Console.WriteLine($"The link is: {link}");
                        AllInternalLinks.Add(link);
                        await GetAllExternalLinksAsync(link);
                    }
                }
            }
            catch (HttpRequestException e)
            {
                Console.WriteLine(e);
                Console.WriteLine($"Request error: {e.Message}");
            }
        }
    }
}

程式有Bug, 您可以給解決下……

第一部分先到這….主要用的是AngleSharp. AngleSharp不止這些功能, 很強大的, 具體請看文件.

由於該書下一部分使用的是Python的Scrapy, 所以下篇文章我也許應該使用DotNetSpider了, 這是一個國產的庫….

專案的代碼在: https://github.com/solenovex/Web-Scraping-With-.NET-Core

已同步到看一看
赞(0)

分享創造快樂