网络爬虫(一)

2020-10-02

1. 网络爬虫

1.1 网络爬虫介绍

在大数据时代，信息的采集是一项重要的工作，而互联网中的数据是海量的，如果单纯靠人力进行信息采集，不仅低效繁琐，搜集的成本也会提高。如何自动高效地获取互联网中我们感兴趣的信息并为我们所用是一个重要的问题，而爬虫技术就是为了解决这些问题而生的。

网络爬虫（Web crawler）也叫做网络机器人，可以代替人们自动地在互联网中进行数据信息的采集与整理。它是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本，可以自动采集所有其能够访问到的页面内容，以获取相关数据。

从功能上来讲，爬虫一般分为数据采集，处理，储存三个部分。爬虫从一个或若干初始网页的URL开始，获得初始网页上的URL，在抓取网页的过程中，不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。

1.2 为什么学网络爬虫

我们初步认识了网络爬虫，但是为什么要学习网络爬虫呢？只有清晰地知道我们的学习目的，才能够更好地学习这一项知识。在此，总结了4种常见的学习爬虫的原因：

可以实现搜索引擎

我们学会了爬虫编写之后，就可以利用爬虫自动地采集互联网中的信息，采集回来后进行相应的存储或处理，在需要检索某些信息的时候，只需在采集回来的信息中进行检索，即实现了私人的搜索引擎。
大数据时代，可以让我们获取更多的数据源。

在进行大数据分析或者进行数据挖掘的时候，需要有数据源进行分析。我们可以从某些提供数据统计的网站获得，也可以从某些文献或内部资料中获得，但是这些获得数据的方式，有时很难满足我们对数据的需求，而手动从互联网中去寻找这些数据，则耗费的精力过大。此时就可以利用爬虫技术，自动地从互联网中获取我们感兴趣的数据内容，并将这些数据内容爬取回来，作为我们的数据源，再进行更深层次的数据分析，并获得更多有价值的信息。
可以更好地进行搜索引擎优化（SEO）。

对于很多SEO从业者来说，为了更好的完成工作，那么就必须要对搜索引擎的工作原理非常清楚，同时也需要掌握搜索引擎爬虫的工作原理。而学习爬虫，可以更深层次地理解搜索引擎爬虫的工作原理，这样在进行搜索引擎优化时，才能知己知彼，百战不殆。
有利于就业。

从就业来说，爬虫工程师方向是不错的选择之一，因为目前爬虫工程师的需求越来越大，而能够胜任这方面岗位的人员较少，所以属于一个比较紧缺的职业方向，并且随着大数据时代和人工智能的来临，爬虫技术的应用将越来越广泛，在未来会拥有很好的发展空间。

2.1 爬虫入门程序

2.1.1 环境准备

JDK1.8
IntelliJ IDEA
IDEA自带的Maven

2.1.2 创建工程添加依赖

<dependencies>
    <!-- HttpClient -->
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.2</version>
    </dependency>

    <!-- 日志 -->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>1.7.25</version>
    </dependency>
</dependencies>

2.1.3 加入log4j.properties

log4j.rootLogger=DEBUG,A1
log4j.logger.com.wgy = DEBUG

log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c]-[%p] %m%n

2.1.4 编写代码

/**
 * 爬虫入门程序
 *
 * @author wgy
 */
public class CrawlerFirst {

    public static void main(String[] args) throws Exception {
        //1. 打开浏览器,创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //2. 输入网址,发起get请求创建HttpGet对象
        HttpGet httpGet = new HttpGet("http://www.itcast.cn");

        //使用User-Agent防止HttpClient发送http请求时403 Forbidden和安全拦截
        //String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36";
        //httpGet.setHeader("User-Agent", userAgent);

        //3.按回车，发起请求，返回响应，使用HttpClient对象发起请求
        CloseableHttpResponse response = httpClient.execute(httpGet);

        //4. 解析响应，获取数据
        //判断状态码是否是200
        if (response.getStatusLine().getStatusCode() == 200) {
            HttpEntity entity = response.getEntity();
            String content = EntityUtils.toString(entity, "UTF-8");
            System.out.println(content);
        }
    }
}

2. HttpClient

网络爬虫就是用程序帮助我们访问网络上的资源，我们一直以来都是使用HTTP协议访问互联网的网页，网络爬虫需要编写程序，在这里使用同样的HTTP协议访问网页。

这里我们使用Java的HTTP协议客户端 HttpClient这个技术，来实现抓取网页数据。

2.1 GET请求

/**
 * GET请求
 *
 * @author wgy
 */
public class HttpGetTest {

    public static void main(String[] args) {
        //创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //创建HttpGet对象，设置url访问地址
        HttpGet httpGet = new HttpGet("http://www.itcast.cn");

        CloseableHttpResponse response = null;
        try {
            //使用HttpClient发起请求，获取response
            response = httpClient.execute(httpGet);

            //解析响应
            if (response.getStatusLine().getStatusCode() == 200) {
                String content = EntityUtils.toString(response.getEntity(), "UTF-8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

请求结果：

2.2 带参数的GET请求

/**
 * 带参数的GET请求
 *
 * @author wgy
 */
public class HttpGetParamTest {

    public static void main(String[] args) throws Exception {
        //创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //设置请求地址是：http://yun.itheima.com/search?keys=Java
        //创建URIBuilder
        URIBuilder uriBuilder = new URIBuilder("http://yun.itheima.com/search");
        //设置参数
        uriBuilder.setParameter("keys", "Java");

        //创建HttpGet对象，设置url访问地址
        HttpGet httpGet = new HttpGet(uriBuilder.build());

        System.out.println("发起请求的信息：" + httpGet);

        CloseableHttpResponse response = null;
        try {
            //使用HttpClient发起请求，获取response
            response = httpClient.execute(httpGet);

            //解析响应
            if (response.getStatusLine().getStatusCode() == 200) {
                String content = EntityUtils.toString(response.getEntity(), "UTF-8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

请求结果：

2.3 POST请求

/**
 * POST请求
 *
 * @author wgy
 */
public class HttpPostTest {

    public static void main(String[] args)  {
        //创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //创建HttpPost对象，设置url访问地址
        HttpPost httpPost = new HttpPost("http://www.itcast.cn");

        CloseableHttpResponse response = null;
        try {
            //使用HttpClient发起请求，获取response
            response = httpClient.execute(httpPost);

            //解析响应
            if (response.getStatusLine().getStatusCode() == 200) {
                String content = EntityUtils.toString(response.getEntity(), "UTF-8");
                System.out.println(content.length());
            }

        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            //关闭response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

请求结果：

2.4 带参数的POST请求

/**
 * 带参数的POST请求
 *
 * @author wgy
 */
public class HttpPostParamTest {

    public static void main(String[] args) throws Exception {
        //创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //创建HttpPost对象，设置url访问地址
        HttpPost httpPost = new HttpPost("http://yun.itheima.com/search");

        //声明List集合，封装表单中的参数
        List<NameValuePair> params = new ArrayList<NameValuePair>();

        //设置请求地址是：http://yun.itheima.com/search?keys=Java
        params.add(new BasicNameValuePair("keys", "Java"));

        //创建表单的Entity对象,第一个参数就是封装好的表单数据，第二个参数就是编码
        UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params, "UTF-8");

        //设置表单的Entity对象到Post请求中
        httpPost.setEntity(formEntity);

        CloseableHttpResponse response = null;
        try {
            //使用HttpClient发起请求，获取response
            response = httpClient.execute(httpPost);

            //解析响应
            if (response.getStatusLine().getStatusCode() == 200) {
                String content = EntityUtils.toString(response.getEntity(), "UTF-8");
                System.out.println(content.length());
            }

        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

请求结果：

2.5 连接池

如果每次请求都要创建HttpClient，会有频繁创建和销毁的问题，可以使用连接池来解决这个问题。

/**
 * HttpClient连接池
 *
 * @author wgy
 */
public class HttpClientPoolTest {

    public static void main(String[] args) {
        //创建连接池管理器
        PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();

        //设置最大连接数
        cm.setMaxTotal(100);

        //设置每个主机的最大连接数
        cm.setDefaultMaxPerRoute(10);

        //使用连接池管理器发起请求
        doGet(cm);
        doGet(cm);
    }

    private static void doGet(PoolingHttpClientConnectionManager cm) {
        //不是每次创建新的HttpClient，而是从连接池中获取HttpClient对象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();

        HttpGet httpGet = new HttpGet("http://www.itcast.cn");

        CloseableHttpResponse response = null;
        try {
            response = httpClient.execute(httpGet);

            if (response.getStatusLine().getStatusCode() == 200) {
                String content = EntityUtils.toString(response.getEntity(), "UTF-8");

                System.out.println(content.length());
            }

        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (response != null) {
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
                //不能关闭HttpClient，由连接池管理HttpClient
                //httpClient.close();
            }
        }
    }
}

2.6 请求参数

有时候因为网络，或者目标服务器的原因，请求需要更长的时间才能完成，我们需要自定义相关时间

/**
 * 请求参数
 *
 * @author wgy
 */
public class HttpConfigTest {

    public static void main(String[] args) {
        //创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //创建HttpGet对象，设置url访问地址
        HttpGet httpGet = new HttpGet("http://www.itcast.cn");

        //配置请求信息
        RequestConfig config = RequestConfig.custom().setConnectTimeout(1000)   //创建连接的最长时间，单位是毫秒
                .setConnectionRequestTimeout(500)   //设置获取连接的最长时间，单位是毫秒
                .setSocketTimeout(10 * 1000)      //设置数据传输的最长时间，单位是毫秒
                .build();

        //给请求设置请求信息
        httpGet.setConfig(config);

        CloseableHttpResponse response = null;
        try {
            //使用HttpClient发起请求，获取response
            response = httpClient.execute(httpGet);

            //解析响应
            if (response.getStatusLine().getStatusCode() == 200) {
                String content = EntityUtils.toString(response.getEntity(), "UTF-8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

3. Jsoup

我们抓取到页面之后，还需要对页面进行解析。可以使用字符串处理工具解析页面，也可以使用正则表达式，但是这些方法都会带来很大的开发成本，所以我们需要使用一款专门解析html页面的技术。

3.1 Jsoup介绍

Jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。

Jsoup的主要功能如下：

从一个URL，文件或字符串中解析HTML；
使用DOM或CSS选择器来查找、取出数据；
可操作HTML元素、属性、文本；

3.2 Jsoup解析

Jsoup依赖：

<dependencies>
    <!-- HttpClient -->
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.2</version>
    </dependency>

    <!-- 日志 -->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>1.7.25</version>
        <!--<scope>test</scope>-->
    </dependency>

    <!--Jsoup-->
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.10.2</version>
    </dependency>

    <!--测试-->
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
        <scope>test</scope>
    </dependency>

    <!--工具-->
    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>2.6</version>
    </dependency>
    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-lang3</artifactId>
        <version>3.7</version>
    </dependency>
</dependencies>

3.2.1 解析url

Jsoup可以直接输入url，它会发起请求并获取数据，封装为Document对象

/**
 * jsoup测试
 *
 * @author wgy
 */
public class JsoupFirstTest {

    /**
     * 解析url
     *
     * @throws Exception
     */
    @Test
    public void testUrl() throws Exception {
        //解析url地址,第一个参数是访问的url，第二个参数是访问时候的超时时间
        Document doc = Jsoup.parse(new URL("http://www.itcast.cn"), 1000);

        //使用标签选择器，获取title标签中的内容
        String title = doc.getElementsByTag("title").first().text();

        //打印
        System.out.println(title);
    }
}

PS：虽然使用Jsoup可以替代HttpClient直接发起请求解析数据，但是往往不会这样用，因为实际的开发过程中，需要使用到多线程，连接池，代理等等方式，而jsoup对这些的支持并不是很好，所以我们一般把jsoup仅仅作为Html解析工具使用

3.2.2 解析字符串

先准备以下html文件

<html>
    <head> 
        <title>传智播客官网-一样的教育,不一样的品质</title> 
    </head> 
    <body>
        <div class="city">
            <h3 id="city_bj">北京中心</h3>
            <fb:img src="/2018czgw/images/slogan.jpg" class="slogan"/>
            <div class="city_in">
                <div class="city_con" style="display: none;">
                    <ul>
                        <li id="test" class="class_a class_b">
                            <a href="http://www.itcast.cn" target="_blank">
                                <span class="s_name">北京</span>
                            </a>
                        </li>
                        <li>
                            <a href="http://sh.itcast.cn" target="_blank">
                                <span class="s_name">上海</span>
                            </a>
                        </li>
                        <li>
                            <a href="http://gz.itcast.cn" target="_blank">
                                <span abc="123" class="s_name">广州</span>
                            </a>
                        </li>
                        <ul>
                            <li>天津</li>
                        </ul>					
                    </ul>
                </div>
            </div>
        </div>
    </body>
</html>

Jsoup可以直接输入字符串，并封装为Document对象

/**
 * jsoup测试
 *
 * @author wgy
 */
public class JsoupFirstTest {

    /**
     * 解析字符串
     *
     * @throws Exception
     */
    @Test
    public void testString() throws Exception {
        //使用工具类读取文件，获取字符串
        String content = FileUtils.readFileToString(new File("C:\\Users\\wgy\\Desktop\\test.html"), "UTF-8");

        //解析字符串
        Document doc = Jsoup.parse(content);

        String title = doc.getElementsByTag("title").first().text();

        System.out.println(title);
    }
}

3.2.3 解析文件

Jsoup可以直接解析文件，并封装为Document对象

/**
 * jsoup测试
 *
 * @author wgy
 */
public class JsoupFirstTest {

    /**
     * 解析文件
     *
     * @throws Exception
     */
    @Test
    public void testFile() throws Exception {
        //解析文件
        Document doc = Jsoup.parse(new File("C:\\Users\\wgy\\Desktop\\test.html"), "UTF-8");

        String title = doc.getElementsByTag("title").first().text();

        System.out.println(title);

    }
}

3.2.4 使用dom方式遍历文档

3.2.4.1 元素获取

根据id查询元素getElementById
根据标签获取元素getElementsByTag
根据class获取元素getElementsByClass
根据属性获取元素getElementsByAttribute

/**
 * jsoup测试
 *
 * @author wgy
 */
public class JsoupFirstTest {

    /**
     * 元素获取
     *
     * @throws Exception
     */
    @Test
    public void testDOM() throws Exception {
        //解析文件，获取Document对象
        Document doc = Jsoup.parse(new File("C:\\Users\\wgy\\Desktop\\test.html"), "UTF-8");


        //获取元素
        //1.	根据id查询元素getElementById
        //Element element = doc.getElementById("city_bj");

        //2.	根据标签获取元素getElementsByTag
        //Element element = doc.getElementsByTag("span").first();

        //3.	根据class获取元素getElementsByClass
        //Element element = doc.getElementsByClass("class_a class_b").first();
        //Element element = doc.getElementsByClass("class_a").first();
        //Element element = doc.getElementsByClass("class_b").first();


        //4.	根据属性获取元素getElementsByAttribute
        //Element element = doc.getElementsByAttribute("abc").first();
        Element element = doc.getElementsByAttributeValue("href", "http://sh.itcast.cn").first();

        //打印元素的内容
        System.out.println("获取到的元素内容是：" + element.text());
    }
}

3.2.4.2 元素中获取数据

从元素中获取id
从元素中获取className
从元素中获取属性的值attr
从元素中获取所有属性attributes
从元素中获取文本内容text

/**
 * jsoup测试
 *
 * @author wgy
 */
public class JsoupFirstTest {

    /**
     * 元素中获取数据
     *
     * @throws Exception
     */
    @Test
    public void testData() throws Exception {
        //解析文件，获取Document
        Document doc = Jsoup.parse(new File("C:\\Users\\wgy\\Desktop\\test.html"), "UTF-8");

        //根据id获取元素
        Element element = doc.getElementById("test");

        String str = "";

        //元素中获取数据
        //1.	从元素中获取id
        str = element.id();

        //2.	从元素中获取className
        str = element.className();
        //Set<String> classSet = element.classNames();
        //for (String s : classSet ) {
        //    System.out.println(s);
        //}

        //3.	从元素中获取属性的值attr
        //str = element.attr("id");
        str = element.attr("class");

        //4.	从元素中获取所有属性attributes
        Attributes attributes = element.attributes();
        System.out.println(attributes.toString());

        //5.	从元素中获取文本内容text
        str = element.text();

        //打印获取到的内容
        System.out.println("获取到的数据是：" + str);

    }
}

3.2.5 使用选择器语法查找元素

Jsoup elements对象支持类似于CSS (或jquery)的选择器语法，来实现非常强大和灵活的查找功能。这个select 方法在Document, Element,或Elements对象中都可以使用。且是上下文相关的，因此可实现指定元素的过滤，或者链式选择访问。

Select方法将返回一个Elements集合，并提供一组方法来抽取和处理结果。

3.2.5.1 Selector选择器概述

tagname: 通过标签查找元素，比如：span
#id: 通过ID查找元素，比如：# city_bj
.class: 通过class名称查找元素，比如：.class_a
[attribute]: 利用属性查找元素，比如：[abc]
[attr=value]: 利用属性值来查找元素，比如：[class=s_name]

/**
 * jsoup测试
 *
 * @author wgy
 */
public class JsoupFirstTest {

    /**
     * Selector选择器
     *
     * @throws Exception
     */
    @Test
    public void testSelector() throws Exception {

        //解析html文件，获取Document对象
        Document doc = Jsoup.parse(new File("C:\\Users\\wgy\\Desktop\\test.html"), "UTF-8");

        //tagname: 通过标签查找元素，比如：span
        Elements elements = doc.select("span");
        //for (Element element : elements) {
        //    System.out.println(element.text());
        //}

        //#id: 通过ID查找元素，比如：#city_bj
        //Element element = doc.select("#city_bj").first();

        //.class: 通过class名称查找元素，比如：.class_a
        //Element element = doc.select(".class_a").first();

        //[attribute]: 利用属性查找元素，比如：[abc]
        Element element = doc.select("[abc]").first();

        //[attr=value]: 利用属性值来查找元素，比如：[class=s_name]
        Elements elements1 = doc.select("[class=s_name]");
        for (Element element1 : elements1) {
            System.out.println(element1.text());
        }


        //打印结果
        System.out.println("获取到的结果是：" + element.text());
    }
}

3.2.5.2 Selector选择器组合使用

el#id: 元素+ID，比如： h3#city_bj
el.class: 元素+class，比如： li.class_a
el[attr]: 元素+属性名，比如： span[abc]
任意组合: 比如：span[abc].s_name
ancestor child: 查找某个元素下子元素，比如：.city_con li 查找”city_con”下的所有li
parent > child: 查找某个父元素下的直接子元素，比如：.city_con > ul > li 查找city_con第一级（直接子元素）的ul，再找所有ul下的第一级li
parent > *: 查找某个父元素下所有直接子元素

/**
 * jsoup测试
 *
 * @author wgy
 */
public class JsoupFirstTest {

    /**
     * Selector选择器组合使用
     *
     * @throws Exception
     */
    @Test
    public void testSelector2() throws Exception {
        //解析html文件，获取Document对象
        Document doc = Jsoup.parse(new File("C:\\Users\\wgy\\Desktop\\test.html"), "UTF-8");

        //el#id: 元素+ID，比如： h3#city_bj
        Element element = doc.select("h3#city_bj").first();

        //el.class: 元素+class，比如： li.class_a
        element = doc.select("li.class_a").first();

        //el[attr]: 元素+属性名，比如： span[abc]
        element = doc.select("span[abc]").first();

        //任意组合: 比如：span[abc].s_name
        element = doc.select("span[abc].s_name").first();

        //ancestor child: 查找某个元素下子元素，比如：.city_con li 查找"city_con"下的所有li
        Elements elements = doc.select(".city_con li");

        //parent > child: 查找某个父元素下的直接子元素，比如：
        //.city_con > ul > li 查找city_con第一级（直接子元素）的ul，再找所有ul下的第一级li
        elements = doc.select(".city_con > ul > li");

        //parent > *: 查找某个父元素下所有直接子元素
        elements = doc.select(".city_con > ul > *");


        System.out.println("获取到的内容是：" + element.text());

        for (Element element1 : elements) {
            System.out.println("遍历的结果：" + element1.text());
        }
    }
}

4. 爬虫案例

学习了HttpClient和Jsoup，就掌握了如何抓取数据和如何解析数据，接下来，我们做一个小练习，把京东的手机数据抓取下来。

主要目的是HttpClient和Jsoup的学习。

4.1 需求分析

首先访问京东，搜索手机，分析页面，我们抓取以下商品数据：商品图片、价格、标题、商品详情页

4.1.1 SPU和SKU

除了以上四个属性以外，我们发现上图中的苹果手机有四种产品，我们应该每一种都要抓取。那么这里就必须要了解spu和sku的概念

SPU = Standard Product Unit （标准产品单位）

SPU是商品信息聚合的最小单位，是一组可复用、易检索的标准化信息的集合，该集合描述了一个产品的特性。通俗点讲，属性值、特性相同的商品就可以称为一个SPU。

例如上图中的苹果手机就是SPU，包括红色、深灰色、金色、银色

SKU=stock keping unit(库存量单位)

SKU即库存进出计量的单位，可以是以件、盒、托盘等为单位。SKU是物理上不可分割的最小存货单元。在使用时要根据不同业态，不同管理模式来处理。在服装、鞋类商品中使用最多最普遍。

例如上图中的苹果手机有几个款式，红色苹果手机，就是一个sku

查看页面的源码也可以看出区别

4.2 开发准备

4.2.1 数据库表

CREATE TABLE `jd_item` (
    `id` bigint(10) NOT NULL AUTO_INCREMENT COMMENT '主键id',
    `spu` bigint(15) DEFAULT NULL COMMENT '商品集合id',
    `sku` bigint(15) DEFAULT NULL COMMENT '商品最小品类单元id',
    `title` varchar(100) DEFAULT NULL COMMENT '商品标题',
    `price` bigint(10) DEFAULT NULL COMMENT '商品价格',
    `pic` varchar(200) DEFAULT NULL COMMENT '商品图片',
    `url` varchar(200) DEFAULT NULL COMMENT '商品详情地址',
    `created` datetime DEFAULT NULL COMMENT '创建时间',
    `updated` datetime DEFAULT NULL COMMENT '更新时间',
    PRIMARY KEY (`id`),
    KEY `sku` (`sku`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='京东商品表';

4.2.2 添加依赖

使用Spring Boot+Spring Data JPA和定时任务进行开发，需要创建Maven工程并添加以下依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.0.2.RELEASE</version>
    </parent>

    <groupId>com.wgy</groupId>
    <artifactId>crawler-jd</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <!--SpringMVC-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <!--SpringData Jpa-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>

        <!--MySQL连接包-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
        </dependency>

        <!-- HttpClient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
        </dependency>

        <!--Jsoup-->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>

        <!--工具包-->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
        </dependency>
    </dependencies>
</project>

4.2.3 添加配置文件

加入application.properties配置文件

#DB Configuration:
spring.datasource.driverClassName=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://127.0.0.1:3306/crawler
spring.datasource.username=root
spring.datasource.password=root

#JPA Configuration:
spring.jpa.database=MySQL
spring.jpa.show-sql=true

server.port=80

4.3 代码实现

4.3.1 编写pojo

/**
 * 京东商品实体类
 *
 * @author wgy
 */
@Entity
@Table(name = "jd_item")
public class Item {
    //主键
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    //标准产品单位（商品集合）
    private Long spu;
    //库存量单位（最小品类单元）
    private Long sku;
    //商品标题
    private String title;
    //商品价格
    private Double price;
    //商品图片
    private String pic;
    //商品详情地址
    private String url;
    //创建时间
    private Date created;
    //更新时间
    private Date updated;
 
    //get/set/toString...
}

4.3.2 编写dao

/**
 * dao接口
 *
 * @author wgy
 */
public interface ItemDao extends JpaRepository<Item, Long> {
}

4.3.3 编写Service

ItemService接口

/**
 * service接口
 *
 * @author wgy
 */
public interface ItemService {

    /**
     * 保存商品
     *
     * @param item
     */
    public void save(Item item);

    /**
     * 根据条件查询商品
     *
     * @param item
     * @return
     */
    public List<Item> findAll(Item item);
}

ItemServiceImpl实现类

/**
 * service实现类
 *
 * @author wgy
 */
@Service
public class ItemServiceImpl implements ItemService {

    @Autowired
    private ItemDao itemDao;

    @Override
    @Transactional
    public void save(Item item) {
        this.itemDao.save(item);
    }

    @Override
    public List<Item> findAll(Item item) {
        //声明查询条件
        Example<Item> example = Example.of(item);

        //根据查询条件进行查询数据
        List<Item> list = this.itemDao.findAll(example);

        return list;
    }
}

4.3.4 编写引导类

/**
 * 引导类
 *
 * @author wgy
 */
@SpringBootApplication
//使用定时任务，需要先开启定时任务，需要添加注解
@EnableScheduling
public class Application {

    public static void main(String[] args) {
        SpringApplication.run(Application.class, args);
    }
}

4.3.5 封装HttpClient

/**
 * HttpClient工具类
 *
 * @author wgy
 */
@Component
public class HttpUtils {

    private PoolingHttpClientConnectionManager cm;

    public HttpUtils() {
        this.cm = new PoolingHttpClientConnectionManager();

        //设置最大连接数
        this.cm.setMaxTotal(100);

        //设置每个主机的最大连接数
        this.cm.setDefaultMaxPerRoute(10);
    }

    /**
     * 根据请求地址下载页面数据
     *
     * @param url
     * @return 页面数据
     */
    public String doGetHtml(String url) {
        //获取HttpClient对象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build();

        //创建httpGet请求对象，设置url地址
        HttpGet httpGet = new HttpGet(url);

        //使用User-Agent防止HttpClient发送http请求时403 Forbidden和安全拦截
        String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36";
        httpGet.setHeader("User-Agent", userAgent);

        //设置请求信息
        httpGet.setConfig(this.getConfig());

        CloseableHttpResponse response = null;

        try {
            //使用HttpClient发起请求，获取响应
            response = httpClient.execute(httpGet);

            //解析响应，返回结果
            if (response.getStatusLine().getStatusCode() == 200) {
                //判断响应体Entity是否不为空，如果不为空就可以使用EntityUtils
                if (response.getEntity() != null) {
                    String content = EntityUtils.toString(response.getEntity(), "UTF-8");
                    return content;
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            if (response != null) {
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        //返回空串
        return "";
    }

    /**
     * 下载图片
     *
     * @param url
     * @return 图片名称
     */
    public String doGetImage(String url) {
        //获取HttpClient对象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build();

        //创建httpGet请求对象，设置url地址
        HttpGet httpGet = new HttpGet(url);

        //使用User-Agent防止HttpClient发送http请求时403 Forbidden和安全拦截
        String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36";
        httpGet.setHeader("User-Agent", userAgent);

        //设置请求信息
        httpGet.setConfig(this.getConfig());

        CloseableHttpResponse response = null;

        try {
            //使用HttpClient发起请求，获取响应
            response = httpClient.execute(httpGet);

            //解析响应，返回结果
            if (response.getStatusLine().getStatusCode() == 200) {
                //判断响应体Entity是否不为空
                if (response.getEntity() != null) {
                    //下载图片
                    //获取图片的后缀
                    String extName = url.substring(url.lastIndexOf("."));

                    //创建图片名，重命名图片
                    String picName = UUID.randomUUID().toString() + extName;

                    //下载图片
                    //声明OutPutStream
                    OutputStream outputStream = new FileOutputStream(new File("C:\\Users\\wgy\\Desktop\\images\\" + picName));
                    response.getEntity().writeTo(outputStream);

                    //返回图片名称
                    return picName;
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            if (response != null) {
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        //如果下载失败，返回空串
        return "";
    }

    /**
     * 设置请求信息
     *
     * @return
     */
    private RequestConfig getConfig() {
        RequestConfig config = RequestConfig.custom()
                .setConnectTimeout(1000)    //创建连接的最长时间
                .setConnectionRequestTimeout(500)  // 获取连接的最长时间
                .setSocketTimeout(10000)    //数据传输的最长时间
                .build();

        return config;
    }
}

4.3.6 实现数据抓取

使用定时任务，可以定时抓取最新的数据

/**
 * 定时任务：京东商品手机信息下载
 *
 * @author wgy
 */
@Component
public class ItemTask {

    private static final ObjectMapper MAPPER = new ObjectMapper();
    @Autowired
    private HttpUtils httpUtils;
    @Autowired
    private ItemService itemService;

    /**
     * 下载任务
     * 当下载任务完成后，间隔100秒进行下一次的任务
     *
     * @throws Exception
     */
    @Scheduled(fixedDelay = 100 * 1000)
    public void itemTask() throws Exception {
        //声明需要解析的初始地址
        String url = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&wq=%E6%89%8B%E6%9C%BA&s=51&click=0&page=";

        //按照页面对手机的搜索结果进行遍历解析
        for (int i = 1; i < 10; i = i + 2) {
            String html = httpUtils.doGetHtml(url + i);

            //解析页面，获取商品数据并存储
            this.parse(html);
        }

        System.out.println("手机数据抓取完成！");
    }

    /**
     * 解析页面，获取商品数据并存储
     *
     * @param html
     * @throws Exception
     */
    private void parse(String html) throws Exception {
        //解析html获取Document
        Document doc = Jsoup.parse(html);

        //获取spu信息
        Elements spuEles = doc.select("div#J_goodsList > ul > li");

        for (Element spuEle : spuEles) {
            //获取spu
            long spu = Long.parseLong(StringUtils.isEmpty(spuEle.attr("data-spu")) ? "0" : spuEle.attr("data-spu"));

            //获取sku信息
            Elements skuEles = spuEle.select("li.ps-item");

            for (Element skuEle : skuEles) {
                //获取sku
                long sku = Long.parseLong(skuEle.select("[data-sku]").attr("data-sku"));

                //根据sku查询商品数据
                Item item = new Item();
                item.setSku(sku);
                List<Item> list = this.itemService.findAll(item);

                if (list.size() > 0) {
                    //如果商品存在，就进行下一个循环，该商品不保存，因为已存在
                    continue;
                }

                //设置商品的spu
                item.setSpu(spu);

                //获取商品的详情的url
                String itemUrl = "https://item.jd.com/" + sku + ".html";
                item.setUrl(itemUrl);

                //获取商品的图片
                String picUrl = "https:" + skuEle.select("img[data-sku]").first().attr("data-lazy-img");
                picUrl = picUrl.replace("/n7/", "/n1/");
                String picName = this.httpUtils.doGetImage(picUrl);
                item.setPic(picName);

                //获取商品的价格
                String priceJson = this.httpUtils.doGetHtml("https://p.3.cn/prices/mgets?skuIds=J_" + sku);
                double price = MAPPER.readTree(priceJson).get(0).get("p").asDouble();
                item.setPrice(price);

                //获取商品的标题
                String itemInfo = this.httpUtils.doGetHtml(item.getUrl());
                String title = Jsoup.parse(itemInfo).select("div.sku-name").text();
                item.setTitle(title);

                item.setCreated(new Date());
                item.setUpdated(item.getCreated());

                //保存商品数据到数据库中
                this.itemService.save(item);
            }
        }
    }
}

-------------本文结束感谢您的阅读-------------

本文作者： Wgy
本文标题： 网络爬虫(一)
本文链接： https://wgy1993.gitee.io/archives/ceeb4255.html
版权声明： 本作品采用 CC BY-NC-SA 4.0 进行许可。转载请注明出处！