溫馨提示×

溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊×
其他方式登錄
點(diǎn)擊 登錄注冊 即表示同意《億速云用戶服務(wù)條款》

webcollector 爬蟲采集java代碼模板(附帶源碼地址)

發(fā)布時(shí)間:2020-07-07 09:38:10 來源:網(wǎng)絡(luò) 閱讀:684 作者:bx123 欄目:編程語言
package work;

import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
import org.springframework.dao.DuplicateKeyException;
import org.springframework.jdbc.core.JdbcTemplate;

import cn.edu.hfut.dmic.contentextractor.ContentExtractor;
import cn.edu.hfut.dmic.contentextractor.News;
import cn.edu.hfut.dmic.webcollector.conf.Configuration;
import cn.edu.hfut.dmic.webcollector.model.CrawlDatum;
import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler;
import cn.edu.hfut.dmic.webcollector.plugin.net.OkHttpRequester;
import db.JDBCHelper;
import okhttp3.Request;
import util.HtmlTools;

/**
 * Crawling news from hfut news
 *  use 2.72 lib
 * @author hu
 */
public class ChujiingNewstpl extends BreadthCrawler {

    //種子 url
    public  String seedUrl="http://news.cnhubei.com/";
    //需要采集的內(nèi)容頁url
    public  String contentRegUrl="http://news.cnhubei.com/.*/p/.*?.html\\.*";

    //線程數(shù)量
    public int threads_num=10;

    //每次迭代爬取的網(wǎng)頁數(shù)量上限
    public int topn_num=10;

    //爬取文章深度
    public static int levelnum=10;

    //停止后能否繼續(xù)上次采集
    public static boolean resumable=true;
    public int executeTime=20000;  //ms
    public static int MaxExecuteCount=2;
    public  int connectTimeout=50;
    public  int readTimeout=60;

    private String contentTable="news_content";

    @Override
    public void visit(Page page, CrawlDatums next) {
//        String url = page.url();

        if (page.matchUrl(contentRegUrl)) {

            //
            /*extract title and content of news by css selector*/
           // String title = page.select("div[id=Article]>h3").first().text();
           // String content = page.selectText("div#artibody");

            News n = null;
            try {
                n=ContentExtractor.getNewsByHtml(page.html());

                String title=n.getTitle();
                String content=n.getContent();

                content = Jsoup.clean(content, HtmlTools.getWhitelist());
                content=HtmlTools.stripNewLine(content);

                title=Jsoup.clean(title,Whitelist.none());
                title=title.trim();

                System.out.println(" get content :"+title );

                if(!title.isEmpty() && !content.isEmpty()) {
                    ChujiingNewstpl.dbHandler.update("insert into "+contentTable+"(title,content) value(?,?)",title,content);
                }
            } catch(DuplicateKeyException e) {
                System.out.println(" duplicate item ");
            }catch (Exception e) {
                // TODO Auto-generated catch block
                System.out.println(e.getMessage());
            }

        }
    }

    private static JdbcTemplate dbHandler;

       // 自定義的請求插件
    public  class MyRequester extends OkHttpRequester {

        String userAgent = "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)";
//        String cookie = "name=abcdef";

        // 每次發(fā)送請求前都會(huì)執(zhí)行這個(gè)方法來構(gòu)建請求
        @Override
        public Request.Builder createRequestBuilder(CrawlDatum crawlDatum) {
            // 這里使用的是OkHttp中的Request.Builder
            // 可以參考OkHttp的文檔來修改請求頭
//            System.out.println("request with cookie: " + cookie);
            return super.createRequestBuilder(crawlDatum).header("User-Agent", userAgent);
                   // .header("Cookie", cookie);
        }
    }

    public ChujiingNewstpl(String crawlPath, boolean autoParse) {
        super(crawlPath, autoParse);

        // 設(shè)置請求插件

        //setRequester(new MyRequester());
        /*start page*/

        this.addSeed(seedUrl);

        this.addRegex(contentRegUrl);

        this.addRegex("-.*\\.(jpg|png|gif|css|js|font).*");
        setThreads(threads_num);

        Configuration cnf=getConf();

        cnf.setTopN(topn_num);
//        cnf.setExecuteInterval(executeTime);
//        cnf.setConnectTimeout(connectTimeout);
//        cnf.setReadTimeout(readTimeout);

    }

    public static void main(String[] args) throws Exception {

        dbHandler=JDBCHelper.db();
        ChujiingNewstpl crawler = new ChujiingNewstpl("spiderdata"+java.io.File.separator+ChujiingNewstpl.class.getName(), true);
        crawler.setResumable(resumable);
        crawler.start(levelnum);

        //失敗最大嘗試次數(shù)
        crawler.setMaxExecuteCount(MaxExecuteCount);

    }

}

源碼地址 https://down.51cto.com/data/2461609

向AI問一下細(xì)節(jié)

免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI