您好,登錄后才能下訂單哦!
這篇文章主要介紹了Java怎么爬取網(wǎng)頁內(nèi)容并輸出到Excel中的相關(guān)知識,內(nèi)容詳細易懂,操作簡單快捷,具有一定借鑒價值,相信大家閱讀完這篇Java怎么爬取網(wǎng)頁內(nèi)容并輸出到Excel中文章都會有所收獲,下面我們一起來看看吧。
網(wǎng)絡(luò)爬蟲(Web Crawler),按照一定的規(guī)則,自動抓取萬維網(wǎng)信息的程序或者腳本,如今被廣泛地應(yīng)用在互聯(lián)網(wǎng)搜索引擎或者其他類似網(wǎng)站。
爬蟲在功能上分為采集、處理和儲存三個部分。
爬蟲基本上可以分為三大類:分布式爬蟲、Java爬蟲以及非Java爬蟲。
在Java爬蟲中又可以細分出三種,Crawler4j、WebMagic、WebCollector。
添加依賴
<!--json-->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.12.0</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.47</version>
</dependency>
<!--excel-->
<dependency>
<groupId>net.sourceforge.jexcelapi</groupId>
<artifactId>jxl</artifactId>
<version>2.6.12</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>3.17</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.17</version>
</dependency>
<!--爬蟲-->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
創(chuàng)建一個Weather
實體類
public class Weather {
/**
* 日期
*/
private String date;
/**
* 最高氣溫
*/
private String maxTemperature;
/**
* 最低氣溫
*/
private String minTemperature;
/**
* 白天天氣
*/
private String dayTimeWeather;
/**
* 夜間天氣
*/
private String nightWeather;
/**
* 風(fēng)向
*/
private String windDirection;
/**
* 風(fēng)力
*/
private String windPower;
public String getDate() {
return date;
}
public void setDate(String date) {
this.date = date;
}
public String getMaxTemperature() {
return maxTemperature;
}
public void setMaxTemperature(String maxTemperature) {
this.maxTemperature = maxTemperature;
}
public String getMinTemperature() {
return minTemperature;
}
public void setMinTemperature(String minTemperature) {
this.minTemperature = minTemperature;
}
public String getDayTimeWeather() {
return dayTimeWeather;
}
public void setDayTimeWeather(String dayTimeWeather) {
this.dayTimeWeather = dayTimeWeather;
}
public String getNightWeather() {
return nightWeather;
}
public void setNightWeather(String nightWeather) {
this.nightWeather = nightWeather;
}
public String getWindDirection() {
return windDirection;
}
public void setWindDirection(String windDirection) {
this.windDirection = windDirection;
}
public String getWindPower() {
return windPower;
}
public void setWindPower(String windPower) {
this.windPower = windPower;
}
@Override
public String toString() {
return "Weather{" +
"date='" + date + '\'' +
", maxTemperature='" + maxTemperature + '\'' +
", minTemperature='" + minTemperature + '\'' +
", dayTimeWeather='" + dayTimeWeather + '\'' +
", nightWeather='" + nightWeather + '\'' +
", windDirection='" + windDirection + '\'' +
", windPower='" + windPower + '\'' +
'}';
}
}
創(chuàng)建一個WeatherTest
測試類
public class WeatherTest {
public static void main(String[] args) throws FileNotFoundException, IOException {
List<Weather> list = getInfo("http://www.tianqi234.com/2020shanghai/1yue.html", 12);
for (Weather weather : list) {
System.out.println(weather.toString());
}
testHSSFWorkbook(list);
}
//可以指定網(wǎng)址,并且按照需求爬取前多少頁的數(shù)據(jù)
public static List<Weather> getInfo(String url, int month) {
List<Weather> weatherList = new ArrayList<Weather>();
for (int i = 1; i < month + 1; i++) {
try {
System.out.println("url:" + url);
Document doc = Jsoup.connect(url).get();
Elements table = doc.select(".graybox_cnt");
/* Elements tbody = table.select("tbody");*/
Elements trList = table.select("tr");
//每次移除的時候,你的列表長度就會發(fā)生新的變化,所以要結(jié)合實際進行移除
trList.remove(0);
if (i > 1) {
trList.remove(0);
trList.remove(10);
trList.remove(10);
trList.remove(20);
trList.remove(20);
trList.remove(20);
} else {
trList.remove(11);
trList.remove(11);
trList.remove(21);
trList.remove(21);
trList.remove(21);
}
for (Element tr : trList) {
Elements tdList = tr.select("td");
Elements aList = tdList.select("a");//查詢a標簽
Weather weather = new Weather();
if (aList != null && aList.size() > 0) {
weather.setDate(aList.get(0).html().toString());
} else {
weather.setDate(tdList.get(0).html().toString());
}
weather.setMaxTemperature(tdList.get(1).html().toString());
weather.setMinTemperature(tdList.get(2).html().toString());
weather.setDayTimeWeather(tdList.get(3).html().toString());
weather.setNightWeather(tdList.get(4).html().toString());
weather.setWindDirection(tdList.get(5).html().toString());
weather.setWindPower(tdList.get(6).html().toString());
weatherList.add(weather);
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
url = "http://www.tianqi234.com/2020shanghai/" + (i + 1) + "yue.html";
}
return weatherList;
}
public static void testHSSFWorkbook(List<Weather> list) throws IOException {
HSSFWorkbook workbook = new HSSFWorkbook();//創(chuàng)建excel文件(workbook)
HSSFSheet sheet = workbook.createSheet("2020年上海天氣統(tǒng)計");
HSSFRow row = sheet.createRow(0);//創(chuàng)建行 從0開始
HSSFCellStyle style = workbook.createCellStyle();//設(shè)置單元格樣式
style.setAlignment(HorizontalAlignment.CENTER);//水平居中
style.setVerticalAlignment(VerticalAlignment.CENTER);//垂直居中
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
sheet.setDefaultColumnWidth(30);
row.setHeightInPoints(25);
Map<String, String> map = (Map<String, String>) getMap(list.get(0));
//設(shè)置表頭
int c = 0;
for (String key : map.keySet()) {
HSSFCell cell = row.createCell(c);//創(chuàng)建行的單元格,從0開始
cell.setCellValue(map.get(key));//設(shè)置單元格內(nèi)容
cell.setCellStyle(style);
c++;
}
Map<Integer, Weather> weatherMap = new HashMap<>();
//除去表頭
for (int i = 1; i < list.size(); i++) {
weatherMap.put(i, list.get(i));
}
for (int i = 1; i <= weatherMap.size(); i++) {
HSSFRow rowInfo = sheet.createRow(i);
rowInfo.setHeightInPoints(30);
Map<String, String> map1 = (Map<String, String>) getMap(list.get(i));
int j = 0;
for (String key : map1.keySet()) {
HSSFCell cellInfo = rowInfo.createCell(j);
cellInfo.setCellValue(map1.get(key));
cellInfo.setCellStyle(style);
j++;
}
}
FileOutputStream out = new FileOutputStream("D:\\weather1.xlsx");
workbook.write(out);
out.close();
}
/**
* json轉(zhuǎn)map
*
* @param object
* @return
*/
public static Map<?, ?> getMap(Object object) {
if (object == null) {
throw new RuntimeException("對象為空,轉(zhuǎn)json失敗");
}
Map<String, Object> map = new HashMap<>();
try {
map = (Map) JSON.parse(JSON.toJSONString(object));
} catch (Exception e) {
System.out.println("對象轉(zhuǎn)map轉(zhuǎn)換失敗");
}
return map;
}
}
關(guān)于“Java怎么爬取網(wǎng)頁內(nèi)容并輸出到Excel中”這篇文章的內(nèi)容就介紹到這里,感謝各位的閱讀!相信大家對“Java怎么爬取網(wǎng)頁內(nèi)容并輸出到Excel中”知識都有一定的了解,大家如果還想學(xué)習(xí)更多知識,歡迎關(guān)注億速云行業(yè)資訊頻道。
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進行舉報,并提供相關(guān)證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權(quán)內(nèi)容。