怎么用python爬取世界大學(xué)排行數(shù)據(jù)

發(fā)布時(shí)間：2022-05-19 15:17:20 來源：億速云閱讀：191 作者：iii 欄目：大數(shù)據(jù)

今天小編給大家分享一下怎么用python爬取世界大學(xué)排行數(shù)據(jù)的相關(guān)知識(shí)點(diǎn)，內(nèi)容詳細(xì)，邏輯清晰，相信大部分人都還太了解這方面的知識(shí)，所以分享這篇文章給大家參考一下，希望大家閱讀完這篇文章后有所收獲，下面我們一起來了解一下吧。

數(shù)據(jù)獲取

我們這里選取的就是上海交通大學(xué)的 ARWU 網(wǎng)站

該網(wǎng)站包含了歷年的大學(xué)分?jǐn)?shù)以及排名情況。

通過分析頁面可以發(fā)現(xiàn)，通過 pandas 的 read_html 函數(shù)來獲取相關(guān)信息是最為方便的

table = pd.read_html(url)
college = table[0]

同時(shí)我們還發(fā)現(xiàn)，大學(xué)所對(duì)應(yīng)的國家數(shù)據(jù)是圖片，所以需要特殊處理下

def get_country_name(html):
    soup = BeautifulSoup(html,'lxml')
    countries = soup.select('td > a > img')
    lst = []
    for i in countries:
        src = i['src']
        pattern = re.compile('flag.*/(.*?).png')
        country = re.findall(pattern,src)[0]
        lst.append(country)
    return lst

最后我們把得到的數(shù)據(jù)進(jìn)行下處理，去除掉不需要的字段，再增加年份字段等

for i in range(2005, 2020):
    print('year', i)
    url = 'http://www.shanghairanking.com/ARWU%s.html' % i
    html = requests.get(url).content
    table = pd.read_html(url)
    college = table[0]
    college.columns = ['world rank','university', 2,3, 'score', 5]
    college.drop([2,3,5],axis = 1,inplace = True)
    college['year'] = i
    college['index_rank'] = college.index
    college['index_rank'] = college['index_rank'].astype(int) + 1
    college['country'] = get_country(html)
    college.to_csv(r'College.csv', mode='a', encoding='utf_8_sig', header=True, index=0)

這樣，我們就得到了 College.csv 文件

以上就是“怎么用python爬取世界大學(xué)排行數(shù)據(jù)”這篇文章的所有內(nèi)容，感謝各位的閱讀！相信大家閱讀完這篇文章都有很大的收獲，小編每天都會(huì)為大家更新不同的知識(shí)，如果還想學(xué)習(xí)更多的知識(shí)，請(qǐng)關(guān)注億速云行業(yè)資訊頻道。

向AI問一下細(xì)節(jié)

怎么用python爬取世界大學(xué)排行數(shù)據(jù)

數(shù)據(jù)獲取

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽