<td id="1qrxa"><menu id="1qrxa"></menu></td>

溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時！

請點(diǎn)擊重新獲取二維碼

tensorflow學(xué)習(xí)教程之文本分類詳析

發(fā)布時間：2020-09-15 09:15:24 來源：腳本之家閱讀：159 作者：michaelgbw 欄目：開發(fā)技術(shù)

前言

這幾天caffe2發(fā)布了，支持移動端，我理解是類似單片機(jī)的物聯(lián)網(wǎng)吧應(yīng)該不是手機(jī)之類的，試想iphone7跑CNN，畫面太美~

作為一個剛?cè)肟拥模踔吝€沒入坑的人，咱們還是老實(shí)研究下tensorflow吧，雖然它沒有caffe好上手。tensorflow的特點(diǎn)我就不介紹了：

基于Python，寫的很快并且具有可讀性。
支持CPU和GPU，在多GPU系統(tǒng)上的運(yùn)行更為順暢。
代碼編譯效率較高。
社區(qū)發(fā)展的非常迅速并且活躍。
能夠生成顯示網(wǎng)絡(luò)拓?fù)浣Y(jié)構(gòu)和性能的可視化圖。

tensorflow（tf）運(yùn)算流程：

tensorflow的運(yùn)行流程主要有2步，分別是構(gòu)造模型和訓(xùn)練。

在構(gòu)造模型階段，我們需要構(gòu)建一個圖(Graph)來描述我們的模型，tensoflow的強(qiáng)大之處也在這了，支持tensorboard:

tensorflow學(xué)習(xí)教程之文本分類詳析

就類似這樣的圖，有點(diǎn)像流程圖，這里還推薦一個google的tensoflow游樂場，很有意思。

然后到了訓(xùn)練階段，在構(gòu)造模型階段是不進(jìn)行計(jì)算的，只有在tensoflow.Session.run()時會開始計(jì)算。

文本分類

先給出代碼，然后我們在一一做解釋

# -*- coding: utf-8 -*-

import pandas as pd
import numpy as np
import tensorflow as tf
from collections import Counter
from sklearn.datasets import fetch_20newsgroups

def get_word_2_index(vocab):
 word2index = {}
 for i,word in enumerate(vocab):
 word2index[word] = i
 return word2index


def get_batch(df,i,batch_size):
 batches = []
 results = []
 texts = df.data[i*batch_size : i*batch_size+batch_size]
 categories = df.target[i*batch_size : i*batch_size+batch_size]
 for text in texts:
 layer = np.zeros(total_words,dtype=float)
 for word in text.split(' '):
  layer[word2index[word.lower()]] += 1
 batches.append(layer)
 
 for category in categories:
 y = np.zeros((3),dtype=float)
 if category == 0:
  y[0] = 1.
 elif category == 1:
  y[1] = 1.
 else:
  y[2] = 1.
 results.append(y)
 return np.array(batches),np.array(results)

def multilayer_perceptron(input_tensor, weights, biases):
 #hidden層RELU函數(shù)激勵
 layer_1_multiplication = tf.matmul(input_tensor, weights['h2'])
 layer_1_addition = tf.add(layer_1_multiplication, biases['b1'])
 layer_1 = tf.nn.relu(layer_1_addition)
 
 layer_2_multiplication = tf.matmul(layer_1, weights['h3'])
 layer_2_addition = tf.add(layer_2_multiplication, biases['b2'])
 layer_2 = tf.nn.relu(layer_2_addition)
 
 # Output layer 
 out_layer_multiplication = tf.matmul(layer_2, weights['out'])
 out_layer_addition = out_layer_multiplication + biases['out']
 return out_layer_addition

#main
#從sklearn.datas獲取數(shù)據(jù)
cate = ["comp.graphics","sci.space","rec.sport.baseball"]
newsgroups_train = fetch_20newsgroups(subset='train', categories=cate)
newsgroups_test = fetch_20newsgroups(subset='test', categories=cate)

# 計(jì)算訓(xùn)練和測試數(shù)據(jù)總數(shù)
vocab = Counter()
for text in newsgroups_train.data:
 for word in text.split(' '):
 vocab[word.lower()]+=1
 
for text in newsgroups_test.data:
 for word in text.split(' '):
 vocab[word.lower()]+=1

total_words = len(vocab)
word2index = get_word_2_index(vocab)

n_hidden_1 = 100 # 一層hidden層神經(jīng)元個數(shù)
n_hidden_2 = 100 # 二層hidden層神經(jīng)元個數(shù)
n_input = total_words 
n_classes = 3  # graphics, sci.space and baseball 3層輸出層即將文本分為三類
#占位
input_tensor = tf.placeholder(tf.float32,[None, n_input],name="input")
output_tensor = tf.placeholder(tf.float32,[None, n_classes],name="output") 
#正態(tài)分布存儲權(quán)值和偏差值
weights = {
 'h2': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
 'h3': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
 'out': tf.Variable(tf.random_normal([n_hidden_2, n_classes]))
}
biases = {
 'b1': tf.Variable(tf.random_normal([n_hidden_1])),
 'b2': tf.Variable(tf.random_normal([n_hidden_2])),
 'out': tf.Variable(tf.random_normal([n_classes]))
}

#初始化
prediction = multilayer_perceptron(input_tensor, weights, biases)

# 定義 loss and optimizer 采用softmax函數(shù)
# reduce_mean計(jì)算平均誤差
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=output_tensor))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

#初始化所有變量
init = tf.global_variables_initializer()

#部署 graph
with tf.Session() as sess:
 sess.run(init)
 training_epochs = 100
 display_step = 5
 batch_size = 1000
 # Training
 for epoch in range(training_epochs):
 avg_cost = 0.
 total_batch = int(len(newsgroups_train.data) / batch_size)
 for i in range(total_batch):
  batch_x,batch_y = get_batch(newsgroups_train,i,batch_size)
  c,_ = sess.run([loss,optimizer], feed_dict={input_tensor: batch_x,output_tensor:batch_y})
  # 計(jì)算平均損失
  avg_cost += c / total_batch
 # 每5次epoch展示一次loss
 if epoch % display_step == 0:
  print("Epoch:", '%d' % (epoch+1), "loss=", "{:.6f}".format(avg_cost))
 print("Finished!")

 # Test model
 correct_prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(output_tensor, 1))
 # 計(jì)算準(zhǔn)確率
 accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
 total_test_data = len(newsgroups_test.target)
 batch_x_test,batch_y_test = get_batch(newsgroups_test,0,total_test_data)
 print("Accuracy:", accuracy.eval({input_tensor: batch_x_test, output_tensor: batch_y_test}))

代碼解釋

這里我們沒有進(jìn)行保存模型的操作。按代碼流程，我解釋下各種函數(shù)和選型，其實(shí)整個代碼是github的已有的，我也是學(xué)習(xí)學(xué)習(xí)~

數(shù)據(jù)獲取，我們從sklearn.datas獲取數(shù)據(jù)，這里有個20種類的新聞文本，我們根據(jù)每個單詞來做分類：

# 計(jì)算訓(xùn)練和測試數(shù)據(jù)總數(shù)
vocab = Counter()
for text in newsgroups_train.data:
 for word in text.split(' '):
 vocab[word.lower()]+=1
 
for text in newsgroups_test.data:
 for word in text.split(' '):
 vocab[word.lower()]+=1

total_words = len(vocab)
word2index = get_word_2_index(vocab)

根據(jù)每個index轉(zhuǎn)為one_hot型編碼，One-Hot編碼，又稱為一位有效編碼，主要是采用N位狀態(tài)寄存器來對N個狀態(tài)進(jìn)行編碼，每個狀態(tài)都由他獨(dú)立的寄存器位，并且在任意時候只有一位有效。

def get_batch(df,i,batch_size):
 batches = []
 results = []
 texts = df.data[i*batch_size : i*batch_size+batch_size]
 categories = df.target[i*batch_size : i*batch_size+batch_size]
 for text in texts:
 layer = np.zeros(total_words,dtype=float)
 for word in text.split(' '):
  layer[word2index[word.lower()]] += 1
 batches.append(layer)
 
 for category in categories:
 y = np.zeros((3),dtype=float)
 if category == 0:
  y[0] = 1.
 elif category == 1:
  y[1] = 1.
 else:
  y[2] = 1.
 results.append(y)
 return np.array(batches),np.array(results)

在這段代碼中根據(jù)自定義的data的數(shù)據(jù)范圍，即多少個數(shù)據(jù)進(jìn)行一次訓(xùn)練，批處理。在測試模型時，我們將用更大的批處理來提供字典，這就是為什么需要定義一個可變的批處理維度。

構(gòu)造神經(jīng)網(wǎng)絡(luò)

神經(jīng)網(wǎng)絡(luò)是一個計(jì)算模型（一種描述使用機(jī)器語言和數(shù)學(xué)概念的系統(tǒng)的方式）。這些系統(tǒng)是自主學(xué)習(xí)和被訓(xùn)練的，而不是明確編程的。下圖是傳統(tǒng)的三層神經(jīng)網(wǎng)絡(luò)：

tensorflow學(xué)習(xí)教程之文本分類詳析

而在這個神經(jīng)網(wǎng)絡(luò)中我們的hidden層拓展到兩層，這兩層是做的完全相同的事，只是hidden1層的輸出是hidden2的輸入。

weights = {
 'h2': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
 'h3': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
 'out': tf.Variable(tf.random_normal([n_hidden_2, n_classes]))
}
biases = {
 'b1': tf.Variable(tf.random_normal([n_hidden_1])),
 'b2': tf.Variable(tf.random_normal([n_hidden_2])),
 'out': tf.Variable(tf.random_normal([n_classes]))
}

在輸入層需要定義第一個隱藏層會有多少節(jié)點(diǎn)。這些節(jié)點(diǎn)也被稱為特征或神經(jīng)元，在上面的例子中我們用每一個圓圈表示一個節(jié)點(diǎn)。

輸入層的每個節(jié)點(diǎn)都對應(yīng)著數(shù)據(jù)集中的一個詞（之后我們會看到這是怎么運(yùn)行的）

每個節(jié)點(diǎn)（神經(jīng)元）乘以一個權(quán)重。每個節(jié)點(diǎn)都有一個權(quán)重值，在訓(xùn)練階段，神經(jīng)網(wǎng)絡(luò)會調(diào)整這些值以產(chǎn)生正確的輸出。

將輸入乘以權(quán)重并將值與偏差相加，有點(diǎn)像y = Wx + b 這種linear regression。這些數(shù)據(jù)也要通過激活函數(shù)傳遞。這個激活函數(shù)定義了每個節(jié)點(diǎn)的最終輸出。有很多激活函數(shù)。

Rectified Linear Unit(RELU) - 用于隱層神經(jīng)元輸出
Sigmoid - 用于隱層神經(jīng)元輸出
Softmax - 用于多分類神經(jīng)網(wǎng)絡(luò)輸出
Linear - 用于回歸神經(jīng)網(wǎng)絡(luò)輸出（或二分類問題）

這里我們的hidden層里面使用RELU，之前大多數(shù)是傳統(tǒng)的sigmoid系來激活。

tensorflow學(xué)習(xí)教程之文本分類詳析

tensorflow學(xué)習(xí)教程之文本分類詳析

由圖可知，導(dǎo)數(shù)從0開始很快就又趨近于0了，易造成“梯度消失”現(xiàn)象，而ReLU的導(dǎo)數(shù)就不存在這樣的問題。對比sigmoid類函數(shù)主要變化是：1）單側(cè)抑制 2）相對寬闊的興奮邊界 3）稀疏激活性。這與人的神經(jīng)皮層的工作原理接近。

為什么要加入偏移常量？

以sigmoid為例

權(quán)重w使得sigmoid函數(shù)可以調(diào)整其傾斜程度，下面這幅圖是當(dāng)權(quán)重變化時，sigmoid函數(shù)圖形的變化情況：

tensorflow學(xué)習(xí)教程之文本分類詳析

可以看到無論W怎么變化，函數(shù)都要經(jīng)過（0,0.5),但實(shí)際情況下，我們可能需要在x接近0時，函數(shù)結(jié)果為其他值。

當(dāng)我們改變權(quán)重w和偏移量b時，可以為神經(jīng)元構(gòu)造多種輸出可能性，這還僅僅是一個神經(jīng)元，在神經(jīng)網(wǎng)絡(luò)中，千千萬萬個神經(jīng)元結(jié)合就能產(chǎn)生復(fù)雜的輸出模式。

輸出層的值也要乘以權(quán)重，并我們也要加上誤差，但是現(xiàn)在激活函數(shù)不一樣。

你想用分類對每一個文本進(jìn)行標(biāo)記，并且這些分類相互獨(dú)立（一個文本不能同時屬于兩個分類）。

考慮到這點(diǎn)，你將使用 Softmax 函數(shù)而不是 ReLu 激活函數(shù)。這個函數(shù)把每一個完整的輸出轉(zhuǎn)換成 0 和 1 之間的值，并且確保所有單元的和等于一。

在這個神經(jīng)網(wǎng)絡(luò)中，output層中明顯是3個神經(jīng)元，對應(yīng)著三種分本分類。

#初始化所有變量
init = tf.global_variables_initializer()

#部署 graph
with tf.Session() as sess:
 sess.run(init)
 training_epochs = 100
 display_step = 5
 batch_size = 1000
 # Training
 for epoch in range(training_epochs):
 avg_cost = 0.
 total_batch = int(len(newsgroups_train.data) / batch_size)
 for i in range(total_batch):
  batch_x,batch_y = get_batch(newsgroups_train,i,batch_size)
  c,_ = sess.run([loss,optimizer], feed_dict={input_tensor: batch_x,output_tensor:batch_y})
  # 計(jì)算平均損失
  avg_cost += c / total_batch
 # 每5次epoch展示一次loss
 if epoch % display_step == 0:
  print("Epoch:", '%d' % (epoch+1), "loss=", "{:.6f}".format(avg_cost))
 print("Finished!")

這里的參數(shù)設(shè)置：

training_epochs = 100 #100次遞歸訓(xùn)練
display_step = 5 # 每5次print 一次當(dāng)前的loss值
batch_size = 1000 #訓(xùn)練數(shù)據(jù)的分割

為了知道網(wǎng)絡(luò)是否正在學(xué)習(xí)，需要比較一下輸出值（Z）和期望值（expected）。我們要怎么計(jì)算這個的不同（損耗）呢？有很多方法去解決這個問題。

因?yàn)槲覀冋谶M(jìn)行分類任務(wù)，測量損耗的最好的方式是交叉熵誤差。

通過 TensorFlow 你將使用 tf.nn.softmax_cross_entropy_with_logits() 方法計(jì)算交叉熵誤差（這個是 softmax 激活函數(shù)）并計(jì)算平均誤差 (tf.reduced_mean() ) 。

通過權(quán)重和誤差的最佳值，以便最小化輸出誤差（實(shí)際得到的值和正確的值之間的區(qū)別）。要做到這一點(diǎn)，將需使用梯度下降法。更具體些是，需要使用隨機(jī)梯度下降。

對應(yīng)代碼：

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=output_tensor))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

tensoflow已經(jīng)將這些發(fā)雜的算法封裝為函數(shù)，我們只需要選取特定的函數(shù)即可。

tf.train.AdamOptimizer(learning_rate).minimize(loss) 方法是一個語法糖，它做了兩件事情：

compute_gradients(loss, <list of variables>) 計(jì)算
apply_gradients(<list of variables>) 展示

這個方法用新的值更新了所有的 tf.Variables ，因此我們不需要傳遞變量列表。

運(yùn)行計(jì)算

Epoch: 0001 loss= 1133.908114347
Epoch: 0006 loss= 329.093700409
Epoch: 00011 loss= 111.876660109
Epoch: 00016 loss= 72.552971845
Epoch: 00021 loss= 16.673050320
........
Finished!
Accuracy: 0.81

Accuracy: 0.81 表示置信度在81%，我們通過調(diào)整參數(shù)和增加數(shù)據(jù)量（本文沒做），置信度會產(chǎn)生變化。

結(jié)束

就是這樣！使用神經(jīng)網(wǎng)絡(luò)創(chuàng)建了一個模型來將文本分類到不同的類別中。采用GPU或者采取分布式的TF可以提升訓(xùn)練速度和效率~

總結(jié)

以上就是這篇文章的全部內(nèi)容了，希望本文的內(nèi)容對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值，如果有疑問大家可以留言交流，謝謝大家對億速云的支持。

向AI問一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場，如果涉及侵權(quán)請聯(lián)系站長郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
Win10+VS2017新CUDA項(xiàng)目配置教程
下一篇新聞：
python中class是什么意思

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專題活動

幫助支持

關(guān)于我們

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關(guān)注億速云

億速云公眾號

手機(jī)網(wǎng)站二維碼

<small id="uojya"></small>

<small id="uojya"></small>

<pre id="uojya"></pre>

<sup id="uojya"></sup>

<td id="uojya"></td>