<strike id="05sgw"></strike>
<th id="05sgw"><delect id="05sgw"><dfn id="05sgw"></dfn></delect></th>
<strike id="05sgw"></strike>

<center id="05sgw"></center>

溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗證碼

其他方式登錄

點擊登錄注冊即表示同意《億速云用戶服務條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時！

請點擊重新獲取二維碼

fastText和GloVe怎么使用

發(fā)布時間：2021-12-27 14:06:59 來源：億速云閱讀：249 作者：iii 欄目：大數(shù)據(jù)

這篇文章主要講解了“fastText和GloVe怎么使用”，文中的講解內(nèi)容簡單清晰，易于學習與理解，下面請大家跟著小編的思路慢慢深入，一起來研究和學習“fastText和GloVe怎么使用”吧！

數(shù)據(jù)+預處理

數(shù)據(jù)包括7613條tweet（Text列）和label（Target列），不管他們是否在談論真正的災難。有3271行通知實際災難，有4342行通知非實際災難。

fastText和GloVe怎么使用

文本中真實災難詞的例子：

“ Forest fire near La Ronge Sask. Canada “

使用災難詞而不是關于災難的例子：

“These boxes are ready to explode! Exploding Kittens finally arrived! gameofkittens #explodingkittens”

數(shù)據(jù)將被分成訓練（6090行）和測試（1523行）集，然后進行預處理。我們將只使用文本列和目標列。

from sklearn.model_selection import train_test_split

data = pd.read_csv('train.csv', sep=',', header=0)

train_df, test_df = train_test_split(data, test_size=0.2, random_state=42, shuffle=True)

此處使用的預處理步驟：

小寫
清除停用詞
標記化

from sklearn.utils import shuffle

raw_docs_train = train_df['text'].tolist()
raw_docs_test = test_df['text'].tolist()
num_classes = len(label_names)

processed_docs_train = []

for doc in tqdm(raw_docs_train):
  tokens = word_tokenize(doc)
  filtered = [word for word in tokens if word not in stop_words]
  processed_docs_train.append(" ".join(filtered))

processed_docs_test = []

for doc in tqdm(raw_docs_test):
  tokens = word_tokenize(doc)
  filtered = [word for word in tokens if word not in stop_words]
  processed_docs_test.append(" ".join(filtered))

tokenizer = Tokenizer(num_words=MAX_NB_WORDS, lower=True, char_level=False)
tokenizer.fit_on_texts(processed_docs_train + processed_docs_test)  

word_seq_train = tokenizer.texts_to_sequences(processed_docs_train)
word_seq_test = tokenizer.texts_to_sequences(processed_docs_test)
word_index = tokenizer.word_index

word_seq_train = sequence.pad_sequences(word_seq_train, maxlen=max_seq_len)

word_seq_test = sequence.pad_sequences(word_seq_test, maxlen=max_seq_len)

詞嵌入

第1步：下載預訓練模型

使用fastText和Glove的第一步是下載每個預訓練過的模型。我使用google colab來防止我的筆記本電腦使用大內(nèi)存，所以我用request library下載了它，然后直接在notebook上解壓。

我使用了兩個詞嵌入中最大的預訓練模型。fastText模型給出了200萬個詞向量，而GloVe給出了220萬個單詞向量。

fastText預訓練模型下載

import requests, zipfile, io

zip_file_url = “https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip"

r = requests.get(zip_file_url)

z = zipfile.ZipFile(io.BytesIO(r.content))

z.extractall()

GloVe預訓練模型下載

import requests, zipfile, io

zip_file_url = “http://nlp.stanford.edu/data/glove.840B.300d.zip"

r = requests.get(zip_file_url)

z = zipfile.ZipFile(io.BytesIO(r.content))

z.extractall()

第2步：下載預訓練模型

FastText提供了加載詞向量的格式，需要使用它來加載這兩個模型。

embeddings_index = {}

f = codecs.open(‘crawl-300d-2M.vec’, encoding=’utf-8')
# Glove
# f = codecs.open(‘glove.840B.300d.txt’, encoding=’utf-8')

for line in tqdm(f):

    values = line.rstrip().rsplit(‘ ‘)

    word = values[0]

    coefs = np.asarray(values[1:], dtype=’float32')

    embeddings_index[word] = coefs

f.close()

第3步：嵌入矩陣

采用嵌入矩陣來確定訓練數(shù)據(jù)中每個詞的權重。

但是有一種可能性是，有些詞不在向量中，比如打字錯誤、縮寫或用戶名。這些單詞將存儲在一個列表中，我們可以比較處理來自fastText和GloVe的詞的性能

words_not_found = []

nb_words = min(MAX_NB_WORDS, len(word_index)+1)
embedding_matrix = np.zeros((nb_words, embed_dim))

for word, i in word_index.items():
  if i >= nb_words:
     continue
  embedding_vector = embeddings_index.get(word)
  
  if (embedding_vector is not None) and len(embedding_vector) > 0:
     embedding_matrix[i] = embedding_vector
  else:
     words_not_found.append(word)

print('number of null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))

fastText上的null word嵌入數(shù)為9175，GloVe 上的null word嵌入數(shù)為9186。

LSTM

你可以對超參數(shù)或架構進行微調(diào)，但我將使用非常簡單的一個架構，它包含嵌入層、LSTM層、Dense層和Dropout層。

from keras.layers import BatchNormalization
import tensorflow as tf

model = tf.keras.Sequential()

model.add(Embedding(nb_words, embed_dim, input_length=max_seq_len, weights=[embedding_matrix],trainable=False))

model.add(Bidirectional(LSTM(32, return_sequences= True)))
model.add(Dense(32,activation=’relu’))

model.add(Dropout(0.3))
model.add(Dense(1,activation=’sigmoid’))

model.summary()

fastText和GloVe怎么使用

from keras.optimizers import RMSprop
from keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import EarlyStopping

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

es_callback = EarlyStopping(monitor='val_loss', patience=3)

history = model.fit(word_seq_train, y_train, batch_size=256, epochs=30, validation_split=0.3, callbacks=[es_callback], shuffle=False)

結果

fastText的準確率為83%，而GloVe的準確率為81%。與沒有詞嵌入的模型（68%）的性能比較，可以看出詞嵌入對性能有顯著的影響。

fastText 嵌入的準確度

fastText和GloVe怎么使用

GloVe 嵌入的準確度

fastText和GloVe怎么使用

沒有詞嵌入的準確度

fastText和GloVe怎么使用

感謝各位的閱讀，以上就是“fastText和GloVe怎么使用”的內(nèi)容了，經(jīng)過本文的學習后，相信大家對fastText和GloVe怎么使用這一問題有了更深刻的體會，具體使用情況還需要大家實踐驗證。這里是億速云，小編將為大家推送更多相關知識點的文章，歡迎關注！

向AI問一下細節(jié)

推薦閱讀：

免責聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點不代表本網(wǎng)站立場，如果涉及侵權請聯(lián)系站長郵箱：is@yisu.com進行舉報，并提供相關證據(jù)，一經(jīng)查實，將立刻刪除涉嫌侵權內(nèi)容。

上一篇新聞：
linux中grep怎么用
下一篇新聞：
Android如何自定View實現(xiàn)滑動驗證效果

猜你喜歡

AI
助
手

產(chǎn)品服務

地區(qū)劃分

專題活動

幫助支持

關于我們

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關注億速云

億速云公眾號

手機網(wǎng)站二維碼

<th id="r0z61"><dd id="r0z61"><dfn id="r0z61"></dfn></dd></th>

<table id="r0z61"></table>