溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時(shí)！

請點(diǎn)擊重新獲取二維碼

怎么使用llama?Index訓(xùn)練pdf

發(fā)布時(shí)間：2023-03-28 14:48:55 來源：億速云閱讀：251 作者：iii 欄目：開發(fā)技術(shù)

這篇文章主要介紹“怎么使用llama Index訓(xùn)練pdf”的相關(guān)知識，小編通過實(shí)際案例向大家展示操作過程，操作方法簡單快捷，實(shí)用性強(qiáng)，希望這篇“怎么使用llama Index訓(xùn)練pdf”文章能幫助大家解決問題。

llama Index是什么

怎么使用llama?Index訓(xùn)練pdf

LlamaIndex 是您的外部數(shù)據(jù)和 LLM 之間的一個(gè)簡單、靈活的接口。它以易于使用的方式提供了以下工具：

為您現(xiàn)有的數(shù)據(jù)源和數(shù)據(jù)格式（API、PDF、文檔、SQL 等）提供數(shù)據(jù)連接器

為您的非結(jié)構(gòu)化和結(jié)構(gòu)化數(shù)據(jù)提供索引，以便與 LLM 一起使用。這些索引有助于抽象出情境學(xué)習(xí)的常見樣板和痛點(diǎn)：

以易于訪問的格式存儲上下文以便快速插入。
當(dāng)上下文太大時(shí)處理提示限制（例如 Davinci 的 4096 個(gè)標(biāo)記）。
處理文本拆分。
為用戶提供查詢索引（輸入提示）并獲得知識增強(qiáng)輸出的界面。
為您提供全面的工具集，權(quán)衡成本和性能。

這里只是LlamaIndex應(yīng)用的冰山一角，還可以挖掘更多好玩的功能

下面讓我一步步來教你如何實(shí)現(xiàn)

第一步:安裝依賴

requirements.txt

Flask==2.2.3
Flask-Cors==3.0.10
langchain==0.0.115
llama-index==0.4.30
PyPDF2==3.0.1

我們需要部署一個(gè)web服務(wù)，這里我使用了Flask，你也可以使用fastapi 或者django實(shí)現(xiàn)。其次我們使用llama-index作為索引進(jìn)行pdf查詢。

第二步：訓(xùn)練數(shù)據(jù)和構(gòu)建索引的server

index_server.py

import os
import pickle
# 這里可以換成你自己的key，但是最好不要上傳到github上
os.environ['OPENAI_API_KEY'] = ""
from multiprocessing import Lock
from multiprocessing.managers import BaseManager
from llama_index import SimpleDirectoryReader, GPTSimpleVectorIndex, Document
index = None
stored_docs = {}
lock = Lock()
# 保存index的json文件
index_name = "./index.json"
# 保存文檔的pkl文件 用于保存文檔的id和文本，這樣客戶端就可以查詢到文檔的列表了
pkl_name = "stored_documents.pkl"
def initialize_index():
    """初始化index，如果已經(jīng)存在index，就使用已經(jīng)訓(xùn)練好的index，否則就創(chuàng)建一個(gè)新的index"""
    global index, stored_docs
    with lock:
        if os.path.exists(index_name):
            """使用已經(jīng)訓(xùn)練好的index"""
            index = GPTSimpleVectorIndex.load_from_disk(index_name)
        else:
            """使用GPTSimpleVectorIndex創(chuàng)建一個(gè)新的index 這里是llama_index的一個(gè)bug，如果你不傳入一個(gè)空的list，就會報(bào)錯(cuò) """
            index = GPTSimpleVectorIndex([])
            index.save_to_disk(index_name)
        if os.path.exists(pkl_name):
            with open(pkl_name, "rb") as f:
                stored_docs = pickle.load(f)
def query_index(query_text):
    """查詢index 根據(jù)你查詢的文本，返回一個(gè)response"""
    global index
    response = index.query(query_text)
    return response
def insert_into_index(doc_file_path, doc_id=None):
    """將文檔插入到index中，插入的文檔可以是一個(gè)文件，也可以是一個(gè)字符串，
    如果doc_id不為空，就使用doc_id，否則就使用文件名作為doc_id"""
    global index, stored_docs
    document = SimpleDirectoryReader(input_files=[doc_file_path]).load_data()[0]
    if doc_id is not None:
        document.doc_id = doc_id
    # Keep track of stored docs -- llama_index doesn't make this easy
    stored_docs[document.doc_id] = document.text[0:200]  # only take the first 200 chars
    with lock:
        index.insert(document)
        index.save_to_disk(index_name)
        with open(pkl_name, "wb") as f:
            pickle.dump(stored_docs, f)
    return
def get_documents_list():
    """查詢保存的文檔列表，返回一個(gè)list"""
    global stored_doc
    documents_list = []
    for doc_id, doc_text in stored_docs.items():
        documents_list.append({"id": doc_id, "text": doc_text})
    return documents_list
if __name__ == "__main__":
    # 初始化index， 如果已經(jīng)存在index，就使用已經(jīng)訓(xùn)練好的index，否則就創(chuàng)建一個(gè)新的index
    print("initializing index...")
    initialize_index()
    # 啟動服務(wù)器，監(jiān)聽5602端口
    manager = BaseManager(('127.0.0.1', 5602), b'123456')
    # 注冊使用到的函數(shù)，這樣客戶端就可以調(diào)用這些函數(shù)了
    manager.register('query_index', query_index)
    manager.register('insert_into_index', insert_into_index)
    manager.register('get_documents_list', get_documents_list)
    server = manager.get_server()
    print("server started...")
    server.serve_forever()

注意上面的OPENAI_API_KEY需要修改為你自己的，否則執(zhí)行initialize_index函數(shù)會提示報(bào)錯(cuò)

最后，成功啟動

$ python index_server.py
initializing index...
server started...

關(guān)于“怎么使用llama Index訓(xùn)練pdf”的內(nèi)容就介紹到這里了，感謝大家的閱讀。如果想了解更多行業(yè)相關(guān)的知識，可以關(guān)注億速云行業(yè)資訊頻道，小編每天都會為大家更新不同的知識點(diǎn)。

向AI問一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場，如果涉及侵權(quán)請聯(lián)系站長郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
Python重寫父類的方法有哪些
下一篇新聞：
javaweb前端向后端傳值的方式有哪些

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專題活動

幫助支持

關(guān)于我們

售后咨詢

7*24小時(shí)在線電話：400-100-2938

7*24小時(shí)在線 QQ：800811969

關(guān)注億速云

億速云公眾號

手機(jī)網(wǎng)站二維碼

<option id="zt1pj"><var id="zt1pj"><pre id="zt1pj"></pre></var></option>

<table id="zt1pj"><dfn id="zt1pj"></dfn></table>