Location>code7788 >text

Zhang Gaoxing's practical model development: (II) Use LangChain to build local knowledge base applications

Popularity:4 ℃/2025-03-18 14:56:44

Table of contents
  • Basic concepts
    • What is LangChain
    • What is Ollama
  • Environment construction and configuration
    • Install Ollama
    • Install LangChain
  • Document loading
    • Loading JSON data
    • Loading documents in folders
    • Text vectorization
  • Implement Q&A application

Retrieval-Augmented Generation (RAG) is a method to optimize the output of large language models, allowing the model to retrieve relevant information from the external knowledge base before generating the answer, rather than relying solely on knowledge trained within the model. Generate more accurate, real-time, and reliable content by citing information from an external knowledge base and solving the problems of knowledge obsolete and hallucinating. The following will introduce the implementation of a local knowledge base application using LangChain and Ollama.

Basic concepts

What is LangChain

LangChain is a big language model (LLM)Programming framework, the purpose is to simplify application development based on large language models and unify the calling methods of different large models, so developers do not need to care about the differences in the underlying API.

What is Ollama

Ollama is an open source native language modelRun the framework, The purpose is to simplify the process of deploying and running large language models on local devices. Through containerized management and standardized interfaces, it allows users to quickly and conveniently use mainstream models such as Qwen and Deepseek without complex configurations.

Environment construction and configuration

Install Ollama

access/downloadDownload the installation file for the corresponding system. Windows system execution.exeFile installation, Linux system executes the following commands to install:

curl -fsSL / | sh

access/searchYou can view models supported by Ollama. Use the command line to download and run the model, such as runningQwen2.5 7BModel:

ollama run qwen2.5

Install LangChain

After creating a Python virtual environment, execute the following command:

pip install langchain langchain-ollama langchain-chroma langchain-community chromadb unstructured jq

Document loading

Documents in the knowledge base can use data in JSON, TXT, PDF, Markdown, web pages, etc., and you only need to select the appropriate parser in LangChain. For example, common ones are:

  • DirectoryLoader: Batch loading of documents in folders.
  • JSONLoader: Loading a document in JSON format.
  • PDFPlumberLoader: Process content in PDF.
  • WebBaseLoader: Crawl the content of the web page.

The following uses the news data crawled from the previous blog as an example to show how to load text data into a program.

# Data Example
 {"date": "February 27, 2024", "author": "Party and Government Office", "title": "Land-Table List of Departments", "content": "...", "url": "/2018/0710/c2031a53936/"}
 {"date": "March 6, 2025", "author": "Security Department", "title": "Our school held the 2025 Safety Work Conference", "content": "...", "url": "/2025/0305/c275a56681/"}
 {"date": "March 3, 2025", "author": "Party Committee Propaganda Department", "title": "Our school held the flag-raising ceremony for the opening of the second semester of 2024-2025 and the "first lesson of the opening of the school", "content": "...", "url": "/2025/0303/c275a56667/"}

Loading JSON data

Loading JSON data can be usedJSONLoaderload,json_lines=TrueIndicates that the format of the JSON file is.jsonl

from langchain_ollama.embeddings import OllamaEmbeddings
from langchain.document_loaders import JSONLoader

documents = JSONLoader(file_path='./documents/', jq_schema='.', text_content=False, json_lines=True).load()

Print variablesdocumentsView the results,page_contentFor the contents of the document loading. It can be observed that all data and fields in the JSON file are loaded todocumentsin variable.

>>> documents[0]
Document(metadata={'source': 'D:\\GitHub\\langchain-ollama\\documents\\', 'seq_num': 1}, page_content='{"date": "2024\\u5e742\\u670827\\u65e5", "author": "\\u515a\\u653f\\u529e", "title": "\\u5404\\u90e8\\u95e8\\u56fa\\u5b9a\\u7535\\u8bdd\\u4e00\\u89c8\\u8868", "content": "...", "url": "/2018/0710/c2031a53936/"}')

Sometimes not all fields are what we need, and can be passedjq_schemaThe parameters specify the required fields, and the following are several common examples:

JSON        -> [{"text": ...}, {"text": ...}, {"text": ...}]
jq_schema   -> ".[].text"

JSON        -> {"key": [{"text": ...}, {"text": ...}, {"text": ...}]}
jq_schema   -> ".key[].text"

JSON        -> ["...", "...", "..."]
jq_schema   -> ".[]"

JSON        -> {"...", "...", "..."}
jq_schema   -> "."

The variable just printeddocumentsMedium, fieldmetadataMetadata used to store documents. In some advanced search scenarios, in addition to using vector representations of text content for similarity search, metadata can also be weighted or filtered as auxiliary information. The following example will be in JSONdateauthorurlAdd fields tometadatamiddle.

def metadata_func(record: dict, metadata: dict) -> dict:
     """Add metadata"""
     metadata["title"] = ("title")
     metadata["author"] = ("author")
     metadata["date"] = ("date")
     metadata["source"] = ("url")
     return metadata

 documents = JSONLoader(file_path='documents/', jq_schema='.', content_key='content', metadata_func=metadata_func, json_lines=True).load()

Print the variable againdocumentsCheck the results and you can seemetadataNewly added totitleauthordatesourceField,page_contentThe content in JSON becomescontentField.

>>> documents[0]
 Document(metadata={'source': '/2018/0710/c2031a53936/', 'seq_num': 1, 'title': 'Land-Table List of Departments', 'author': 'Party and Government Office', 'date': 'February 27, 2024'}, page_content='...')

Loading documents in folders

More common situations, data will not be in just one document, and if the document is scattered in a folder, it can be usedDirectoryLoaderLoad, it recursively loads all files in the folder.

from langchain.document_loaders import DirectoryLoader, TextLoader

 text_loader_kwargs={'encoding': 'utf-8'} # Set the parameters of the text loader
 documents = DirectoryLoader(path='./documents', glob='**/*.txt', show_progress=True, use_multithreading=True, loader_cls=TextLoader, loader_kwargs=text_loader_kwargs).load()

If the loaded single document is too long, you also need to split the long text into chunks of reasonable length to avoid exceeding the LLM context window.RecursiveCharacterTextSplitterSplitting text using a series of separators in order of priority. Splitting helps maintain the logical structure of the text, such as keeping a complete sentence or paragraph. A series of splitters are set hereseparators, such as line breaks, spaces, English and Chinese punctuation marks. usechunk_sizeSpecifies that the length of the divided text block is 1000. And setchunk_overlapThe overlap length between adjacent text blocks is 20, and setting the overlap so that information is not lost because it is cut off at the boundary.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    separators=[
        "\n\n",
        "\n",
        " ",
        ".",
        ",",
        "\u200B",  # Zero-width space
        "\uff0c",  # Fullwidth comma
        "\u3001",  # Ideographic comma
        "\uff0e",  # Fullwidth full stop
        "\u3002",  # Ideographic full stop
        "",
    ], chunk_size=1000, chunk_overlap=20, add_start_index=True)
split_docs = text_splitter.split_documents(documents)

Text vectorization

Large models cannot directly understand the text in the knowledge base. We need to convert human-readable text data into numerical forms that machines can understand and process. This process is called text vectorization. If the process of searching content by entries is compared to searching dictionaries, text vectorization is writing dictionaries. Embedding, a method of using vectorization here, can enable vectors to capture semantic relationships, grammatical similarities and other information between words.OllamaEmbeddingsIt is an Embedding tool for a large language model that converts text data into vector representations for easier subsequent retrieval and generation. Before using it, you also need to download the Embedding model, such as executing it on the command lineollama pull nomic-embed-text. After vectorization, it is also necessary to use a vector database for storage, for exampleFAISSChromawait.

from langchain_ollama.embeddings import OllamaEmbeddings
from langchain_chroma.vectorstores import Chroma

embeddings = OllamaEmbeddings(model="nomic-embed-text")
db = Chroma.from_documents(documents, embeddings, persist_directory="./embeddings")

After waiting for this process to complete, try to retrieve the document and return the document that is most similar to the query statement.

results = db.similarity_search("Recruitment Announcement")

 for result in results:
     print("Title:", ["title"])
     print("Author:", ["author"])
     print("Date:", ["date"])
     print("Content:", result.page_content)
     print("------\n")

You can also use vectors for searching.kSpecifies the number of returned results.filterSets the search criteria for the metadata field.

query_vector = embeddings.embed_query("Recruitment Announcement")
 results = db.similarity_search_by_vector(query_vector, k=1, filter={"author": "Personnel Office"})

Implement Q&A application

First introduce relevant packages and load documents and models.

from langchain_ollama.llms import OllamaLLM
from langchain_ollama.embeddings import OllamaEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

model = OllamaLLM(model="qwen2.5:7b")
embeddings = OllamaEmbeddings(model="nomic-embed-text")
db = Chroma(embedding_function=embeddings, persist_directory='./embeddings')
retriever=db.as_retriever(search_kwargs={"k": 3})

Then useChatPromptTemplateDefines the prompt template to use when interacting with the user. This template contains two placeholders{context}and{question}, respectively represent the relevant content retrieved from the knowledge base and the user's questions.

template = """You are a professional assistant at Xuzhou Industrial Vocational and Technical College and can answer questions based on press releases in the knowledge base:
 {context}
 Question: {question}
 The answer should be concise and accurate, and avoid fabricating information.
 """
 prompt = ChatPromptTemplate.from_template(template)

Then use a key concept in LangChain-Chain call, a data processing pipeline is constructed: first obtain the relevant context through the searcher, then format it with the question into a prompt, then send it to the language model to generate an answer, and finally passStrOutputParserParses the output into a string.

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

Last usedinvoke()The method executes the pipeline, passes in the question as a parameter, and you can get answers.

response = ("When was the last recruitment of the school?")
 print(f"AI: {response}")

The complete procedure is as follows:

from langchain_ollama.llms import OllamaLLM
 from langchain_ollama.embeddings import OllamaEmbeddings
 from langchain_chroma import Chroma
 from langchain_core.prompts import ChatPromptTemplate
 from langchain_core.output_parsers import StrOutputParser
 from langchain_core.runnables import RunnablePassthrough

 model = OllamaLLM(model="qwen2.5:0.5b")
 embeddings = OllamaEmbeddings(model="nomic-embed-text")
 db = Chroma(embedding_function=embeddings, persist_directory='./embeddings')
 retriever=db.as_retriever(search_kwargs={"k": 3})

 template = """You are a professional assistant at Xuzhou Industrial Vocational and Technical College and can answer questions based on press releases in the knowledge base:
 {context}
 Question: {question}
 The answer should be concise and accurate, and avoid fabricating information.
 """
 prompt = ChatPromptTemplate.from_template(template)

 chain = (
     {"context": retrieve, "question": RunnablePassthrough()}
     | prompt
     | model
     | StrOutputParser()
 )

 def chat_with_system(question):
     response = (question)
     print(f"AI: {response}")
     Return response

 # Sample dialogue
 if __name__ == "__main__":
     While True:
         user_input = input("You: ")
         if user_input.lower() in ["exit", "quit"]:
             break
         response = chat_with_system(user_input)