LangChain Basics (05)

LangChain Core Module: Data Conneciton - Document Transformers

Once the document is loaded, you will usually want to convert it to better suit your application.

The simplest example is that you might want to split a long document into smaller chunks to fit the context window of your model. LangChain has many built-in document converters that can easily split, merge, filter and other manipulate documents.

Text Splitters Text Splitters

When you want to work with long text, it is necessary to divide the text into chunks. It sounds simple, but there is potential complexity here. Ideally, you want to put semantic-related snippets together.

From a high level, the working principle of a text splitter is as follows:

Divide the text into small, meaningful chunks (usually sentences).
Start combining these small pieces into larger pieces until they reach a certain size (measured by a certain function).
Once this size is reached, make the block become part of itself and start creating a new block of text with some overlap (to maintain contextual relationships).

This means you can customize your text splitter along two different axial directions:

1. How to split text
2. How to measure block size

use`RecursiveCharacterTextSplitter`Text splitter

This text splitter takes a list of characters as arguments, dices based on the first character, but if any dice is too large, it continues to move to the next character, and so on. By default, the characters it tries to cut include["\n\n", "\n", " ", ""]

In addition to controlling the characters that can be cut, you can control some other things:

length_function: A method used to calculate the length of the tiling. By default, only the number of characters is counted, but usually a token counter is passed here.
chunk_size: The maximum size of your tiles (measured by the length function).
chunk_overlap: The maximum overlap between tiles. Maintaining a certain degree of overlap can allow coherence between the individual tiles (eg, sliding windows).
add_start_index: Whether to include the starting position of each tiled in the original document in the metadata.

LangChain Core Module: Data Connecton - Text Embedding Models

The Embeddings class is a class that is specifically used to interact with text embedding models. There are many embedded model providers (OpenAI, Cohere, Hugging Face, etc.) - this class is designed to provide a standard interface for all of these providers.

Embed creates a piece of text into a vector representation. This is very useful because it means we can think about text in vector space and we can perform operations such as semantic searches to find the most similar text fragments in vector space.

The basic Embeddings class in LangChain exposes two methods:One is used to embed documents and the other is used to embed queries.The former enters multiple text, while the latter enters a single text. They are used as two separate methods because some embedding providers have different embedding methods for the file to be searched and the query (the search query itself).

Calling OpenAI Embeddings with OpenAIEmbeddings

Embed the text list using the embed_documents method

from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

embeddings = embeddings_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)

Embedding issues using embed_query method

Embed a piece of text for comparison with other embeddings:

embedded_query = embeddings_model.embed_query("What was the name mentioned in the conversation?")

LangChain Core Module: Data Connecton - Vector Stores

One of the most common ways to store and search for unstructured data is to embed it and store the generated embedding vector, then embed the unstructured query at query time and retrieve the embedding vector that is "most similar" to the embedding query.

The vector repository is responsible for storing data that has been embedded for you and performing vector searches.

Below isChromaShow functions and usage as an example

## Use Chroma as a vector database to implement semantic search

from langchain.document_loaders import TextLoader
 from langchain_openai import OpenAIEmbeddings
 from langchain.text_splitter import CharacterTextSplitter
 from import Chroma

 # Load long text
 raw_documents = TextLoader('../tests/state_of_the_union.txt',encoding='utf-8').load()

# Instantiate text splitter
 text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=0)

# Split text
 documents = text_splitter.split_documents(raw_documents)

embeddings_model = OpenAIEmbeddings()

# Use the OpenAI embedding model to obtain the embedding vector and store it in Chroma
 db = Chroma.from_documents(documents, embeddeds_model)

Semantic similarity search using text

query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)

Semantic similarity search using embedded vectors

embedding_vector = embeddings_model.embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)
print(docs[0].page_content)