LangChain Basics (04)

LangChain Core Module: Data Conneciton - Document Loaders

Use the document loader to load data from the source as a document. A document is a piece of text and related metadata.

For example, there is a document loader for loading simple .txt files, which is used to load ArXiv papers, or text content of any web page

Document Class

This code defines a name calledDocumentThe class allows users to interact with the content of the document, can view paragraphs, summary of the document, and use the search function to query specific strings in the document.

# Document class based on BaseModel definition.
 class Document(BaseModel):
     """Interface for interacting with documents.""""

     # The main content of the document.
     page_content: str
     # The string used to find.
     lookup_str: str = ""
     # The index found, the default is 0 for the first time.
     lookup_index = 0
     # Used to store any document-related metadata.
     metadata: dict = Field(default_factory=dict)

     @property
     def paragraphs(self) -> List[str]:
         """Page list."""
         # Use "\n\n" to split the content into multiple paragraphs.
         return self.page_content.split("\n\n")

     @property
     def summary(self) -> str:
         """Summary of the page (i.e. the first paragraph). """
         # Return to the first paragraph as a summary.
         Return [0]

     # This method mimics the search function in the command line.
     def lookup(self, string: str) -> str:
         """Find a word in the page to imitate the cmd-F function."""
         # If the entered string is different from the current lookup string, reset the lookup string and index.
         if () != self.lookup_str:
             self.lookup_str = ()
             self.lookup_index = 0
         else:
             # If the input string is the same as the current lookup string, the search index is added by 1.
             self.lookup_index += 1
         # Find all paragraphs containing the search string.
         lookups = [p for p in if self.lookup_str in ()]
         # Return the corresponding information based on the search results.
         if len(lookups) == 0:
             return "No Results"
         elif self.lookup_index >= len(lookups):
             return "No More Results"
         else:
             result_prefix = f"(Result {self.lookup_index + 1}/{len(lookups)})"
             return f"{result_prefix} {lookups[self.lookup_index]}"

BaseLoader class definition

BaseLoaderThe class defines how documents are loaded from different data sources and provides an optional method to split the loaded documents. Using this class as a basis, developers can create custom loaders for specific data sources and ensure that all of these loaders provide methods to load data. The load_and_split method also provides an additional feature that can split the loaded document into smaller chunks as needed.

# Base loader class.
 class BaseLoader(ABC):
     """Basic loader class definition."""

     # Abstract method, all subclasses must implement this method.
     @abstractmethod
     def load(self) -> List[Document]:
         """Load the data and convert it into a document object."""

     # This method can load the document and split it into smaller chunks.
     def load_and_split(
         self, text_splitter: Optional[TextSplitter] = None
     ) -> List[Document]:
         """Load the document and split it into blocks."""
         # If no specific text splitter is provided, use the default character text splitter.
         if text_splitter is None:
             _text_splitter: TextSplitter = RecursiveCharacterTextSplitter()
         else:
             _text_splitter = text_splitter
         # Load the document first.
         docs = ()
         # Then use _text_splitter to split each document.
         return _text_splitter.split_documents(docs)

Loading a Txt file using TextLoader

from langchain.document_loaders import TextLoader

docs = TextLoader('../tests/state_of_the_union.txt',encoding='utf-8').load()

Loading ArXiv papers using ArxivLoader

ArxivLoader class definition

ArxivLoaderClasses are specifically used to get documents from the Arxiv platform. The user provides a search query, and the loader then interacts with the Arxiv API to retrieve a list of documents related to the query. These documents are then returned in standard Document format.

# Loader class for Arxiv platform.
 class ArxivLoader(BaseLoader):
     """Load the document based on the search query from `Arxiv`.

     This loader is responsible for converting Arxiv's original PDF document to plain text format for easy processing.
     """

     # Initialization method.
     def __init__(
         Self,
         query: str,
         load_max_docs: Optional[int] = 100,
         load_all_available_meta: Optional[bool] = False,
     ):
          = query
         """Specific query or keyword passed to the Arxiv API for searching."""
         self.load_max_docs = load_max_docs
         """Upper limit to retrieve documents from search."""
         self.load_all_available_meta = load_all_available_meta
         """Flag that determines whether to load all metadata associated with the document.""""

     # Loading method for obtaining documents based on query.
     def load(self) -> List[Document]:
         arxiv_client = ArxivAPIWrapper(
             load_max_docs=self.load_max_docs,
             load_all_available_meta=self.load_all_available_meta,
         )
         docs = arxiv_client.search()
         return docs

ArxivLoader has the following parameters:

query: used inArXivSearch the text of the document
load_max_docs: The default value is 100. Use it to limit the number of documents downloaded. It takes time to download all 100 documents, so use smaller numbers in your experiments.
load_all_available_meta: The default value is False. Only the most important fields are downloaded by default: Release date (Document release/last update date), title, author, summary. If set to True, additional fields are also downloaded.

byGPT-3 Papers (Language Models are Few-Shot Learners)As an example, show how to use itArxivLoader

Arxiv link to GPT-3 paper:/abs/2005.14165

from langchain.document_loaders import ArxivLoader
query = "2005.14165"
docs = ArxivLoader(query=query, load_max_docs=5).load()
len(docs)
docs[0].metadata  # meta-information of the Document

{'Published': '2020-07-22',
 'Title': 'Language Models are Few-Shot Learners',
 'Authors': 'Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei',
 'Summary': "Recent work has demonstrated substantial gains on many NLP tasks and\nbenchmarks by pre-training on a large corpus of text followed by fine-tuning on\na specific task. While typically task-agnostic in architecture, this method\nstill requires task-specific fine-tuning datasets of thousands or tens of\nthousands of examples. By contrast, humans can generally perform a new language\ntask from only a few examples or from simple instructions - something which\ncurrent NLP systems still largely struggle to do. Here we show that scaling up\nlanguage models greatly improves task-agnostic, few-shot performance, sometimes\neven reaching competitiveness with prior state-of-the-art fine-tuning\napproaches. Specifically, we train GPT-3, an autoregressive language model with\n175 billion parameters, 10x more than any previous non-sparse language model,\nand test its performance in the few-shot setting. For all tasks, GPT-3 is\napplied without any gradient updates or fine-tuning, with tasks and few-shot\ndemonstrations specified purely via text interaction with the model. GPT-3\nachieves strong performance on many NLP datasets, including translation,\nquestion-answering, and cloze tasks, as well as several tasks that require\non-the-fly reasoning or domain adaptation, such as unscrambling words, using a\nnovel word in a sentence, or performing 3-digit arithmetic. At the same time,\nwe also identify some datasets where GPT-3's few-shot learning still struggles,\nas well as some datasets where GPT-3 faces methodological issues related to\ntraining on large web corpora. Finally, we find that GPT-3 can generate samples\nof news articles which human evaluators have difficulty distinguishing from\narticles written by humans. We discuss broader societal impacts of this finding\nand of GPT-3 in general."}

Use UnstructedURLLoader to load web content

Unstructured partitioning functions are used to detect MIME types and route files to the appropriate partitioner.

Supports two modes of running loaders: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If the "elements" pattern is used, the unstructured library will split the document into elements such as title and narrative text. You can pass in other unstructured kwargs after mode to apply different unstructured settings.

UnstructedURLLoader Main parameters:

urls: URL list of web pages to be loaded
continue_on_failure: defaultTrue, after a certain URL fails to load, will it continue
mode: defaultsingle，

Take ReAct web page as an example (/）Display usage

from langchain.document_loaders import UnstructuredURLLoader
urls = [
    "/",
]
loader = UnstructuredURLLoader(urls=urls)
data = ()
data[0].metadata
print(data[0].page_content)

Output:

{'source': '/'}
ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao,

Jeffrey Zhao,

Dian Yu,

Nan Du,

Izhak Shafran,

Karthik Narasimhan,

Yuan Cao

[Paper]

[Code]

[Blogpost]

[BibTex]

Language models are getting better at reasoning (. chain-of-thought prompting) and acting (. WebGPT, SayCan, ACT-1), but these two directions have remained separate. 
                ReAct asks, what if these two fundamental capabilities are combined?

Abstract

While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (. chain-of-thought prompting) and acting (. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples.

ReAct Prompting

A ReAct prompt consists of few-shot task-solving trajectories, with human-written text reasoning traces and actions, as well as environment observations in response to actions (see examples in paper appendix!) 
                ReAct prompting is intuitive and flexible to design, and achieves state-of-the-art few-shot performances across a variety of tasks, from question answering to online shopping!

HotpotQA Example

The reason-only baseline (. chain-of-thought) suffers from misinformation (in red) as it is not grounded to external environments to obtain and update knowledge, and has to rely on limited internal knowledge. 
                The act-only baseline suffers from the lack of reasoning, unable to synthesize the final answer despite having the same actions and observation as ReAct in this case. 
                In contrast, ReAct solves the task with a interpretable and factual trajectory.

ALFWorld Example

For decision making tasks, we design human trajectories with sparse reasoning traces, letting the LM decide when to think vs. act. 
                ReAct isn't perfect --- below is a failure example on ALFWorld. However, ReAct format allows easy human inspection and behavior correction by changing a couple of model thoughts, an exciting novel approach to human alignment!

ReAct Finetuning: Initial Results

Prompting has limited context window and learning support. 
                    Initial finetuning results on HotpotQA using ReAct prompting trajectories suggest:
                    (1) ReAct is the best fintuning format across model sizes; 
                    (2) ReAct finetuned smaller models outperform prompted larger models!

loader = UnstructuredURLLoader(urls=urls, mode="elements")

new_data = ()

new_data[0].page_content

Output:

'ReAct: Synergizing Reasoning and Acting in Language Models'