How to Analyze Documents With LangChain and the OpenAI API

Extracting insights from documents and data is crucial in making informed decisions. However, privacy concerns arise when dealing with sensitive information. LangChain, in combination with the OpenAI API, allows you to analyze your local documents without the need to upload them online.

They achieve this by keeping your data locally, using embeddings and vectorization for analysis, and executing processes within your environment. OpenAI does not use data submitted by customers via their API to train their models or improve their services.

OpenAI API homepage

Setting Up Your Environment

Create a new Python virtual environment. This will ensure there are no library version conflicts. Then run the following terminal command to install the required libraries.

Here is a breakdown of how you will use each library:

OpenAI API page

After all the libraries are installed, your environment is now ready.

Getting an OpenAI API Key

When you make requests to the OpenAI API, you need to include an API key as part of the request. This key allows the API provider to verify that the requests are coming from a legitimate source and that you have the necessary permissions to access its features.

To obtain an OpenAI API key, proceed to theOpenAI platform.

Then, under your account’s profile in the top-right, click onView API keys. TheAPI keyspage will appear.

Click ontheCreate new secretkeybutton. Name your key and click onCreate new secret key. OpenAI will generate your API key which you should copy and keep somewhere safe. For security reasons, you won’t be able to view it again through your OpenAI account. If you lose this secret key, you’ll need to generate a new one.

Results of analyzing a PDF file through querying on a terminal

The full source code is available in aGitHub repository.

Importing the Required Libraries

To be able to use the libraries installed in your virtual environment, you need to import them.

Notice that you import the dependency libraries from LangChain. This allows you to use specific features of the LangChain framework.

Output of a program showing analysis of source code on the terminal

Loading the Document for Analysis

Start by creating a variable that holds your API key. You will use this variable later in the code for authentication.

It is not recommended to hard code your API key if you plan to share your code with third parties. For production code that you aim to distribute,use an environment variable instead.

Next, create a function that loads a document. The function should load a PDF or a text file. If the document is neither, the function should raise aValueError.

After loading the documents, create aCharacterTextSplitter. This splitter will split the loaded documents into smaller chunks based on characters.

Splitting the document ensures that the chunks are of a manageable size and are still connected with some overlapping context. This is useful for tasks like text analysis and information retrieval.

Querying the Document

You need a way to query the uploaded document to derive insights from it. To do so, create a function that takes aquerystring and aretrieveras input. It then creates aRetrievalQAinstance using theretrieverand an instance of the OpenAI language model.

This function uses the created QA instance to run the query and print the result.

Creating the Main Function

The main function will control the overall program flow. It will take user input for a document filename and load that document. Then create anOpenAIEmbeddingsinstance for embeddings and construct avector storebased on the loaded documents andembeddings. Save this vector store to a local file.

Next, load the persisted vector store from the local file. Then enter a loop where the user can input queries. Themainfunction passes these queries to thequery_pdffunction along with the persisted vector store’s retriever. The loop will continue until the user enters “exit”.

Embeddings capture semantic relationships between words. Vectors are a form in which you’re able to represent pieces of text.

This code converts the text data in the document into vectors using the embeddings generated byOpenAIEmbeddings. It then indexes these vectors usingFAISS, for efficient retrieval and comparison of similar vectors. This is what allows for the analysis of the uploaded document.

Finally, usethe name == “main” constructto call the main function if a user runs the program standalone:

This app is a command-line application. As an extension, you canuse Streamlit to add a web interface to the app.

Performing Document Analysis

To perform document analysis, store the document you want to analyze in the same folder as your project, then run the program. It will ask for the name of the document you want to analyze. Enter its full name, then enter queries for the program to analyze.

The screenshot below shows the results of analyzing a PDF.

The following output shows the results of analyzing a text file containing source code.

Ensure the files you want to analyze are in either PDF or text format. If your documents are in other formats, you canconvert them to PDF format using online tools.

Understanding the Technology Behind Large Language Models

LangChain simplifies the creation of applications using large language models. This also means it abstracts what is going on behind the scenes. To understand exactly how the application you are creating works, you should familiarize yourself with the technology behind large language models.

Setting Up Your Environment#

Getting an OpenAI API Key#

To obtain an OpenAI API key, proceed to theOpenAI platform.#

The full source code is available in aGitHub repository.#

Importing the Required Libraries#

Loading the Document for Analysis#

Querying the Document#

Creating the Main Function#

Performing Document Analysis#

The screenshot below shows the results of analyzing a PDF.#

Understanding the Technology Behind Large Language Models#