Empowering Users with Advanced Question-Answering Systems

Posted 10 months ago by Milos Svana

For a long time, we have dreamt about systems able to answer questions related to a set of text documents — a next-gen search engine. As developers, we spend a significant portion of our time reading through documentation, trying to solve a specific problem. We are not alone. People in many other fields face similar problems. Addressing this issue could save an immense amount of time.

The problem of answering a question from a set of text documents has been studied for quite some time. However, only the recent improvements in large language models lead to practically useful solutions. Most of them are based on the same basic sequence of steps:

  1. Create a numerical representation of each document
  2. Create a numerical representation of the question
  3. Use these numerical representations to find documents similar to the question
  4. Select several most similar documents and build a prompt for a language model such as ChatGPT. This prompt includes both the question and the actual text of the most similar documents.

Let’s have a look at three different classes of solutions to the question-answering problem: ready-made commercial offerings, Python libraries designed specifically for this purpose, and a custom solution built on top of well-known Python libraries. We will compare these three approaches on a use case typical for software engineers: answering questions about a software library, using its documentation as our knowledge base. Our test subject will be Peewee, a simple ORM library for Python. But it shouldn’t be hard to imagine how the presented solutions could be used with other types of documents such as legal documents or various reports.

Commercial solutions

It’s not surprising that there are already multiple commercial question-answering systems able to work with a user-provided set of documents. Many of these tools take the form of a chatbot.

Let’s take ChatNode as an example. When creating a new chatbot, we first need to define our knowledge base. ChatNode lets us upload documents, enter text directly, or provide a URL for automated scraping. Given our use case, the third option seems most appropriate. After entering the URL of Peewee’s documentation, ChatNode automatically detects all subpages. We select subpages that should be included in our bot’s knowledge base and start training the bot.

 

 

After the bot is trained, we can start chatting:

 

 

 

As we just saw, the process of creating a chatbot with ChatNode is extremely easy. The bot can be added to any webpage or used via Slack. ChatNode however does not provide any API. Moreover, there is a limitation on the number of tokens that can be added to the chatbot’s knowledge base. The number of messages is limited too.

ChatNode is a good representation of all question-answering chatbot solutions we’ve examined; both in terms of features and limitations. If you are interested in exploring other services yourself, this list of AI tools for document search is a good starting point.

Building a question-answering system with Haystack

If commercial solutions seem too limiting for your use case, there are some excellent libraries for creating a custom question-answering system such as LangChain or Haystack. We prefer Haystack a bit more. Its documentation is better structured and much more complete.

The process of creating a question-answering system is very similar to ChatNode’s. But instead of configuring everything in a UI, we have to write Python code. We start by installing Haystack with a web crawler plugin:

$ pip install farm-haystack[crawler]

Next, we create a simple script for scraping the docs. Haystack’s Crawler makes the tasks very easy:

import sys
from haystack.nodes.connector import Crawler
crawler = Crawler(output_dir=sys.argv[2])
crawler.crawl(urls=[sys.argv[1]], filter_urls=[sys.argv[1]])

To scrape Peewee’s docs, we run the script with the documentation URL and the output directory as command line arguments. For example:

$ python scrape.py https://docs.peewee-orm.com/en/latest/ docs/

Now we are ready to create a simple question-answering system. It will consist of several main components:

  • Document store: As the name suggests, this component is responsible for storing all documents and their numerical representation. This numerical representation enables us to quickly find candidate documents that could potentially contain the answer to our question. Haystack supports many document storage types. In our example, we will use the simplest in-memory store.
  • Text Indexing Pipeline: This component iterates over the scraped documentation files, calculates their numerical representation, and stores both the original text and the numerical representation in the document store. In our case, we chose to use BM25 as our numerical representation. You can also use embeddings such as those generated by OpenAI’s text-embedding-ada-002 model.
  • A prompt pipeline: Finally, we define a pipeline for answering our question. This pipeline consists of two basic components: a Retriever responsible for finding candidate documents in the document store and a PromptNode that takes the documents found by the retriever and asks ChatGPT to generate an answer to our question.

Here is the code of the whole question-answering system:

import os
import sys

from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import BM25Retriever, PromptNode, PreProcessor
from haystack.nodes.file_converter.json import JsonConverter
from haystack.pipelines import Pipeline
from haystack.pipelines.standard_pipelines import TextIndexingPipeline

API_KEY = '<your-open-ai-api-key>'
question = sys.argv[1]

document_store = InMemoryDocumentStore(use_bm25=True)
files_to_index = ['docs/' + d for d in os.listdir('docs/')]
indexing_pipeline = TextIndexingPipeline(
    document_store, JsonConverter(), PreProcessor(
        split_length=250, 
        split_overlap=5, 
        split_respect_sentence_boundary=False))
indexing_pipeline.run_batch(file_paths=files_to_index)

retriever = BM25Retriever(document_store=document_store)
prompt_node = PromptNode(
    model_name_or_path="gpt-3.5-turbo", api_key=API_KEY, 
    default_prompt_template="question-answering",
    stop_words=['<|endoftext|>'], max_length=250)

pipeline = Pipeline()
pipeline.add_node(component=retriever, name='retriever', inputs=['Query'])
pipeline.add_node(component=prompt_node, name='prompt', inputs=['retriever'])

prediction = pipeline.run(query=question, params={'retriever': {'top_k': 7}})
print(prediction['answers'][0].answer)

There are several things to note. If we look at the TextIndexingPipeline, we see that it contains a JsonConverter and a PreProcessor. Haystack’s crawler stores downloaded data as JSON files. JsonConverter transforms them into an internal document representation. The PreProcessor then splits each document into shorter chunks. Splitting gives us more flexibility when looking for documents that potentially contain the answer to our question. It also protects us from exceeding Open AI’s input size limits.

There are many parameters we can configure to tweak how our question-answering system works. We can choose among different language models and prompt templates, set the maximal response length, or the number of top candidate documents to pass to the language model.

We can run our Python script with the question as a command line argument. For example:

$ python qa.py “How can I define a model? Can you give me an example?”

To define a model in Peewee, you can create a subclass of the Model class and define fields as class attributes. Here’s an example:

from peewee import *
database = SqliteDatabase('my_database.db')
class User(Model):
    username = CharField(unique=True)
    email = CharField()
    password = CharField()

    class Meta:
        database = database

In this example, we define a User model with three fields: username (a unique string), email (a string), and password (a string). We also specify a Meta class with a database attribute, which tells Peewee which database to use for this model.

One disadvantage of our implementation is that there is no chat history. If needed though, we could implement a simple workaround by explicitly recording our questions and ChatGPTs answers and then adding these records to the prompt text.

There are however several advantages. Besides more customization options, Haystack provides a list of documents used to answer the question. Thanks to this feature, you can easily verify that the answer is indeed correct. In our implementation, these documents are stored in prediction[‘documents’]. What’s more, there are no limitations on the number of tokens you can put into your knowledge base or on the number of queries you can make. You can also easily replace generative question answering with other tools, such as extractive question answering or summarization.

Custom solution with established Python libraries

Finally, to prove that you can build a simple question-answering system without any special tools (besides OpenAI’s API), let’s have a short look at a custom solution that uses well-known Python libraries: numpy, requests, beautifulsoup4, and openai.

Just as before, we start with a scraper script which will revolve around the following function:

def scrape_page(base_url: str, output_dir):
    urls_to_scrape = [base_url]
    scraped_urls = []

    while len(urls_to_scrape) > 0:
        current_url = urls_to_scrape.pop()
        response = requests.get(current_url)
        soup = BeautifulSoup(response.text, 'html.parser')
        text = soup.get_text()
        scraped_urls.append(current_url)
        file_name = f'{md5(current_url.encode()).hexdigest()}.txt'
        file_path = os.path.join(output_dir, file_name)
        with open(file_path, 'w') as file_fd:
            file_fd.write(text)
        links = soup.find_all('a')
        for link in links:
            subpage_url = urljoin(
                base_url, link.get('href')).split('#')[0]
            if subpage_url.startswith(base_url) \
                    and subpage_url not in urls_to_scrape \
                    and subpage_url not in scraped_urls:
                urls_to_scrape.append(subpage_url)

Compared to just invoking Haystack’s crawler, the code is longer, but not too complicated. The function maintains a list of URLs to scrape (initialized with the provided base URL). It downloads the contents of each URL, saves it into a file, and extracts additional URLs to scrape. We use a somewhat naive approach to document splitting: we split each document into chunks based on a predetermined number of characters. This means we will most likely split many words or code samples in half.

Let’s now move to actual question-answering. The process starts by creating a numerical representation of each document. Instead of BM25, we will now use OpenAI’s embeddings:

def create_embeddings(input_dir: str) -> tuple[np.ndarray, list[str]]:
    embeddings = []
    texts = []

    for document_name in os.listdir(input_dir):
        document_path = os.path.join(input_dir, document_name)
        with open(document_path, 'r') as document:
             text = document.read()
             text = re.sub(r'\s+', '', text)
        for chunk in textwrap.wrap(text, 3000):
            embedding_response = openai.Embedding.create(
                input=chunk, model='text-embedding-ada-002')
            embedding = embedding_response['data'][0]['embedding']
            embeddings.append(embedding)
            texts.append(chunk)
    return np.array(embeddings), texts

Next, we need a function to find candidate documents. This function first asks OpenAI to create an embedding from our question. We then use numpy to calculate the cosine distance between the question embedding and each document embedding. We sort the documents according to the cosine distance and return several closest candidates:

def filter_embeddings_by_question(embeddings: np.ndarray, question: str) -> list[int]:
    embedding_response = openai.Embedding.create(
        input=question, model='text-embedding-ada-002')
    question_embedding = embedding_response['data'][0]['embedding']
    distance = np.dot(embeddings, question_embedding) \
        / (np.linalg.norm(embeddings) *  np.linalg.norm(question_embedding))
    return np.argsort(distance)[::-1][:5]

Finally, let’s implement a main() function that makes everything work together. We read the input directory and the question from command line arguments, create embeddings, find candidate documents, and ask ChatGPT to generate an answer:

def main():
    input_dir = sys.argv[1]
    question = sys.argv[2]
    embeddings, texts = create_embeddings(input_dir)
    document_candidates_idx = filter_embeddings_by_question(embeddings, question)
    document_candidates = np.array(texts)[document_candidates_idx]
    prompt = f'''
        Act as a question answering system. Using the following text and your general
        knowledge as context, answer the question: {question}
        Context: {' '.join(document_candidates)}
    '''
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages=[{'role': 'user', 'content': prompt}],
        max_tokens=512, temperature=0.2)
    print(response['choices'][0]['message']['content'].strip())

Having everything implemented, we can now run the script and ask a question:

$ python custom_qa.py docs/ “How do I define a model? Give me an example.”

To create a model in Peewee, you need to define a model class for each table. The model class defines one or more field attributes which correspond to the table’s columns. For example, in the Twitter-like app, there are three models: User, Relationship, and Message. To create these models, you need to instantiate a SqliteDatabase object and define the model classes, specifying the columns as Field instances on the class. Here’s an example:

from peewee import *

database = SqliteDatabase('my_database.db')

class BaseModel(Model):
    class Meta:
        database = database

class User(BaseModel):
    username = CharField(unique=True)
    password = CharField()
    email = CharField()
    join_date = DateTimeField()

...

We can see that our custom solution takes quite a long time to execute. Asking OpenAI to create embeddings takes much longer than using BM25. For production use, we should probably create a separate process for creating embeddings and storing them permanently. Creating document embeddings from scratch each time we ask a question doesn’t make much sense.

If we wanted, we could relatively easily extend our implementation into a chatbot. Having direct access to the Open AI’s API, we can record chat history and send it to ChatGPT as the messages argument of the ChatCompletion.create() function.

Which solution is the best?

We’ve explored three different approaches to building question-answering systems: commercial solutions, Haystack — a library designed specifically for this purpose, and a custom solution built on top of well-known Python libraries. But which approach is the best? In terms of answer quality, we think that all 3 categories of solutions provide decent results. The choice comes down mainly to the issues of integration and customization options, convenience, and cost.

If you want to quickly build a question-answering chatbot for your website or if you don’t have enough technical knowledge, commercial solutions might be the right fit. However, it’s important to also consider various usage limitations.

If you need more customization options or if you want to integrate question-answering into other products, Haystack or similar libraries seem like the best option. They provide many tools that might come in handy when developing a question-answering system. Identification of documents used to generate the answer might also be an important advantage.

We recommend building a custom solution only if you want the maximum level of control and customization. When building from scratch you have to deal with many small issues (crawling, or document splitting) that libraries like Haystack solve for you. So creating a good implementation might take much longer.

Finally, a note of caution: all solutions presented in this article talk to OpenAI. They send the actual contents of candidate documents to their APIs. You should consider this factor especially when working with very sensitive information. In the future, we plan to explore open language models that can be installed on your company’s hardware and run in a secure environment.

Milos Svana

Leave a Reply

Related articles