Spaces:
Sleeping
Sleeping
Create README.md
#1
by
manasvinid
- opened
README.md
CHANGED
@@ -1,13 +1,53 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## BCS_PDF_READER
|
2 |
+
A language processing pipeline for retrieval-based question answering using the Langchain library.
|
3 |
+
|
4 |
+
## Table of Contents
|
5 |
+
|
6 |
+
- [Overview](#overview)
|
7 |
+
- [Features](#features)
|
8 |
+
- [How to use the gradio interface](#howtousethegradiointerface)
|
9 |
+
- [Prerequisites](#prerequisites)
|
10 |
+
|
11 |
+
## Overview
|
12 |
+
|
13 |
+
This repository contains a Python script for building a retrieval-based question-answering system using the Langchain library. It offers a comprehensive language processing pipeline designed to help you answer questions based on textual data stored in PDF documents. The pipeline includes the following key components:
|
14 |
+
|
15 |
+
1. **PDF Document Loader**: This model utilizes the `OnlinePDFLoader` from Langchain to load and extract text content from PDF documents. It prepares the PDF content for further processing.
|
16 |
+
|
17 |
+
2. **Text Splitter**: The `RecursiveCharacterTextSplitter` is responsible for splitting large text content into manageable chunks, ensuring efficient and accurate text processing. It uses various separators to intelligently segment the text.
|
18 |
+
|
19 |
+
3. **Embeddings Model**: The chatbot employs the `HuggingFaceHubEmbeddings` model to compute embeddings for text data. These embeddings capture the semantic information of the text, which is vital for retrieval-based question answering.
|
20 |
|
21 |
+
4. **Vector Stores**: To store and efficiently retrieve embeddings, the chatbot utilizes `FAISS`, which is a high-performance similarity search library from Facebook AI. It offers fast, approximate similarity search capabilities, enabling quick retrieval of relevant documents.
|
22 |
+
|
23 |
+
5. **Question Answering (QA)**: The pipeline incorporates a RetrievalQA component that allows users to ask questions based on the embeddings of the text data. It retrieves and ranks documents that contain relevant information to answer the user's query.
|
24 |
+
|
25 |
+
The primary use case for this pipeline is to process PDF documents, generate embeddings, and enable users to ask questions about the document content. Whether you're conducting research, analyzing reports, or searching for information in a large document collection, this system can assist in extracting meaningful answers efficiently.
|
26 |
+
|
27 |
+
## Features
|
28 |
+
- Interact with PDF documents using a user-friendly interface.
|
29 |
+
- Ask questions and receive answers from PDF content.
|
30 |
+
- Load PDF documents or select from pre-loaded books.
|
31 |
+
|
32 |
+
## How to use the gradio interface
|
33 |
+
- In the Gradio interface, you can load a PDF document or choose from pre-loaded books.
|
34 |
+
- Type your questions in the chat window and hit Enter.
|
35 |
+
- The chatbot will provide answers based on the content of the PDF document.
|
36 |
+
|
37 |
+
## Prerequisites
|
38 |
+
|
39 |
+
If you wish to use this chatbot on your own machine, ensure you have the following dependencies installed:
|
40 |
+
|
41 |
+
- Python 3.7+
|
42 |
+
- Pip
|
43 |
+
- Required Python packages (install them using `pip`):
|
44 |
+
- Langchain
|
45 |
+
- PyPDF
|
46 |
+
- Sentence-Transformers
|
47 |
+
- Faiss-CPU
|
48 |
+
- NumPy
|
49 |
+
- Pandas
|
50 |
+
|
51 |
+
---
|
52 |
+
License: mit
|
53 |
+
---
|