AI/tutorials

LINER PDF Chat Tutorial

728x90

https://github.com/liner-engineering/liner-pdf-chat-tutorial/tree/main

GitHub - liner-engineering/liner-pdf-chat-tutorial: LINER PDF Chat Tutorial with ChatGPT & Pinecone

LINER PDF Chat Tutorial with ChatGPT & Pinecone. Contribute to liner-engineering/liner-pdf-chat-tutorial development by creating an account on GitHub.

github.com

transformer.pdf

2.10MB

ChatGPT를 활용해 PDF 파일에 기반해 답변할 수 있는 질의응답 챗봇 코드를 다루고 있는 튜토리얼이다.

튜토리얼은 크게 세 단계로 나누어 진행된다.

PDF-to-Image
Text Preprocessing
Vector Search

1. PDF to Image

PDF 파일에서 언어 모델이 이해할 수 있는 플레인 텍스트를 추출하는 과정으로 PDF를 문서 이미지로 변환하는 PDF-to-Image, 문서 이미지에서 텍스트를 추출하는 Image-to-Text 로직으로 나뉜다.

1.1. PDF-to-Image

pdf2image 라이브러리를 사용한다. 이 라이브러리를 활용하기 위해서는 poppler 설치가 필요하다.

# pdf2image 설치
pip install pdf2image

# poppler 설치
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zprofile
brew install poppler

# 이미지 변환
from pdf2image import convert_from_path
FILE_NAME = "transformer.pdf"
images = convert_from_path(FILE_NAME)

# 다음 단계를 위해 이미지 파일 로컬에 저장
import os
DIC = "image"
if not os.path.exists(DIC):
    os.makedirs(DIC)

for i, image in enumerate(images):
    image.save(f"{DIC}/page_{str(i)}.jpg", "JPEG")

1.2. Image-to-Text

튜토리얼에서는 Google OCR을 활용하며, 기호에 따라 다른 OCR 기술 (e.g. HuggingFace, Tesseract, ...) 을 활용할 수도 있다고 한다.

Google OCR을 사용하려면 우선 Google Cloud Platform 에 먼저 API 등록을 해야한다.

pip install --upgrade google-cloud-vision

# vision API 키 설정
import os

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'JSON_FILE'


# Google OCR 라이브러리 임포트
import io
from tqdm import tqdm
from google.cloud import vision

client = vision.ImageAnnotatorClient()

# Google OCR을 활용하여 이미지 파일에서 텍스트를 추출하는 메서드
def detect_text(path: str):
    with io.open(path, "rb") as image_file:
        content = image_file.read()

    image = vision.Image(content=content)

    response = client.text_detection(image=image)
    return response.full_text_annotation

Google OCR에서 내려준 결과를 곧바로 활용할 경우 각 행의 마지막에 위치한 띄어쓰기, 개행 등의 Break 정보가 유실된 상태의 텍스트 (e.g. Numerous 뒤에 불필요한 개행문자가 포함) 를 얻게 되므로 Break Detection 으로 올바르게 정렬하는 후처리 작업을 진행해야 하낟.

# Break Detection 결과 적용을 위한 후처리 메서드
breaks = vision.TextAnnotation.DetectedBreak.BreakType

def postprocess_ocr(annotation) -> str:
    text = ""

    for page in annotation.pages:
        for block in page.blocks:
            for paragraph in block.paragraphs:
                for word in paragraph.words:
                    for symbol in word.symbols:
                        detected_break = symbol.property.detected_break
                        detected_break_type = detected_break.type_

                        if detected_break_type == breaks.UNKNOWN:
                            text += symbol.text
                        elif detected_break_type == breaks.SPACE:
                            text += f"{symbol.text} "
                        elif detected_break_type == breaks.SURE_SPACE:
                            text += f"{symbol.text} "
                        elif detected_break_type == breaks.EOL_SURE_SPACE:
                            text += f"{symbol.text} "
                        elif detected_break_type == breaks.HYPHEN:
                            text += f"{symbol.text}-"
                        elif detected_break_type == breaks.LINE_BREAK:
                            text += f"{symbol.text}\n"

    return text.strip()

# 모든 데이터에 OCR과 Break Detection 후처리 적용하기
documents = []
for i in tqdm(range(len(images))):
    documents.append(
        {
            "page": int(i+1),
            "text": postprocess_ocr(detect_text(f"page_{i}.jpg")),
        }
    )

2. Text Preprocessing

언어 모델이 보다 잘 이해할 수 있는 단위로 텍스트 데이터를 정제하는 과정으로 불필요한 텍스트를 제거하는 Text Cleansing, 텍스트를 보다 작은 의미 단위로 분할하는 Text Chunking 로직이 포함된다. 일반적으로 문서 전처리에 따라 서비스 품질이 크게 달라질 수 있기에 이 과정에 튜토리얼 코드 이상으로 많은 공을 들이는게 좋다.

2.1. Text Cleansing

Text Cleansing 로직은 도메인 특성에 따라 다르게 작성될 수 있다. 튜토리얼에서는 최소 단위 정제 작업만 진행하도록 한다.

import re
from typing import List, Optional

citation_pattern = r"\[\d+\]"

def cleanse_text(text: str) -> Optional[str]:
    # 길이 단위 필터링
    if len(text) <= 5:
        return None

    # 각주 제거
    text = re.sub(citation_pattern, "", text)
        
    # 불필요하게 나열된 여러 개 공백 제거
    text = re.sub(" +", " ", text)
    return text

2.2. Text Chunking

대개 문단 단위로 자르는 로직, 토큰 갯수로 자르는 로직 등이 있으며 본 튜토리얼에서는 편의상 토큰 갯수로 자르는 로직을 구현한다. 목적에 따라 필요한 분할 로직을 활용하는게 바람직하다.

openAI에서 문장의 토큰 갯수를 반환해주는 tiktoken 라이브러리를 제공하는데, 이를 활용해 토큰 갯수 기반 청킹을 적용한다.

import tiktoken

# ChatGPT 인코딩 로직인 `cl100k_base`를 기본 인코딩으로 설정
tokenizer = tiktoken.get_encoding("cl100k_base")

# 최대 토큰 갯수 지정
CHUNK_SIZE = 256

# 입력 문장의 토큰 갯수를 카운트 하는 메서드
def num_tokens_from_text(text: str) -> int:
    num_tokens = len(tokenizer.encode(text))
    return num_tokens
    
# 토큰 갯수 단위로 문서 분할하는 메서드
def chunkify(text: str) -> List[str]:
    lines = text.split("\n")

    chunks = []

    chunk = ""
    for line in lines:
        line = cleanse_text(line)
        if line is None:
            continue

        chunk += f" {line}"

        if num_tokens_from_text(chunk) >= CHUNK_SIZE:
            chunks.append(chunk.strip())
            chunk = ""

    # 마지막 청크가 남아 있다면 추가하며 마무리
    if chunk:
        chunks.append(chunk)

    return chunks

# 모든 문서 데이터에 Text Chunking 로직 적용
chunked_documents = []
for document in documents:
    chunks = chunkify(document["text"])
    for chunk in chunks:
        chunked_documents.append(
            {
                "page": document["page"],
                "text": chunk,
            }
        )

3. Vector Search

사용자 질의에 부합하는 문서를 반환 받기 위해 문서를 벡터 검색 엔진에 추가하고, 활용하는 과정이다. 문서를 벡터화하는 Embedding, 임베딩 된 문서를 검색해오는 Hybrid Search 로직이 포함된다.

3.1. Embedding

검색 엔진에 등록할 문서의 텍스트를 벡터로 변환하는 단계다.

튜토리얼에서는 Ada V2 Embedding을 활용했는데 오류로 인해 다른 방법을 찾게 됐다.

( 튜토리얼 방법 )

pip install openai==0.28

# openai 라이브러리 임포트
import openai

# openai API Key 설정
openai.api_key = "YOUR_OPENAI_KEY"

# 앞서 준비한 문서 데이터를 순회하며 벡터 추출 후, 문서 객체에 벡터 추가 할당
for chunked_document in tqdm(chunked_documents):
    embedding = openai.Embedding.create(
        input=chunked_document["text"],
        model="text-embedding-ada-002",
    )["data"][0]["embedding"]

    chunked_document["embedding"] = embedding

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

( 새로운 벡터 임베딩 방법 )

pip install sentence-transformers

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('snunlp/KR-SBERT-V40K-klueNLI-augSTS')

ids = [str(x) for x in range(0, len(chunked_documents))]
for chunked_document in chunked_documents :
    chunked_document['page'] = str(chunked_document['page'])
    chunked_document['embedding'] = model.encode(chunked_document['text']).tolist()

** numpy 배열로 저장할 수 없기 때문에 꼭 tolist()를 해줘야 한다.

3.2. Hybrid Search

사용자 질의에 따라 레퍼런스가 될 수 있는 문서를 검색하는 단계로 튜토리얼에서는 Pinecone 을 활용한다.

먼저 Pinecone에서 활용할 인덱스를 생성해줍니다. sentence-transforemers 가 768차원의 벡터를 반환하므로 해당 값을 Dimensions에, 유사도 검색에 활용하고자 하는 메트릭을 Metric에 선택해주면 된다.

pip install -U pinecone-client

# pinecone 라이브러리 임포트
import pinecone

# pinecone API Key 설정
pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENVIRONMENT")

# pinecone 등록 인덱스 확인
active_indexes = pinecone.list_indexes()

# 확인
active_indexes
>>> ['test--pdf-chat']

# 이제 문서 데이터를 Pinecone에 등록하기 위해 벡터 데이터를 튜플 형태로 생성합니다.
vectors = [
    (
        f"vec{str(i)}",                  # 문서 아이디
        chunked_document["embedding"],   # 벡터
        {                                # 문서 메타 정보 딕셔너리
            "text": chunked_document["text"],
            "page": chunked_document["page"],
            "file": FILE_NAME,
        },
    )
    for i, chunked_document in enumerate(chunked_documents)
]

# 인덱스 설정
index = pinecone.Index("test--pdf-chat")

# 설정된 인덱스에 앞서 생성한 벡터 데이터 Upsert
index.upsert(
    vectors=vectors,
    namespace="pdf-chat",
)

# 검색 엔진이 사용자 쿼리에 부합하는 문서 데이터를 가져오는지 확인
# 사용자 쿼리 벡터화 위한 메서드
def query_embed(text: str) -> List[float]:
    return model.encode(text).tolist()

# 사용자 쿼리 벡터화
query_vector = query_embed("What advantages do transformers have over RNNs?")

# 사용자 쿼리 벡터와 `filter` 로직을 활용해 Hybrid Search
query_response = index.query(
    namespace="pdf-chat",
    top_k=10,
    include_values=True,
    include_metadata=True,
    vector=query_vector,
    filter={
        "file": {"$in": [FILE_NAME]},
    }
)

REF

저작자표시 비영리 변경금지 (새창열림)

'AI > tutorials' 카테고리의 다른 글

LINER PDF Chat Tutorial (2) (0)	2024.03.06
한국어 텍스트 데이터 전처리 (0)	2023.04.22
키워드 추출하기 (1) (0)	2023.04.22

Contents

새소식