> ## Documentation Index
> Fetch the complete documentation index at: https://koreai-ai-for-process-dev.mintlify.app/llms.txt
> Use this file to discover all available pages before exploring further.

# Knowledge Extraction

<Badge icon="arrow-left" color="gray">[Back to NLP Topics](/ai-for-service/automation/natural-language/nlp-topics)</Badge>

Knowledge Extraction pulls FAQ content from external sources—web pages, PDFs, and CSV files—and lets you review and add it to your Knowledge Graph.

**Workflow:**

1. **Extract** — Pull Q\&A pairs from a supported source (PDF, URL, or CSV).
2. **Edit** — Review and edit extracted questions and answers.
3. **Move** — Drag extracted content into KG nodes. If no KG exists, one is created automatically.

**Two ways to add extracted content:**

* **Add to Knowledge Graph** — Moves selected questions to the root node.
* **Add to Specific Term** — Drag and drop to a specific node (requires an existing KG).

***

## Extract from a URL

1. Go to **Automation AI > Knowledge AI > FAQs > ⋯ > Manage Extracts**.
2. Click **Extract from URL**.
3. Enter a **Name** and the **URL**, then click **Proceed**.
4. After extraction completes, click **Review & Add** to add questions to the KG.

***

## Extract from a File

File size limit: 5 MB. Supported formats: PDF, CSV.

1. Go to **Automation AI > Knowledge AI > FAQs > ⋯ > Manage Extracts**.
2. Click **Extract from file**.
3. Click **Browse** and select your file.
4. Click **Proceed**.
5. For PDFs, you can optionally annotate before extraction. See [Annotate & Extract](#annotate-&-extract-pdf-only).
6. After extraction, click **Review & Add**.

### Annotate & Extract (PDF only)

Use this when your PDF is not in XO Platform-compatible format. Annotating teaches the KG engine where questions and answers are.

1. Select a PDF (new, or previously extracted with no questions added to the KG yet).

2. Click **Annotate & Extract**.

3. The PDF loads in the Annotation Tool. Select text and tag it:

   | Tag             | Effect                                                                                 |
   | --------------- | -------------------------------------------------------------------------------------- |
   | **Heading**     | Marks the question. Content between two consecutive headings is treated as the answer. |
   | **Header**      | Ignored during extraction. Trains the model to recognize and skip headers.             |
   | **Footer**      | Ignored during extraction. Trains the model to recognize and skip footers.             |
   | **Exclude**     | Not used for extraction.                                                               |
   | **Ignore Page** | Entire page is skipped.                                                                |

4. Annotate a few pages, then click **Extract** to review. Re-annotate if results are unsatisfactory.

5. After extraction, click **Review & Add** to add questions to the KG.

<Note>
  You can re-annotate only if no questions from the file have been added to the KG yet. If questions were already added, create a copy of the annotated document to work with.
</Note>

***

## Edit Extracted Content

1. Go to **Automation AI > Knowledge AI > FAQs > ⋯ > Manage Extracts**.
2. Click a successful extract.
3. Hover over a Q\&A pair and click **Edit**.
4. Make changes and click **Save**.

***

## Add Extracted Content to the KG

**From Manage Extracts:**

1. Go to **Manage Extracts** and open a successful extract.
2. Drag and drop Q\&A pairs to the target node (child nodes expand during drag).
3. Multi-select for bulk moves.

**From the Knowledge Graph:**

1. Select the target node in the KG.
2. Click **Add from Extraction**.
3. Select a successful extract.
4. Check the Q\&A pairs to add and click **Add**.

<Note>
  Once a Q\&A pair is moved to the KG, it cannot be moved again. If the question is later modified or deleted from the KG, you can re-add it from the extract.
</Note>

***

## Supported Formats

### CSV

* Column 1: question; Column 2: answer.
* No headers allowed. Other columns are ignored.

### PDF

* **With table of contents:** The extraction service uses the ToC to derive heading hierarchy (`heading | subheading | sub-subheading`).
* **Without table of contents:** Uses a pre-trained ML model to detect headings by font style or size.

### Web Pages

Supported page layouts:

* Linear Q\&A pairs.
* Questions with hyperlinks pointing to answers on the same page.
* Questions with hyperlinks pointing to answers on a different page.

**Extraction fails for a question when:**

* Question text spans multiple HTML tags.
* The answer tag is not a child or sibling of the question in the DOM.
* The question has no hyperlink to its answer (for hyperlink-based layouts).
* The linked answer page does not repeat the question above the answer.

**Entire page extraction fails when:**

* The page mixes more than one of the above layouts.

***


Built with [Mintlify](https://mintlify.com).