Skip to content

Configure Datasets

Setting up Datasets for your Flow (RAG)

Mira Flows supports Retrieval-Augmented Generation (RAG) to enhance your flows with specific knowledge. This section outlines how to create and manage datasets for RAG.

Supported File Formats

Mira Flows accepts the following file formats for creating datasets:

File TypeProcessing Method
PDF (.pdf)Textual content is extracted from the PDF document
Markdown (.md)Textual content is extracted from the Markdown document
URLThe specified webpage's content is scraped and extracted for textual information
CSV (.csv)All URLs contained within the CSV are identified, extracted, and their respective web content is scraped.
Text (.txt)Textual content is directly extracted from the plain text file
Zip files (.zip)The Zip file is decompressed and processed, provided it contains only supported file types (PDF, MD, CSV, TXT). Each file within the zip is then individually processed according to its respective file type's method.

Creating a Dataset

To create a dataset for RAG:

python
from mira_sdk import MiraClient

client = MiraClient(config={"API_KEY": "YOUR_API_KEY"})

# Create dataset
client.dataset.create("author/dataset_name", "Optional description")

# Add URL to your data set (URL must be added to an existing dataset)
client.dataset.add_source("author/dataset_name", url="example.com")

# Add file to your data set (file must be added to an existing dataset)
client.dataset.add_source("author/dataset_name", file_path="path/to/my/file.csv")

Link a Dataset with your Flow

Once you have created a dataset, you can associate it with your flow by adding the following configuration in your flow.yaml file.

yaml
# Datasets configuration

dataset:
  source: "author/dataset_name"

Best Practices

  • Ensure your PDFs are text-based and not scanned images for optimal text extraction.
  • When using URLs, make sure they are accessible and do not require authentication.
  • For CSV files, use a single column for URLs to ensure proper processing.
  • Consider the size and relevance of your dataset to optimize performance and accuracy.