Configure Datasets
Setting up Datasets for your Flow (RAG)
Mira Flows supports Retrieval-Augmented Generation (RAG) to enhance your flows with specific knowledge. This section outlines how to create and manage datasets for RAG.
Supported File Formats
Mira Flows accepts the following file formats for creating datasets:
File Type | Processing Method |
---|---|
PDF (.pdf) | Textual content is extracted from the PDF document |
Markdown (.md) | Textual content is extracted from the Markdown document |
URL | The specified webpage's content is scraped and extracted for textual information |
CSV (.csv) | All URLs contained within the CSV are identified, extracted, and their respective web content is scraped. |
Text (.txt) | Textual content is directly extracted from the plain text file |
Zip files (.zip) | The Zip file is decompressed and processed, provided it contains only supported file types (PDF, MD, CSV, TXT). Each file within the zip is then individually processed according to its respective file type's method. |
Creating a Dataset
To create a dataset for RAG:
python
from mira_sdk import MiraClient
client = MiraClient(config={"API_KEY": "YOUR_API_KEY"})
# Create dataset
client.dataset.create("author/dataset_name", "Optional description")
# Add URL to your data set (URL must be added to an existing dataset)
client.dataset.add_source("author/dataset_name", url="example.com")
# Add file to your data set (file must be added to an existing dataset)
client.dataset.add_source("author/dataset_name", file_path="path/to/my/file.csv")
Link a Dataset with your Flow
Once you have created a dataset, you can associate it with your flow by adding the following configuration in your flow.yaml file.
yaml
# Datasets configuration
dataset:
source: "author/dataset_name"
Best Practices
- Ensure your PDFs are text-based and not scanned images for optimal text extraction.
- When using URLs, make sure they are accessible and do not require authentication.
- For CSV files, use a single column for URLs to ensure proper processing.
- Consider the size and relevance of your dataset to optimize performance and accuracy.