databonsai：用 LLM 进行数据清洗的 Python 库

xsmile 发布于 1星期前分类：数据分析

databonsai - clean & curate your data with LLMs.' GitHub: github.com/databonsai/databonsai

Clean & curate your data with LLMs

databonsai is a Python library that uses LLMs to perform data cleaning tasks.

Features

Suite of tools for data processing using LLMs including categorization, transformation, and extraction
Validation of LLM outputs
Batch processing for token savings
Retry logic with exponential backoff for handling rate limits and transient errors

Installation

pip install databonsai

Store your API keys on an .env file in the root of your project, or specify it as an argument when initializing the provider.

OPENAI_API_KEY=xxx # if you use OpenAiProvider
ANTHROPIC_API_KEY=xxx # If you use AnthropicProvider

Quickstart

Categorization

Setup the LLM provider and categories (as a dictionary.

from databonsai.categorize import MultiCategorizer, BaseCategorizer
from databonsai.llm_providers import OpenAIProvider, AnthropicProvider

provider = OpenAIProvider()  # Or AnthropicProvider(). Highly recommend using Haiku, which is the default AnthropicProvider() model, as it is cheap and effective for these tasks
categories = {
    "Weather": "Insights and remarks about weather conditions.",
    "Sports": "Observations and comments on sports events.",
    "Politics": "Political events related to governments, nations, or geopolitical issues.",
    "Celebrities": "Celebrity sightings and gossip",
    "Others": "Comments do not fit into any of the above categories",
    "Anomaly": "Data that does not look like comments or natural language",
}
few_shot_examples = [
        {"example": "Big stormy skies over city", "response": "Weather"},
        {"example": "The team won the championship", "response": "Sports"},
        {"example": "I saw a famous rapper at the mall", "response": "Celebrities"},
    ]

Categorize your data:

categorizer = BaseCategorizer(
    categories=categories,
    llm_provider=provider,
    examples = few_shot_examples,
    #strict = False # Default true, set to False to allow for categories not in the provided dict
)
category = categorizer.categorize("It's been raining outside all day")
print(category)

Output:

Weather

Use categorize_batch to categorize a batch. This saves tokens as it only sends the schema and few shot examples once! (Works best for better models. Ideally, use at least 3 few shot examples.)

categories = categorizer.categorize_batch([
    "Massive Blizzard Hits the Northeast, Thousands Without Power",
    "Local High School Basketball Team Wins State Championship After Dramatic Final",
    "Celebrated Actor Launches New Environmental Awareness Campaign",
])
print(categories)

Output:

['Weather', 'Sports', 'Celebrities']

AutoBatch for Larger datasets

If you have a pandas dataframe or list, use apply_to_column_autobatch

Batching data for LLM api calls saves tokens by not sending the prompt for every row. However, too large a batch size / complex tasks can lead to errors. Naturally, the better the LLM model, the larger the batch size you can use.
This batching is handled adaptively (i.e., it will increase the batch size if the response is valid and reduce it if it's not, with a decay factor)

Other features:

Progress bar
Returns the last successful index so you can resume from there, in case it exceeds max_retries
Modifies your output list in place, so you don't lose any progress

Retry Logic:

LLM providers have retry logic built in for API related errors. This can be configured in the provider.
The retry logic in the apply_to_column_autobatch is for handling invalid responses (e.g. unexpected category, different number of outputs, etc.)

0个回复

暂无回复

问答社区