Show HN: A Comprehensive AI Data Quality Evaluation Tool

4 months ago 9

👋 join us on Discord and WeChat

2024/12/27: Project Initialization

Dingo is a data quality evaluation tool that helps you automatically detect data quality issues in your datasets. Dingo provides a variety of built-in rules and model evaluation methods, and also supports custom evaluation methods. Dingo supports commonly used text datasets and multimodal datasets, including pre-training datasets, fine-tuning datasets, and evaluation datasets. In addition, Dingo supports multiple usage methods, including local CLI and SDK, making it easy to integrate into various evaluation platforms, such as OpenCompass.

1. Evaluate LLM chat data

from dingo.config.config import DynamicLLMConfig from dingo.io.input.Data import Data from dingo.model.llm.llm_text_quality_model_base import LLMTextQualityModelBase from dingo.model.rule.rule_common import RuleEnterAndSpace data = Data( data_id='123', prompt="hello, introduce the world", content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty." ) def llm(): LLMTextQualityModelBase.dynamic_config = DynamicLLMConfig( key='YOUR_API_KEY', api_url='https://api.openai.com/v1/chat/completions', model='gpt-4o', ) res = LLMTextQualityModelBase.eval(data) print(res) def rule(): res = RuleEnterAndSpace().eval(data) print(res)

from dingo.io import InputArgs from dingo.exec import Executor # Evaluate a dataset from Hugging Face input_data = { "eval_group": "sft", # Rule set for SFT data "input_path": "tatsu-lab/alpaca", # Dataset from Hugging Face "data_format": "plaintext", # Format: plaintext "save_data": True # Save evaluation results } input_args = InputArgs(**input_data) executor = Executor.exec_map["local"](input_args) result = executor.execute() print(result)

python -m dingo.run.cli --input_path data.txt --dataset local -e sft --data_format plaintext --save_data True

Evaluate with LLM (e.g., GPT-4o)

python -m dingo.run.cli --input_path data.json --dataset local -e openai --data_format json --column_content text --custom_config config_gpt.json --save_data True

Example config_gpt.json:

{ "llm_config": { "openai": { "model": "gpt-4o", "key": "YOUR_API_KEY", "api_url": "https://api.openai.com/v1/chat/completions" } } }

After evaluation (with save_data=True), a frontend page will be automatically generated. To manually start the frontend:

python -m dingo.run.vsl --input output_directory

Where output_directory contains the evaluation results with a summary.json file.

Try Dingo on our online demo: (Hugging Face)🤗

Try Dingo in local:

cd app_gradio python app.py

Experience Dingo interactively with Google Colab notebook:

Dingo includes an experimental Model Context Protocol (MCP) server. For details on running the server and integrating it with clients like Cursor, please see the dedicated documentation:

English · 简体中文 · 日本語

To help you get started quickly with Dingo MCP, we've created a video walkthrough:

mcp_demo.mp4

This video demonstrates step-by-step how to use Dingo MCP server with Cursor.

Dingo classifies data quality issues into 7 dimensions of Quality Metrics. Each dimension can be evaluated using both rule-based methods and LLM-based prompts:

Quality Metric Description Rule Examples LLM Prompt Examples

COMPLETENESS	Checks if data is incomplete or missing	RuleColonEnd, RuleContentNull	Evaluates if text abruptly ends with a colon or ellipsis, has mismatched parentheses, or missing critical components
EFFECTIVENESS	Checks if data is meaningful and properly formatted	RuleAbnormalChar, RuleHtmlEntity, RuleSpecialCharacter	Detects garbled text, words stuck together without spaces, and text lacking proper punctuation
FLUENCY	Checks if text is grammatically correct and reads naturally	RuleAbnormalNumber, RuleNoPunc, RuleWordStuck	Identifies excessively long words, text fragments without punctuation, or content with chaotic reading order
RELEVANCE	Detects irrelevant content within the data	RuleHeadWord variants for different languages	Examines for irrelevant information like citation details, headers/footers, entity markers, HTML tags
SECURITY	Identifies sensitive information or value conflicts	RuleIDCard, RuleUnsafeWords	Checks for personal information, and content related to gambling, pornography, political issues
SIMILARITY	Detects repetitive or highly similar content	RuleDocRepeat	Evaluates text for consecutive repeated content or multiple occurrences of special characters
UNDERSTANDABILITY	Assesses how easily data can be interpreted	RuleCapitalWords	Ensures LaTeX formulas and Markdown are correctly formatted, with proper segmentation and line breaks

Dingo provides several LLM-based assessment methods defined by prompts in the dingo/model/prompt directory. These prompts are registered using the prompt_register decorator and can be combined with LLM models for quality evaluation: