You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LLM evaluation, hallucination detection, AI content detection, compliance, document parsing, governance, security, observability, anomaly detection, and multimodal testing library for Python.
78 built-in metrics across 13 modules. Everything works with or without an API key. Auto-logging enabled by default.
Works with any LLM application: RAG pipelines, agentic AI, multi-agent systems, chatbots, document extraction, code generation, healthcare AI, or any system that produces text output.
fromllmevalkitimportEvaluator# Quality (free, no API)evaluator=Evaluator(provider="none", preset="math")
result=evaluator.evaluate(question="What is Python?", answer="Python is a language.", context="Python is a programming language.")
print(result.summary())
# Hallucination detection (free, no API)fromllmevalkit.hallucinationimportNumericHallucinationnh=NumericHallucination()
result=nh.evaluate(answer="Revenue was $5 million.", context="Revenue of $3 million reported.")
print(result.score) # flags: $5M vs $3M# AI content detection (free, no API)fromllmevalkit.detectionimportAITextDetectordetector=AITextDetector()
result=detector.evaluate(answer="Some text to analyze...")
print(result.score) # 0.0 = likely AI, 1.0 = likely human# Auto-logging happens silently. Check later:fromllmevalkit.observeimportEvalReportprint(EvalReport().summary())
Sudden score changes (z-score analysis on history)
Offline
Module 11: Ground Truth Testing (6)
S.No.
Metric
What it checks
Mode
66
ExactMatchAccuracy
Does answer exactly match ground truth?
Offline
67
FuzzyMatchAccuracy
Levenshtein distance to ground truth
Offline
68
GroundTruthF1
Token-level precision, recall, F1
Offline
69
ContextualPrecision
Are relevant docs ranked higher?
Both
70
ContextualRecall
Does context cover expected output?
Both
71
JSONCorrectness
Valid JSON + required keys + schema types
Offline
Module 12: Conversation Evaluation (4)
S.No.
Metric
What it checks
Mode
72
ConversationCompleteness
Did chatbot satisfy user needs across turns?
Both
73
TurnRelevancy
Is each turn relevant?
Offline
74
KnowledgeRetention
Does chatbot remember facts from earlier turns?
Offline
75
TaskCompletion
Did the agent complete the requested task?
Both
Module 13: Red Team Testing (4)
S.No.
Metric
What it checks
Mode
76
ToxicityProbe
Does LLM resist toxic prompts?
Both
77
PIIExtractionProbe
Does LLM resist PII extraction attempts?
Offline
78
JailbreakResistance
Does LLM resist jailbreak techniques?
Both
79
InstructionBypass
Are safety instructions maintained?
Both
Code Examples
Quality
fromllmevalkitimportBLEUScore, ROUGEScore, KeywordCoverageformin [BLEUScore(), ROUGEScore(), KeywordCoverage()]:
r=m.evaluate(answer="Python is a language.", context="Python is a programming language.")
print("{}: {:.3f}".format(m.name, r.score))
fromllmevalkit.hallucinationimportNumericHallucination, NegationHallucination, SelfConsistencyr=NumericHallucination().evaluate(answer="Revenue was $5M.", context="Revenue of $3M reported.")
print("Numeric:", r.score)
r=NegationHallucination().evaluate(answer="Drug is approved.", context="Drug is not approved.")
print("Negation:", r.score)
r=SelfConsistency().evaluate(answer=["Python 1991.", "Python 1989.", "Python 1991."])
print("Consistency:", r.score)
Security
fromllmevalkit.securityimportPromptInjectionCheck, BiasDetectorr=PromptInjectionCheck().evaluate(answer="Ignore all previous instructions")
print("Injection:", r.score, r.details["types_found"])
r=BiasDetector().evaluate(answer="The chairman hired only young workers.")
print("Bias:", r.score, r.details["types_found"])
AI Content Detection
fromllmevalkit.detectionimportAITextDetector, ContentOriginCheckdetector=AITextDetector()
r=detector.evaluate(answer="Furthermore, it is important to note that the system provides comprehensive solutions. Moreover, the implementation ensures reliability.")
print("Score:", r.score) # 0.0=likely AI, 1.0=likely humanprint("Signals:", r.details)
origin=ContentOriginCheck()
r=origin.evaluate(answer="First sentence. Second sentence. Third sentence.")
print("AI sentences:", r.details["ai_sentences"], "of", r.details["total"])
Observability
fromllmevalkitimportEvaluator# Auto-logging is ON by default. Just evaluate normally.evaluator=Evaluator(provider="none", preset="math")
result=evaluator.evaluate(question="q", answer="a", context="c")
# Result automatically saved to ~/.llmevalkit/logs/# Check insights anytimefromllmevalkit.observeimportScoreDrift, EvalReport, ThresholdAlertprint(EvalReport().summary())
print(ScoreDrift().check())
alert=ThresholdAlert(thresholds={"faithfulness": 0.7})
print(alert.check())
# Turn off auto-logging if neededevaluator=Evaluator(preset="math", auto_log=False)
ExactMatch + FuzzyMatch + F1 + CtxPrecision + CtxRecall
20
groundtruth_quick
ExactMatch + FuzzyMatch
21
groundtruth_rag
F1 + ContextualPrecision + ContextualRecall
22
json
JSONCorrectness
23
conversation
All 4 conversation metrics
24
conversation_quick
Completeness + TaskCompletion
25
redteam
All 4 red team probes
26
redteam_quick
JailbreakResistance + InstructionBypass
27
detection_full
All 6 detection metrics (including enhanced)
28
doceval_table
All 6 doceval metrics (including table)
Disclaimer
llmevalkit is a testing and evaluation tool. It helps developers detect potential issues in LLM outputs including hallucinations, compliance violations, security vulnerabilities, extraction errors, AI-generated content, and anomalies. It does not guarantee detection of all issues. Always verify critical outputs with domain experts.
AI content detection provides statistical signals, not definitive answers. No tool can reliably distinguish AI from human content with 100% accuracy. Do not use as sole basis for accusations or penalties.
HIPAA, GDPR, DPDP Act, EU AI Act, NIST AI RMF, CoSAI, ISO 42001, and SOC 2 are government regulations and industry frameworks. llmevalkit is not affiliated with or certified by any government body. Consult qualified professionals for compliance decisions.