AI Solutions

Discover and compare the best AI tools, rated by the community

Sort by:

InfiBench

a benchmark designed to evaluate large language models (LLMs) specifically in their ability to answer real-world coding-related questions.

0 reviewsView details →

CompMix

API

a benchmark evaluating QA methods that operate over a mixture of heterogeneous input sources (KB, text, tables, infoboxes).

0 reviewsView details →

MathEval

API

a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.

0 reviewsView details →

CompassRank

API

CompassRank is dedicated to exploring the most advanced language and visual models, offering a comprehensive, objective, and neutral evaluation reference for the industry and research.

0 reviewsView details →

MixEval

API

a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU).

0 reviewsView details →

M3CoT

API

a benchmark that evaluates large language models on a variety of multimodal reasoning tasks, including language, natural and social sciences, physical and social commonsense, temporal reasoning, algebra, and geometry.

0 reviewsView details →

LLMEval

API

focuses on understanding how these models perform in various scenarios and analyzing results from an interpretability perspective.

0 reviewsView details →

FELM

API

a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).

0 reviewsView details →

DreamBench++

API

a benchmark for evaluating the performance of large language models (LLMs) in various tasks related to both textual and visual imagination.

0 reviewsView details →

MMToM-QA

API

a multimodal question-answering benchmark designed to evaluate AI models' cognitive ability to understand human beliefs and goals.

0 reviewsView details →

PubMedQA

API

a biomedical question-answering benchmark designed for answering research-related questions using PubMed abstracts.

0 reviewsView details →

MMedBench

API

a benchmark that evaluates large language models' ability to answer medical questions across multiple languages.

0 reviewsView details →

TAT-DQA

API

a large-scale Document Visual Question Answering (VQA) dataset designed for complex document understanding, particularly in financial reports.

0 reviewsView details →

SuperLim

API

a Swedish language understanding benchmark that evaluates natural language processing (NLP) models on various tasks such as argumentation analysis, semantic similarity, and textual entailment.

0 reviewsView details →

SciBench

API

benchmark designed to evaluate large language models (LLMs) on solving complex, college-level scientific problems from domains like chemistry, physics, and mathematics.

0 reviewsView details →

SuperBench

API

a benchmark platform designed for evaluating large language models (LLMs) on a range of tasks, particularly focusing on their performance in different aspects such as natural language understanding, reasoning, and generalization.

0 reviewsView details →