AI Solutions
Discover and compare the best AI tools, rated by the community
Discover and compare the best AI tools, rated by the community
a benchmark designed to evaluate large language models (LLMs) specifically in their ability to answer real-world coding-related questions.
a benchmark evaluating QA methods that operate over a mixture of heterogeneous input sources (KB, text, tables, infoboxes).
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.
CompassRank is dedicated to exploring the most advanced language and visual models, offering a comprehensive, objective, and neutral evaluation reference for the industry and research.
a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU).
a benchmark that evaluates large language models on a variety of multimodal reasoning tasks, including language, natural and social sciences, physical and social commonsense, temporal reasoning, algebra, and geometry.
focuses on understanding how these models perform in various scenarios and analyzing results from an interpretability perspective.
a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).
a benchmark for evaluating the performance of large language models (LLMs) in various tasks related to both textual and visual imagination.
a multimodal question-answering benchmark designed to evaluate AI models' cognitive ability to understand human beliefs and goals.
a biomedical question-answering benchmark designed for answering research-related questions using PubMed abstracts.
a benchmark that evaluates large language models' ability to answer medical questions across multiple languages.
a large-scale Document Visual Question Answering (VQA) dataset designed for complex document understanding, particularly in financial reports.
a Swedish language understanding benchmark that evaluates natural language processing (NLP) models on various tasks such as argumentation analysis, semantic similarity, and textual entailment.
benchmark designed to evaluate large language models (LLMs) on solving complex, college-level scientific problems from domains like chemistry, physics, and mathematics.
a benchmark platform designed for evaluating large language models (LLMs) on a range of tasks, particularly focusing on their performance in different aspects such as natural language understanding, reasoning, and generalization.
a benchmark that evaluates large multimodal models (LMMs) on their ability to perform human-like mathematical reasoning.
a benchmark designed to assess the performance of multimodal web agents on realistic visually grounded tasks.
a benchmark dataset testing AI's ability to reason about visual commonsense through images that defy normal expectations.
Playground for devs to finetune & deploy LLMs
AI tool from awesome-llm
AI tool from awesome-llm
AI tool from awesome-llm
AI tool from awesome-llm