Skip to content
 
OpenCompass Website HOT      OpenCompass Toolkit TRY IT OUT
 

GitHub Org's stars

What is OpenCompass ? OpenCompass is a platform focused on understanding of the AGI, include Large Language Model and Multi-modality Model.

We aim to:

  • develop high-quality libraries to reduce the difficulties in evaluation
  • provide convincing leaderboards for improving the understanding of the large models
  • create powerful toolchains targeting a variety of abilities and tasks
  • build solid benchmarks to support the large model research
  • research on inference of Large Model(analysis, reasoning, prompt engineering.)

Toolkit

OpenCompass

VLMEvalKit

Models

CompassVerifier

CompassJudger

Benchmarks and Methods

Project Topic Paper

DevBench

Automated Software Development

DevBench: Towards LLMs based Automated Software Development

CriticBench

Critic Reasoning

CriticBench: Evaluating Large Language Models as Critic

ANAH

Hallucination Annotation

ANAH: Analytical Annotation of Hallucinations in Large Language Models

MathBench

Mathematical Reasoning

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

T-Eval

Tool Utilization

T-Eval: Evaluating the Tool Utilization Capability Step by Step

MMBench

Multi Modality

MMBench: Is Your Multi-modal Model an All-around Player?

BotChat

Subjective Evaluation

BotChat: Evaluating LLMs’ Capabilities of Having Multi-Turn Dialogues

LawBench

Domain Evaluation

LawBench: Benchmarking Legal Knowledge of Large Language Models

Pinned Loading

  1. opencompass opencompass Public

    OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

    Python 6.9k 756

  2. VLMEvalKit VLMEvalKit Public

    Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

    Python 4k 678

  3. MMBench MMBench Public

    Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"

    296 17

  4. CompassVerifier CompassVerifier Public

    [EMNLP 2025] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

    Jupyter Notebook 68 2

  5. CompassJudger CompassJudger Public

    The All-in-one Judge Models introduced by Opencompass

    119 6

  6. MMBench-GUI MMBench-GUI Public

    Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical manner across multiple platforms, includi…

    Python 103 6

Repositories

Showing 10 of 49 repositories
  • GenEditEvalKit Public

    The first unified, efficient, and extensible evaluation toolkit for evaluating image generation and editing models across multiple benchmarks.

    open-compass/GenEditEvalKit’s past year of commit activity
    Jupyter Notebook 41 MIT 4 0 0 Updated Apr 12, 2026
  • open-compass/SWE-bench-server’s past year of commit activity
    Python 1 0 0 0 Updated Apr 11, 2026
  • open-compass/SearchAgentService’s past year of commit activity
    Python 0 1 0 0 Updated Apr 11, 2026
  • open-compass/Terminal-Bench-server’s past year of commit activity
    Shell 0 0 0 0 Updated Apr 11, 2026
  • VLMEvalKit Public

    Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

    open-compass/VLMEvalKit’s past year of commit activity
    Python 4,026 Apache-2.0 678 203 27 Updated Apr 10, 2026
  • CNFinBench Public

    CNFinBench — the first comprehensive benchmark for high-stakes financial scenarios. It spans 29 subtasks grounded in authoritative financial corpora and real business contexts, reconstructing end-to-end agent execution chains from requirement parsing, path planning, tool invocation, to result verification.

    open-compass/CNFinBench’s past year of commit activity
    Python 1 0 0 0 Updated Apr 10, 2026
  • opencompass Public

    OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

    open-compass/opencompass’s past year of commit activity
    Python 6,850 Apache-2.0 756 373 (1 issue needs help) 71 Updated Apr 9, 2026
  • open-compass/pinchbench_server’s past year of commit activity
    Python 0 0 0 0 Updated Apr 3, 2026
  • TextEdit Public

    We provide TextEdit, a high-quality, multi-scenario text editing benchmark for generation models.

    open-compass/TextEdit’s past year of commit activity
    Python 19 MIT 0 0 0 Updated Mar 16, 2026
  • GTA Public

    [NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents

    open-compass/GTA’s past year of commit activity
    Python 138 Apache-2.0 9 0 0 Updated Feb 16, 2026

Top languages

Loading…