Overview
Tables
1st2ndSecond to lastLast
| Model | Size | Min-Max | Z-Score | Avg Rank | ASR (WER %) | FR | EN | DE | ES | IT | PT | NL | AST (METEOR %) | ES-FR · Multilingual_TEDx | ES-IT · Multilingual_TEDx | FR-EN · Multilingual_TEDx | FR-ES · Multilingual_TEDx | Question Answering (FLOW_JUDGE) | FR | EN | Others | Emotion Recognition (FLOW_JUDGE) | Gender Recognition (FLOW_JUDGE) | Age Recognition (FLOW_JUDGE) | Dialogue Summarization (FLOW_JUDGE) | Spoken Language Identification (FLOW_JUDGE) | Music | Music Question Answering (FLOW_JUDGE) | Music Captioning (FLOW_JUDGE) | Sound | Audio Captioning (FLOW_JUDGE) | Audio Question Answering (FLOW_JUDGE) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.863 | 0.95 | 2.3 | 25.23 ±1.67 | 33.56 ±3.68 | 11.00 ±1.15 | 32.65 ±4.70 | 16.57 ±2.64 | 21.20 ±7.11 | 19.15 ±2.94 | 40.68 ±3.75 | 48.00 ±1.15 | 46.22 ±2.08 | 45.46 ±3.11 | 51.06 ±2.30 | 49.24 ±2.04 | 63.04 ±0.05 | 50.99 ±0.09 | 68.21 ±0.05 | 55.99 | 30.20 ±0.02 | 85.80 ±0.02 | 23.60 ±0.03 | 51.93 ±0.08 | 88.40 ±0.03 | 59.99 | 60.23 ±0.07 | 59.76 ±0.04 | 57.35 | 53.70 ±0.06 | 60.99 ±0.09 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.613 | 0.27 | 3.7 | 37.96 ±5.64 | 44.41 ±9.42 | 41.77 ±10.57 | 48.09 ±21.47 | 28.24 ±23.62 | 18.58 ±12.24 | 35.15 ±16.46 | 31.38 ±4.68 | 45.93 ±1.20 | 42.80 ±2.25 | 39.45 ±3.23 | 53.66 ±2.01 | 47.82 ±2.26 | 66.86 ±0.04 | 53.78 ±0.07 | 72.47 ±0.05 | 32.44 | 10.00 ±0.02 | 16.87 ±0.02 | 0.20 ±0.00 | 55.33 ±0.08 | 79.80 ±0.04 | 37.80 | 38.20 ±0.05 | 37.40 ±0.07 | 56.93 | 54.68 ±0.07 | 59.19 ±0.09 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.608 | 0.25 | 3.8 | 36.83 ±6.09 | 43.81 ±16.23 | 10.29 ±1.80 | 41.56 ±6.57 | 34.85 ±16.84 | 24.70 ±9.17 | 31.95 ±6.49 | 103.10 ±22.76 | 42.31 ±1.41 | 31.28 ±2.46 | 40.85 ±3.57 | 57.03 ±2.36 | 40.09 ±2.66 | 65.30 ±0.05 | 52.16 ±0.09 | 70.93 ±0.05 | 35.67 | 18.60 ±0.02 | 48.53 ±0.03 | 4.00 ±0.01 | 56.40 ±0.09 | 50.80 ±0.04 | 44.59 | 44.93 ±0.07 | 44.24 ±0.06 | 53.18 | 47.34 ±0.06 | 59.02 ±0.09 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.564 | 0.03 | 4.5 | 72.44 ±5.63 | 73.74 ±9.09 | 12.58 ±8.94 | 112.01 ±22.08 | 67.31 ±10.68 | 106.41 ±20.15 | 88.70 ±22.54 | 74.85 ±6.63 | 23.97 ±1.22 | 17.21 ±1.99 | 12.67 ±2.35 | 43.68 ±2.03 | 22.34 ±2.25 | 55.98 ±0.05 | 44.63 ±0.07 | 60.84 ±0.06 | 59.31 | 52.73 ±0.03 | 75.80 ±0.02 | 19.50 ±0.02 | 55.53 ±0.09 | 93.00 ±0.02 | 56.73 | 57.97 ±0.07 | 55.48 ±0.11 | 58.74 | 57.76 ±0.06 | 59.73 ±0.08 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.431 | -0.22 | 4.7 | 125.88 ±18.69 | 127.95 ±36.03 | 81.49 ±20.97 | 163.93 ±69.63 | 125.22 ±69.41 | 135.14 ±50.01 | 135.23 ±47.91 | 134.67 ±40.94 | 25.60 ±0.78 | 21.31 ±1.35 | 25.62 ±2.25 | 29.81 ±1.36 | 25.65 ±1.50 | 73.85 ±0.04 | 61.04 ±0.07 | 79.34 ±0.04 | 35.65 | 15.33 ±0.02 | 18.27 ±0.02 | 2.40 ±0.01 | 66.67 ±0.08 | 75.60 ±0.04 | 48.93 | 47.27 ±0.07 | 50.60 ±0.07 | 48.80 | 38.90 ±0.05 | 58.69 ±0.07 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.424 | -0.27 | 5.5 | 65.99 ±7.27 | 82.32 ±12.68 | 68.47 ±13.86 | 71.28 ±25.51 | 55.52 ±24.49 | 44.98 ±19.24 | 52.58 ±24.86 | 39.77 ±7.22 | 44.77 ±1.17 | 39.10 ±2.19 | 41.91 ±3.00 | 51.84 ±2.00 | 46.21 ±2.27 | 64.97 ±0.04 | 51.26 ±0.08 | 70.84 ±0.05 | 28.05 | 11.00 ±0.02 | 7.93 ±0.01 | 2.90 ±0.01 | 55.20 ±0.09 | 63.20 ±0.04 | 34.67 | 33.05 ±0.06 | 36.28 ±0.07 | 44.41 | 35.88 ±0.06 | 52.93 ±0.08 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.410 | -0.26 | 4.8 | 18.17 ±7.32 | 21.70 ±13.80 | 9.43 ±1.71 | 28.29 ±36.38 | 12.58 ±32.37 | 12.59 ±7.25 | 19.23 ±7.46 | 23.10 ±2.25 | 40.85 ±1.11 | 34.56 ±2.00 | 36.04 ±3.14 | 47.57 ±1.93 | 45.23 ±2.00 | 66.45 ±0.05 | 59.84 ±0.08 | 69.28 ±0.05 | 27.10 | 10.93 ±0.02 | 32.07 ±0.02 | 2.50 ±0.01 | 49.40 ±0.09 | 40.60 ±0.04 | 35.33 | 36.77 ±0.06 | 33.88 ±0.07 | 32.89 | 22.38 ±0.03 | 43.41 ±0.09 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.253 | -0.75 | 6.7 | 27.98 ±1.73 | 20.80 ±1.78 | 12.09 ±0.95 | 33.12 ±7.10 | 13.93 ±6.25 | 18.95 ±7.88 | 80.39 ±6.12 | 49.77 ±3.54 | 33.41 ±1.11 | 27.86 ±1.93 | 30.12 ±3.08 | 41.47 ±1.94 | 34.19 ±2.08 | 59.86 ±0.05 | 54.45 ±0.08 | 62.17 ±0.06 | 26.09 | 10.73 ±0.02 | 33.33 ±0.02 | 2.00 ±0.01 | 41.60 ±0.09 | 42.80 ±0.04 | 31.56 | 34.84 ±0.06 | 28.28 ±0.06 | 32.61 | 21.74 ±0.03 | 43.49 ±0.09 |
Overview (FR/EN — ASR, AST, QA)
Tables
1st2ndSecond to lastLast
| Model | Size | Min-Max | Z-Score | Avg Rank | ASR (WER %) | FR | EN | AST (METEOR %) | ES-FR · Multilingual_TEDx | FR-EN · Multilingual_TEDx | FR-ES · Multilingual_TEDx | Question Answering (FLOW_JUDGE) | FR | EN |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.770 | 0.57 | 3.0 | 17.61 ±9.23 | 21.70 ±13.80 | 9.43 ±1.71 | 42.45 ±1.18 | 34.56 ±2.00 | 47.57 ±1.93 | 45.23 ±2.00 | 66.45 ±0.05 | 59.84 ±0.08 | 69.28 ±0.05 |
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.769 | 0.50 | 3.3 | 26.04 ±2.52 | 33.56 ±3.68 | 11.00 ±1.15 | 48.84 ±1.24 | 46.22 ±2.08 | 51.06 ±2.30 | 49.24 ±2.04 | 63.04 ±0.05 | 50.99 ±0.09 | 68.21 ±0.05 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.768 | 0.54 | 3.0 | 43.53 ±7.21 | 44.41 ±9.42 | 41.77 ±10.57 | 48.09 ±1.28 | 42.80 ±2.25 | 53.66 ±2.01 | 47.82 ±2.26 | 66.86 ±0.04 | 53.78 ±0.07 | 72.47 ±0.05 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.701 | 0.34 | 4.0 | 32.64 ±10.86 | 43.81 ±16.23 | 10.29 ±1.80 | 42.80 ±1.54 | 31.28 ±2.46 | 57.03 ±2.36 | 40.09 ±2.66 | 65.30 ±0.05 | 52.16 ±0.09 | 70.93 ±0.05 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.578 | -0.05 | 5.0 | 77.70 ±9.65 | 82.32 ±12.68 | 68.47 ±13.86 | 45.72 ±1.27 | 39.10 ±2.19 | 51.84 ±2.00 | 46.21 ±2.27 | 64.97 ±0.04 | 51.26 ±0.08 | 70.84 ±0.05 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.532 | -0.19 | 5.0 | 17.90 ±1.24 | 20.80 ±1.78 | 12.09 ±0.95 | 34.50 ±1.18 | 27.86 ±1.93 | 41.47 ±1.94 | 34.19 ±2.08 | 59.86 ±0.05 | 54.45 ±0.08 | 62.17 ±0.06 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.333 | -0.61 | 5.7 | 112.46 ±25.06 | 127.95 ±36.03 | 81.49 ±20.97 | 25.59 ±0.83 | 21.31 ±1.35 | 29.81 ±1.36 | 25.65 ±1.50 | 73.85 ±0.04 | 61.04 ±0.07 | 79.34 ±0.04 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.239 | -1.10 | 7.0 | 53.35 ±6.83 | 73.74 ±9.09 | 12.58 ±8.94 | 27.74 ±1.34 | 17.21 ±1.99 | 43.68 ±2.03 | 22.34 ±2.25 | 55.98 ±0.05 | 44.63 ±0.07 | 60.84 ±0.06 |
Tasks · ASR
Tables
Tasks · ASR — WER (%)
| Model | Size | Min-Max | Z-Score | Avg Rank | Average | FR | CommonVoice | Fleurs | Multilingual_TEDx | SUMM-RE | VoxPopuli | YouTubeFr | EN | CommonVoice | Fleurs | VoxPopuli | DE | Fleurs | Multilingual_TEDx | ES | Fleurs | Multilingual_TEDx | IT | Fleurs | Multilingual_TEDx | PT | Fleurs | Multilingual_TEDx | NL - Fleurs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.999 | 0.91 | 1.3 | 18.13 | 21.70 ±13.80 | 10.65 ±1.84 | 10.15 ±1.40 | 22.07 ±55.31 | 42.83 ±55.71 | 13.37 ±24.17 | 31.16 ±8.81 | 9.43 ±1.71 | 10.89 ±1.54 | 7.49 ±0.98 | 9.92 ±4.78 | 28.29 ±36.38 | 11.23 ±1.05 | 45.35 ±72.58 | 12.58 ±32.37 | 5.66 ±0.85 | 19.50 ±64.61 | 12.59 ±7.25 | 9.10 ±1.03 | 16.08 ±14.39 | 19.23 ±7.46 | 12.14 ±2.25 | 26.31 ±14.69 | 23.10 ±2.25 |
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.938 | 0.73 | 2.9 | 24.97 | 33.56 ±3.68 | 18.92 ±2.45 | 27.90 ±2.52 | 26.52 ±15.79 | 39.09 ±3.37 | 37.35 ±4.38 | 51.58 ±13.46 | 11.00 ±1.15 | 10.30 ±1.65 | 7.16 ±1.22 | 15.54 ±2.72 | 32.65 ±4.70 | 22.40 ±3.06 | 42.91 ±8.72 | 16.57 ±2.64 | 14.55 ±2.27 | 18.59 ±4.76 | 21.20 ±7.11 | 20.92 ±2.51 | 21.47 ±13.97 | 19.15 ±2.94 | 14.99 ±2.14 | 23.31 ±5.43 | 40.68 ±3.75 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.871 | 0.53 | 3.4 | 32.72 | 20.80 ±1.78 | 13.01 ±1.94 | 11.92 ±1.34 | 18.06 ±8.22 | 32.69 ±4.36 | 17.08 ±1.58 | 32.04 ±3.48 | 12.09 ±0.95 | 12.26 ±1.89 | 11.05 ±1.12 | 12.97 ±1.80 | 33.12 ±7.10 | 16.66 ±1.57 | 49.57 ±13.82 | 13.93 ±6.25 | 8.33 ±0.99 | 19.52 ±12.34 | 18.95 ±7.88 | 12.38 ±1.20 | 25.53 ±15.59 | 80.39 ±6.12 | 58.79 ±3.43 | 101.99 ±11.44 | 49.77 ±3.54 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.826 | 0.41 | 4.0 | 35.37 | 44.41 ±9.42 | 25.86 ±17.61 | 12.16 ±2.18 | 36.79 ±35.85 | 82.09 ±20.51 | 31.48 ±7.55 | 78.10 ±31.14 | 41.77 ±10.57 | 75.07 ±20.40 | 20.60 ±3.87 | 29.65 ±23.53 | 48.09 ±21.47 | 18.30 ±3.04 | 77.88 ±42.03 | 28.24 ±23.62 | 13.29 ±2.63 | 43.19 ±46.66 | 18.58 ±12.24 | 13.79 ±2.83 | 23.36 ±24.20 | 35.15 ±16.46 | 15.68 ±3.65 | 54.62 ±32.25 | 31.38 ±4.68 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.793 | 0.29 | 4.3 | 41.47 | 43.81 ±16.23 | 26.18 ±5.77 | 21.29 ±3.57 | 55.24 ±41.81 | 63.68 ±72.63 | 27.56 ±7.56 | 68.93 ±47.91 | 10.29 ±1.80 | 12.42 ±4.75 | 6.36 ±1.06 | 12.09 ±2.27 | 41.56 ±6.57 | 26.78 ±5.05 | 56.33 ±11.83 | 34.85 ±16.84 | 15.44 ±3.23 | 54.25 ±33.36 | 24.70 ±9.17 | 18.83 ±4.24 | 30.58 ±17.70 | 31.95 ±6.49 | 29.88 ±9.76 | 34.03 ±8.53 | 103.10 ±22.76 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.601 | -0.25 | 5.7 | 59.27 | 82.32 ±12.68 | 97.54 ±33.55 | 28.89 ±7.09 | 93.08 ±41.98 | 128.57 ±25.32 | 47.17 ±14.03 | 98.67 ±41.83 | 68.47 ±13.86 | 120.36 ±27.27 | 29.73 ±7.38 | 55.32 ±29.51 | 71.28 ±25.51 | 28.98 ±5.83 | 113.59 ±49.14 | 55.52 ±24.49 | 26.69 ±5.24 | 84.34 ±47.31 | 44.98 ±19.24 | 25.84 ±4.56 | 64.12 ±37.65 | 52.58 ±24.86 | 22.88 ±5.74 | 82.28 ±48.63 | 39.77 ±7.22 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.504 | -0.57 | 6.4 | 76.51 | 73.74 ±9.09 | 48.31 ±7.41 | 57.50 ±6.22 | 76.10 ±46.43 | 73.64 ±9.60 | 94.59 ±6.21 | 92.27 ±23.11 | 12.58 ±8.94 | 13.22 ±2.87 | 11.98 ±1.63 | 12.53 ±26.61 | 112.01 ±22.08 | 97.16 ±10.60 | 126.86 ±42.67 | 67.31 ±10.68 | 64.86 ±6.38 | 69.77 ±20.29 | 106.41 ±20.15 | 120.32 ±10.00 | 92.50 ±39.02 | 88.70 ±22.54 | 96.60 ±9.93 | 80.80 ±43.95 | 74.85 ±6.63 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.000 | -2.05 | 8.0 | 129.09 | 127.95 ±36.03 | 71.61 ±29.42 | 75.16 ±3.59 | 185.12 ±176.73 | 201.50 ±52.98 | 103.83 ±38.14 | 130.46 ±98.40 | 81.49 ±20.97 | 54.35 ±37.70 | 73.73 ±14.57 | 116.40 ±47.54 | 163.93 ±69.63 | 97.87 ±7.06 | 229.99 ±137.44 | 125.22 ±69.41 | 83.45 ±6.91 | 167.00 ±137.18 | 135.14 ±50.01 | 80.32 ±3.53 | 189.97 ±98.55 | 135.23 ±47.91 | 96.91 ±16.09 | 173.54 ±93.23 | 134.67 ±40.94 |
Tasks · AST
Tables
Tasks · AST — BLEU
| Model | Size | Min-Max | Z-Score | Avg Rank | Average | FR-EN · Multilingual_TEDx (FR→EN) | FR-ES · Multilingual_TEDx (FR→ES) | ES-FR · Multilingual_TEDx (ES→FR) | ES-IT · Multilingual_TEDx (ES→IT) |
|---|---|---|---|---|---|---|---|---|---|
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.901 | 0.92 | 1.8 | 26.86 | 29.15 ±2.03 | 24.60 ±1.71 | 28.09 ±1.79 | 25.60 ±2.55 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.861 | 0.82 | 2.0 | 26.47 | 35.51 ±2.08 | 27.14 ±1.94 | 23.90 ±1.80 | 19.34 ±3.08 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.846 | 0.76 | 2.8 | 26.37 | 39.13 ±2.25 | 22.38 ±2.14 | 22.90 ±1.92 | 21.06 ±2.88 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.710 | 0.30 | 4.2 | 22.14 | 25.53 ±1.71 | 22.45 ±1.83 | 23.56 ±1.87 | 17.03 ±2.49 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.643 | 0.07 | 5.2 | 20.29 | 23.17 ±1.95 | 21.08 ±1.95 | 18.32 ±1.76 | 18.60 ±2.34 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.597 | -0.07 | 5.8 | 19.61 | 27.71 ±1.83 | 18.19 ±1.67 | 17.13 ±1.49 | 15.40 ±2.60 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.385 | -0.74 | 6.2 | 14.86 | 29.01 ±1.94 | 13.53 ±1.62 | 12.43 ±1.48 | 4.48 ±1.97 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.000 | -2.06 | 8.0 | 3.67 | 4.26 ±0.52 | 3.64 ±0.60 | 3.19 ±0.51 | 3.61 ±0.99 |
Tasks · AST — METEOR (%)
| Model | Size | Min-Max | Z-Score | Avg Rank | Average | FR-EN · Multilingual_TEDx (FR→EN) | FR-ES · Multilingual_TEDx (FR→ES) | ES-FR · Multilingual_TEDx (ES→FR) | ES-IT · Multilingual_TEDx (ES→IT) |
|---|---|---|---|---|---|---|---|---|---|
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.945 | 1.04 | 1.8 | 48.00 | 51.06 ±2.30 | 49.24 ±2.04 | 46.22 ±2.08 | 45.46 ±3.11 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.880 | 0.84 | 2.5 | 45.93 | 53.66 ±2.01 | 47.82 ±2.26 | 42.80 ±2.25 | 39.45 ±3.23 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.836 | 0.71 | 2.8 | 44.77 | 51.84 ±2.00 | 46.21 ±2.27 | 39.10 ±2.19 | 41.91 ±3.00 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.751 | 0.48 | 3.5 | 42.31 | 57.03 ±2.36 | 40.09 ±2.66 | 31.28 ±2.46 | 40.85 ±3.57 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.703 | 0.28 | 4.5 | 40.85 | 47.57 ±1.93 | 45.23 ±2.00 | 34.56 ±2.00 | 36.04 ±3.14 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.442 | -0.51 | 6.2 | 33.41 | 41.47 ±1.94 | 34.19 ±2.08 | 27.86 ±1.93 | 30.12 ±3.08 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.165 | -1.38 | 7.2 | 25.60 | 29.81 ±1.36 | 25.65 ±1.50 | 21.31 ±1.35 | 25.62 ±2.25 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.127 | -1.46 | 7.5 | 23.97 | 43.68 ±2.03 | 22.34 ±2.25 | 17.21 ±1.99 | 12.67 ±2.35 |
Tasks · QA
Tables
Math Question Answering
Tasks · QA — ACC (%)
| Model | Size | Min-Max | Z-Score | Avg Rank | Average | EN · spoken-mqa_short_digit |
|---|---|---|---|---|---|---|
| Qwen/Qwen2.5-Omni-7B | 11B | 1.000 | 1.28 | 1.0 | 89.00 | 89.00 ±6.13 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.939 | 1.09 | 2.0 | 85.00 | 85.00 ±7.00 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.697 | 0.31 | 3.0 | 69.00 | 69.00 ±9.06 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.652 | 0.16 | 4.0 | 66.00 | 66.00 ±9.28 |
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.652 | 0.16 | 5.0 | 66.00 | 66.00 ±9.28 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.621 | 0.06 | 6.0 | 64.00 | 64.00 ±9.41 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.258 | -1.11 | 7.0 | 40.00 | 40.00 ±9.60 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.000 | -1.94 | 8.0 | 23.00 | 23.00 ±8.25 |
Question Answering
Tasks · QA — FLOW_JUDGE
| Model | Size | Min-Max | Z-Score | Avg Rank | Average | FR · CohereLabs-Aya_collection | FR · Vigogne--Alpaca | FR · VoxPopuli-QA | EN · NationalSpeechCorpus_SQA | EN · OpenHermes_audio | EN · SLUE-P2-SQA5 | EN · SpokenWOZ_AIR-Bench | EN · alpaca_audio | EN · fisher_AIR-Bench | EN · public-sg-speech |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 1.000 | 1.69 | 1.0 | 70.19 | 48.39 ±0.41 | 57.08 ±0.10 | 77.64 ±0.08 | 74.00 ±0.07 | 77.40 ±0.20 | 88.28 ±0.09 | 77.20 ±0.15 | 78.20 ±0.19 | 81.10 ±0.11 | 79.20 ±0.06 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.692 | 0.65 | 3.5 | 64.56 | 48.39 ±0.33 | 52.56 ±0.09 | 78.56 ±0.10 | 59.70 ±0.09 | 67.40 ±0.25 | 86.13 ±0.12 | 65.18 ±0.20 | 70.40 ±0.25 | 69.80 ±0.17 | 66.36 ±0.08 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.593 | 0.32 | 3.0 | 63.12 | 42.58 ±0.38 | 51.08 ±0.10 | 67.68 ±0.10 | 63.05 ±0.09 | 68.40 ±0.19 | 90.10 ±0.09 | 71.30 ±0.19 | 69.40 ±0.22 | 73.70 ±0.16 | 71.32 ±0.08 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.502 | 0.01 | 4.0 | 61.54 | 35.48 ±0.32 | 41.76 ±0.09 | 79.24 ±0.10 | 63.75 ±0.09 | 61.80 ±0.27 | 91.76 ±0.08 | 74.72 ±0.17 | 53.40 ±0.29 | 76.90 ±0.15 | 74.16 ±0.08 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.473 | -0.09 | 5.0 | 61.05 | 39.35 ±0.39 | 48.84 ±0.10 | 65.60 ±0.11 | 64.95 ±0.09 | 64.80 ±0.18 | 88.97 ±0.10 | 67.56 ±0.20 | 68.60 ±0.23 | 71.10 ±0.17 | 69.92 ±0.08 |
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.393 | -0.36 | 6.5 | 59.60 | 34.84 ±0.42 | 43.28 ±0.10 | 74.84 ±0.10 | 57.45 ±0.08 | 62.60 ±0.27 | 87.94 ±0.11 | 68.91 ±0.19 | 63.80 ±0.28 | 72.00 ±0.16 | 64.76 ±0.08 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.335 | -0.55 | 5.0 | 58.31 | 42.58 ±0.31 | 48.20 ±0.10 | 72.56 ±0.11 | 54.95 ±0.09 | 57.00 ±0.22 | 77.35 ±0.15 | 57.62 ±0.20 | 60.60 ±0.26 | 63.90 ±0.18 | 63.80 ±0.08 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.000 | -1.68 | 8.0 | 52.74 | 37.42 ±0.32 | 38.12 ±0.09 | 58.36 ±0.10 | 63.60 ±0.08 | 48.80 ±0.27 | 78.48 ±0.13 | 69.22 ±0.17 | 24.00 ±0.41 | 71.70 ±0.13 | 70.08 ±0.07 |
Tasks · Others
Tables
Age Recognition
Tasks · Others — FLOW_JUDGE
| Model | Size | Min-Max | Z-Score | Avg Rank | Average | FR · CommonVoice_Age | EN · CommonVoice_Age |
|---|---|---|---|---|---|---|---|
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.950 | 2.02 | 1.5 | 23.60 | 15.00 ±0.03 | 32.20 ±0.04 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.596 | 0.80 | 3.0 | 19.50 | 3.20 ±0.02 | 35.80 ±0.04 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.199 | -0.17 | 3.0 | 4.00 | 5.00 ±0.02 | 3.00 ±0.01 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.165 | -0.26 | 5.0 | 2.90 | 4.80 ±0.02 | 1.00 ±0.01 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.129 | -0.39 | 5.0 | 2.50 | 3.60 ±0.02 | 1.40 ±0.01 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.079 | -0.56 | 5.5 | 2.00 | 1.80 ±0.01 | 2.20 ±0.01 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.061 | -0.64 | 5.5 | 2.40 | 0.40 ±0.01 | 4.40 ±0.02 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.000 | -0.80 | 7.5 | 0.20 | 0.40 ±0.01 | 0.00 ±0.00 |
Dialogue Summarization
Tasks · Others — FLOW_JUDGE
| Model | Size | Min-Max | Z-Score | Avg Rank | Average | EN · NationalSpeechCorpus_SDS |
|---|---|---|---|---|---|---|
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 1.000 | 1.91 | 1.0 | 66.67 | 66.67 ±0.08 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.590 | 0.36 | 2.0 | 56.40 | 56.40 ±0.09 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.556 | 0.23 | 3.0 | 55.53 | 55.53 ±0.09 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.548 | 0.20 | 4.0 | 55.33 | 55.33 ±0.08 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.543 | 0.18 | 5.0 | 55.20 | 55.20 ±0.09 |
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.412 | -0.31 | 6.0 | 51.93 | 51.93 ±0.08 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.311 | -0.70 | 7.0 | 49.40 | 49.40 ±0.09 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.000 | -1.87 | 8.0 | 41.60 | 41.60 ±0.09 |
Emotion Recognition
Tasks · Others — FLOW_JUDGE
| Model | Size | Min-Max | Z-Score | Avg Rank | Average | FR · MELD_Emotion | EN · IEMOCAP-Emotion | EN · MELD_Emotion |
|---|---|---|---|---|---|---|---|---|
| nvidia/audio-flamingo-3-hf | 8.2B | 1.000 | 2.36 | 1.0 | 52.80 | 53.00 ±0.04 | 43.20 ±0.04 | 62.00 ±0.04 |
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.448 | 0.61 | 2.0 | 26.70 | 16.20 ±0.03 | 35.80 ±0.04 | 38.60 ±0.04 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.214 | -0.09 | 3.0 | 17.00 | 12.20 ±0.03 | 22.20 ±0.04 | 21.40 ±0.04 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.142 | -0.31 | 4.0 | 13.80 | 9.20 ±0.03 | 27.20 ±0.04 | 9.60 ±0.03 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.051 | -0.59 | 6.5 | 9.90 | 6.80 ±0.02 | 18.80 ±0.03 | 7.20 ±0.02 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.040 | -0.63 | 6.5 | 9.25 | 4.80 ±0.02 | 22.00 ±0.04 | 5.40 ±0.02 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.038 | -0.64 | 6.0 | 8.90 | 2.60 ±0.01 | 24.80 ±0.04 | 5.60 ±0.02 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.015 | -0.71 | 7.0 | 7.90 | 1.60 ±0.01 | 22.60 ±0.04 | 5.80 ±0.02 |
Gender Recognition
Tasks · Others — FLOW_JUDGE
| Model | Size | Min-Max | Z-Score | Avg Rank | Average | FR · CommonVoice_Gender | EN · CommonVoice_Gender | EN · IEMOCAP-gender |
|---|---|---|---|---|---|---|---|---|
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 1.000 | 1.77 | 1.0 | 81.10 | 67.00 ±0.04 | 92.40 ±0.02 | 98.00 ±0.01 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.824 | 1.20 | 2.0 | 68.95 | 48.40 ±0.04 | 93.00 ±0.02 | 86.00 ±0.03 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.499 | 0.23 | 3.0 | 44.15 | 31.00 ±0.04 | 58.40 ±0.04 | 56.20 ±0.04 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.347 | -0.20 | 5.0 | 31.85 | 27.40 ±0.04 | 40.40 ±0.04 | 32.20 ±0.04 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.344 | -0.20 | 4.5 | 31.35 | 29.20 ±0.04 | 17.60 ±0.03 | 49.40 ±0.04 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.216 | -0.56 | 6.0 | 20.75 | 28.20 ±0.04 | 26.20 ±0.04 | 0.40 ±0.01 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.102 | -0.97 | 6.5 | 14.25 | 6.40 ±0.02 | 8.80 ±0.02 | 35.40 ±0.04 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.000 | -1.27 | 8.0 | 6.35 | 1.60 ±0.01 | 9.20 ±0.03 | 13.00 ±0.03 |
Spoken Language Identification
Tasks · Others — FLOW_JUDGE
| Model | Size | Min-Max | Z-Score | Avg Rank | Average | OTHER · CoVost2_AIR-Bench |
|---|---|---|---|---|---|---|
| nvidia/audio-flamingo-3-hf | 8.2B | 1.000 | 1.37 | 1.0 | 93.00 | 93.00 ±0.02 |
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.912 | 1.13 | 2.0 | 88.40 | 88.40 ±0.03 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.748 | 0.68 | 3.0 | 79.80 | 79.80 ±0.04 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.668 | 0.46 | 4.0 | 75.60 | 75.60 ±0.04 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.431 | -0.19 | 5.0 | 63.20 | 63.20 ±0.04 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.195 | -0.83 | 6.0 | 50.80 | 50.80 ±0.04 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.042 | -1.25 | 7.0 | 42.80 | 42.80 ±0.04 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.000 | -1.37 | 8.0 | 40.60 | 40.60 ±0.04 |
Tasks · Music
Tables
Music Captioning
Tasks · Music — FLOW_JUDGE
| Model | Size | Min-Max | Z-Score | Avg Rank | Average | OTHER · MusicCaps |
|---|---|---|---|---|---|---|
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 1.000 | 1.58 | 1.0 | 59.76 | 59.76 ±0.04 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.864 | 1.17 | 2.0 | 55.48 | 55.48 ±0.11 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.709 | 0.71 | 3.0 | 50.60 | 50.60 ±0.07 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.507 | 0.10 | 4.0 | 44.24 | 44.24 ±0.06 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.290 | -0.56 | 5.0 | 37.40 | 37.40 ±0.07 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.254 | -0.67 | 6.0 | 36.28 | 36.28 ±0.07 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.178 | -0.90 | 7.0 | 33.88 | 33.88 ±0.07 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.000 | -1.43 | 8.0 | 28.28 | 28.28 ±0.06 |
Music Question Answering
Tasks · Music — FLOW_JUDGE
| Model | Size | Min-Max | Z-Score | Avg Rank | Average | FR · MusicCaps-QA | EN · MTJ-Jamendo_AIR-Bench | EN · MusicCaps-QA |
|---|---|---|---|---|---|---|---|---|
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.970 | 1.40 | 1.5 | 59.00 | 55.32 ±0.07 | 63.20 ±0.04 | 62.16 ±0.08 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.906 | 1.23 | 2.5 | 57.16 | 54.72 ±0.09 | 62.60 ±0.04 | 56.60 ±0.08 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.684 | 0.59 | 2.0 | 49.57 | 56.48 ±0.07 | 28.00 ±0.04 | 57.32 ±0.08 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.560 | 0.28 | 4.0 | 46.87 | 52.68 ±0.08 | 31.80 ±0.04 | 50.32 ±0.09 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.205 | -0.63 | 5.5 | 38.48 | 43.60 ±0.07 | 24.80 ±0.04 | 41.92 ±0.08 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.143 | -0.75 | 6.0 | 38.22 | 38.28 ±0.09 | 41.00 ±0.04 | 35.32 ±0.08 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.069 | -0.96 | 6.5 | 35.77 | 38.56 ±0.07 | 23.80 ±0.04 | 42.16 ±0.08 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.000 | -1.14 | 8.0 | 34.06 | 37.08 ±0.08 | 21.00 ±0.04 | 41.08 ±0.08 |
Tasks · Sound
Tables
Audio Captioning
Tasks · Sound — FLOW_JUDGE
| Model | Size | Min-Max | Z-Score | Avg Rank | Average | OTHER · AudioCaps | OTHER · WavCaps |
|---|---|---|---|---|---|---|---|
| nvidia/audio-flamingo-3-hf | 8.2B | 1.000 | 1.22 | 1.0 | 57.76 | 64.00 ±0.06 | 51.52 ±0.10 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.914 | 0.99 | 2.0 | 54.68 | 56.60 ±0.09 | 52.76 ±0.10 |
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.887 | 0.91 | 3.0 | 53.70 | 54.36 ±0.09 | 53.04 ±0.10 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.711 | 0.44 | 4.0 | 47.34 | 48.40 ±0.08 | 46.28 ±0.09 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.476 | -0.20 | 5.0 | 38.90 | 38.56 ±0.07 | 39.24 ±0.07 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.393 | -0.43 | 6.0 | 35.88 | 37.48 ±0.09 | 34.28 ±0.09 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.018 | -1.44 | 7.0 | 22.38 | 21.24 ±0.03 | 23.52 ±0.05 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.000 | -1.49 | 8.0 | 21.74 | 21.04 ±0.03 | 22.44 ±0.05 |
Audio Question Answering
Tasks · Sound — FLOW_JUDGE
| Model | Size | Min-Max | Z-Score | Avg Rank | Average | EN · AudioCaps-QA | EN · Clotho-AQA | EN · WavCaps-QA |
|---|---|---|---|---|---|---|---|---|
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 1.000 | 0.92 | 1.0 | 60.99 | 63.26 ±0.15 | 57.88 ±0.14 | 61.84 ±0.16 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.928 | 0.74 | 2.0 | 59.73 | 63.71 ±0.11 | 52.64 ±0.13 | 62.83 ±0.14 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.897 | 0.66 | 3.0 | 59.19 | 58.08 ±0.16 | 65.92 ±0.14 | 53.55 ±0.17 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.888 | 0.63 | 4.0 | 59.02 | 56.10 ±0.15 | 63.72 ±0.15 | 57.24 ±0.17 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.869 | 0.59 | 5.0 | 58.69 | 60.13 ±0.12 | 58.72 ±0.11 | 57.24 ±0.14 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.542 | -0.25 | 6.0 | 52.93 | 53.42 ±0.15 | 58.28 ±0.13 | 47.11 ±0.16 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.005 | -1.63 | 7.0 | 43.49 | 38.34 ±0.14 | 52.72 ±0.15 | 39.41 ±0.15 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.000 | -1.64 | 8.0 | 43.41 | 39.68 ±0.14 | 49.56 ±0.15 | 40.99 ±0.15 |
Languages · French
Tables
French — Models × Tasks
| Model | Size | Min-Max | Z-Score | Avg Rank | ASR (WER %) | CommonVoice | Fleurs | Multilingual_TEDx | SUMM-RE | VoxPopuli | YouTubeFr | AST (METEOR %) | Multilingual_TEDx (FR→EN) | Multilingual_TEDx (FR→ES) | QUESTION ANSWERING (FLOW_JUDGE) | CohereLabs-Aya_collection | Vigogne--Alpaca | VoxPopuli-QA | MUSIC QUESTION ANSWERING (FLOW_JUDGE) | EMOTION RECOGNITION (FLOW_JUDGE) | GENDER RECOGNITION (FLOW_JUDGE) | AGE RECOGNITION (FLOW_JUDGE) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.781 | 0.94 | 2.6 | 33.56 ±3.68 | 18.92 ±2.45 | 27.90 ±2.52 | 26.52 ±15.79 | 39.09 ±3.37 | 37.35 ±4.38 | 51.58 ±13.46 | 50.15 ±1.54 | 51.06 ±2.30 | 49.24 ±2.04 | 50.99 ±0.09 | 34.84 ±0.42 | 43.28 ±0.10 | 74.84 ±0.10 | 55.32 ±0.07 | 16.20 ±0.03 | 67.00 ±0.04 | 15.00 ±0.03 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.561 | 0.23 | 3.6 | 43.81 ±16.23 | 26.18 ±5.77 | 21.29 ±3.57 | 55.24 ±41.81 | 63.68 ±72.63 | 27.56 ±7.56 | 68.93 ±47.91 | 48.56 ±1.85 | 57.03 ±2.36 | 40.09 ±2.66 | 52.16 ±0.09 | 35.48 ±0.32 | 41.76 ±0.09 | 79.24 ±0.10 | 52.68 ±0.08 | 12.20 ±0.03 | 31.00 ±0.04 | 5.00 ±0.02 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.544 | 0.24 | 3.9 | 21.70 ±13.80 | 10.65 ±1.84 | 10.15 ±1.40 | 22.07 ±55.31 | 42.83 ±55.71 | 13.37 ±24.17 | 31.16 ±8.81 | 46.40 ±1.39 | 47.57 ±1.93 | 45.23 ±2.00 | 59.84 ±0.08 | 48.39 ±0.33 | 52.56 ±0.09 | 78.56 ±0.10 | 43.60 ±0.07 | 6.80 ±0.02 | 29.20 ±0.04 | 3.60 ±0.02 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.507 | 0.09 | 4.6 | 73.74 ±9.09 | 48.31 ±7.41 | 57.50 ±6.22 | 76.10 ±46.43 | 73.64 ±9.60 | 94.59 ±6.21 | 92.27 ±23.11 | 33.01 ±1.65 | 43.68 ±2.03 | 22.34 ±2.25 | 44.63 ±0.07 | 37.42 ±0.32 | 38.12 ±0.09 | 58.36 ±0.10 | 54.72 ±0.09 | 53.00 ±0.04 | 48.40 ±0.04 | 3.20 ±0.02 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.381 | -0.24 | 4.9 | 20.80 ±1.78 | 13.01 ±1.94 | 11.92 ±1.34 | 18.06 ±8.22 | 32.69 ±4.36 | 17.08 ±1.58 | 32.04 ±3.48 | 37.83 ±1.44 | 41.47 ±1.94 | 34.19 ±2.08 | 54.45 ±0.08 | 42.58 ±0.31 | 48.20 ±0.10 | 72.56 ±0.11 | 38.56 ±0.07 | 4.80 ±0.02 | 27.40 ±0.04 | 1.80 ±0.01 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.365 | -0.35 | 5.0 | 127.95 ±36.03 | 71.61 ±29.42 | 75.16 ±3.59 | 185.12 ±176.73 | 201.50 ±52.98 | 103.83 ±38.14 | 130.46 ±98.40 | 27.73 ±1.02 | 29.81 ±1.36 | 25.65 ±1.50 | 61.04 ±0.07 | 48.39 ±0.41 | 57.08 ±0.10 | 77.64 ±0.08 | 56.48 ±0.07 | 9.20 ±0.03 | 28.20 ±0.04 | 0.40 ±0.01 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.353 | -0.37 | 5.6 | 44.41 ±9.42 | 25.86 ±17.61 | 12.16 ±2.18 | 36.79 ±35.85 | 82.09 ±20.51 | 31.48 ±7.55 | 78.10 ±31.14 | 50.74 ±1.52 | 53.66 ±2.01 | 47.82 ±2.26 | 53.78 ±0.07 | 42.58 ±0.38 | 51.08 ±0.10 | 67.68 ±0.10 | 38.28 ±0.09 | 1.60 ±0.01 | 6.40 ±0.02 | 0.40 ±0.01 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.297 | -0.54 | 6.0 | 82.32 ±12.68 | 97.54 ±33.55 | 28.89 ±7.09 | 93.08 ±41.98 | 128.57 ±25.32 | 47.17 ±14.03 | 98.67 ±41.83 | 49.03 ±1.52 | 51.84 ±2.00 | 46.21 ±2.27 | 51.26 ±0.08 | 39.35 ±0.39 | 48.84 ±0.10 | 65.60 ±0.11 | 37.08 ±0.08 | 2.60 ±0.01 | 1.60 ±0.01 | 4.80 ±0.02 |
Languages · English
Tables
English — Models × Tasks
| Model | Size | Min-Max | Z-Score | Avg Rank | ASR (WER %) | CommonVoice | Fleurs | VoxPopuli | QUESTION ANSWERING (FLOW_JUDGE) | NationalSpeechCorpus_SQA | OpenHermes_audio | SLUE-P2-SQA5 | SpokenWOZ_AIR-Bench | alpaca_audio | fisher_AIR-Bench | public-sg-speech | MUSIC QUESTION ANSWERING (FLOW_JUDGE) | MTJ-Jamendo_AIR-Bench | MusicCaps-QA | EMOTION RECOGNITION (FLOW_JUDGE) | IEMOCAP-Emotion | MELD_Emotion | GENDER RECOGNITION (FLOW_JUDGE) | CommonVoice_Gender | IEMOCAP-gender | AGE RECOGNITION (FLOW_JUDGE) | AUDIO QUESTION ANSWERING (FLOW_JUDGE) | AudioCaps-QA | Clotho-AQA | WavCaps-QA | DIALOGUE SUMMARIZATION (FLOW_JUDGE) | MATH QUESTION ANSWERING (ACC %) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.772 | 0.82 | 3.0 | 11.00 ±1.15 | 10.30 ±1.65 | 7.16 ±1.22 | 15.54 ±2.72 | 68.21 ±0.05 | 57.45 ±0.08 | 62.60 ±0.27 | 87.94 ±0.11 | 68.91 ±0.19 | 63.80 ±0.28 | 72.00 ±0.16 | 64.76 ±0.08 | 62.68 ±0.09 | 63.20 ±0.04 | 62.16 ±0.08 | 37.20 ±0.03 | 35.80 ±0.04 | 38.60 ±0.04 | 95.20 ±0.01 | 92.40 ±0.02 | 98.00 ±0.01 | 32.20 ±0.04 | 60.99 ±0.09 | 63.26 ±0.15 | 57.88 ±0.14 | 61.84 ±0.16 | 51.93 ±0.08 | 66.00 ±9.28 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.766 | 0.79 | 3.3 | 12.58 ±8.94 | 13.22 ±2.87 | 11.98 ±1.63 | 12.53 ±26.61 | 60.84 ±0.06 | 63.60 ±0.08 | 48.80 ±0.27 | 78.48 ±0.13 | 69.22 ±0.17 | 24.00 ±0.41 | 71.70 ±0.13 | 70.08 ±0.07 | 59.60 ±0.08 | 62.60 ±0.04 | 56.60 ±0.08 | 52.60 ±0.03 | 43.20 ±0.04 | 62.00 ±0.04 | 89.50 ±0.02 | 93.00 ±0.02 | 86.00 ±0.03 | 35.80 ±0.04 | 59.73 ±0.08 | 63.71 ±0.11 | 52.64 ±0.13 | 62.83 ±0.14 | 55.53 ±0.09 | 64.00 ±9.41 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.469 | 0.11 | 3.9 | 81.49 ±20.97 | 54.35 ±37.70 | 73.73 ±14.57 | 116.40 ±47.54 | 79.34 ±0.04 | 74.00 ±0.07 | 77.40 ±0.20 | 88.28 ±0.09 | 77.20 ±0.15 | 78.20 ±0.19 | 81.10 ±0.11 | 79.20 ±0.06 | 42.66 ±0.09 | 28.00 ±0.04 | 57.32 ±0.08 | 18.40 ±0.02 | 27.20 ±0.04 | 9.60 ±0.03 | 13.30 ±0.02 | 26.20 ±0.04 | 0.40 ±0.01 | 4.40 ±0.02 | 58.69 ±0.07 | 60.13 ±0.12 | 58.72 ±0.11 | 57.24 ±0.14 | 66.67 ±0.08 | 69.00 ±9.06 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.465 | -0.03 | 3.7 | 10.29 ±1.80 | 12.42 ±4.75 | 6.36 ±1.06 | 12.09 ±2.27 | 70.93 ±0.05 | 63.75 ±0.09 | 61.80 ±0.27 | 91.76 ±0.08 | 74.72 ±0.17 | 53.40 ±0.29 | 76.90 ±0.15 | 74.16 ±0.08 | 41.06 ±0.08 | 31.80 ±0.04 | 50.32 ±0.09 | 21.80 ±0.03 | 22.20 ±0.04 | 21.40 ±0.04 | 57.30 ±0.03 | 58.40 ±0.04 | 56.20 ±0.04 | 3.00 ±0.01 | 59.02 ±0.09 | 56.10 ±0.15 | 63.72 ±0.15 | 57.24 ±0.17 | 56.40 ±0.09 | 23.00 ±8.25 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.446 | -0.02 | 4.6 | 41.77 ±10.57 | 75.07 ±20.40 | 20.60 ±3.87 | 29.65 ±23.53 | 72.47 ±0.05 | 63.05 ±0.09 | 68.40 ±0.19 | 90.10 ±0.09 | 71.30 ±0.19 | 69.40 ±0.22 | 73.70 ±0.16 | 71.32 ±0.08 | 38.16 ±0.06 | 41.00 ±0.04 | 35.32 ±0.08 | 14.20 ±0.02 | 22.60 ±0.04 | 5.80 ±0.02 | 22.10 ±0.03 | 8.80 ±0.02 | 35.40 ±0.04 | 0.00 ±0.00 | 59.19 ±0.09 | 58.08 ±0.16 | 65.92 ±0.14 | 53.55 ±0.17 | 55.33 ±0.08 | 89.00 ±6.13 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.314 | -0.38 | 5.8 | 68.47 ±13.86 | 120.36 ±27.27 | 29.73 ±7.38 | 55.32 ±29.51 | 70.84 ±0.05 | 64.95 ±0.09 | 64.80 ±0.18 | 88.97 ±0.10 | 67.56 ±0.20 | 68.60 ±0.23 | 71.10 ±0.17 | 69.92 ±0.08 | 31.04 ±0.07 | 21.00 ±0.04 | 41.08 ±0.08 | 15.20 ±0.02 | 24.80 ±0.04 | 5.60 ±0.02 | 11.10 ±0.02 | 9.20 ±0.03 | 13.00 ±0.03 | 1.00 ±0.01 | 52.93 ±0.08 | 53.42 ±0.15 | 58.28 ±0.13 | 47.11 ±0.16 | 55.20 ±0.09 | 85.00 ±7.00 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.311 | -0.44 | 5.6 | 9.43 ±1.71 | 10.89 ±1.54 | 7.49 ±0.98 | 9.92 ±4.78 | 69.28 ±0.05 | 59.70 ±0.09 | 67.40 ±0.25 | 86.13 ±0.12 | 65.18 ±0.20 | 70.40 ±0.25 | 69.80 ±0.17 | 66.36 ±0.08 | 33.36 ±0.07 | 24.80 ±0.04 | 41.92 ±0.08 | 13.00 ±0.02 | 18.80 ±0.03 | 7.20 ±0.02 | 33.50 ±0.03 | 17.60 ±0.03 | 49.40 ±0.04 | 1.40 ±0.01 | 43.41 ±0.09 | 39.68 ±0.14 | 49.56 ±0.15 | 40.99 ±0.15 | 49.40 ±0.09 | 66.00 ±9.28 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.193 | -0.85 | 6.2 | 12.09 ±0.95 | 12.26 ±1.89 | 11.05 ±1.12 | 12.97 ±1.80 | 62.17 ±0.06 | 54.95 ±0.09 | 57.00 ±0.22 | 77.35 ±0.15 | 57.62 ±0.20 | 60.60 ±0.26 | 63.90 ±0.18 | 63.80 ±0.08 | 32.98 ±0.07 | 23.80 ±0.04 | 42.16 ±0.08 | 13.70 ±0.02 | 22.00 ±0.04 | 5.40 ±0.02 | 36.30 ±0.03 | 40.40 ±0.04 | 32.20 ±0.04 | 2.20 ±0.01 | 43.49 ±0.09 | 38.34 ±0.14 | 52.72 ±0.15 | 39.41 ±0.15 | 41.60 ±0.09 | 40.00 ±9.60 |
Languages · Others
Tables
Others — Models × Tasks
| Model | Size | Min-Max | Z-Score | Avg Rank | ASR (WER %) | Fleurs | Multilingual_TEDx | AST (METEOR %) | Multilingual_TEDx (ES→FR) | Multilingual_TEDx (ES→IT) | AUDIO CAPTIONING (FLOW_JUDGE) | AudioCaps | WavCaps | MUSIC CAPTIONING (FLOW_JUDGE) | SPOKEN LANGUAGE IDENTIFICATION (FLOW_JUDGE) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen/Qwen2-Audio-7B-Instruct | 8.4B | 0.950 | 1.15 | 1.8 | 24.43 ±2.19 | 14.99 ±2.14 | 23.31 ±5.43 | 45.84 ±1.73 | 46.22 ±2.08 | 45.46 ±3.11 | 53.70 ±0.06 | 54.36 ±0.09 | 53.04 ±0.10 | 59.76 ±0.04 | 88.40 ±0.03 |
| Qwen/Qwen2.5-Omni-7B | 11B | 0.737 | 0.50 | 3.0 | 32.39 ±8.72 | 15.68 ±3.65 | 54.62 ±32.25 | 41.12 ±1.85 | 42.80 ±2.25 | 39.45 ±3.23 | 54.68 ±0.07 | 56.60 ±0.09 | 52.76 ±0.10 | 37.40 ±0.07 | 79.80 ±0.04 |
| nvidia/audio-flamingo-3-hf | 8.2B | 0.652 | 0.18 | 3.8 | 91.53 ±8.97 | 96.60 ±9.93 | 80.80 ±43.95 | 14.94 ±1.55 | 17.21 ±1.99 | 12.67 ±2.35 | 57.76 ±0.06 | 64.00 ±0.06 | 51.52 ±0.10 | 55.48 ±0.11 | 93.00 ±0.02 |
| microsoft/Phi-4-multimodal-instruct | 5.6B | 0.582 | 0.07 | 4.6 | 41.02 ±5.26 | 29.88 ±9.76 | 34.03 ±8.53 | 36.07 ±2.05 | 31.28 ±2.46 | 40.85 ±3.57 | 47.34 ±0.06 | 48.40 ±0.08 | 46.28 ±0.09 | 44.24 ±0.06 | 50.80 ±0.04 |
| Qwen/Qwen2.5-Omni-3B | 5.9B | 0.522 | -0.10 | 5.2 | 54.28 ±10.89 | 22.88 ±5.74 | 82.28 ±48.63 | 40.50 ±1.77 | 39.10 ±2.19 | 41.91 ±3.00 | 35.88 ±0.06 | 37.48 ±0.09 | 34.28 ±0.09 | 36.28 ±0.07 | 63.20 ±0.04 |
| mistralai/Voxtral-Mini-3B-2507 | 4.68B | 0.426 | -0.45 | 5.4 | 139.30 ±27.78 | 96.91 ±16.09 | 173.54 ±93.23 | 23.46 ±1.18 | 21.31 ±1.35 | 25.62 ±2.25 | 38.90 ±0.05 | 38.56 ±0.07 | 39.24 ±0.07 | 50.60 ±0.07 | 75.60 ±0.04 |
| LINAGORA/Canary-Qwen3-4B_data-v1_8h | 4.8B | 0.371 | -0.51 | 5.6 | 18.72 ±11.42 | 12.14 ±2.25 | 26.31 ±14.69 | 35.30 ±1.69 | 34.56 ±2.00 | 36.04 ±3.14 | 22.38 ±0.03 | 21.24 ±0.03 | 23.52 ±0.05 | 33.88 ±0.07 | 40.60 ±0.04 |
| LINAGORA/Canary-Qwen3-1.7B_data-v1_8h | 2.5B | 0.267 | -0.84 | 6.6 | 38.06 ±3.23 | 58.79 ±3.43 | 101.99 ±11.44 | 28.99 ±1.64 | 27.86 ±1.93 | 30.12 ±3.08 | 21.74 ±0.03 | 21.04 ±0.03 | 22.44 ±0.05 | 28.28 ±0.06 | 42.80 ±0.04 |