Overview

Tables
1st2ndSecond to lastLast
Model Size Min-Max Z-Score Avg Rank ASR (WER %) FR EN DE ES IT PT NL AST (METEOR %) ES-FR · Multilingual_TEDx ES-IT · Multilingual_TEDx FR-EN · Multilingual_TEDx FR-ES · Multilingual_TEDx Question Answering (FLOW_JUDGE) FR EN Others Emotion Recognition (FLOW_JUDGE) Gender Recognition (FLOW_JUDGE) Age Recognition (FLOW_JUDGE) Dialogue Summarization (FLOW_JUDGE) Spoken Language Identification (FLOW_JUDGE) Music Music Question Answering (FLOW_JUDGE) Music Captioning (FLOW_JUDGE) Sound Audio Captioning (FLOW_JUDGE) Audio Question Answering (FLOW_JUDGE)
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.863 0.95 2.3 25.23 ±1.67 33.56 ±3.68 11.00 ±1.15 32.65 ±4.70 16.57 ±2.64 21.20 ±7.11 19.15 ±2.94 40.68 ±3.75 48.00 ±1.15 46.22 ±2.08 45.46 ±3.11 51.06 ±2.30 49.24 ±2.04 63.04 ±0.05 50.99 ±0.09 68.21 ±0.05 55.99 30.20 ±0.02 85.80 ±0.02 23.60 ±0.03 51.93 ±0.08 88.40 ±0.03 59.99 60.23 ±0.07 59.76 ±0.04 57.35 53.70 ±0.06 60.99 ±0.09
Qwen/Qwen2.5-Omni-7B 11B 0.613 0.27 3.7 37.96 ±5.64 44.41 ±9.42 41.77 ±10.57 48.09 ±21.47 28.24 ±23.62 18.58 ±12.24 35.15 ±16.46 31.38 ±4.68 45.93 ±1.20 42.80 ±2.25 39.45 ±3.23 53.66 ±2.01 47.82 ±2.26 66.86 ±0.04 53.78 ±0.07 72.47 ±0.05 32.44 10.00 ±0.02 16.87 ±0.02 0.20 ±0.00 55.33 ±0.08 79.80 ±0.04 37.80 38.20 ±0.05 37.40 ±0.07 56.93 54.68 ±0.07 59.19 ±0.09
microsoft/Phi-4-multimodal-instruct 5.6B 0.608 0.25 3.8 36.83 ±6.09 43.81 ±16.23 10.29 ±1.80 41.56 ±6.57 34.85 ±16.84 24.70 ±9.17 31.95 ±6.49 103.10 ±22.76 42.31 ±1.41 31.28 ±2.46 40.85 ±3.57 57.03 ±2.36 40.09 ±2.66 65.30 ±0.05 52.16 ±0.09 70.93 ±0.05 35.67 18.60 ±0.02 48.53 ±0.03 4.00 ±0.01 56.40 ±0.09 50.80 ±0.04 44.59 44.93 ±0.07 44.24 ±0.06 53.18 47.34 ±0.06 59.02 ±0.09
nvidia/audio-flamingo-3-hf 8.2B 0.564 0.03 4.5 72.44 ±5.63 73.74 ±9.09 12.58 ±8.94 112.01 ±22.08 67.31 ±10.68 106.41 ±20.15 88.70 ±22.54 74.85 ±6.63 23.97 ±1.22 17.21 ±1.99 12.67 ±2.35 43.68 ±2.03 22.34 ±2.25 55.98 ±0.05 44.63 ±0.07 60.84 ±0.06 59.31 52.73 ±0.03 75.80 ±0.02 19.50 ±0.02 55.53 ±0.09 93.00 ±0.02 56.73 57.97 ±0.07 55.48 ±0.11 58.74 57.76 ±0.06 59.73 ±0.08
mistralai/Voxtral-Mini-3B-2507 4.68B 0.431 -0.22 4.7 125.88 ±18.69 127.95 ±36.03 81.49 ±20.97 163.93 ±69.63 125.22 ±69.41 135.14 ±50.01 135.23 ±47.91 134.67 ±40.94 25.60 ±0.78 21.31 ±1.35 25.62 ±2.25 29.81 ±1.36 25.65 ±1.50 73.85 ±0.04 61.04 ±0.07 79.34 ±0.04 35.65 15.33 ±0.02 18.27 ±0.02 2.40 ±0.01 66.67 ±0.08 75.60 ±0.04 48.93 47.27 ±0.07 50.60 ±0.07 48.80 38.90 ±0.05 58.69 ±0.07
Qwen/Qwen2.5-Omni-3B 5.9B 0.424 -0.27 5.5 65.99 ±7.27 82.32 ±12.68 68.47 ±13.86 71.28 ±25.51 55.52 ±24.49 44.98 ±19.24 52.58 ±24.86 39.77 ±7.22 44.77 ±1.17 39.10 ±2.19 41.91 ±3.00 51.84 ±2.00 46.21 ±2.27 64.97 ±0.04 51.26 ±0.08 70.84 ±0.05 28.05 11.00 ±0.02 7.93 ±0.01 2.90 ±0.01 55.20 ±0.09 63.20 ±0.04 34.67 33.05 ±0.06 36.28 ±0.07 44.41 35.88 ±0.06 52.93 ±0.08
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.410 -0.26 4.8 18.17 ±7.32 21.70 ±13.80 9.43 ±1.71 28.29 ±36.38 12.58 ±32.37 12.59 ±7.25 19.23 ±7.46 23.10 ±2.25 40.85 ±1.11 34.56 ±2.00 36.04 ±3.14 47.57 ±1.93 45.23 ±2.00 66.45 ±0.05 59.84 ±0.08 69.28 ±0.05 27.10 10.93 ±0.02 32.07 ±0.02 2.50 ±0.01 49.40 ±0.09 40.60 ±0.04 35.33 36.77 ±0.06 33.88 ±0.07 32.89 22.38 ±0.03 43.41 ±0.09
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.253 -0.75 6.7 27.98 ±1.73 20.80 ±1.78 12.09 ±0.95 33.12 ±7.10 13.93 ±6.25 18.95 ±7.88 80.39 ±6.12 49.77 ±3.54 33.41 ±1.11 27.86 ±1.93 30.12 ±3.08 41.47 ±1.94 34.19 ±2.08 59.86 ±0.05 54.45 ±0.08 62.17 ±0.06 26.09 10.73 ±0.02 33.33 ±0.02 2.00 ±0.01 41.60 ±0.09 42.80 ±0.04 31.56 34.84 ±0.06 28.28 ±0.06 32.61 21.74 ±0.03 43.49 ±0.09

Overview (FR/EN — ASR, AST, QA)

Tables
1st2ndSecond to lastLast
Model Size Min-Max Z-Score Avg Rank ASR (WER %) FR EN AST (METEOR %) ES-FR · Multilingual_TEDx FR-EN · Multilingual_TEDx FR-ES · Multilingual_TEDx Question Answering (FLOW_JUDGE) FR EN
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.770 0.57 3.0 17.61 ±9.23 21.70 ±13.80 9.43 ±1.71 42.45 ±1.18 34.56 ±2.00 47.57 ±1.93 45.23 ±2.00 66.45 ±0.05 59.84 ±0.08 69.28 ±0.05
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.769 0.50 3.3 26.04 ±2.52 33.56 ±3.68 11.00 ±1.15 48.84 ±1.24 46.22 ±2.08 51.06 ±2.30 49.24 ±2.04 63.04 ±0.05 50.99 ±0.09 68.21 ±0.05
Qwen/Qwen2.5-Omni-7B 11B 0.768 0.54 3.0 43.53 ±7.21 44.41 ±9.42 41.77 ±10.57 48.09 ±1.28 42.80 ±2.25 53.66 ±2.01 47.82 ±2.26 66.86 ±0.04 53.78 ±0.07 72.47 ±0.05
microsoft/Phi-4-multimodal-instruct 5.6B 0.701 0.34 4.0 32.64 ±10.86 43.81 ±16.23 10.29 ±1.80 42.80 ±1.54 31.28 ±2.46 57.03 ±2.36 40.09 ±2.66 65.30 ±0.05 52.16 ±0.09 70.93 ±0.05
Qwen/Qwen2.5-Omni-3B 5.9B 0.578 -0.05 5.0 77.70 ±9.65 82.32 ±12.68 68.47 ±13.86 45.72 ±1.27 39.10 ±2.19 51.84 ±2.00 46.21 ±2.27 64.97 ±0.04 51.26 ±0.08 70.84 ±0.05
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.532 -0.19 5.0 17.90 ±1.24 20.80 ±1.78 12.09 ±0.95 34.50 ±1.18 27.86 ±1.93 41.47 ±1.94 34.19 ±2.08 59.86 ±0.05 54.45 ±0.08 62.17 ±0.06
mistralai/Voxtral-Mini-3B-2507 4.68B 0.333 -0.61 5.7 112.46 ±25.06 127.95 ±36.03 81.49 ±20.97 25.59 ±0.83 21.31 ±1.35 29.81 ±1.36 25.65 ±1.50 73.85 ±0.04 61.04 ±0.07 79.34 ±0.04
nvidia/audio-flamingo-3-hf 8.2B 0.239 -1.10 7.0 53.35 ±6.83 73.74 ±9.09 12.58 ±8.94 27.74 ±1.34 17.21 ±1.99 43.68 ±2.03 22.34 ±2.25 55.98 ±0.05 44.63 ±0.07 60.84 ±0.06

Tasks · ASR

Tables
Tasks · ASR — WER (%)
Model Size Min-Max Z-Score Avg Rank Average FR CommonVoice Fleurs Multilingual_TEDx SUMM-RE VoxPopuli YouTubeFr EN CommonVoice Fleurs VoxPopuli DE Fleurs Multilingual_TEDx ES Fleurs Multilingual_TEDx IT Fleurs Multilingual_TEDx PT Fleurs Multilingual_TEDx NL - Fleurs
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.999 0.91 1.3 18.13 21.70 ±13.80 10.65 ±1.84 10.15 ±1.40 22.07 ±55.31 42.83 ±55.71 13.37 ±24.17 31.16 ±8.81 9.43 ±1.71 10.89 ±1.54 7.49 ±0.98 9.92 ±4.78 28.29 ±36.38 11.23 ±1.05 45.35 ±72.58 12.58 ±32.37 5.66 ±0.85 19.50 ±64.61 12.59 ±7.25 9.10 ±1.03 16.08 ±14.39 19.23 ±7.46 12.14 ±2.25 26.31 ±14.69 23.10 ±2.25
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.938 0.73 2.9 24.97 33.56 ±3.68 18.92 ±2.45 27.90 ±2.52 26.52 ±15.79 39.09 ±3.37 37.35 ±4.38 51.58 ±13.46 11.00 ±1.15 10.30 ±1.65 7.16 ±1.22 15.54 ±2.72 32.65 ±4.70 22.40 ±3.06 42.91 ±8.72 16.57 ±2.64 14.55 ±2.27 18.59 ±4.76 21.20 ±7.11 20.92 ±2.51 21.47 ±13.97 19.15 ±2.94 14.99 ±2.14 23.31 ±5.43 40.68 ±3.75
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.871 0.53 3.4 32.72 20.80 ±1.78 13.01 ±1.94 11.92 ±1.34 18.06 ±8.22 32.69 ±4.36 17.08 ±1.58 32.04 ±3.48 12.09 ±0.95 12.26 ±1.89 11.05 ±1.12 12.97 ±1.80 33.12 ±7.10 16.66 ±1.57 49.57 ±13.82 13.93 ±6.25 8.33 ±0.99 19.52 ±12.34 18.95 ±7.88 12.38 ±1.20 25.53 ±15.59 80.39 ±6.12 58.79 ±3.43 101.99 ±11.44 49.77 ±3.54
Qwen/Qwen2.5-Omni-7B 11B 0.826 0.41 4.0 35.37 44.41 ±9.42 25.86 ±17.61 12.16 ±2.18 36.79 ±35.85 82.09 ±20.51 31.48 ±7.55 78.10 ±31.14 41.77 ±10.57 75.07 ±20.40 20.60 ±3.87 29.65 ±23.53 48.09 ±21.47 18.30 ±3.04 77.88 ±42.03 28.24 ±23.62 13.29 ±2.63 43.19 ±46.66 18.58 ±12.24 13.79 ±2.83 23.36 ±24.20 35.15 ±16.46 15.68 ±3.65 54.62 ±32.25 31.38 ±4.68
microsoft/Phi-4-multimodal-instruct 5.6B 0.793 0.29 4.3 41.47 43.81 ±16.23 26.18 ±5.77 21.29 ±3.57 55.24 ±41.81 63.68 ±72.63 27.56 ±7.56 68.93 ±47.91 10.29 ±1.80 12.42 ±4.75 6.36 ±1.06 12.09 ±2.27 41.56 ±6.57 26.78 ±5.05 56.33 ±11.83 34.85 ±16.84 15.44 ±3.23 54.25 ±33.36 24.70 ±9.17 18.83 ±4.24 30.58 ±17.70 31.95 ±6.49 29.88 ±9.76 34.03 ±8.53 103.10 ±22.76
Qwen/Qwen2.5-Omni-3B 5.9B 0.601 -0.25 5.7 59.27 82.32 ±12.68 97.54 ±33.55 28.89 ±7.09 93.08 ±41.98 128.57 ±25.32 47.17 ±14.03 98.67 ±41.83 68.47 ±13.86 120.36 ±27.27 29.73 ±7.38 55.32 ±29.51 71.28 ±25.51 28.98 ±5.83 113.59 ±49.14 55.52 ±24.49 26.69 ±5.24 84.34 ±47.31 44.98 ±19.24 25.84 ±4.56 64.12 ±37.65 52.58 ±24.86 22.88 ±5.74 82.28 ±48.63 39.77 ±7.22
nvidia/audio-flamingo-3-hf 8.2B 0.504 -0.57 6.4 76.51 73.74 ±9.09 48.31 ±7.41 57.50 ±6.22 76.10 ±46.43 73.64 ±9.60 94.59 ±6.21 92.27 ±23.11 12.58 ±8.94 13.22 ±2.87 11.98 ±1.63 12.53 ±26.61 112.01 ±22.08 97.16 ±10.60 126.86 ±42.67 67.31 ±10.68 64.86 ±6.38 69.77 ±20.29 106.41 ±20.15 120.32 ±10.00 92.50 ±39.02 88.70 ±22.54 96.60 ±9.93 80.80 ±43.95 74.85 ±6.63
mistralai/Voxtral-Mini-3B-2507 4.68B 0.000 -2.05 8.0 129.09 127.95 ±36.03 71.61 ±29.42 75.16 ±3.59 185.12 ±176.73 201.50 ±52.98 103.83 ±38.14 130.46 ±98.40 81.49 ±20.97 54.35 ±37.70 73.73 ±14.57 116.40 ±47.54 163.93 ±69.63 97.87 ±7.06 229.99 ±137.44 125.22 ±69.41 83.45 ±6.91 167.00 ±137.18 135.14 ±50.01 80.32 ±3.53 189.97 ±98.55 135.23 ±47.91 96.91 ±16.09 173.54 ±93.23 134.67 ±40.94

Tasks · AST

Tables
Tasks · AST — BLEU
Model Size Min-Max Z-Score Avg Rank Average FR-EN · Multilingual_TEDx (FR→EN) FR-ES · Multilingual_TEDx (FR→ES) ES-FR · Multilingual_TEDx (ES→FR) ES-IT · Multilingual_TEDx (ES→IT)
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.901 0.92 1.8 26.86 29.15 ±2.03 24.60 ±1.71 28.09 ±1.79 25.60 ±2.55
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.861 0.82 2.0 26.47 35.51 ±2.08 27.14 ±1.94 23.90 ±1.80 19.34 ±3.08
microsoft/Phi-4-multimodal-instruct 5.6B 0.846 0.76 2.8 26.37 39.13 ±2.25 22.38 ±2.14 22.90 ±1.92 21.06 ±2.88
Qwen/Qwen2.5-Omni-7B 11B 0.710 0.30 4.2 22.14 25.53 ±1.71 22.45 ±1.83 23.56 ±1.87 17.03 ±2.49
Qwen/Qwen2.5-Omni-3B 5.9B 0.643 0.07 5.2 20.29 23.17 ±1.95 21.08 ±1.95 18.32 ±1.76 18.60 ±2.34
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.597 -0.07 5.8 19.61 27.71 ±1.83 18.19 ±1.67 17.13 ±1.49 15.40 ±2.60
nvidia/audio-flamingo-3-hf 8.2B 0.385 -0.74 6.2 14.86 29.01 ±1.94 13.53 ±1.62 12.43 ±1.48 4.48 ±1.97
mistralai/Voxtral-Mini-3B-2507 4.68B 0.000 -2.06 8.0 3.67 4.26 ±0.52 3.64 ±0.60 3.19 ±0.51 3.61 ±0.99
Tasks · AST — METEOR (%)
Model Size Min-Max Z-Score Avg Rank Average FR-EN · Multilingual_TEDx (FR→EN) FR-ES · Multilingual_TEDx (FR→ES) ES-FR · Multilingual_TEDx (ES→FR) ES-IT · Multilingual_TEDx (ES→IT)
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.945 1.04 1.8 48.00 51.06 ±2.30 49.24 ±2.04 46.22 ±2.08 45.46 ±3.11
Qwen/Qwen2.5-Omni-7B 11B 0.880 0.84 2.5 45.93 53.66 ±2.01 47.82 ±2.26 42.80 ±2.25 39.45 ±3.23
Qwen/Qwen2.5-Omni-3B 5.9B 0.836 0.71 2.8 44.77 51.84 ±2.00 46.21 ±2.27 39.10 ±2.19 41.91 ±3.00
microsoft/Phi-4-multimodal-instruct 5.6B 0.751 0.48 3.5 42.31 57.03 ±2.36 40.09 ±2.66 31.28 ±2.46 40.85 ±3.57
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.703 0.28 4.5 40.85 47.57 ±1.93 45.23 ±2.00 34.56 ±2.00 36.04 ±3.14
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.442 -0.51 6.2 33.41 41.47 ±1.94 34.19 ±2.08 27.86 ±1.93 30.12 ±3.08
mistralai/Voxtral-Mini-3B-2507 4.68B 0.165 -1.38 7.2 25.60 29.81 ±1.36 25.65 ±1.50 21.31 ±1.35 25.62 ±2.25
nvidia/audio-flamingo-3-hf 8.2B 0.127 -1.46 7.5 23.97 43.68 ±2.03 22.34 ±2.25 17.21 ±1.99 12.67 ±2.35

Tasks · QA

Tables
Math Question Answering
Tasks · QA — ACC (%)
Model Size Min-Max Z-Score Avg Rank Average EN · spoken-mqa_short_digit
Qwen/Qwen2.5-Omni-7B 11B 1.000 1.28 1.0 89.00 89.00 ±6.13
Qwen/Qwen2.5-Omni-3B 5.9B 0.939 1.09 2.0 85.00 85.00 ±7.00
mistralai/Voxtral-Mini-3B-2507 4.68B 0.697 0.31 3.0 69.00 69.00 ±9.06
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.652 0.16 4.0 66.00 66.00 ±9.28
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.652 0.16 5.0 66.00 66.00 ±9.28
nvidia/audio-flamingo-3-hf 8.2B 0.621 0.06 6.0 64.00 64.00 ±9.41
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.258 -1.11 7.0 40.00 40.00 ±9.60
microsoft/Phi-4-multimodal-instruct 5.6B 0.000 -1.94 8.0 23.00 23.00 ±8.25
Question Answering
Tasks · QA — FLOW_JUDGE
Model Size Min-Max Z-Score Avg Rank Average FR · CohereLabs-Aya_collection FR · Vigogne--Alpaca FR · VoxPopuli-QA EN · NationalSpeechCorpus_SQA EN · OpenHermes_audio EN · SLUE-P2-SQA5 EN · SpokenWOZ_AIR-Bench EN · alpaca_audio EN · fisher_AIR-Bench EN · public-sg-speech
mistralai/Voxtral-Mini-3B-2507 4.68B 1.000 1.69 1.0 70.19 48.39 ±0.41 57.08 ±0.10 77.64 ±0.08 74.00 ±0.07 77.40 ±0.20 88.28 ±0.09 77.20 ±0.15 78.20 ±0.19 81.10 ±0.11 79.20 ±0.06
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.692 0.65 3.5 64.56 48.39 ±0.33 52.56 ±0.09 78.56 ±0.10 59.70 ±0.09 67.40 ±0.25 86.13 ±0.12 65.18 ±0.20 70.40 ±0.25 69.80 ±0.17 66.36 ±0.08
Qwen/Qwen2.5-Omni-7B 11B 0.593 0.32 3.0 63.12 42.58 ±0.38 51.08 ±0.10 67.68 ±0.10 63.05 ±0.09 68.40 ±0.19 90.10 ±0.09 71.30 ±0.19 69.40 ±0.22 73.70 ±0.16 71.32 ±0.08
microsoft/Phi-4-multimodal-instruct 5.6B 0.502 0.01 4.0 61.54 35.48 ±0.32 41.76 ±0.09 79.24 ±0.10 63.75 ±0.09 61.80 ±0.27 91.76 ±0.08 74.72 ±0.17 53.40 ±0.29 76.90 ±0.15 74.16 ±0.08
Qwen/Qwen2.5-Omni-3B 5.9B 0.473 -0.09 5.0 61.05 39.35 ±0.39 48.84 ±0.10 65.60 ±0.11 64.95 ±0.09 64.80 ±0.18 88.97 ±0.10 67.56 ±0.20 68.60 ±0.23 71.10 ±0.17 69.92 ±0.08
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.393 -0.36 6.5 59.60 34.84 ±0.42 43.28 ±0.10 74.84 ±0.10 57.45 ±0.08 62.60 ±0.27 87.94 ±0.11 68.91 ±0.19 63.80 ±0.28 72.00 ±0.16 64.76 ±0.08
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.335 -0.55 5.0 58.31 42.58 ±0.31 48.20 ±0.10 72.56 ±0.11 54.95 ±0.09 57.00 ±0.22 77.35 ±0.15 57.62 ±0.20 60.60 ±0.26 63.90 ±0.18 63.80 ±0.08
nvidia/audio-flamingo-3-hf 8.2B 0.000 -1.68 8.0 52.74 37.42 ±0.32 38.12 ±0.09 58.36 ±0.10 63.60 ±0.08 48.80 ±0.27 78.48 ±0.13 69.22 ±0.17 24.00 ±0.41 71.70 ±0.13 70.08 ±0.07

Tasks · Others

Tables
Age Recognition
Tasks · Others — FLOW_JUDGE
Model Size Min-Max Z-Score Avg Rank Average FR · CommonVoice_Age EN · CommonVoice_Age
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.950 2.02 1.5 23.60 15.00 ±0.03 32.20 ±0.04
nvidia/audio-flamingo-3-hf 8.2B 0.596 0.80 3.0 19.50 3.20 ±0.02 35.80 ±0.04
microsoft/Phi-4-multimodal-instruct 5.6B 0.199 -0.17 3.0 4.00 5.00 ±0.02 3.00 ±0.01
Qwen/Qwen2.5-Omni-3B 5.9B 0.165 -0.26 5.0 2.90 4.80 ±0.02 1.00 ±0.01
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.129 -0.39 5.0 2.50 3.60 ±0.02 1.40 ±0.01
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.079 -0.56 5.5 2.00 1.80 ±0.01 2.20 ±0.01
mistralai/Voxtral-Mini-3B-2507 4.68B 0.061 -0.64 5.5 2.40 0.40 ±0.01 4.40 ±0.02
Qwen/Qwen2.5-Omni-7B 11B 0.000 -0.80 7.5 0.20 0.40 ±0.01 0.00 ±0.00
Dialogue Summarization
Tasks · Others — FLOW_JUDGE
Model Size Min-Max Z-Score Avg Rank Average EN · NationalSpeechCorpus_SDS
mistralai/Voxtral-Mini-3B-2507 4.68B 1.000 1.91 1.0 66.67 66.67 ±0.08
microsoft/Phi-4-multimodal-instruct 5.6B 0.590 0.36 2.0 56.40 56.40 ±0.09
nvidia/audio-flamingo-3-hf 8.2B 0.556 0.23 3.0 55.53 55.53 ±0.09
Qwen/Qwen2.5-Omni-7B 11B 0.548 0.20 4.0 55.33 55.33 ±0.08
Qwen/Qwen2.5-Omni-3B 5.9B 0.543 0.18 5.0 55.20 55.20 ±0.09
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.412 -0.31 6.0 51.93 51.93 ±0.08
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.311 -0.70 7.0 49.40 49.40 ±0.09
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.000 -1.87 8.0 41.60 41.60 ±0.09
Emotion Recognition
Tasks · Others — FLOW_JUDGE
Model Size Min-Max Z-Score Avg Rank Average FR · MELD_Emotion EN · IEMOCAP-Emotion EN · MELD_Emotion
nvidia/audio-flamingo-3-hf 8.2B 1.000 2.36 1.0 52.80 53.00 ±0.04 43.20 ±0.04 62.00 ±0.04
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.448 0.61 2.0 26.70 16.20 ±0.03 35.80 ±0.04 38.60 ±0.04
microsoft/Phi-4-multimodal-instruct 5.6B 0.214 -0.09 3.0 17.00 12.20 ±0.03 22.20 ±0.04 21.40 ±0.04
mistralai/Voxtral-Mini-3B-2507 4.68B 0.142 -0.31 4.0 13.80 9.20 ±0.03 27.20 ±0.04 9.60 ±0.03
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.051 -0.59 6.5 9.90 6.80 ±0.02 18.80 ±0.03 7.20 ±0.02
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.040 -0.63 6.5 9.25 4.80 ±0.02 22.00 ±0.04 5.40 ±0.02
Qwen/Qwen2.5-Omni-3B 5.9B 0.038 -0.64 6.0 8.90 2.60 ±0.01 24.80 ±0.04 5.60 ±0.02
Qwen/Qwen2.5-Omni-7B 11B 0.015 -0.71 7.0 7.90 1.60 ±0.01 22.60 ±0.04 5.80 ±0.02
Gender Recognition
Tasks · Others — FLOW_JUDGE
Model Size Min-Max Z-Score Avg Rank Average FR · CommonVoice_Gender EN · CommonVoice_Gender EN · IEMOCAP-gender
Qwen/Qwen2-Audio-7B-Instruct 8.4B 1.000 1.77 1.0 81.10 67.00 ±0.04 92.40 ±0.02 98.00 ±0.01
nvidia/audio-flamingo-3-hf 8.2B 0.824 1.20 2.0 68.95 48.40 ±0.04 93.00 ±0.02 86.00 ±0.03
microsoft/Phi-4-multimodal-instruct 5.6B 0.499 0.23 3.0 44.15 31.00 ±0.04 58.40 ±0.04 56.20 ±0.04
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.347 -0.20 5.0 31.85 27.40 ±0.04 40.40 ±0.04 32.20 ±0.04
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.344 -0.20 4.5 31.35 29.20 ±0.04 17.60 ±0.03 49.40 ±0.04
mistralai/Voxtral-Mini-3B-2507 4.68B 0.216 -0.56 6.0 20.75 28.20 ±0.04 26.20 ±0.04 0.40 ±0.01
Qwen/Qwen2.5-Omni-7B 11B 0.102 -0.97 6.5 14.25 6.40 ±0.02 8.80 ±0.02 35.40 ±0.04
Qwen/Qwen2.5-Omni-3B 5.9B 0.000 -1.27 8.0 6.35 1.60 ±0.01 9.20 ±0.03 13.00 ±0.03
Spoken Language Identification
Tasks · Others — FLOW_JUDGE
Model Size Min-Max Z-Score Avg Rank Average OTHER · CoVost2_AIR-Bench
nvidia/audio-flamingo-3-hf 8.2B 1.000 1.37 1.0 93.00 93.00 ±0.02
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.912 1.13 2.0 88.40 88.40 ±0.03
Qwen/Qwen2.5-Omni-7B 11B 0.748 0.68 3.0 79.80 79.80 ±0.04
mistralai/Voxtral-Mini-3B-2507 4.68B 0.668 0.46 4.0 75.60 75.60 ±0.04
Qwen/Qwen2.5-Omni-3B 5.9B 0.431 -0.19 5.0 63.20 63.20 ±0.04
microsoft/Phi-4-multimodal-instruct 5.6B 0.195 -0.83 6.0 50.80 50.80 ±0.04
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.042 -1.25 7.0 42.80 42.80 ±0.04
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.000 -1.37 8.0 40.60 40.60 ±0.04

Tasks · Music

Tables
Music Captioning
Tasks · Music — FLOW_JUDGE
Model Size Min-Max Z-Score Avg Rank Average OTHER · MusicCaps
Qwen/Qwen2-Audio-7B-Instruct 8.4B 1.000 1.58 1.0 59.76 59.76 ±0.04
nvidia/audio-flamingo-3-hf 8.2B 0.864 1.17 2.0 55.48 55.48 ±0.11
mistralai/Voxtral-Mini-3B-2507 4.68B 0.709 0.71 3.0 50.60 50.60 ±0.07
microsoft/Phi-4-multimodal-instruct 5.6B 0.507 0.10 4.0 44.24 44.24 ±0.06
Qwen/Qwen2.5-Omni-7B 11B 0.290 -0.56 5.0 37.40 37.40 ±0.07
Qwen/Qwen2.5-Omni-3B 5.9B 0.254 -0.67 6.0 36.28 36.28 ±0.07
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.178 -0.90 7.0 33.88 33.88 ±0.07
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.000 -1.43 8.0 28.28 28.28 ±0.06
Music Question Answering
Tasks · Music — FLOW_JUDGE
Model Size Min-Max Z-Score Avg Rank Average FR · MusicCaps-QA EN · MTJ-Jamendo_AIR-Bench EN · MusicCaps-QA
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.970 1.40 1.5 59.00 55.32 ±0.07 63.20 ±0.04 62.16 ±0.08
nvidia/audio-flamingo-3-hf 8.2B 0.906 1.23 2.5 57.16 54.72 ±0.09 62.60 ±0.04 56.60 ±0.08
mistralai/Voxtral-Mini-3B-2507 4.68B 0.684 0.59 2.0 49.57 56.48 ±0.07 28.00 ±0.04 57.32 ±0.08
microsoft/Phi-4-multimodal-instruct 5.6B 0.560 0.28 4.0 46.87 52.68 ±0.08 31.80 ±0.04 50.32 ±0.09
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.205 -0.63 5.5 38.48 43.60 ±0.07 24.80 ±0.04 41.92 ±0.08
Qwen/Qwen2.5-Omni-7B 11B 0.143 -0.75 6.0 38.22 38.28 ±0.09 41.00 ±0.04 35.32 ±0.08
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.069 -0.96 6.5 35.77 38.56 ±0.07 23.80 ±0.04 42.16 ±0.08
Qwen/Qwen2.5-Omni-3B 5.9B 0.000 -1.14 8.0 34.06 37.08 ±0.08 21.00 ±0.04 41.08 ±0.08

Tasks · Sound

Tables
Audio Captioning
Tasks · Sound — FLOW_JUDGE
Model Size Min-Max Z-Score Avg Rank Average OTHER · AudioCaps OTHER · WavCaps
nvidia/audio-flamingo-3-hf 8.2B 1.000 1.22 1.0 57.76 64.00 ±0.06 51.52 ±0.10
Qwen/Qwen2.5-Omni-7B 11B 0.914 0.99 2.0 54.68 56.60 ±0.09 52.76 ±0.10
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.887 0.91 3.0 53.70 54.36 ±0.09 53.04 ±0.10
microsoft/Phi-4-multimodal-instruct 5.6B 0.711 0.44 4.0 47.34 48.40 ±0.08 46.28 ±0.09
mistralai/Voxtral-Mini-3B-2507 4.68B 0.476 -0.20 5.0 38.90 38.56 ±0.07 39.24 ±0.07
Qwen/Qwen2.5-Omni-3B 5.9B 0.393 -0.43 6.0 35.88 37.48 ±0.09 34.28 ±0.09
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.018 -1.44 7.0 22.38 21.24 ±0.03 23.52 ±0.05
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.000 -1.49 8.0 21.74 21.04 ±0.03 22.44 ±0.05
Audio Question Answering
Tasks · Sound — FLOW_JUDGE
Model Size Min-Max Z-Score Avg Rank Average EN · AudioCaps-QA EN · Clotho-AQA EN · WavCaps-QA
Qwen/Qwen2-Audio-7B-Instruct 8.4B 1.000 0.92 1.0 60.99 63.26 ±0.15 57.88 ±0.14 61.84 ±0.16
nvidia/audio-flamingo-3-hf 8.2B 0.928 0.74 2.0 59.73 63.71 ±0.11 52.64 ±0.13 62.83 ±0.14
Qwen/Qwen2.5-Omni-7B 11B 0.897 0.66 3.0 59.19 58.08 ±0.16 65.92 ±0.14 53.55 ±0.17
microsoft/Phi-4-multimodal-instruct 5.6B 0.888 0.63 4.0 59.02 56.10 ±0.15 63.72 ±0.15 57.24 ±0.17
mistralai/Voxtral-Mini-3B-2507 4.68B 0.869 0.59 5.0 58.69 60.13 ±0.12 58.72 ±0.11 57.24 ±0.14
Qwen/Qwen2.5-Omni-3B 5.9B 0.542 -0.25 6.0 52.93 53.42 ±0.15 58.28 ±0.13 47.11 ±0.16
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.005 -1.63 7.0 43.49 38.34 ±0.14 52.72 ±0.15 39.41 ±0.15
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.000 -1.64 8.0 43.41 39.68 ±0.14 49.56 ±0.15 40.99 ±0.15

Languages · French

Tables
French — Models × Tasks
Model Size Min-Max Z-Score Avg Rank ASR (WER %) CommonVoice Fleurs Multilingual_TEDx SUMM-RE VoxPopuli YouTubeFr AST (METEOR %) Multilingual_TEDx (FR→EN) Multilingual_TEDx (FR→ES) QUESTION ANSWERING (FLOW_JUDGE) CohereLabs-Aya_collection Vigogne--Alpaca VoxPopuli-QA MUSIC QUESTION ANSWERING (FLOW_JUDGE) EMOTION RECOGNITION (FLOW_JUDGE) GENDER RECOGNITION (FLOW_JUDGE) AGE RECOGNITION (FLOW_JUDGE)
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.781 0.94 2.6 33.56 ±3.68 18.92 ±2.45 27.90 ±2.52 26.52 ±15.79 39.09 ±3.37 37.35 ±4.38 51.58 ±13.46 50.15 ±1.54 51.06 ±2.30 49.24 ±2.04 50.99 ±0.09 34.84 ±0.42 43.28 ±0.10 74.84 ±0.10 55.32 ±0.07 16.20 ±0.03 67.00 ±0.04 15.00 ±0.03
microsoft/Phi-4-multimodal-instruct 5.6B 0.561 0.23 3.6 43.81 ±16.23 26.18 ±5.77 21.29 ±3.57 55.24 ±41.81 63.68 ±72.63 27.56 ±7.56 68.93 ±47.91 48.56 ±1.85 57.03 ±2.36 40.09 ±2.66 52.16 ±0.09 35.48 ±0.32 41.76 ±0.09 79.24 ±0.10 52.68 ±0.08 12.20 ±0.03 31.00 ±0.04 5.00 ±0.02
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.544 0.24 3.9 21.70 ±13.80 10.65 ±1.84 10.15 ±1.40 22.07 ±55.31 42.83 ±55.71 13.37 ±24.17 31.16 ±8.81 46.40 ±1.39 47.57 ±1.93 45.23 ±2.00 59.84 ±0.08 48.39 ±0.33 52.56 ±0.09 78.56 ±0.10 43.60 ±0.07 6.80 ±0.02 29.20 ±0.04 3.60 ±0.02
nvidia/audio-flamingo-3-hf 8.2B 0.507 0.09 4.6 73.74 ±9.09 48.31 ±7.41 57.50 ±6.22 76.10 ±46.43 73.64 ±9.60 94.59 ±6.21 92.27 ±23.11 33.01 ±1.65 43.68 ±2.03 22.34 ±2.25 44.63 ±0.07 37.42 ±0.32 38.12 ±0.09 58.36 ±0.10 54.72 ±0.09 53.00 ±0.04 48.40 ±0.04 3.20 ±0.02
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.381 -0.24 4.9 20.80 ±1.78 13.01 ±1.94 11.92 ±1.34 18.06 ±8.22 32.69 ±4.36 17.08 ±1.58 32.04 ±3.48 37.83 ±1.44 41.47 ±1.94 34.19 ±2.08 54.45 ±0.08 42.58 ±0.31 48.20 ±0.10 72.56 ±0.11 38.56 ±0.07 4.80 ±0.02 27.40 ±0.04 1.80 ±0.01
mistralai/Voxtral-Mini-3B-2507 4.68B 0.365 -0.35 5.0 127.95 ±36.03 71.61 ±29.42 75.16 ±3.59 185.12 ±176.73 201.50 ±52.98 103.83 ±38.14 130.46 ±98.40 27.73 ±1.02 29.81 ±1.36 25.65 ±1.50 61.04 ±0.07 48.39 ±0.41 57.08 ±0.10 77.64 ±0.08 56.48 ±0.07 9.20 ±0.03 28.20 ±0.04 0.40 ±0.01
Qwen/Qwen2.5-Omni-7B 11B 0.353 -0.37 5.6 44.41 ±9.42 25.86 ±17.61 12.16 ±2.18 36.79 ±35.85 82.09 ±20.51 31.48 ±7.55 78.10 ±31.14 50.74 ±1.52 53.66 ±2.01 47.82 ±2.26 53.78 ±0.07 42.58 ±0.38 51.08 ±0.10 67.68 ±0.10 38.28 ±0.09 1.60 ±0.01 6.40 ±0.02 0.40 ±0.01
Qwen/Qwen2.5-Omni-3B 5.9B 0.297 -0.54 6.0 82.32 ±12.68 97.54 ±33.55 28.89 ±7.09 93.08 ±41.98 128.57 ±25.32 47.17 ±14.03 98.67 ±41.83 49.03 ±1.52 51.84 ±2.00 46.21 ±2.27 51.26 ±0.08 39.35 ±0.39 48.84 ±0.10 65.60 ±0.11 37.08 ±0.08 2.60 ±0.01 1.60 ±0.01 4.80 ±0.02

Languages · English

Tables
English — Models × Tasks
Model Size Min-Max Z-Score Avg Rank ASR (WER %) CommonVoice Fleurs VoxPopuli QUESTION ANSWERING (FLOW_JUDGE) NationalSpeechCorpus_SQA OpenHermes_audio SLUE-P2-SQA5 SpokenWOZ_AIR-Bench alpaca_audio fisher_AIR-Bench public-sg-speech MUSIC QUESTION ANSWERING (FLOW_JUDGE) MTJ-Jamendo_AIR-Bench MusicCaps-QA EMOTION RECOGNITION (FLOW_JUDGE) IEMOCAP-Emotion MELD_Emotion GENDER RECOGNITION (FLOW_JUDGE) CommonVoice_Gender IEMOCAP-gender AGE RECOGNITION (FLOW_JUDGE) AUDIO QUESTION ANSWERING (FLOW_JUDGE) AudioCaps-QA Clotho-AQA WavCaps-QA DIALOGUE SUMMARIZATION (FLOW_JUDGE) MATH QUESTION ANSWERING (ACC %)
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.772 0.82 3.0 11.00 ±1.15 10.30 ±1.65 7.16 ±1.22 15.54 ±2.72 68.21 ±0.05 57.45 ±0.08 62.60 ±0.27 87.94 ±0.11 68.91 ±0.19 63.80 ±0.28 72.00 ±0.16 64.76 ±0.08 62.68 ±0.09 63.20 ±0.04 62.16 ±0.08 37.20 ±0.03 35.80 ±0.04 38.60 ±0.04 95.20 ±0.01 92.40 ±0.02 98.00 ±0.01 32.20 ±0.04 60.99 ±0.09 63.26 ±0.15 57.88 ±0.14 61.84 ±0.16 51.93 ±0.08 66.00 ±9.28
nvidia/audio-flamingo-3-hf 8.2B 0.766 0.79 3.3 12.58 ±8.94 13.22 ±2.87 11.98 ±1.63 12.53 ±26.61 60.84 ±0.06 63.60 ±0.08 48.80 ±0.27 78.48 ±0.13 69.22 ±0.17 24.00 ±0.41 71.70 ±0.13 70.08 ±0.07 59.60 ±0.08 62.60 ±0.04 56.60 ±0.08 52.60 ±0.03 43.20 ±0.04 62.00 ±0.04 89.50 ±0.02 93.00 ±0.02 86.00 ±0.03 35.80 ±0.04 59.73 ±0.08 63.71 ±0.11 52.64 ±0.13 62.83 ±0.14 55.53 ±0.09 64.00 ±9.41
mistralai/Voxtral-Mini-3B-2507 4.68B 0.469 0.11 3.9 81.49 ±20.97 54.35 ±37.70 73.73 ±14.57 116.40 ±47.54 79.34 ±0.04 74.00 ±0.07 77.40 ±0.20 88.28 ±0.09 77.20 ±0.15 78.20 ±0.19 81.10 ±0.11 79.20 ±0.06 42.66 ±0.09 28.00 ±0.04 57.32 ±0.08 18.40 ±0.02 27.20 ±0.04 9.60 ±0.03 13.30 ±0.02 26.20 ±0.04 0.40 ±0.01 4.40 ±0.02 58.69 ±0.07 60.13 ±0.12 58.72 ±0.11 57.24 ±0.14 66.67 ±0.08 69.00 ±9.06
microsoft/Phi-4-multimodal-instruct 5.6B 0.465 -0.03 3.7 10.29 ±1.80 12.42 ±4.75 6.36 ±1.06 12.09 ±2.27 70.93 ±0.05 63.75 ±0.09 61.80 ±0.27 91.76 ±0.08 74.72 ±0.17 53.40 ±0.29 76.90 ±0.15 74.16 ±0.08 41.06 ±0.08 31.80 ±0.04 50.32 ±0.09 21.80 ±0.03 22.20 ±0.04 21.40 ±0.04 57.30 ±0.03 58.40 ±0.04 56.20 ±0.04 3.00 ±0.01 59.02 ±0.09 56.10 ±0.15 63.72 ±0.15 57.24 ±0.17 56.40 ±0.09 23.00 ±8.25
Qwen/Qwen2.5-Omni-7B 11B 0.446 -0.02 4.6 41.77 ±10.57 75.07 ±20.40 20.60 ±3.87 29.65 ±23.53 72.47 ±0.05 63.05 ±0.09 68.40 ±0.19 90.10 ±0.09 71.30 ±0.19 69.40 ±0.22 73.70 ±0.16 71.32 ±0.08 38.16 ±0.06 41.00 ±0.04 35.32 ±0.08 14.20 ±0.02 22.60 ±0.04 5.80 ±0.02 22.10 ±0.03 8.80 ±0.02 35.40 ±0.04 0.00 ±0.00 59.19 ±0.09 58.08 ±0.16 65.92 ±0.14 53.55 ±0.17 55.33 ±0.08 89.00 ±6.13
Qwen/Qwen2.5-Omni-3B 5.9B 0.314 -0.38 5.8 68.47 ±13.86 120.36 ±27.27 29.73 ±7.38 55.32 ±29.51 70.84 ±0.05 64.95 ±0.09 64.80 ±0.18 88.97 ±0.10 67.56 ±0.20 68.60 ±0.23 71.10 ±0.17 69.92 ±0.08 31.04 ±0.07 21.00 ±0.04 41.08 ±0.08 15.20 ±0.02 24.80 ±0.04 5.60 ±0.02 11.10 ±0.02 9.20 ±0.03 13.00 ±0.03 1.00 ±0.01 52.93 ±0.08 53.42 ±0.15 58.28 ±0.13 47.11 ±0.16 55.20 ±0.09 85.00 ±7.00
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.311 -0.44 5.6 9.43 ±1.71 10.89 ±1.54 7.49 ±0.98 9.92 ±4.78 69.28 ±0.05 59.70 ±0.09 67.40 ±0.25 86.13 ±0.12 65.18 ±0.20 70.40 ±0.25 69.80 ±0.17 66.36 ±0.08 33.36 ±0.07 24.80 ±0.04 41.92 ±0.08 13.00 ±0.02 18.80 ±0.03 7.20 ±0.02 33.50 ±0.03 17.60 ±0.03 49.40 ±0.04 1.40 ±0.01 43.41 ±0.09 39.68 ±0.14 49.56 ±0.15 40.99 ±0.15 49.40 ±0.09 66.00 ±9.28
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.193 -0.85 6.2 12.09 ±0.95 12.26 ±1.89 11.05 ±1.12 12.97 ±1.80 62.17 ±0.06 54.95 ±0.09 57.00 ±0.22 77.35 ±0.15 57.62 ±0.20 60.60 ±0.26 63.90 ±0.18 63.80 ±0.08 32.98 ±0.07 23.80 ±0.04 42.16 ±0.08 13.70 ±0.02 22.00 ±0.04 5.40 ±0.02 36.30 ±0.03 40.40 ±0.04 32.20 ±0.04 2.20 ±0.01 43.49 ±0.09 38.34 ±0.14 52.72 ±0.15 39.41 ±0.15 41.60 ±0.09 40.00 ±9.60

Languages · Others

Tables
Others — Models × Tasks
Model Size Min-Max Z-Score Avg Rank ASR (WER %) Fleurs Multilingual_TEDx AST (METEOR %) Multilingual_TEDx (ES→FR) Multilingual_TEDx (ES→IT) AUDIO CAPTIONING (FLOW_JUDGE) AudioCaps WavCaps MUSIC CAPTIONING (FLOW_JUDGE) SPOKEN LANGUAGE IDENTIFICATION (FLOW_JUDGE)
Qwen/Qwen2-Audio-7B-Instruct 8.4B 0.950 1.15 1.8 24.43 ±2.19 14.99 ±2.14 23.31 ±5.43 45.84 ±1.73 46.22 ±2.08 45.46 ±3.11 53.70 ±0.06 54.36 ±0.09 53.04 ±0.10 59.76 ±0.04 88.40 ±0.03
Qwen/Qwen2.5-Omni-7B 11B 0.737 0.50 3.0 32.39 ±8.72 15.68 ±3.65 54.62 ±32.25 41.12 ±1.85 42.80 ±2.25 39.45 ±3.23 54.68 ±0.07 56.60 ±0.09 52.76 ±0.10 37.40 ±0.07 79.80 ±0.04
nvidia/audio-flamingo-3-hf 8.2B 0.652 0.18 3.8 91.53 ±8.97 96.60 ±9.93 80.80 ±43.95 14.94 ±1.55 17.21 ±1.99 12.67 ±2.35 57.76 ±0.06 64.00 ±0.06 51.52 ±0.10 55.48 ±0.11 93.00 ±0.02
microsoft/Phi-4-multimodal-instruct 5.6B 0.582 0.07 4.6 41.02 ±5.26 29.88 ±9.76 34.03 ±8.53 36.07 ±2.05 31.28 ±2.46 40.85 ±3.57 47.34 ±0.06 48.40 ±0.08 46.28 ±0.09 44.24 ±0.06 50.80 ±0.04
Qwen/Qwen2.5-Omni-3B 5.9B 0.522 -0.10 5.2 54.28 ±10.89 22.88 ±5.74 82.28 ±48.63 40.50 ±1.77 39.10 ±2.19 41.91 ±3.00 35.88 ±0.06 37.48 ±0.09 34.28 ±0.09 36.28 ±0.07 63.20 ±0.04
mistralai/Voxtral-Mini-3B-2507 4.68B 0.426 -0.45 5.4 139.30 ±27.78 96.91 ±16.09 173.54 ±93.23 23.46 ±1.18 21.31 ±1.35 25.62 ±2.25 38.90 ±0.05 38.56 ±0.07 39.24 ±0.07 50.60 ±0.07 75.60 ±0.04
LINAGORA/Canary-Qwen3-4B_data-v1_8h 4.8B 0.371 -0.51 5.6 18.72 ±11.42 12.14 ±2.25 26.31 ±14.69 35.30 ±1.69 34.56 ±2.00 36.04 ±3.14 22.38 ±0.03 21.24 ±0.03 23.52 ±0.05 33.88 ±0.07 40.60 ±0.04
LINAGORA/Canary-Qwen3-1.7B_data-v1_8h 2.5B 0.267 -0.84 6.6 38.06 ±3.23 58.79 ±3.43 101.99 ±11.44 28.99 ±1.64 27.86 ±1.93 30.12 ±3.08 21.74 ±0.03 21.04 ±0.03 22.44 ±0.05 28.28 ±0.06 42.80 ±0.04