Improving the CNVD Severity Classifier: Honest Metrics and Data Leakage Fixes

We recently made significant improvements to our CNVD severity classifier and the underlying Vulnerability-CNVD dataset, prompted by a thorough independent review from Eric Romang. These changes ship in VulnTrain v3.0.0, released today.

What happened

Eric opened VulnTrain#19 with a detailed technical analysis of the dataset and model. His key findings:

  • Data leakage: CNVD reuses boilerplate descriptions across different vulnerability IDs. Our train/test split was done on IDs, not on description text, so 15.6% of the test set contained descriptions identical to training data. This inflated the reported accuracy by ~1.7pp.
  • Low-class recall at 38.4%: 60% of Low-severity entries were misclassified as Medium. The dataset is heavily imbalanced (Low ~9%, Medium ~55%, High ~36%).
  • Keyword dependency: the model predicts severity based on vulnerability-type keywords rather than actual impact. Accuracy drops from ~89% to ~55% on entries whose severity deviates from the type’s typical level.

His full analysis, code, and data are available at eromang/researches/CNVD-Dataset-Validation.

What we fixed

Data leakage

We implemented a deduplicate_split function that groups entries by description text before splitting. All entries sharing a description land in the same split. The result: our retrained model scores 76.8% accuracy on the deduplicated test set, matching Eric’s independently measured unleaked accuracy of 76.6%. The model quality was always ~77% — we just have honest metrics now.

Class imbalance experiments

We tested four loss strategies to improve Low-class recall:

StrategyLow recallMedium recallOverall acc
Uniform (baseline)41.0%81.7%76.8%
Sqrt-dampened weights49.0%74.8%74.6%
Balanced weights60.8%70.2%73.2%
Focal loss (gamma=2)63.3%64.4%71.1%

Every strategy that improved Low recall caused disproportionate Medium recall loss. The Low/Medium vocabulary overlap in CNVD descriptions makes this a data-level ceiling, not a loss-function problem. Eric’s own experience with the CyberScale Phase 1 project — predicting 4-class CVSS bands from CVE descriptions using ModernBERT-base — reached the same conclusion: nothing moved the needle beyond ~2pp. Adjacent severity classes share vocabulary because vulnerability descriptions are formulaic.

We defaulted to uniform loss and documented the Low class limitation.

Dataset improvements

The Vulnerability-CNVD dataset now includes:

  • A cve_id field cross-referencing CVE equivalents. Approximately 81% of CNVD entries have a corresponding CVE (68-69% in 2020-2021, rising to 91-97% after 2022). The ~19% CNVD-only entries are concentrated in Chinese domestic software (PHP CMS, ERP systems). Western vendors (Adobe, Microsoft, IBM, Cisco) are largely absent from the CNVD-only subset.
  • A dataset card documenting severity distribution, CVE overlap rates, and the coverage decline: CNVD published details for 94% of reserved IDs in 2015 but only 4% in 2023. This drop coincides with China’s Regulations on the Management of Security Vulnerabilities (RMSV), effective September 2021.
  • A warning about duplicate descriptions and the need to split on description text rather than IDs.

The RMSV effect

The RMSV regulations deserve attention. Before September 2021, CNVD published vulnerability details for most of the IDs it reserved. After the regulations took effect, publication rates dropped sharply. As a result, the CNVD dataset is increasingly sparse for recent years and the model’s training data is concentrated in pre-2022 entries. Users should be aware of this temporal bias.

CNVD reserves 50,000–100,000 vulnerability IDs per year but publishes full details for only a fraction. As noted above, the publication rate has declined significantly:

  • 2015: ~94% of reserved IDs have published details
  • 2023: ~4% of reserved IDs have published details

Model card

The model card is now dynamically generated from actual training metrics and documents the known limitations: Low-class recall, keyword dependency, negation blindness, and CVE overlap.

Links

Acknowledgments

Thanks to Eric Romang for his detailed and constructive analysis. His work directly led to these improvements and confirmed that the model adds real value (+12pp over a keyword heuristic baseline) despite its limitations.

Funding

EU Funding

AIPITCH aims to create advanced artificial intelligence-based tools supporting key operational services in cyber defense. These include technologies for early threat detection, automatic malware classification, and improvement of analytical processes through the integration of Large Language Models (LLM). The project has the potential to set new standards in the cybersecurity industry.

The project leader is NASK National Research Institute. The international consortium includes:

Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Cybersecurity Competence Centre. Neither the European Union nor the European Cybersecurity Competence Centre can be held responsible for them.