Mitigating Distribution Shift in Android Malware Classification

Abstract

Graph malware classifiers achieve high accuracy on standard benchmarks but suffer significantly under distribution shift when new malware variants emerge. Our research highlights that existing structural features fail to capture the deeper semantic patterns necessary for robust generalization.

We introduce two new benchmarks, MalNet-Tiny-Common and MalNet-Tiny-Distinct, designed to evaluate performance under realistic covariate and domain shifts. We propose a semantic enrichment framework that augments Function Call Graphs (FCGs) with function-level metadata and code embeddings derived from Large Language Models (LLMs).

Our experiments demonstrate that this approach improves classification performance by up to 14.2% under distribution shift and enhances the robustness of adaptation-based methods.

Architecture & Method

Our framework integrates semantic signals from function metadata and LLM-based embeddings into the graph structure. Please consult our README.md files for detailed information. Below is an overview of the proposed pipeline of our work.

Dataset

Curation

The dataset curation process involved several stages to ensure high-quality, semantically enriched benchmarks:

Sample Selection: We selected malware samples from the MalNet database, focusing on families that represent shifts in distribution with respect to the original MalNet-Tiny benchmark:

MalNet-Tiny-Common: Malwares of the same families but different malware types, simulating covariance shift.
MalNet-Tiny-Distinct: Malwares of completely different families, simulating a more drastic domain shift.

APK Acquisition: Raw binaries (.apk files) were retrieved from AndroZoo using SHA256 hashes provided by MalNet.
Graph Extraction: For each APK, we extracted the FCGs using static analysis tools, identifying all function nodes and their call relationships.
Semantic Enrichment: Each node in the FCG was augmented with:
- Function Metadata: Including information such as function names, access modifiers, return types, and parameters.
- LLM Embeddings: High-dimensional vector representations of the function's bytecode/source code using state-of-the-art LLMs (e.g., CodeXEmbed, Qwen 3).

Download

The pre-processed attributed graphs (including all semantic embeddings) will be released as downloadable files. These files allow researchers to train and evaluate models without performing the expensive LLM inference step themselves. Due to the large size of the dataset, we provide links to download the datasets across multiple platforms — please download all below links and merge them in the datasets directory.

You can download the datasets from the following links:

Dataset Construction

We provide our code to construct our dataset from any APKs, ensuring compliance with AndroZoo's redistribution policies.

1. Setup

Install the necessary dependencies:

pip install -r requirements.txt

2. LLM Embedding Server

Start the inference server to handle code embedding requests (supports various backends):

# Example: Start server on port 8080
python llm_inference_server_cxe.py --port 8080

3. Graph Construction

Run the construction script to process APKs into attributed graphs:

python create_graph.py --apk_dir ./path/to/apks --n_jobs 8 --port 8080

The splits directory contains the definitions for MalNet-Tiny-Common and MalNet-Tiny-Distinct.

Training & Evaluation

The training directory contains the Exphormer-based model implementation and evaluation scripts.

Data Structure

Organize your datasets as follows in the datasets/ folder:

datasets/
├── [dataset_name]/
│   ├── raw/
│   │   ├── malnet-graph-tiny/  (Graph structures)
│   │   └── split_info_tiny/    (Train/Val/Test splits)
│   └── processed/              (Generated automatically)

If you download the dataset from our repository to the processed directory, you can skip the graph construction step.

Running Experiments

To reproduce the results, use the provided configuration files:

python main.py --cfg config_file.yaml

Citation

@misc{tran2025mitigatingdistributionshiftgraphbased,
      title={Mitigating Distribution Shift in Graph-Based Android Malware Classification via Function Metadata and LLM Embeddings},
      author={Ngoc N. Tran and Anwar Said and Waseem Abbas and Tyler Derr and Xenofon D. Koutsoukos},
      year={2025},
      eprint={2508.06734},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2508.06734},
}