This repository contains the code and data for the paper “Hypergraph-Based High-Order Correlation Analysis for Large-Scale Long-Tailed Data.”
- HSMOTE: Lightweight dual encoder–decoder (
models.HSMOTE) trained with BCE reconstruction and negative sampling on node–hyperedge incidence. - Tail oversampling via synthesis: Interpolation in embedding space + Bernoulli edge membership sampling over class-aware candidate sets (
synthesize_vertices_bernoulli), with Top-M per-class popular edges, candidate capping, and class-specific degree targets. - Global structure aggregation: Build row-normalized Top-k PPR on the item–user bipartite graph, keep item→item mass, and fine-tune
encoder_node + LinearHeadwith fixed P. - Inference-time diffusion: 3-step lightweight diffusion (
utils.diffuse_3steps) to smooth representations before classification.
repo/
├── config.py # Hyperparameters & paths (edit to your data locations)
├── main.py # End-to-end: load → hypergraph → HSMOTE → synth → PPR → finetune → eval
├── models.py # HSMOTE and LinearHead
├── load_data.py # Loading/alignment, subgraph builder, BCE w/ negative sampling, helpers
├── utils.py # Training utils, metrics, PPR builders, diffusion, memmap, etc.
├── ppr.py # PPR
├── data/ # raw features/labels/user JSON
├── cache/ # runtime caches
└── README.md
config.py points to your files (update paths as needed):
# Example fields — adapt to your setup
data_dir = "Datasets/subset"
asins_pkl = "asins_**.pkl"
feat_pkl = "features_**.pkl"
lab_pkl = "labels_**.pkl"
user_json = "user_products_1000.json"Expected formats
-
features_**.pkl:{'features': float32 [N, d], 'asins': List[str]} -
labels_**.pkl:{'labels': int64 [N]} -
asins_**.pkl: several schemas are supported;load_idx2node_from_asins_pklauto-parses and aligns -
user_products_1000.json: user–item interactions, e.g.:[ {"user": "u1", "items": [{"prefix":"P","asin":"A1"}, {"prefix":"P","asin":"A2"}]}, {"user": "u2", "items": [{"asin":"B3"}]} ]
Hypergraph construction & cache
-
Datasets download: https://drive.google.com/file/d/1v8nXKoIrd7bmfGZyW6N3_WT0dZUASDrw/view?usp=sharing
-
load_hypergraphbuilds CSR for item→user (offsets_i,indices_users) and user→item (offsets_u,indices_items). -
A stable cache key (from
kept_asin2row) is used to write.csr-<hash>.npz,.users.json, and.meta.jsonfor fast reloads.
Run:
python main.pyPipeline in main.py:
- Load & align:
load_and_alignaligns features/labels toidx2node→X, y. - Build hypergraph:
load_hypergraph→ CSRH; compute full edge featuresXe_full. - Split:
stratified_split_ratioper class. - Train HSMOTE:
train_hsmote_on_tailson tail classes via batch subgraphs (BCE + MSE). - Synthesize:
compute_ze_full_memmapencodes edge embeddingsZe(memmap).synthesize_vertices_bernoullicreates new tail nodes + merges dataset. - Build PPR:
build_P_items_topk_bipartiteon the bipartite graph, keep item→item rows, re-normalize →P_hat. - Fine-tune & evaluate:
finetune_with_P_fixed(fixedP_hat) on train/val; inference uses 3-step diffusion.
Console logs include [HSMOTE], [PPR], [FT] (fine-tuning), and final [TEST] metrics.
Adjust in config.py :
- Model:
d_in,d_hid,d_embed - HSMOTE pretraining:
lr_hsmote,wd_hsmote,beta_mse,neg_per_pos,hsmote_batches,per_class_train,log_every - Tail threshold:
tail_threshold(relative to the largest class) - Synthesis:
target_tail_ratio,topM,cand_cap_per_syn,max_edges_per_syn,ze_score_chunk,block_new_nodes - PPR:
ppr_alpha,ppr_eps,ppr_topk,ppr_chunk - Fine-tuning:
lr_enc,wd_enc,lr_head,wd_head,ft_epochs,precls_patience,bs_seeds - Inference:
infer_alpha,infer_steps,encode_bs
- Maintainer: Xiangmin Han / Tsinghua University
- Email: [email protected]_