AsymmeTree: Interactive Asymmetric Decision Trees for Business-Ready Imbalanced Classification

Authored by Aoxue Chen

AsymmeTree is an interactive decision tree classifier specifically designed for highly imbalanced datasets. Unlike traditional decision trees that optimize for node purity, AsymmeTree focuses on maximizing precision while capturing sufficient recall, making it ideal for fraud detection, anomaly detection, and other rare event prediction tasks.

🚀 Key Features

Imbalanced-Optimized Algorithm: Novel splitting strategy where left child = "positive node" (higher positive ratio), right child = "neutral node" (lower positive ratio)
F-Score Based Optimization: Optimizes splits using F-score of the positive node for better precision-recall balance
Multiple Splitting Criteria: Supports both imbalanced-focused (f_score) and traditional purity-based approaches (information gain, information gain ratio, information value)
Hybrid Interaction Modes: Interactive, automatic, or hybrid modes allowing domain expertise integration
Business-Aligned Design: Built with real-world business constraints and interpretability in mind
Comprehensive Feature Support: Handles categorical and numerical features with custom operator constraints
Programmable Metrics: Extensible metrics system for custom evaluation criteria

📦 Installation

# Clone the repository
git clone https://github.com/azurechen97/asymmetree.git

# Install asymmetree
pip install -e asymmetree/

🔧 Quick Start

Basic Usage

import pandas as pd
from asymmetree import AsymmeTree

# Load your imbalanced dataset
X = pd.read_csv('features.csv')
y = pd.read_csv('labels.csv')['target']

# Initialize AsymmeTree
tree = AsymmeTree(
    max_depth=5,
    sorted_by='f_score',  # Optimized for imbalanced data
    node_min_recall=0.05,
    leaf_min_precision=0.15
)

# Fit the model
tree.fit(X, y, auto=True)  # Automatic mode

# Make predictions
predictions = tree.predict(X)

# View performance
tree.performance()

# Export to SQL for business rules
sql_rules = tree.to_sql()
print(sql_rules)

Interactive Mode

# Interactive tree building with domain expertise
tree = AsymmeTree(max_depth=5)
tree.import_data(X, y)

# Start interactive splitting
tree.split(id=0)  # Split root node interactively

# Continue building specific branches
tree.continue_fit(id=2)  # Continue from node 2

# Quick split with custom condition
tree.quick_split(id=4, sql="age >= 25", overwrite=True)

Advanced Configuration

# Configure for specific use case
tree = AsymmeTree(
    max_depth=7,
    max_cat_unique=20, # Maximum unique values for categorical features
    cat_value_min_recall=0.005,  # Minimum recall for categorical values
    num_bin=25, # Number of bins for numerical feature discretization
    node_max_precision=0.3, # Maximum precision threshold for node splitting
    node_min_recall=0.05, # Minimum recall threshold for nodes
    leaf_min_precision=0.15, # Minimum precision threshold for leaf nodes
    feature_shown_num=5, # Number of features to show in interactive mode
    condition_shown_num=5, # Number of conditions to show in interactive mode
    sorted_by='f_score', # Metric to sort splits by ('f_score', 'ig', 'igr', 'iv')
    pos_weight=1, # Weight for positive class in calculations
    beta=1,  # F-beta score parameter
    knot=1,  # Precision scaling threshold
    factor=1,  # Scaling factor for precision above knot
    ignore_null=True, # Whether to ignore null values
    show_metrics=False, # Whether to show metrics in tree display
    verbose=False # Whether to print verbose output
)

tree.import_data(
    X,
    y,
    weights, # Sample weights
    cat_features=['category', 'region'], # List of categorical feature names
    lt_only_features=['age'],  # Age can only use < operator
    gt_only_features=['income'],  # Income can only use > operator
    pinned_features=['risk_score'],  # Prioritize this feature
    extra_metrics=extra_metrics, # Additional metrics to calculate
    extra_metrics_data=extra_metrics_data, # Data for extra metrics
    total_pos=total_pos, # Total positive samples for recall calculation
)

🎯 Use Cases

AsymmeTree excels in scenarios with significant class imbalance (< 5% positive class):

Fraud Detection

# Fraud detection with high precision requirements
fraud_tree = AsymmeTree(
    sorted_by='f_score',
    node_max_precision=0.4,  # Stop splitting at 40% precision
    leaf_min_precision=0.2,  # Require 20% precision for positive prediction
    pos_weight=10  # Weight positive cases more heavily
)

Medical Diagnosis

# Medical diagnosis with interpretable rules
# Patient outcome data aligned with feature matrix
patient_data = pd.DataFrame({
    'cost': patient_costs,
    'severity_score': severity_scores
}, index=X.index)

medical_tree = AsymmeTree(
    max_depth=4,  # Keep rules simple
    show_metrics=True,
    verbose=True
)

tree.fit(
    X,
    y,
    extra_metrics={
        'avg_cost': lambda x: x['cost'].mean(),
        'avg_severity': lambda x: x['severity_score'].mean()
    },
    extra_metrics_data=patient_data
)

Quality Control

# Manufacturing defect detection
qc_tree = AsymmeTree(
    sorted_by='igr',  # Use information gain ratio
    num_bin=20,  # Fine-grained numerical splits
    ignore_null=False  # Include null as category
)

tree.fit(X, y, cat_features=['machine_id', 'shift', 'operator'])

📊 Performance Metrics

AsymmeTree provides comprehensive metrics for imbalanced classification:

# Built-in metrics
metrics = tree.metrics()
print(f"Precision: {metrics['Precision']:.2%}")
print(f"Recall: {metrics['Recall']:.2%}")
print(f"Positives Captured: {metrics['Positives']:.0f}")

# Custom metrics (requires extra_metrics_data with same index as X)
def business_value(data):
    return data['revenue'].sum() - data['cost'].sum()

def avg_transaction_size(data):
    return data['amount'].mean()

# extra_metrics_data must have same index as X
extra_data = pd.DataFrame({
    'revenue': [...],  # Revenue values for each sample
    'cost': [...],     # Cost values for each sample  
    'amount': [...]    # Transaction amounts
}, index=X.index)

tree = AsymmeTree()
tree.import_data(
    X, y, 
    extra_metrics={
        'business_value': business_value,
        'avg_transaction': avg_transaction_size
    },
    extra_metrics_data=extra_data
)

🔍 Tree Visualization and Export

# Print tree structure
tree.print(show_metrics=True)

# Export to SQL for deployment
sql_where_clause = tree.to_sql()

# Save/load models
tree.save('fraud_model.json')
new_tree = AsymmeTree()
new_tree.load('fraud_model.json')

# Export to dictionary
tree_dict = tree.to_dict()

# Toggle leaf node predictions manually
tree.toggle_prediction(id=5)  # Toggle prediction for node 5

# Relabel nodes based on thresholds
tree.relabel(min_precision=0.2, min_recall=0.05)

# Prune tree to remove redundant splits
tree.prune()

# Clear children of a specific node
tree.clear_children(id=3)

🧠 Algorithm Details

Imbalanced Splitting Strategy

Positive Node Assignment: Left child always gets higher positive ratio samples
Neutral Node Assignment: Right child gets lower positive ratio samples
F-Score Optimization: Maximizes F-score of positive node for better precision-recall balance
Adaptive Thresholds: Uses configurable precision/recall thresholds to control splitting

Supported Split Criteria

f_score: F-score based (default for imbalanced data)
ig: Information Gain
igr: Information Gain Ratio
iv: Information Value (Weight of Evidence)

🛠️ API Reference

Core Classes

AsymmeTree: Main classifier class
Node: Individual tree node with split conditions and metrics

Key Methods

Model Training

fit(X, y, auto=False): Train the model
import_data(X, y, ...): Import training data with configuration
continue_fit(id, auto=False): Continue building from specific node

Interactive Splitting

split(id, auto=False): Interactive node splitting
quick_split(id, sql, overwrite=False): Quick split with SQL condition

Prediction and Evaluation

predict(X): Generate predictions
performance(): Display model metrics
metrics(): Return metrics dictionary

Tree Manipulation

toggle_prediction(id): Toggle leaf node prediction
relabel(min_precision, min_recall): Relabel nodes based on thresholds
prune(): Remove redundant splits
clear_children(id): Remove all children of a node

Model Export/Import

to_sql(): Export as SQL WHERE clause
to_dict(): Export as dictionary
to_json(): Export as JSON string
save(file_path): Save model to file
load(file_path): Load model from file
from_dict(nodes): Load from dictionary
from_json(json_str): Load from JSON string

Visualization

print(show_metrics=False): Display tree structure

Configuration Parameters

max_depth (int): Maximum tree depth (default: 5)
max_cat_unique (int): Maximum unique values for categorical features (default: 50)
cat_value_min_recall (float): Minimum recall threshold for categorical values (default: 0.005)
num_bin (int): Number of bins for numerical discretization (default: 25)
node_max_precision (float): Maximum precision threshold for splitting (default: 0.3)
node_min_recall (float): Minimum recall threshold for nodes (default: 0.05)
leaf_min_precision (float): Minimum precision for positive leaf prediction (default: 0.15)
feature_shown_num (int): Number of features shown in interactive mode (default: 5)
condition_shown_num (int): Number of conditions shown in interactive mode (default: 5)
sorted_by (str): Split criterion - 'f_score', 'ig', 'igr', 'iv' (default: 'f_score')
pos_weight (float): Weight for positive class in calculations (default: 1)
beta (float): Beta parameter for F-beta score (default: 1)
knot (float): Threshold for precision scaling (default: 1)
factor (float): Scaling factor for precision above knot (default: 1)
ignore_null (bool): Whether to ignore null values (default: True)
show_metrics (bool): Whether to show metrics in tree display (default: False)
verbose (bool): Whether to print verbose output (default: False)

Feature Constraints

cat_features (list): Categorical feature names
lt_only_features (list): Features restricted to '<' operators
gt_only_features (list): Features restricted to '>' operators
pinned_features (list): Features to prioritize in splitting

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

If you use AsymmeTree in your research, please cite:

@software{asymmetree,
  author = {Aoxue Chen},
  title = {AsymmeTree: Interactive Asymmetric Decision Trees for Business-Ready Imbalanced Classification},
  url = {https://github.com/azurechen97/asymmetree},
  year = {2025}
}

🏆 Acknowledgments

Thanks to the scikit-learn team for inspiration
Built with NumPy, Pandas, and optimized for performance

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.vscode		.vscode
asymmetree		asymmetree
scripts		scripts
tests		tests
.gitignore		.gitignore
.python-version		.python-version
HISTORY.md		HISTORY.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

License

azurechen97/asymmetree

Folders and files

Latest commit

History

Repository files navigation