diff --git a/samples/Sagemaker-to-Snowflake/Computer_Vision/0007.jpg b/samples/Sagemaker-to-Snowflake/Computer_Vision/0007.jpg new file mode 100644 index 00000000..5505e697 Binary files /dev/null and b/samples/Sagemaker-to-Snowflake/Computer_Vision/0007.jpg differ diff --git a/samples/Sagemaker-to-Snowflake/Computer_Vision/0009.jpg b/samples/Sagemaker-to-Snowflake/Computer_Vision/0009.jpg new file mode 100644 index 00000000..1a3e8fe1 Binary files /dev/null and b/samples/Sagemaker-to-Snowflake/Computer_Vision/0009.jpg differ diff --git a/samples/Sagemaker-to-Snowflake/Computer_Vision/0016.jpg b/samples/Sagemaker-to-Snowflake/Computer_Vision/0016.jpg new file mode 100644 index 00000000..4a7fc55a Binary files /dev/null and b/samples/Sagemaker-to-Snowflake/Computer_Vision/0016.jpg differ diff --git a/samples/Sagemaker-to-Snowflake/Computer_Vision/Readme.md b/samples/Sagemaker-to-Snowflake/Computer_Vision/Readme.md new file mode 100644 index 00000000..d981b14a --- /dev/null +++ b/samples/Sagemaker-to-Snowflake/Computer_Vision/Readme.md @@ -0,0 +1,62 @@ +# CIFAR-10 Computer Vision Model Training and Evaluation + +This repository contains a Jupyter notebook that implements a Convolutional Neural Network (CNN) for image classification using the **CIFAR-10 dataset**. It provides a comprehensive machine learning pipeline, from data loading and preprocessing to model training, evaluation, and prediction. + +----- + +## 🚀 Getting Started + +To use this notebook, you'll need to install the necessary libraries: + +```bash +pip install tensorflow scikit-learn +``` + +These commands will install **TensorFlow**, the deep learning framework used for building and training the CNN, and **Scikit-learn**, which is used for calculating evaluation metrics. + +----- + +## 📋 Key Components + +The notebook is structured into several key sections to guide you through the process: + +### 1\. Data Preprocessing + +The code loads the CIFAR-10 dataset, which consists of 60,000 32x32 color images across 10 classes. The pixel values of these images are then normalized from the range `[0, 255]` to `[0, 1]` to enhance training performance. The dataset uses sparse categorical labels, which are integers from 0 to 9, instead of one-hot encoding. + +### 2\. CNN Architecture + +The model is a sequential CNN with the following layers: + + * **Input Layer**: Accepts images of size 32x32 with 3 color channels (RGB). + * **Convolutional Layers**: Two `Conv2D` layers (32 and 64 filters) with ReLU activation, designed to extract features from the images. + * **Pooling Layers**: Two `MaxPooling2D` layers that reduce the spatial dimensions of the feature maps, helping to decrease computational complexity and prevent overfitting. + * **Dense Layers**: Fully connected layers that interpret the features and output a final prediction using a 10-class softmax activation function. + +### 3\. Optimizer Comparison + +The notebook includes three optimizer options with different performance characteristics: + + * **Adam**: Offers the fastest convergence, typically achieving an accuracy of 60-65% in 20-30 epochs. It is also the least sensitive to the learning rate. + * **RMSprop**: Has medium convergence speed, reaching 60-65% accuracy in about 30 epochs. + * **SGD**: Has the slowest convergence, requiring more than 50 epochs to achieve an accuracy of 65-70%. It is very sensitive to the learning rate. + +----- + +## 📊 Model Evaluation and Prediction + +### Training and Saving + +The model is compiled with `sparse_categorical_crossentropy` loss and trained with validation on a test set to monitor for overfitting. After training, the model is saved to the `/saved_model/cifar10_model.keras` file. + +### Comprehensive Evaluation + +After training, the model is evaluated using a variety of metrics provided by Scikit-learn, including: + + * **Accuracy**: Measures the overall correctness of the model's predictions. + * **Precision, Recall, and F1-Score**: These metrics are calculated with weighted averaging to provide a comprehensive view of performance across all classes. + * **Confusion Matrix**: Provides a detailed breakdown of the classification results, showing which classes are being confused with others. + +### Image Prediction + +The notebook also includes a function to load the trained model and make predictions on new, single images. This process requires a crucial preprocessing step: any new image must be resized to 32x32 pixels to match the dimensions the model was trained on. This ensures consistency and compatibility. diff --git a/samples/Sagemaker-to-Snowflake/Computer_Vision/V1_ADMIN_SAGE_VIS_SNOW.ipynb b/samples/Sagemaker-to-Snowflake/Computer_Vision/V1_ADMIN_SAGE_VIS_SNOW.ipynb new file mode 100644 index 00000000..65852bed --- /dev/null +++ b/samples/Sagemaker-to-Snowflake/Computer_Vision/V1_ADMIN_SAGE_VIS_SNOW.ipynb @@ -0,0 +1,438 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "c695373e-ac74-4b62-a1f1-08206cbd5c81", + "metadata": { + "language": "python", + "name": "cell3" + }, + "outputs": [], + "source": [ + "!pip install tensorflow" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a1b67244-3b3f-4f30-b770-745a2b10cb0a", + "metadata": { + "language": "python", + "name": "cell4" + }, + "outputs": [], + "source": [ + "!pip install sklearn" + ] + }, + { + "cell_type": "markdown", + "id": "3fa1318d-0e3a-4fce-88c9-bb791b8c5b52", + "metadata": { + "collapsed": false, + "name": "cell21" + }, + "source": [ + "# CIFAR-10 Computer Vision Model Training and Evaluation\n", + "\n", + "This notebook implements a Convolutional Neural Network (CNN) for image classification using the CIFAR-10 dataset. The code demonstrates a complete machine learning pipeline from data loading to model evaluation.\n", + "\n", + "## Key Components:\n", + "\n", + "### 1. **Library Imports**\n", + "- **TensorFlow/Keras**: Deep learning framework for building and training the CNN\n", + "- **NumPy**: Numerical operations and array handling\n", + "- **Scikit-learn**: Evaluation metrics (accuracy, precision, recall, F1-score, confusion matrix)\n", + "- **OS**: File system operations for saving the trained model\n", + "- **Logging**: Training progress and debugging information\n", + "\n", + "### 2. **Data Preprocessing**\n", + "- Loads CIFAR-10 dataset (60,000 32x32 color images in 10 classes)\n", + "- Normalizes pixel values from [0,255] to [0,1] range for better training performance\n", + "- Uses sparse categorical labels (integers 0-9) instead of one-hot encoding\n", + "\n", + "### 3. **CNN Architecture**\n", + "- **Input Layer**: 32x32x3 (RGB images)\n", + "- **Convolutional Layers**: Two Conv2D layers (32 and 64 filters) with ReLU activation\n", + "- **Pooling Layers**: MaxPooling2D for dimensionality reduction\n", + "- **Dense Layers**: Fully connected layers ending with 10-class softmax output\n", + "\n", + "### 4. **Optimizer Comparison**\n", + "The code includes three optimizer options with performance characteristics:\n", + "- **SGD**: Slower convergence, needs 50+ epochs, achieves ~65-70% accuracy\n", + "- **RMSprop**: Medium convergence, ~30 epochs, achieves ~60-65% accuracy \n", + "- **Adam**: Fast convergence, ~20-30 epochs, achieves ~60-65% accuracy\n", + "\n", + "### 5. **Training Process**\n", + "- Compiles model with sparse categorical crossentropy loss\n", + "- Trains with validation on test set to monitor overfitting\n", + "- Saves trained model to `/saved_model/cifar10_model.keras`\n", + "\n", + "### 6. **Model Evaluation**\n", + "Comprehensive evaluation using multiple metrics:\n", + "- **Accuracy**: Overall classification correctness\n", + "- **Precision**: True positives / (True positives + False positives)\n", + "- **Recall**: True positives / (True positives + False negatives)\n", + "- **F1-Score**: Harmonic mean of precision and recall\n", + "- **Confusion Matrix**: Detailed breakdown of classification results\n", + "\n", + "### 7. **Key Technical Details**\n", + "- Uses `np.argmax()` to convert probability predictions to class labels\n", + "- Handles label format conversion (flattening for sparse labels)\n", + "- Implements weighted averaging for multi-class metrics\n", + "- Includes TensorBoard logging for hyperparameter visualization\n", + "\n", + "This implementation provides a solid foundation for image classification tasks and demonstrates best practices for CNN training and evaluation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e1dc3905-7f5e-4e6e-b2c9-3916566a2da4", + "metadata": { + "language": "python", + "name": "TRAIN_MODEL" + }, + "outputs": [], + "source": [ + "import tensorflow as tf\n", + "from tensorflow import keras\n", + "import os # Add this import since you're using os.makedirs\n", + "import numpy as np\n", + "import sklearn.metrics\n", + "#from tensorflow.keras.layers import Input\n", + "#from tensorflow.keras.models import Sequential\n", + "from tensorflow.keras.optimizers import SGD, Adam, RMSprop\n", + "import logging\n", + "\n", + "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix\n", + "\n", + "logging.getLogger().setLevel(logging.INFO)\n", + "tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)\n", + "\n", + "METRIC_ACCURACY = \"accuracy\"\n", + "validation = 'validation'\n", + "logging.getLogger().setLevel(logging.INFO)\n", + "tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)\n", + "\n", + "\n", + "def train_cifar_model(epochs, batch_size, optimizer):\n", + " \"\"\"\n", + " This function contains the logic from your keras_cifar10.py script.\n", + " \"\"\"\n", + " # 1. Load data\n", + " (x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()\n", + "\n", + " # Normalize pixel values\n", + " x_train = x_train.astype(\"float32\") / 255.0\n", + " x_test = x_test.astype(\"float32\") / 255.0\n", + "\n", + " # 2. Define the Keras model - Use Input layer instead of input_shape\n", + " model = keras.Sequential([\n", + " keras.layers.Input(shape=(32, 32, 3)), # Add Input layer first\n", + " keras.layers.Conv2D(32, (3, 3), activation='relu'), # Remove input_shape parameter\n", + " keras.layers.MaxPooling2D((2, 2)),\n", + " keras.layers.Conv2D(64, (3, 3), activation='relu'),\n", + " keras.layers.MaxPooling2D((2, 2)),\n", + " keras.layers.Flatten(),\n", + " keras.layers.Dense(64, activation='relu'),\n", + " keras.layers.Dense(10, activation='softmax')\n", + " ])\n", + "\n", + " # Typical results for CIFAR-10:\n", + " ##SGD: Slower convergence, may need 50+ epochs, final accuracy ~65-70%\n", + " ##RMSprop: Medium convergence, ~30 epochs, final accuracy ~60-65% \n", + " ##Adam: Fast convergence, ~20-30 epochs, final accuracy ~60-65%\n", + " ##Learning Rate Sensitivity:\n", + " ##SGD: Very sensitive - wrong LR can break training\n", + " ##RMSprop: Moderately sensitive - usually works with default\n", + " ##Adam: Least sensitive - 0.001 works for most problems\n", + "\n", + "\n", + " learning_rate = 0.001\n", + "\n", + " \n", + " loss_param=\"categorical_crossentropy\"\n", + " loss_param='sparse_categorical_crossentropy'\n", + "\n", + " #optimizer = \"sgd\"\n", + " #optimizer = \"rmsprop\"\n", + " optimizer = \"adam\"\n", + " #opt = None\n", + " if optimizer == \"sgd\":\n", + " opt = SGD(learning_rate=learning_rate)\n", + " elif optimizer == \"rmsprop\":\n", + " opt = RMSprop(learning_rate=learning_rate)\n", + " elif optimizer == \"adam\":\n", + " opt = Adam(learning_rate=learning_rate)\n", + " else:\n", + " raise Exception(\"Unknown optimizer\", optimizer)\n", + "\n", + " # 3. Compile the model\n", + " model.compile(#optimizer=optimizer,\n", + " loss=loss_param,\n", + " metrics=['accuracy'],\n", + " optimizer=opt\n", + " )\n", + "\n", + " # 4. Train the model\n", + " history = model.fit(x_train, y_train,\n", + " epochs=epochs,\n", + " batch_size=batch_size,\n", + " validation_data=(x_test, y_test))\n", + " print(f\"Model History: {history}\")\n", + "\n", + " # 5. Save the trained model\n", + " os.makedirs('/saved_model', exist_ok=True)\n", + " model_save_path = '/saved_model/cifar10_model.keras' \n", + " model.save(model_save_path)\n", + " print(f\"Model saved to {model_save_path}\")\n", + "\n", + " # -- model.summary()\n", + " \n", + " # Convert the lists of uniform elements into NumPy arrays\n", + " test_x = x_test # np.array(test_x_list)\n", + " test_y = y_test # np.array(test_y_list)\n", + "\n", + " # Use the model to predict the labels\n", + " test_predictions = model.predict(test_x)\n", + " test_y_pred = np.argmax(test_predictions, axis=1)\n", + " ###test_y_true = np.argmax(test_y, axis=1) - incorrect\n", + " test_y_true = test_y.flatten() # Correct\n", + " # or\n", + " ##test_y_true = test_y.squeeze() # Also correct\n", + "\n", + "\n", + " y_pred = np.argmax(test_predictions, axis=1) # Convert probabilities to class predictions\n", + " y_true = np.argmax(y_test, axis=1) # Convert one-hot encoded labels to class indices\n", + "\n", + " # Evaluating model accuracy and logging it as a scalar for TensorBoard hyperparameter visualization.\n", + " accuracy = sklearn.metrics.accuracy_score(test_y_true, test_y_pred)\n", + " tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)\n", + " #logging.info(\"Test accuracy:{}\".format(accuracy))\n", + "\n", + " print(\"Test accuracy:{}\".format(accuracy))\n", + "\n", + " ## Diff test accuracy\n", + "\n", + " # Evaluate\n", + " ## 2\n", + "\n", + " #result = test_predictions #model.predict(test_y_true, test_y_pred)\n", + "\n", + " #print(result)\n", + " ''' ''' \n", + " y_true = test_y_true\n", + " y_pred = test_y_pred\n", + " \n", + " # Calculate metrics\n", + " metrics = {\n", + " \"accuracy\": accuracy_score(y_true, y_pred),\n", + " \n", + " \"precision\": precision_score(y_true, y_pred, average='weighted'), # or 'macro', 'micro'\n", + " \n", + " \"recall\": recall_score(y_true, y_pred, average='weighted'),\n", + " \n", + " \"f1_score\": f1_score(y_true, y_pred, average='weighted'),\n", + " \n", + " \"confusion_matrix\": confusion_matrix(y_true, y_pred).tolist()\n", + " }\n", + "\n", + "\n", + " return f\"The Score for the computer vision model:\\n {metrics}\" # history\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "438c3e30-d61a-43a3-9a6b-d03da3db6734", + "metadata": { + "language": "python", + "name": "RUN_TRAIN_MODEL" + }, + "outputs": [], + "source": [ + "#Run to train the model\n", + "\n", + "# Typical results for CIFAR-10:\n", + "##SGD: Slower convergence, may need 50+ epochs, final accuracy ~65-70%\n", + "##RMSprop: Medium convergence, ~30 epochs, final accuracy ~60-65% \n", + "##Adam: Fast convergence, ~20-30 epochs, final accuracy ~60-65%\n", + "##Learning Rate Sensitivity:\n", + "##SGD: Very sensitive - wrong LR can break training\n", + "##RMSprop: Moderately sensitive - usually works with default\n", + "##Adam: Least sensitive - 0.001 works for most problems\n", + "\n", + "hyperparameters = {\"epochs\": 30, \"batch-size\": 256, \"optimizer\": 'adam'}\n", + "\n", + "train_cifar_model('', hyperparameters[\"epochs\"], hyperparameters[\"batch-size\"], hyperparameters[\"optimizer\"])" + ] + }, + { + "cell_type": "markdown", + "id": "c6226010-dc28-407a-9a0e-2988ffc9bf12", + "metadata": { + "collapsed": false, + "name": "Mark_IMG" + }, + "source": [ + "## Image Preprocessing - Essential Requirements\n", + "\n", + "This preprocessing step is **essential** because:\n", + "\n", + "- **Standardization**: All images must be the same size for batch processing\n", + "- **Model Compatibility**: The CNN was trained on 32×32 images, so new images must match \n", + "- **Memory Efficiency**: Smaller images reduce computational requirements\n", + "- **Consistency**: Ensures the model receives data in the expected format\n", + "\n", + "> **Note**: This resizing step is critical before feeding images to our trained CIFAR-10 model for accurate predictions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "95dc3140-a431-47b5-9e0c-32360162ead0", + "metadata": { + "language": "python", + "name": "Check_IMG" + }, + "outputs": [], + "source": [ + "from PIL import Image\n", + "image_path = '0009.jpg' \n", + "img = Image.open(image_path).resize((32, 32))\n", + "\n", + "img" + ] + }, + { + "cell_type": "markdown", + "id": "848c2467-2705-439e-8aaf-b75e0ff90e72", + "metadata": { + "collapsed": false, + "name": "Mark_Test" + }, + "source": [ + "## CIFAR-10 Image Prediction System\n", + "\n", + "- This notebook demonstrates how to load a trained CIFAR-10 model and make predictions on new images using Snowflake's Python environment.\n", + "- Load a saved Keras model and predict which of the 10 CIFAR-10 classes a new image belongs to, with complete preprocessing pipeline.\n", + "- Predict single image\n", + "- Batch Processing predict images" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d3ec134d-e84e-4580-b474-49dd4ad12b5c", + "metadata": { + "codeCollapsed": false, + "language": "python", + "name": "TEST_1" + }, + "outputs": [], + "source": [ + "import tensorflow as tf\n", + "import numpy as np\n", + "# You might need to install Pillow: pip install Pillow\n", + "from PIL import Image\n", + "\n", + "# Define the human-readable class names for CIFAR-10\n", + "CLASS_NAMES = ['airplane', 'automobile', 'bird', 'cat', 'deer', \n", + " 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "local_model_path = '/tmp/keras_cifar10_model'\n", + "\n", + "def predict_single_image(model_path, image_path):\n", + " \"\"\"\n", + " Loads a saved model, preprocesses a single image,\n", + " and returns the predicted class name.\n", + " \n", + " Args:\n", + " model_path (str): The path to the saved Keras model directory.\n", + " image_path (str): The path to the new image file.\n", + " \n", + " Returns:\n", + " str: The predicted class name.\n", + " \"\"\"\n", + " # --- 1. Load the saved model ---\n", + " # Use tf.keras.models.load_model to load the entire model.\n", + " print(f\"Loading model from: {model_path}\")\n", + " model = tf.keras.models.load_model(model_path)\n", + " \n", + " # --- 2. Load and Preprocess the Image ---\n", + " # The model was trained on 32x32 images, so we must resize our new image.\n", + " img = Image.open(image_path).resize((32, 32))\n", + " \n", + " # Convert the image to a NumPy array and normalize pixel values to [0, 1]\n", + " img_array = np.array(img) / 255.0\n", + " \n", + " # The model.predict method expects a \"batch\" of images.\n", + " # We add a new axis to turn our (32, 32, 3) image into (1, 32, 32, 3).\n", + " img_batch = np.expand_dims(img_array, axis=0)\n", + " \n", + " # --- 3. Make the Prediction ---\n", + " predictions = model.predict(img_batch)\n", + " \n", + " # --- 4. Interpret the Results ---\n", + " # The output is an array of probabilities for each class.\n", + " # We find the index of the highest probability using np.argmax.\n", + " predicted_class_index = np.argmax(predictions[0])\n", + " \n", + " # Map the index to its corresponding class name.\n", + " predicted_class_name = CLASS_NAMES[predicted_class_index]\n", + " \n", + " return predicted_class_name\n", + "\n", + "# --- Example Usage ---\n", + "if __name__ == '__main__':\n", + " # First, run your training process\n", + " ##main()\n", + " \n", + " # Now, use the function to predict on a new image\n", + " # IMPORTANT: Replace these paths with your actual paths\n", + " SAVED_MODEL_PATH = '/saved_model/cifar10_model.keras' # '/tmp/keras_cifar10_model'\n", + " # Download or find a sample image of a truck, car, etc.\n", + " NEW_IMAGE_PATH = '0009.jpg' #plane\n", + " #NEW_IMAGE_PATH = '0016.jpg' # cat\n", + " #NEW_IMAGE_PATH = '0007.jpg' # ship\n", + "\n", + " # Create a list of all images you want to test\n", + " images_to_test = ['0007.jpg', '0009.jpg', '0016.jpg'] # ship, plane, cat\n", + " \n", + " for image_file in images_to_test:\n", + " try:\n", + " print(f\"\\n--- Predicting for image: {image_file} ---\")\n", + " prediction = predict_single_image(SAVED_MODEL_PATH, image_file)\n", + " print(f\"📸 The model predicts this image is a: {prediction.upper()}\")\n", + " except FileNotFoundError:\n", + " print(f\"Error: Image file not found at '{image_file}'.\")\n", + "\n", + " try:\n", + " prediction = predict_single_image(SAVED_MODEL_PATH, NEW_IMAGE_PATH)\n", + " print(\"\\n\" + \"=\"*30)\n", + " print(f\"📸 The model predicts the image is a: {prediction.upper()}\")\n", + " print(\"=\"*30)\n", + " except FileNotFoundError:\n", + " print(f\"\\nError: Image file not found at '{NEW_IMAGE_PATH}'.\")\n", + " print(\"Please update NEW_IMAGE_PATH with a valid path to an image.\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Streamlit Notebook", + "name": "streamlit" + }, + "lastEditStatus": { + "authorEmail": "ihor.karbovskyy@snowflake.com", + "authorId": "26807344418", + "authorName": "IKARBOV", + "lastEditTime": 1757002350893, + "notebookId": "azkuerkjgkaq6evt5sll", + "sessionId": "eff06377-08f9-408e-8f7e-774e6f67b0c7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/samples/Sagemaker-to-Snowflake/README.md b/samples/Sagemaker-to-Snowflake/README.md new file mode 100644 index 00000000..1a371585 --- /dev/null +++ b/samples/Sagemaker-to-Snowflake/README.md @@ -0,0 +1,52 @@ +# SageMaker ➜ Snowflake Migration Examples + +This repository provides simple examples to help you migrate machine learning workloads from **AWS SageMaker** to **Snowflake ML**. + +### Included Examples + +* **XGBoost Classifier** +* **PyTorch Classifier** +* **Image Classification** + +### Why Migrate? + +* Eliminate data movement between platforms +* Use Snowflake’s built‑in governance and security +* Deploy models directly as SQL functions + +### Quick Start + +1. Clone the repo: + +```bash +git clone https://github.com/Snowflake-Labs/sf-samples.git +cd sf-samples/samples/ml-sagemaker-to-snowflake +``` + +2. Install requirements: + +```bash +pip install -r requirements.txt +``` + +3. Run an example (e.g., XGBoost): + +```bash +cd xgboost_classifier +python train.py +``` + +### Repo Structure + +``` +ml-sagemaker-to-snowflake/ +├─ xgboost_classifier/ +├─ pytorch_classifier/ +├─ image_classification/ +└─ README.md +``` + +### License + +Apache 2.0 + diff --git a/samples/Sagemaker-to-Snowflake/XGBoost_Classifier/About_Data.md b/samples/Sagemaker-to-Snowflake/XGBoost_Classifier/About_Data.md new file mode 100644 index 00000000..6ad76c1e --- /dev/null +++ b/samples/Sagemaker-to-Snowflake/XGBoost_Classifier/About_Data.md @@ -0,0 +1,27 @@ +# About the Data + +The dataset used in this project is publicly available and cited in *Discovering Knowledge in Data* by Daniel T. Larose. +It originates from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php), as acknowledged by the author. + +To ensure an **apples-to-apples comparison**, I’ve chosen the **same dataset** featured in the official [AWS SageMaker example](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn_outputs.html#Data). +Let’s go ahead and download it. + +--- + +### Feature Descriptions + +- **State**: The U.S. state in which the customer resides (two-letter abbreviation, e.g., `OH`, `NJ`) +- **Account Length**: Number of days the account has been active +- **Area Code**: The three-digit area code of the customer's phone number +- **Phone**: Remaining seven digits of the phone number (non-informative) +- **Int’l Plan**: Whether the customer has an international calling plan (`yes`/`no`) +- **VMail Plan**: Whether the customer has a voice mail feature (`yes`/`no`) +- **VMail Message**: Average number of voice mail messages per month +- **Day Mins**: Total calling minutes used during the day +- **Day Calls**: Total number of calls placed during the day +- **Day Charge**: Billed cost of daytime calls +- **Eve Mins / Eve Calls / Eve Charge**: Evening call usage and billing +- **Night Mins / Night Calls / Night Charge**: Nighttime call usage and billing +- **Intl Mins / Intl Calls / Intl Charge**: International call usage and billing +- **CustServ Calls**: Number of calls placed to Customer Service +- **Churn?**: Whether the customer left the service (`true`/`false`) diff --git a/samples/Sagemaker-to-Snowflake/XGBoost_Classifier/XGBOOST_CUSTOMER_CHURN_OSS.ipynb b/samples/Sagemaker-to-Snowflake/XGBoost_Classifier/XGBOOST_CUSTOMER_CHURN_OSS.ipynb new file mode 100644 index 00000000..1ea3d814 --- /dev/null +++ b/samples/Sagemaker-to-Snowflake/XGBoost_Classifier/XGBOOST_CUSTOMER_CHURN_OSS.ipynb @@ -0,0 +1,265 @@ +{ + "metadata": { + "kernelspec": { + "display_name": "Streamlit Notebook", + "name": "streamlit" + }, + "lastEditStatus": { + "notebookId": "i3hoemszsycuifmde67o", + "authorId": "6149508575120", + "authorName": "RPEGU", + "authorEmail": "ranjeeta.pegu@snowflake.com", + "sessionId": "5756ca3a-7443-4e3f-82fb-ed44f88d3597", + "lastEditTime": 1756821000333 + } + }, + "nbformat_minor": 5, + "nbformat": 4, + "cells": [ + { + "cell_type": "markdown", + "id": "3c2528dc-25c7-4442-8cb9-144c02f127c2", + "metadata": { + "name": "Introduction", + "collapsed": false + }, + "source": "## Introduction ##\n\nCustomer loss can significantly impact a business’s bottom line. By detecting at-risk customers early, companies can proactively engage them with retention strategies. In this workshop, we'll explore how to use machine learning capabilities to automate the identification of dissatisfied customers—commonly referred to as churn prediction\n\n ** Internal** [aws - example ](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn_outputs.html#Data)\n\n### Configuring the environment " + }, + { + "cell_type": "markdown", + "id": "af05cc8a-7a47-4ea3-812b-554343ab2260", + "metadata": { + "name": "prerequisite", + "collapsed": false + }, + "source": "I have download the data and uploaded it into snowflake using the **COPY** Command" + }, + { + "cell_type": "code", + "id": "3775908f-ca36-4846-8f38-5adca39217f2", + "metadata": { + "language": "python", + "name": "Libraries" + }, + "source": "# Import python packages\nimport streamlit as st\nimport pandas as pd\n\n# We can also use Snowpark for our analyses!\nfrom snowflake.snowpark.context import get_active_session\nsession = get_active_session()\n\n#Snowflake libraries \nfrom snowflake import snowpark\nfrom snowflake.ml import dataset\nfrom snowflake.snowpark.functions import col\nfrom snowflake.snowpark.types import *\n\n\n# python libraries \nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport time\nimport json\nfrom IPython.display import display\n\n## set the database and schema\nsession.use_database('ml_models')\nsession.use_schema('ml_models.ds')\n", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "id": "8d50cbf4-0c8d-4950-86cb-114990437ac9", + "metadata": { + "language": "python", + "name": "downloadData" + }, + "source": "#download the data \nchurn = session.table(\"CHURN\")\n\nchurn.head(5)", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "id": "beaacb4a-a9ef-4c2b-9bfe-e3ba6067fb5c", + "metadata": { + "name": "cell2", + "collapsed": false + }, + "source": "## EDA\n\nLet’s explore the dataset further and uncover additional insights." + }, + { + "cell_type": "code", + "id": "c695373e-ac74-4b62-a1f1-08206cbd5c81", + "metadata": { + "language": "python", + "name": "EDA" + }, + "source": "# get the numerical and categorical features\nnumerical_columns = churn.select_dtypes(include=['number']).columns.tolist()\ncategorical_columns = churn.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()\n\nprint(\"Numerical Columns:\", numerical_columns)\nprint(\"Categorical Columns:\", categorical_columns)", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "id": "de9b1cea-3344-4117-8989-b0af6eaa36be", + "metadata": { + "language": "python", + "name": "Hist" + }, + "outputs": [], + "source": "pd.set_option(\"display.max_columns\", 500)\ndf = churn.describe()\ndf\nhist = churn.hist(bins=30, sharey=True, figsize=(10, 10))", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "bd05e5bd-de68-429c-bfe3-95695392996c", + "metadata": { + "name": "cell1", + "collapsed": false + }, + "source": "We can see immediately that: - State appears to be quite evenly distributed. - Phone takes on too many unique values to be of any practical use. It’s possible that parsing out the prefix could have some value, but without more context on how these are allocated, we should avoid using it. - Most of the numeric features are surprisingly nicely distributed, with many showing bell-like gaussianity. VMail Message is a notable exception (and Area Code showing up as a feature we should convert to non-numeric)." + }, + { + "cell_type": "code", + "id": "e166619a-c6af-4f75-aeac-0281e928a2df", + "metadata": { + "language": "python", + "name": "drop_phone" + }, + "outputs": [], + "source": "churn = churn.drop(\"PHONE\", axis=1)\nchurn[\"AREA_CODE\"] = churn[\"AREA_CODE\"].astype(object)\n", + "execution_count": null + }, + { + "cell_type": "code", + "id": "41e13eb5-4b31-43a8-ac9e-a241e60b4666", + "metadata": { + "language": "python", + "name": "histplot", + "collapsed": false, + "codeCollapsed": true + }, + "outputs": [], + "source": "import matplotlib.pyplot as plt\n\n# Histograms of numeric features by CHURN class\nfor column in churn.select_dtypes(include=[\"number\"]).columns:\n hist = churn[[column, \"CHURN\"]].hist(by=\"CHURN\", bins=30, edgecolor='black', figsize=(4, 3))\n plt.suptitle(f\"{column} by CHURN\", y=1) # Add title\n plt.tight_layout()\n plt.show()\n", + "execution_count": null + }, + { + "cell_type": "code", + "id": "4c410412-42f3-4972-8068-b93d40f8e5e1", + "metadata": { + "language": "python", + "name": "corr" + }, + "outputs": [], + "source": "df_corr = churn.select_dtypes(include=['number']).corr()\ndf_corr", + "execution_count": null + }, + { + "cell_type": "code", + "id": "cd41666f-1de9-40bb-b6b0-e2390b736b5f", + "metadata": { + "language": "python", + "name": "matrix", + "collapsed": false, + "codeCollapsed": true + }, + "outputs": [], + "source": "# Scatter matrix only on numeric columns\npd.plotting.scatter_matrix(churn.select_dtypes(include=['number']), figsize=(12, 12), diagonal='hist', alpha=0.5)\nplt.suptitle(\"Scatter Matrix of Numeric Features\", y=1)\nplt.show()", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "7dc5ec61-0ef1-494a-af01-65c39e1e3423", + "metadata": { + "name": "cell8", + "collapsed": false + }, + "source": "We see several features that essentially have 100% correlation with one another. Including these feature pairs in some machine learning algorithms can create catastrophic problems, while in others it will only introduce minor redundancy and bias. Let’s remove one feature from each of the highly correlated pairs: Day Charge from the pair with Day Mins, Night Charge from the pair with Night Mins, Intl Charge from the pair with Intl Mins:" + }, + { + "cell_type": "code", + "id": "534eedd3-31ea-4b44-8292-5a9073aa6521", + "metadata": { + "language": "python", + "name": "cell9", + "collapsed": true, + "codeCollapsed": false + }, + "outputs": [], + "source": "#churn.columns\nchurn = churn.drop([\"Day Charge\", \"Eve Charge\", \"Night Charge\", \"Intl Charge\"], axis=1)", + "execution_count": null + }, + { + "cell_type": "code", + "id": "15b86be3-987d-426d-a3d6-634a1dac6f5b", + "metadata": { + "language": "python", + "name": "cell16" + }, + "outputs": [], + "source": "\n# Make a copy to avoid modifying the original\ndf = churn.copy()\n\n# Step 1: Convert bool columns to string so they are treated as categorical\nbool_cols = df.select_dtypes(include='bool').columns\ndf[bool_cols] = df[bool_cols].astype(str)\n\n# Step 2: One-hot encode object and bool columns, dropping first level\ncategorical_cols = df.select_dtypes(include='object').columns.union(bool_cols)\ndf_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)\n\n# Step 3: Check result\n\ndf_encoded.head()", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "32673c44-0071-4f4a-95e3-cae5f2f1fcd6", + "metadata": { + "name": "cell11", + "collapsed": false + }, + "source": "But first, let’s convert our categorical features into numeric features." + }, + { + "cell_type": "markdown", + "id": "f6383ab1-65bf-401e-b917-06ce75c739b2", + "metadata": { + "name": "cell7", + "collapsed": false + }, + "source": "Let’s split the data into training, validation, and test sets." + }, + { + "cell_type": "code", + "id": "760188b7-c8fe-4156-938d-9137ae9430ff", + "metadata": { + "language": "sql", + "name": "run_if_needed" + }, + "outputs": [], + "source": "ALTER DATASET CHURN_TRAIN_DF DROP VERSION 'v1';\nALTER DATASET CHURN_TEST_DF DROP VERSION 'v1';\nALTER DATASET CHURN_VALIDATION_DF DROP VERSION 'v1';\n", + "execution_count": null + }, + { + "cell_type": "code", + "id": "a012985c-77cd-4459-a5a5-94d069535901", + "metadata": { + "language": "python", + "name": "snfdataset" + }, + "outputs": [], + "source": "train_data, validation_data, test_data = np.split(\n df_encoded.sample(frac=1, random_state=1729),\n [int(0.7 * len(df_encoded)), int(0.9 * len(df_encoded))],\n)\n\n\n## we will keep the dataset in snowflake for future use\nfrom snowflake.ml import dataset\n\ntrain_df = session.create_dataframe(train_data)\nvalidation_df =session.create_dataframe(validation_data)\ntest_df = session.create_dataframe(test_data)\n\n# Materialize DataFrame contents into a Dataset\nds1 = dataset.create_from_dataframe(\n session,\n \"churn_train_df\",\n \"v1\",\n input_dataframe=train_df)\nds2 = dataset.create_from_dataframe(\n session,\n \"churn_test_df\",\n \"v1\",\n input_dataframe=train_df)\nds3 = dataset.create_from_dataframe(\n session,\n \"churn_validation_df\",\n \"v1\",\n input_dataframe=train_df)", + "execution_count": null + }, + { + "cell_type": "code", + "id": "963dd88e-2e9e-4ede-995f-7c596e0557ea", + "metadata": { + "language": "python", + "name": "download_traindata" + }, + "outputs": [], + "source": "# Create a DataConnector from a Snowflake Dataset\nds_train = dataset.load_dataset(session, \"churn_train_df\", \"v1\")\n# Get a Snowpark DataFrame\ndf_train = ds_train.read.to_snowpark_dataframe().to_pandas()\n\nds_validation = dataset.load_dataset(session, \"churn_validation_df\", \"v1\")\ndf_validation = ds_validation.read.to_snowpark_dataframe().to_pandas()\n\n\nds_test = dataset.load_dataset(session, \"churn_test_df\", \"v1\")\ndf_test = ds_test.read.to_snowpark_dataframe().to_pandas()\n\n\n\n", + "execution_count": null + }, + { + "cell_type": "code", + "id": "e320dcce-015c-4aab-8384-eff783010070", + "metadata": { + "language": "python", + "name": "cell14" + }, + "outputs": [], + "source": "df_train.columns", + "execution_count": null + }, + { + "cell_type": "code", + "id": "521da291-f095-49c9-8cd1-09b5790f111f", + "metadata": { + "language": "python", + "name": "cell15" + }, + "outputs": [], + "source": "import xgboost as xgb # pre-install with snowflake container runtime notebook \nfrom sklearn.metrics import accuracy_score, classification_report\nimport matplotlib.pyplot as plt\n\n# Assuming 'CHURN_Yes' is your target\nX = df_encoded.drop(columns=['CHURN_True.'])\ny = df_encoded['CHURN_True.']\n\n\n# Step 1: Define feature and target columns\ntarget_col = 'CHURN_True.'\nX_train = df_train.drop(columns=['CHURN_True.'])\ny_train = df_train['CHURN_True.']\n\nX_val = df_validation.drop(columns=['CHURN_True.'])\ny_val = df_validation['CHURN_True.']\n\nX_test = df_test.drop(columns=['CHURN_True.'])\ny_test = df_test['CHURN_True.']\n\n", + "execution_count": null + }, + { + "cell_type": "code", + "id": "51214f91-08f1-434d-9fc3-3332f22841ef", + "metadata": { + "language": "python", + "name": "cell13" + }, + "outputs": [], + "source": "from xgboost import XGBClassifier\nfrom sklearn.metrics import accuracy_score, classification_report\n\n2r\nmodel.fit(\n X_train,\n y_train,\n eval_set=[(X_val, y_val)],\n verbose=True\n)\n\n\n# Predict on test\ny_pred = model.predict(X_test)\n\n# Evaluate\nprint(\"Test Accuracy:\", accuracy_score(y_test, y_pred))\nprint(\"\\nClassification Report:\\n\", classification_report(y_test, y_pred))\n", + "execution_count": null + } + ] +} \ No newline at end of file diff --git a/samples/Sagemaker-to-Snowflake/XGBoost_Classifier/XGBOOST_CUSTOMER_CHURN_SNF.ipynb b/samples/Sagemaker-to-Snowflake/XGBoost_Classifier/XGBOOST_CUSTOMER_CHURN_SNF.ipynb new file mode 100644 index 00000000..0158f428 --- /dev/null +++ b/samples/Sagemaker-to-Snowflake/XGBoost_Classifier/XGBOOST_CUSTOMER_CHURN_SNF.ipynb @@ -0,0 +1,320 @@ +{ + "metadata": { + "kernelspec": { + "display_name": "Streamlit Notebook", + "name": "streamlit" + }, + "lastEditStatus": { + "notebookId": "nasuateia7jyqpmuwhj6", + "authorId": "6149508575120", + "authorName": "RPEGU", + "authorEmail": "ranjeeta.pegu@snowflake.com", + "sessionId": "78f8c5f2-6a29-440b-98da-67fb24ac8c4a", + "lastEditTime": 1756827515044 + } + }, + "nbformat_minor": 5, + "nbformat": 4, + "cells": [ + { + "cell_type": "markdown", + "id": "3c2528dc-25c7-4442-8cb9-144c02f127c2", + "metadata": { + "name": "Introduction", + "collapsed": false + }, + "source": "## Introduction ##\n\nCustomer loss can significantly impact a business’s bottom line. By detecting at-risk customers early, companies can proactively engage them with retention strategies. In this workshop, we'll explore how to use native Snowflake’s [machine learning](https://docs.snowflake.com/de/developer-guide/snowpark-ml/reference/1.5.3/modeling) capabilities to automate the identification of dissatisfied customers—commonly referred to as churn prediction\n\n ** Internal** [aws - example ](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn_outputs.html#Data)\n\n### Configuring the environment " + }, + { + "cell_type": "markdown", + "id": "af05cc8a-7a47-4ea3-812b-554343ab2260", + "metadata": { + "name": "prerequisite", + "collapsed": false + }, + "source": "I have download the data and uploaded it into snowflake using the **COPY** Command" + }, + { + "cell_type": "code", + "id": "3775908f-ca36-4846-8f38-5adca39217f2", + "metadata": { + "language": "python", + "name": "Libraries" + }, + "source": "# Import python packages\nimport streamlit as st\nimport pandas as pd\n\n# We can also use Snowpark for our analyses!\nfrom snowflake.snowpark.context import get_active_session\nsession = get_active_session()\n\n#Snowflake libraries \nfrom snowflake import snowpark\nfrom snowflake.ml import dataset\nfrom snowflake.snowpark.functions import col,when,lit\nfrom snowflake.snowpark.types import *\n\n## Snowflake ml libraries\nfrom snowflake.ml.modeling.xgboost import XGBClassifier\nfrom snowflake.ml.modeling.preprocessing import MinMaxScaler , OneHotEncoder\n\n# snowpark ML metrics\nfrom snowflake.ml.modeling.metrics import accuracy_score,f1_score,precision_score,roc_auc_score,roc_curve,recall_score\n\n\n# python libraries \nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport time\nimport json\nfrom IPython.display import display\n\n## set the database and schema\nsession.use_database('ml_models')\nsession.use_schema('ml_models.ds')\n", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "id": "8d50cbf4-0c8d-4950-86cb-114990437ac9", + "metadata": { + "language": "python", + "name": "downloadData" + }, + "source": "#download the data \nchurn = session.table(\"CHURN\")\n\nchurn.show(5)", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "id": "beaacb4a-a9ef-4c2b-9bfe-e3ba6067fb5c", + "metadata": { + "name": "cell2", + "collapsed": false + }, + "source": "## EDA\n\nLet’s explore the dataset further and uncover additional insights." + }, + { + "cell_type": "code", + "id": "c695373e-ac74-4b62-a1f1-08206cbd5c81", + "metadata": { + "language": "python", + "name": "EDA" + }, + "source": "# get the numerical and categorical features\n#get the schema\nschema = churn.schema\n\nnumerical_types = (IntegerType, FloatType, DecimalType, LongType, ShortType, DoubleType)\nnumerical_columns =[f.name for f in schema if isinstance(f.datatype, numerical_types)]\n\n\ncategorical_types = (StringType, VariantType, BooleanType)\ncategorical_columns = [f.name for f in schema if isinstance(f.datatype, categorical_types)]\n\nprint(\"Numerical Columns:\", numerical_columns)\nprint(\"Categorical Columns:\", categorical_columns)", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "id": "de9b1cea-3344-4117-8989-b0af6eaa36be", + "metadata": { + "language": "python", + "name": "describe" + }, + "outputs": [], + "source": "pd.set_option(\"display.max_columns\", 500)\ndf = churn.describe()\ndf\n", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "bd05e5bd-de68-429c-bfe3-95695392996c", + "metadata": { + "name": "cell1", + "collapsed": false + }, + "source": "We can see immediately that: - State appears to be quite evenly distributed. - Phone takes on too many unique values to be of any practical use. It’s possible that parsing out the prefix could have some value, but without more context on how these are allocated, we should avoid using it. - Most of the numeric features are surprisingly nicely distributed, with many showing bell-like gaussianity. VMail Message is a notable exception (and Area Code showing up as a feature we should convert to non-numeric)." + }, + { + "cell_type": "code", + "id": "e166619a-c6af-4f75-aeac-0281e928a2df", + "metadata": { + "language": "python", + "name": "drop_phone_col" + }, + "outputs": [], + "source": "#drop column phone from the snowprk dataframe\nchurn = churn.drop(\"PHONE\")\n\n#convert to a string column\nchurn = churn.with_column(\"AREA_CODE\", col(\"AREA_CODE\").cast(StringType()))\n", + "execution_count": null + }, + { + "cell_type": "code", + "id": "41e13eb5-4b31-43a8-ac9e-a241e60b4666", + "metadata": { + "language": "python", + "name": "Hist", + "collapsed": false, + "codeCollapsed": true + }, + "outputs": [], + "source": "import matplotlib.pyplot as plt\ndf = churn.to_pandas()\n\n# Histograms of numeric features by CHURN class\nfor column in df.select_dtypes(include=[\"number\"]).columns:\n hist = df[[column, \"CHURN\"]].hist(by=\"CHURN\", bins=30, edgecolor='black', figsize=(4, 3))\n plt.suptitle(f\"{column} by CHURN\", y=1) # Add title\n plt.tight_layout()\n plt.show()\n", + "execution_count": null + }, + { + "cell_type": "code", + "id": "4c410412-42f3-4972-8068-b93d40f8e5e1", + "metadata": { + "language": "python", + "name": "corr" + }, + "outputs": [], + "source": "#df_corr = churn.select_dtypes(include=['number']).corr()\n#df_corr\nnumerical_columns =[f.name for f in churn.schema if isinstance(f.datatype, numerical_types)]\n # Initialize an empty DataFrame to store the correlation matrix\ncorr_matrix = pd.DataFrame(index=numerical_columns, columns=numerical_columns, dtype=float)\n\n\n# For each pair of numerical columns, calculate the correlation\nfor col1 in numerical_columns:\n for col2 in numerical_columns:\n correlation_value = churn.stat.corr(col1, col2)\n corr_matrix.loc[col1, col2] = correlation_value\n \n \nprint(\"\\nCorrelation Matrix calculated with df.stat.corr():\")\nprint(corr_matrix)\n \n", + "execution_count": null + }, + { + "cell_type": "code", + "id": "cd41666f-1de9-40bb-b6b0-e2390b736b5f", + "metadata": { + "language": "python", + "name": "corrmatrix" + }, + "outputs": [], + "source": "import seaborn as sns\nplt.figure(figsize=(10, 8))\n\nsns.heatmap(\n corr_matrix,\n annot = True,\n cmap ='coolwarm',\n fmt = \".2f\",\n linewidths =-.5,\n cbar_kws={'label': 'Correlation Coefficient'}\n)\n\n", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "7dc5ec61-0ef1-494a-af01-65c39e1e3423", + "metadata": { + "name": "cell8", + "collapsed": false + }, + "source": "We see several features that essentially have 100% correlation with one another. Including these feature pairs in some machine learning algorithms can create catastrophic problems, while in others it will only introduce minor redundancy and bias. Let’s remove one feature from each of the highly correlated pairs: Day Charge from the pair with Day Mins, Night Charge from the pair with Night Mins, Intl Charge from the pair with Intl Mins:" + }, + { + "cell_type": "code", + "id": "1df86bf2-6fe6-4a97-bad2-ba40107d4c99", + "metadata": { + "language": "python", + "name": "rename_bool_col", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "churn= churn.with_column_renamed(\"Int'l Plan\",\"INTL_PLAN\")\n#churn.columns\n#Cat_cols =['STATE','INTL_PLAN', 'VMAIL_PLAN']\n\n\n\nchurn= (\n churn\n .with_column(\"INTL_PLAN\", \n when(col(\"INTL_PLAN\")== True,1).otherwise(0))\n .with_column(\"VMAIL_PLAN\", when(col(\"VMAIL_PLAN\")== True,1).otherwise(0))\n)\n", + "execution_count": null + }, + { + "cell_type": "code", + "id": "534eedd3-31ea-4b44-8292-5a9073aa6521", + "metadata": { + "language": "python", + "name": "drop_correlate_cols" + }, + "outputs": [], + "source": "#drop \nchurn = churn.drop(\"Day Charge\", \"Eve Charge\", \"Night Charge\", \"Intl Charge\")", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "32673c44-0071-4f4a-95e3-cae5f2f1fcd6", + "metadata": { + "name": "cell11", + "collapsed": false + }, + "source": "But first, let’s convert our categorical features into numeric features." + }, + { + "cell_type": "code", + "id": "15b86be3-987d-426d-a3d6-634a1dac6f5b", + "metadata": { + "language": "python", + "name": "Onehotencoding" + }, + "outputs": [], + "source": "\n\ncat_cols =['STATE','INTL_PLAN', 'VMAIL_PLAN','AREA_CODE']\nohe = OneHotEncoder(input_cols=cat_cols,\n output_cols=cat_cols,\n drop_input_cols=True,\n drop=\"first\",\n handle_unknown=\"ignore\")\n#fit & Transform\ndf = ohe.fit(churn).transform(churn)\ndf= df.with_column(\n \"CHURN\",\n when(col(\"CHURN\") == \"True.\", 1).otherwise(0)\n)", + "execution_count": null + }, + { + "cell_type": "code", + "id": "27a5b8f6-2fa7-47e0-80ed-ab1d3393a8ab", + "metadata": { + "language": "python", + "name": "cell4" + }, + "outputs": [], + "source": "#df", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "f6383ab1-65bf-401e-b917-06ce75c739b2", + "metadata": { + "name": "Train_test_split", + "collapsed": false + }, + "source": "# Train Test Split \nLet’s split the data into training, validation, and test sets." + }, + { + "cell_type": "code", + "id": "760188b7-c8fe-4156-938d-9137ae9430ff", + "metadata": { + "language": "sql", + "name": "optional_runifneeded", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "ALTER DATASET CHURN_TRAIN_DF DROP VERSION 'snf';\nALTER DATASET CHURN_TEST_DF DROP VERSION 'snf';\nALTER DATASET CHURN_VALIDATION_DF DROP VERSION 'snf';\n", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "2bb5965b-4563-4a1c-b3a3-80b33b0cc8c3", + "metadata": { + "name": "cell3", + "collapsed": false + }, + "source": "# Dataset\nAfter splitting the data into training, validation, and test sets, I will store them as Snowflake datasets (tables or views). \nThis ensures I can reuse the splits in future runs without repeating the preprocessing steps.\n" + }, + { + "cell_type": "code", + "id": "a012985c-77cd-4459-a5a5-94d069535901", + "metadata": { + "language": "python", + "name": "snfdataset" + }, + "outputs": [], + "source": "\ntrain_df, validation_data,test_df = df.random_split(weights = [0.70,0.20,0.1],seed=62)\n\n## we will keep the dataset in snowflake for future use\nfrom snowflake.ml import dataset\n\n# Materialize DataFrame contents into a Dataset\nds1 = dataset.create_from_dataframe(\n session,\n \"churn_train_df\",\n \"snf\",\n input_dataframe=train_df)\nds2 = dataset.create_from_dataframe(\n session,\n \"churn_test_df\",\n \"snf\",\n input_dataframe=train_df)\nds3 = dataset.create_from_dataframe(\n session,\n \"churn_validation_df\",\n \"snf\",\n input_dataframe=train_df)", + "execution_count": null + }, + { + "cell_type": "code", + "id": "963dd88e-2e9e-4ede-995f-7c596e0557ea", + "metadata": { + "language": "python", + "name": "download_traindata" + }, + "outputs": [], + "source": "# Create a DataConnector from a Snowflake Dataset\nds_train = dataset.load_dataset(session, \"churn_train_df\", \"snf\")\n# Get a Snowpark DataFrame\ndf_train = ds_train.read.to_snowpark_dataframe()\n\nds_validation = dataset.load_dataset(session, \"churn_validation_df\", \"snf\")\ndf_validation = ds_validation.read.to_snowpark_dataframe()\n\n\nds_test = dataset.load_dataset(session, \"churn_test_df\", \"snf\")\ndf_test = ds_test.read.to_snowpark_dataframe()\n\n\n\n", + "execution_count": null + }, + { + "cell_type": "code", + "id": "1733e471-9628-4adf-9334-32183ee349be", + "metadata": { + "language": "python", + "name": "CastDouble" + }, + "outputs": [], + "source": "\n# the snowflake ml libraries are sensitive to datatypes , make sure to cast it properly \ninput_cols = [c for c in df.columns if c != \"CHURN\"]\n\n\nfor c in input_cols:\n df_train = df_train.with_column(c, col(c).cast(\"double\"))\n\nfor c in input_cols:\n df_test = df_test.with_column(c, col(c).cast(\"double\"))\nfor c in input_cols:\n df_validation = df_validation.with_column(c, col(c).cast(\"double\"))\n", + "execution_count": null + }, + { + "cell_type": "code", + "id": "53070813-c44f-47b4-81e5-cb4a403c718b", + "metadata": { + "language": "python", + "name": "input_label_cols" + }, + "outputs": [], + "source": "#df_train.columns\n# Filter out the target column to get the feature columns\ninput_cols = [col_name for col_name in df_train.columns if col_name != \"CHURN\"]\nOUTPUT_COLUMNS=\"PREDICTED_CHURN\"\nlabel_col=\"CHURN\"", + "execution_count": null + }, + { + "cell_type": "code", + "id": "5eeead4d-241b-4d23-b72e-f5e3be79dd72", + "metadata": { + "language": "python", + "name": "modelTrain", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "model1 = XGBClassifier(\n objective=\"binary:logistic\",\n n_estimators=100,\n learning_rate=0.1,\n max_depth=5,\n gamma=4,\n min_child_weight=6,\n subsample=0.8,\n use_label_encoder=False,\n eval_metric=\"logloss\",\n input_cols=input_cols ,\n label_cols=label_col,\n output_cols=OUTPUT_COLUMNS\n)\n\n#fit\nmodel1.fit(df_train)\npredict_df_train = model1.predict(df_train)\n", + "execution_count": null + }, + { + "cell_type": "code", + "id": "521da291-f095-49c9-8cd1-09b5790f111f", + "metadata": { + "language": "python", + "name": "Test_Xgbclassifier" + }, + "outputs": [], + "source": "predict_on_test_data = model1.predict(df_test)\n\n\n\n\ntest_accuracy = accuracy_score(df=predict_on_test_data, \n y_true_col_names=[\"CHURN\"],\n y_pred_col_names=[\"PREDICTED_CHURN\"]\n )\n\n\n\n# Evaluate\nprint(\"Test Accuracy:\", test_accuracy)\n#print(\"\\nClassification Report:\\n\", classification_report(predict_on_test_data[\"CHURN\"], predict_on_test_data[\"PREDICTED_CHURN\"]))\n", + "execution_count": null + }, + { + "cell_type": "code", + "id": "51214f91-08f1-434d-9fc3-3332f22841ef", + "metadata": { + "language": "python", + "name": "xgboostMetrics" + }, + "outputs": [], + "source": "from snowflake.ml.modeling.metrics import confusion_matrix\nresult = model.predict(df_validation)\n\n\nmetrics = {\n\"accuracy\":accuracy_score(df=result, \n y_true_col_names=\"CHURN\", \n y_pred_col_names=\"PREDICTED_CHURN\"),\n\n\"precision\":precision_score(df=result,\n y_true_col_names=\"CHURN\", \n y_pred_col_names=\"PREDICTED_CHURN\"),\n\n\n\"recall\": recall_score(df=result, \n y_true_col_names=\"CHURN\",\n y_pred_col_names=\"PREDICTED_CHURN\"),\n\n\n\n\"f1_score\":f1_score(df=result,\n y_true_col_names=\"CHURN\",\n y_pred_col_names=\"PREDICTED_CHURN\"),\n\"confusion_matrix\":confusion_matrix(df=result, \n y_true_col_name=\"CHURN\",\n y_pred_col_name=\"PREDICTED_CHURN\").tolist()\n}\n\nprint(f\" The Score for the xgboost model :\\n {metrics}\")\nprint(f\" The Score for the xgboost model :\\n {metrics}\")", + "execution_count": null + } + ] +} \ No newline at end of file