This is a Mini-Project for the module SC1015 (Introduction to Data Science and Artificial Intelligence) which focuses on online-shopping activities.
Source data retrieved from eCommerce behavior data from multi category store by MICHAEL KECHINOV
For more details on datasets, please view: The Dataset Collections.
For detailed walkthrough of the code, please view the notebooks for each step of our analysis and the conclusion at the end of the final notebook. The notebook are as follow (please read them in order):
- Data Cleaning
- Data Mining and Visualization
- Classification Tree Model
- Random Forest Classification Model
- XGBoost Classification Model
- @NguyenPhamMinhQuan
- @dnyk7
- @Jwong611
- To predict the probability that a product that is already in cart is likely to be purchased by users.
- Which model would be the best to predict it?
- Category of the product
- Sub-category of the product (since this information is provided in our dataset).
- Weekday: weekday of the event
- Day of month: Date of event
- Time period of the event: Morning (6am -2pm), Afternoon(2pm - 10pm) and Night(10pm - 6am)
- User access history: Number of sessions accessed by user before this event. This show customer's loyalty or familiarity with the online shop.
- Activity count: Number of activities user engage in in one session.
- Brand of product
- Price of product
- Weekday: weekday of the event
- Day of month: Date of event
- User access history: Number of sessions accessed by user before this event. This show customer's loyalty or familiarity with the online shop.
- Activity count: Number of activities user engage in in one session.
- Brand of product
- Price of product
- There is a limit to classification accuracy using classification tree, regardless of the depth
- Thus, in order to enhance our model, we need to adopt a different algorithm or engineer better predictors (features)
- Number of acitivities in a session (how active a user is on the online shopping platform) is the best predictor for purchases of cart items. This is because the longer and more items users see and interacts, the more likely they will think about buying something, resulting in an eventual purchase.
- Surprisingly, day of the week (Mon-Fri) is also a good predictor. However, we cannot deduce a logical reason behind this due to limited information on the online shopping platform. This could have been due to customer demographic characteristics or a particular pattern in the shop operations (such as offering sales on Sunday-Monday)
- Price, brand, and user's familiarity (or history) with the platform do not contribute significantly to the prediction of purchases of cart items. This is because once an item is in cart, user's will no longer consider these variables as important considerations.
- XGBoost is one of the most efficient model to date and is the most suitable for building model on tabular data such as online shopping data.
- Yes, it is possible to predict customers' behaviours using machine learning with sufficient accuracy. Online shopping platform should employ machine learning model to boost their revenue.
- Explore and classify the view-to-cart process. View represents 94% of activities on the platform and thus, being able to convert a small amount of views into cart or purchase will boost sale of the platform significantly. XGBoost can be applied to build a multi-class classifier without having to consider One-vs-Rest classifier approach.
- Scale our project by employing Dask to read and work on larger data files.
- Perform cross-validation and tuning for all models
- Engineer more features to examine such as BounceRate, Promotion and Sale, etc. given more data available.
- Memory handling: StackOverflow, Del, and inplace=True
- Ensemble Modelling: Bagging and Boosting
- Tuning Hyperparameters with GridSearch
- Extreme Gradient Boost (XGBoost) Classifier
- Tuning of XGBoost using cross-validation (cv)
- Concepts of F-score and Feature Performance analysis (based on 'gain')
- Collaborating using GitHub
- https://www.kaggle.com/code/tshephisho/ecommerce-behaviour-using-xgboost/input
- https://slidesgo.com/theme/minimalist-business-slides#search-simple&position-6&results-3926&rs=search
- https://www.markdownguide.org/cheat-sheet/
- https://docs.github.com/en/repositories
- https://pandas.pydata.org/docs/reference/index.html
- https://numpy.org/doc/
- https://xgboost.readthedocs.io/en/stable/
- https://scikit-learn.org/stable/
- https://matplotlib.org/stable/index.html
- https://seaborn.pydata.org/
- https://towardsdatascience.com/ensemble-learning-bagging-boosting-3098079e5422
- https://medium.com/@thedatabeast/adaboost-gradient-boosting-xg-boost-similarities-differences-516874d644c6
- https://arxiv.org/pdf/2106.03253.pdf
- https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
- https://docs.dask.org/en/stable/
- https://www.sciencedirect.com/topics/computer-science/ensemble-modeling
- https://towardsdatascience.com/machine-learning-multiclass-classification-with-imbalanced-data-set-29f6a177c1a