Maintaining large-scale product catalogs and category trees poses unique challenges for major online marketplaces and price comparison services. Conventional methods for handling vast unstructured data in the e-commerce industry frequently fail to deliver the desired results, causing operational inefficiencies that lead to poor customer experiences and reduced ROI. We were part of a team of consultants hired to address these issues by transforming the organically grown, and therefore inconsistent, virtually unmanageable 12-million product catalog for an international online marketplace. As a novel approach, I developed a machine learning-based prototype aimed at managing raw product data automatically with minimal human intervention, setting the stage for significant cost savings.
The marketplace faced a complex set of challenges stemming from a disordered product taxonomy. Redundant sub-categories with varying naming conventions were dispersed across different branches, creating confusion for both users and internal search algorithms. This disarray had cascading impacts: weakened SEO, difficulties for merchants in product matching, and a fractured user experience that negatively affected both conversions and profits. On the brink of sinking significant resources into a laborious, costly manual overhaul, the company changed direction when we suggested a machine learning-based, scalable, and automated alternative.
Instead of resorting to manual cleanup, our proposed machine learning-based solution centered on the novel idea of product-vectorization. In close collaboration with the client, we developed a prototype that was:
- Fast & Efficient: Significantly reduces the manual effort typically required.
- Robust & Scalable: Built to adapt to future needs and scale effortlessly with the business.
- Automated & Semi-Automated Processes: Offers both fully automated and semi-automated methods for category and product management, creating a balanced, human-machine collaborative environment.
Key Features & Value-Adds
By pivoting to this machine learning-driven strategy, we offered more than just a solution to an immediate issue. We delivered a comprehensive, scalable framework that stands to revolutionize multiple facets of e-commerce operations such as:
- Rebuilding Taxonomies: Creates efficient new category trees that substantially boost both SEO and onsite search capabilities.
- Duplicate Identification: Speeds up the listing process and enhances customer experience by removing redundant categories.
- Attribute Extraction: Adds depth to product categories by identifying new sub-categories or filters.
- Automatic Categorization: Further reduces the manual effort required in listing products.
- Data Enrichment: Elevates listing quality, thereby positively impacting both internal search and SEO.
- Duplicate Management: Ensures the integrity of the product data, contributing to higher conversion rates.
The project was structured in distinct, iterative phases, each fine-tuned together with the client’s data and infrastructure teams to align with mutual expectations. Initially, we identified the most effective methods for converting product data into numerical ‘product vectors,’ making them computationally comparable. We then applied appropriate clustering methods to group these vectors into logically coherent categories. This set the stage for the use of similarity detection algorithms in multidimensional product-category mapping. The final phase rigorously tested the solution’s scalability and its readiness for seamless integration into existing production systems.
This project demonstrated the substantial impact machine learning can have on improving data quality and management in e-commerce. Our prototype provides a systematic approach to tackle persistent, critical issues in the online retail space. Rather than being a mere quick fix, it acts as a catalyst for a broader, strategic approach to digital transformation. It encourages business leaders to transition from labour intensive, messy manual methods to advanced human-machine argumented and data-driven technologies. In doing so, it addresses current challenges while laying the groundwork for further growth.
- Data Vectorization
We started with the transformation of messy product data into a machine-readable format, which we termed “product vectors.” Utilizing standard techniques in machine learning, we created a data representation that encapsulated the critical attributes of each product, making it possible to apply clustering algorithms effectively.
- Clustering for Category Refinement
With the vectors in place, we applied unsupervised machine learning techniques to cluster products based on similarities. This step was crucial in recreating an entirely new and highly consistent product hierarchy. The clustering algorithms were tailored to align with the specific category requirements of an e-commerce platform, thus ensuring the resulting tree was not only lean but highly relevant.
- Duplicate Leaf Node Identification
We have developed a unique algorithmic approach to identify near-duplicate leaf nodes in category trees. This technique aided in clearing high-category overlap and inconsistencies, allowing for faster product listings, especially beneficial for third-party data providers.
- Product-Category Matching
To address the challenge of correctly assigning products to categories, we leveraged the product vectors. Our system presented either the best matching leaf node for confirmation or automatically slotted the item into the optimal category, without requiring manual intervention.
- Product Enrichment
During the initial restructuring of the product-catalog, our prototype augmented product details with available attributes or filters from its matching category. This process was automated and data-driven, contributing to higher quality listings and better internal linking.
- Data Deduplication
We tackled the issue of similar or duplicate items within a product pool by employing a deduplication algorithm. This worked on both structured data, typical for large online shops, and unstructured data repositories more common for merchant-driven marketplaces.
- What Is Also Possible With Our Technology
Our current implementation also lays the groundwork for additional applications such as: automated product image matching using vectorized data, enrichment of sparse product taxonomies using external content-rich data and also SEO and conversion optimization through semi-structured data connections within weakly structured product catalogs.
Whether it’s e-commerce or any other data-driven business, modern challenges require modern solutions. Machine learning-based approaches, like the prototype we developed, can be invaluable components in your arsenal for digital transformation.