How to get rid of duplicate products in your catalog once and for all?
-
Larisa Shishkova
Copywriter Elbuz
Duplicate products in an online store's catalog are a critical issue that impacts conversion, SEO, and user experience. According to statistics, in the average e-commerce catalog, 8% to 25% of products are duplicates, resulting in a loss of up to 15% of potential revenue.
Duplicate products create numerous problems: customers can't determine which product page is relevant, search engines don't know which page to index, and managers waste time managing multiple versions of the same product. In this article, we'll explore a systematic approach to identifying, eliminating, and preventing duplicates in your product catalog.
Why do duplicate products appear in the catalog?
Understanding the causes of duplicates is the first step to eliminating them. The main sources of duplication are:
1. Multiple data sources
When products are loaded from different supplier price lists, Excel files, or warehouse management systems, the same product may appear multiple times under different SKUs or with minor differences in description.
2. Human factor
Managers manually create product cards without checking for duplicates. This is especially common in large teams where several employees work with the catalog simultaneously.
3. Product variations without proper grouping
Products with different colors, sizes, or configurations are created as separate items rather than variants of a single product. For example, "Nike T-shirt red M" and "Nike T-shirt blue L" should be variants, not separate products.
4. Changes in article numbers by suppliers
Manufacturers and distributors periodically change their SKU system. The old SKU remains in the catalog, while the new one is imported as a separate product.
5. Technical failures during import
Errors in synchronization logic, when the system does not recognize an existing product and creates a duplicate instead of updating the existing card.
Types of Duplicates: How to Classify the Problem
Full takes
Identical products with the same article number, name, and specifications. These typically occur due to technical issues during import.
Partial takes
Products with minor differences in name or description but the same functionality. For example: "iPhone 15 Pro 256GB" and "Apple iPhone 15 Pro 256GB Smartphone."
Semantic duplicates
Products that are essentially the same but described differently require intelligent analysis to identify.
Cross-modification doubles
Products that should be grouped together as variants (color, size, volume), but exist as separate items.
Algorithms and methods for finding duplicates
Exact match search
The simplest method is to search for products with identical key fields:
- By SKU: identifies complete duplicates when one article appears several times
- By EAN/UPC/GTIN: International product identifiers, especially important for products from well-known brands
- By name: search for products with absolutely identical names
Fuzzy matching
Algorithms for calculating the degree of similarity of strings:
- Levenshtein distance: calculating the minimum number of operations to transform one string into another
- Jaro-Winkler algorithm: takes into account the matches of characters and their positions, effective for short names
- N-grams: splitting text into sequences of N characters for comparison
Example of application
The products "Samsung Galaxy S24 Ultra 256GB Black" and "Samsung Galaxy S24 Ultra 256GB Black" have a Levenshtein distance of 15, which, at a similarity threshold of 80%, identifies them as duplicates.
Search by combination of attributes
Creating a unique product fingerprint based on a combination of characteristics:
- Brand + model + key characteristics
- Category + manufacturer + main parameters
- Hashing normalized data
Machine learning for duplicate detection
The modern approach uses ML algorithms for intelligent recognition:
- Vectorization of descriptions: Converting text to numeric vectors using word2vec or BERT
- Clustering: grouping similar products for visual analysis
- Training on labeled data: creation of a model based on expert labeling of duplicates
Strategies for merging duplicates
Definition of the Master Record
Criteria for selecting the main version of a product:
- Oldest card: saves sales history and SEO weight
- The most complete: contains a maximum of completed characteristics and high-quality photographs
- With the best indicators: More views, reviews, conversions
- With the correct URL: meets SEO requirements and contains keywords
Data fusion methods
1. Complete merger
All duplicates are deleted and their data is transferred to the main card:
- Combining descriptions (selecting the most complete one)
- Image consolidation
- Transfer all reviews and ratings
- Summation of remainders
- Order history is redirected to the main card
2. Creating product variants
When duplicates represent different modifications:
- Definition of the base product
- Converting duplicates into variants (color, size, configuration)
- Setting up a matrix of options with individual prices and balances
3. Setting up redirects
Critical for SEO:
- 301 redirect from all deleted cards to the main page
- Updating internal links in the directory
- Redirecting external links
Automatic Deduplication: Tools and Technologies
Built-in mechanisms of e-commerce platforms
- Shopify: Bulk Editor for bulk editing, an app for combining products
- WooCommerce: Product Merger and Bulk Edit Products plugins
- Magento: automatic deduplication and attribute merging modules
- OpenCart: extensions for finding and removing duplicates
Specialized PIM systems
Product Information Management systems with advanced capabilities:
- Akeneo: deduplication rules, automatic data enrichment
- Pimcore: fuzzy search algorithms, ML models for identifying duplicates
- Salsify: intelligent fusion of data from multiple sources
Elbuz Deduplication Solution
Platform Elbuz offers a comprehensive approach to automatic processing of duplicates:
- Automatic detection of duplicates when importing price lists
- Customizable matching rules (by article number, EAN, name, field combination)
- Intelligent data merging with priority field selection
- Preview changes before applying them
- Audit log of all merge operations
- API for integration with external systems
Case Study: Online Electronics Store
A company with a catalog of 50,000 products detected 8,500 duplicates after migrating from another platform. Using Elbuz allowed them to:
- Automatically detect 6,200 complete duplicates by EAN
- Find 1,800 partial duplicates using fuzzy search
- Combine cards while preserving all reviews and sales history
- Set up 301 redirects for 8,500 URLs
- Result: 12% increase in conversion, improved search rankings, saving 40 hours of manual work per month
SQL queries for finding duplicates
For technical specialists, here are some examples of database queries:
-- Search for duplicates by article number SELECT sku, COUNT(*) as count FROM products GROUP BY sku HAVING COUNT(*) > 1; -- Search by similar names (requires the pg_trgm extension for PostgreSQL) SELECT p1.id, p1.name, p2.id, p2.name, similarity(p1.name, p2.name) as similarity_score FROM products p1, products p2 WHERE p1.id< p2.id AND similarity(p1.name, p2.name) > 0.8; -- Search for products with the same EAN SELECT ean, COUNT(*) as duplicate_count, STRING_AGG(name, ' | ') as product_names FROM products WHERE ean IS NOT NULL AND ean!= '' GROUP BY ean HAVING COUNT(*) > 1;SEO consequences of duplicates and how to eliminate them correctly
Negative impact on search engine optimization
- Keyword cannibalization: several pages compete for the same queries, lowering the rankings of everyone
- Link weight dispersal: External links lead to different duplicates, reducing the authority of each page
- Indexing issues: Search engines don't know which version to show in search results
- Decrease crawl budget: Robots waste time scanning duplicates instead of unique content.
- Duplicate content filters: In extreme cases, the site may be subject to sanctions.
The Right Elimination Strategy for SEO
1. Audit of the current state
- Analyzing indexing in Google Search Console
- Checking for duplicates via site: operator
- Identifying Page Cannibalization in Ahrefs/Semrush
2. Page prioritization
- Selecting the main card based on traffic and positions
- Counting the number of external links
- Indexing history analysis
3. Technical implementation
- 301 redirect: mandatory for all deleted pages
- Updating sitemap.xml: removing old URLs
- Canonical tag: if you temporarily need to save multiple versions
- Updating internal linking
4. Post-Troubleshooting Monitoring
- Monitoring reindexing in Search Console
- Checking the correctness of redirects
- Monitoring position changes
- Traffic dynamics analysis
Duplicate prevention system
Organizational measures
1. Rules for working with the catalog
- Clear instructions for managers on how to check product availability before creating
- Mandatory search by article number and name
- Appointment of a data quality officer
- Regular catalog audits
2. Personnel training
- Product Naming Rules
- Using variants instead of creating separate positions
- Working with identifiers (EAN, UPC)
- Checking import results
Technical solutions
1. Validation when creating a product
- Automatic verification of article uniqueness
- Similar name warning
- Search by EAN in the database before saving
- Offer existing products when matching
2. Data import rules
- Clear definition of matching keys (SKU, EAN, supplier article number)
- Update-only mode for existing products
- Logging of all created positions
- Quarantine for new goods with manual inspection
3. Regular automatic checking
- Weekly catalog scanning for duplicates
- Suspicious Product Reports
- Data quality metrics dashboard
- Alerts when the duplicate threshold is exceeded
Using Master Data Management (MDM)
Creating a single source of truth for product information:
- Centralized catalog: all products in one system with unique identifiers
- Rules of enrichment: automatic data supplementation from verified sources
- Approval workflow: moderation of new products before publication
- API integration: all systems receive data from a single source
Case Study: Fashion Marketplace
A platform with 300 suppliers has implemented a duplicate prevention system:
- Mandatory indication of EAN for all products
- Automatic check on load: an item with an existing EAN is updated instead of being created anew
- Weekly report to suppliers on duplicates in their catalogues
- Penalty for exceeding the duplicate threshold (5% of the supplier's catalog)
- Result: Reducing duplicates from 18% to 2% in 6 months, improving user experience
International identifiers: EAN, UPC, GTIN
Using standardized codes is the most reliable way to prevent duplicates.
Types of identifiers
- EAN-13: European standard, 13 digits (e.g. 5901234123457)
- UPC-A: North American standard, 12 digits (e.g. 012345678905)
- GTIN: Global identifier, includes EAN and UPC
- ISBN: For books, 10 or 13 digits
- MPN: Manufacturer Part Number
Benefits of using
- Absolute uniqueness on a global scale
- Simplifying integration with marketplaces (Ozon, Wildberries, Amazon require EAN/UPC)
- Automatic enrichment of data from external databases
- Accurate comparison of products from different suppliers
- Improving the quality of shopping feeds for Google Shopping
Implementation into the catalog management process
- Audit of current data: Checking the availability of EAN for existing products
- Enrichment: obtaining EAN from suppliers or from open databases (GS1, UPC Database)
- Validation: checking the correctness of codes (checksum, format)
- Setting up import rules: EAN as a primary matching key
- Quality monitoring: percentage of goods with EAN, reports of incorrect codes
Practical examples and solutions
Example 1: Duplicates due to different supplier SKUs
Situation: The online electronics store works with five distributors. The same product comes with different part numbers.
Solution:
- Transition to EAN as the primary identifier
- Creating a mapping table: EAN → article numbers of all suppliers
- Import settings: search for products by EAN, update prices from all suppliers
- Storing supplier part numbers in a separate field for working with orders
Result: Elimination of 4,200 duplicates, automatic selection of the best price among suppliers.
Example 2: Duplicates due to product modifications
Situation: The clothing store created separate product cards for each size and color. For one jacket model, there were 30 separate items.
Solution:
- Analysis of products with similar names
- Group by model (extract base name without color/size)
- Creating master cards for each model
- Converting individual products into variants with a size×color matrix
- Setting up 301 redirects from old URLs
Result: Reducing the catalog from 15,000 to 3,500 products, improving navigation, and increasing conversion by 18%.
Example 3: Duplicates after platform migration
Situation: After migrating from a custom CMS to Shopify, products were imported twice due to different IDs.
Solution:
- Export all products to CSV with the old system ID
- Creating an intermediate mapping table: old ID → new ID
- SQL script for finding duplicates by name and key characteristics
- Manual verification of 200 questionable cases
- Bulk removal of duplicates with data transfer to main cards
- Updating order history with old IDs to new ones
Result: Cleaning 7,800 duplicates, saving the entire sales history and reviews.
Conclusion: A Systems Approach to Deduplication
Combating duplicate products isn't a one-time measure, but an ongoing data quality management process. An effective strategy includes three components:
- Detection: regular automated search using exact and fuzzy matching algorithms
- Elimination: Proper data aggregation, taking into account SEO implications, sales history, and user experience
- Prevention: implementation of technical and organizational measures to prevent the emergence of new duplicates
Investments in product data quality pay off with increased conversion, improved search rankings, and reduced catalog management operating costs. Modern tools such as Elbuz platform, allow you to automate most of the deduplication work.
Start with an audit of your current catalog, implement basic preventative measures, and gradually scale up your automation. A clean, duplicate-free catalog is the foundation of successful e-commerce.
Next steps
Learn more about comprehensive product data management in our import and synchronization guide.
- Why do duplicate products appear in the catalog?
- Types of Duplicates: How to Classify the Problem
- Algorithms and methods for finding duplicates
- Strategies for merging duplicates
- Automatic Deduplication: Tools and Technologies
- SEO consequences of duplicates and how to eliminate them correctly
- Duplicate prevention system
- International identifiers: EAN, UPC, GTIN
- Practical examples and solutions
- Conclusion: A Systems Approach to Deduplication
Save a link to this article
Larisa Shishkova
Copywriter ElbuzIn the world of automation, I am a translator of ideas into the language of effective business. Here, every dot is a code for success, and every comma is an inspiration for Internet prosperity!
Discussion of the topic – How to get rid of duplicate products in your catalog once and for all?
How to get rid of duplicate products in your catalog once and for all?
There are no reviews for this product.


Write a comment
Your email address will not be published. Required fields are checked *