How to set up a parser for any online store site - getting a catalog of goods with prices, descriptions and photos
How the e-commerce parser Elbuz works
All online stores are created using the HTML language, this is a standardized page markup language on the World Wide Web, so all sites use the same elements for different blocks, the Elbuz parser uses this standard to receive data from the online store site.
List of the most commonly used HTML tags that are on the pages of online stores:
- div tag. A universal block element that allows you to select a section with visual content on the site. It could be a list of products.
- Tag a. Displays a link to a page. These can be links to products in a particular category.
- h1 tag. Displays the heading of the first level (there are also h2, h3, h4, h5, h6). It may be the name of the product.
- p tag. Displays a text paragraph. It could be a product description.
- table tag. Displays a table. It can be a product attribute table.
- ul tag. Displays a bulleted list. It can be a short description of the item.
- img tag. Designed to be displayed on the image page. These may be product photos.
The tags can contain the name of the style for the visual display of information on the site, for example, the specified block style allows you to display bold text or green color for any element. Based on these standardized data in the Elbuz system, you can configure the parser for any online store to get the information you need, the Elbuz parser uses CSS selectors (site design styles) or XPath (query language for site elements) to receive data .
To get started, you need to install the extension for the Google Chrome browser, to do this, follow this link. Parsing is possible only in the Google Chrome browser. If the Chrome Store link doesn't work, install the extension manually. You can also not use the browser extension, for this you need to activate the server parsing function.
Creating a new parser
To configure the parser, follow this order of operations:
- Open Price Lists.
- Click the Add counterparty button.
- Specify the name of the site.
- Select the Parser tab.
- Click the Add Site button.
- Provide a link to the main page of the site.
- Specify tag selectors.
To add a new site parser, open the "Price lists" window, click the "+" button and select "Add counterparty"
Specify the name of the counterparty (online store) and select a group, the following groups are available by default: Supplier, Competitor, Client.
After adding the counterparty, you will be prompted to choose from where you want to download data, in this list, select the "Website Parser" item and click the "Add site" button
Specify the address of the online store for data parsing
Setting up a parser to receive data from an online store
The order of the parser for downloading products from the online store:
- Get links to product categories
- Get product links
- Get product cards and save the necessary information
After adding the online store parser, the settings window will open
The setup table contains the types of operations and the list of fields to store data in them. Operation types are the stages of the parser to get data from the site.
For example, to get a list of products from the site, you need to get links to product categories so that the parser can open a page to get information on each product, so the first operation that the parser will use is "List of links to product categories".
Operation types:
- List of links to product categories. Used to get links to product categories.
- List of links to products. Used to get links to products.
- Card Product. Used to get product information. When performing this operation, you can get the product name, manufacturer's SKU, model, warranty, manufacturer's name, photos, video reviews and other information from the site.
- Product attributes. Used to get product attributes.
Description of the grid columns for setting up the parser
- Operation selector. Sign of the main selector for receiving data from the site to perform this operation.
- Field name. The name of the operation or field to store data in.
- Selector #1-4. The Elbuz parser uses CSS selectors (site styles) or XPath (query language for site elements) to receive data from site pages. The selector fields specify the conditions for finding the blocks you need on the site and getting information from them.
- Link for testing. Link to the site page for testing data acquisition. For each operation, a link to a separate section of the site is indicated, for example, for the "List of links to product categories" operation, a link to the main page of the site is indicated, where there is a list of all product categories. To test the receipt of products attributes for the "Item card" operation, a link to the products is specified.
- Text to clean up. Keywords to clean up when getting data. For example, in the product name on the site there is extra text that you do not want to receive from the site, you can set this text in the "Text to clean up" field to remove it.
- To find. Search text.
- Replace. Text to replace (based on found text).
- Receive HTML. If it is necessary to preserve the formatting of the text received from the site page using html tags, then set this flag.
- Regular expression. You can use a regular expression to get the desired value based on the text received through the selector, that is, parse the string into components in more detail and get what you need in the end.
- XPath. Activation for the XPath query language mode selector.
- The maximum number of results. Allows you to limit the download of data for download testing, so as not to wait until the entire site is downloaded, you can set to receive only 1 link to a category and receive, for example, 2 links to products, for this you can set the number of results for each operation.
- Goods in this operation. You can receive products without opening product cards on the site. This mode will be useful if you want to get only prices for products and other values that are available when listing products in a category.
- Note. A note for a setting string, for example, you can save yourself a reminder of what this setting means.
Stage number 1. Getting a list of links to product categories
To get a list of links to product categories, you need to find the link selector that leads to the category, for this, copy the link from the site (usually this is the main page of the site) into the "Link for testing" field and click the "T" button
The "Download testing" tab will open, in which the page at the link you specified will be displayed, product categories should be visible on it. The results of the parser's work are displayed on the left. Your task is to get a list of links to product categories from the site; if the parser is successfully configured, you will see a list of links to categories on the left side of the screen.
Attention! Download testing is possible only for sites that use the secure https protocol, only for such sites you can visually check the receipt of data in the "Download testing" tab, while you can still configure the parsing of such sites (using the non-secure http protocol), but visually check getting data will not work, that is, all tags and selectors must be entered "blindly" (at random).
To search for a link selector for product categories, right-click on the name of any category and select "View code", after which a browser window will open with the source code of the site. You can position it as you like, for example, on the left or at the bottom of the screen
You can also open the link in a separate browser tab if you need more screen space to search for the product link selector and do the same there.
We are looking for blocks of product categories and a link in them
Your task is to find blocks of links to product categories. After you have selected the "View Code" item, the browser will open the source code of the site in the place where the right mouse button was pressed, in this example we clicked on the category name and we see that the links to the categories are located in the "div" and "a" tags (image below is clickable to enlarge).
As you can see, each product category has a "div" block and it contains "a" links, while the "div" block has the style name links-list (class="links-list") and the link "a" has the style name link ( class="link") .
Let's write the selectors in the parser settings in this form: specify the tag names separated by a space and specify the style names separated by a dot. You can simply specify the "a" tag and its style, if it is unique within the page for a link that leads to a product category (then the 1st paragraph is not required).
We check the result, for this we press the "T" button. As you can see in the example, we got 74 links to product categories, that is, our parser already knows how to search for categories on a third-party site
Stage number 2. Getting a list of product links
To get a list of product links, you need to find the product link selector on the product listing page in the category, to open any category of products on the site and copy the link in the "Link for testing" field, then click the "T" button
The "Download testing" tab will open, in which the page at the link you specified will be displayed, a list of products should be visible on it. The results of the parser's work are displayed on the left. Your task is to get a list of links to products from the site; if the parser is successfully configured, you will see a list of links on the left side of the screen.
To search for a product link selector, right-click on the name of any product and select "View code", after which a browser window will open with the source code of the site.
We are looking for blocks of products and a link in them
Your task is to find product blocks with links to the product card. After you have selected the "View code" item, the browser will open the source code of the site in the place where the right mouse button was pressed, in this example we clicked on the product name and we see that the product links are located in the "div" and "a" tags .
That is, each product in the search results has a "div" block and contains an "a" link in it, while the "div" block has the tile style name (class="tile").
That is, each product in the list has the same style called tile, and we will use this information to get links to each product.
Let's write the selectors in the parser settings in this form: specify the style name through a dot and the "a" tag separated by a space
We check the result, for this we press the "T" button. As you can see in the example, we got 28 links to products, that is, our parser already knows how to find products on a third-party site
Page navigation setup (pagination)
When opening a product category, not all products are usually displayed, for example, only 28 products can be displayed, the following products are on page No. 2, this mode is called pagination (pagination). To get product links on other pages, you need to find a link selector that leads to the next page, you need to find a navigation block on the page to go to other pages (paginator), in the example below, this block looks like this and has such a selector
ul[name="paginator"] li a
The found selector for pagination of products is indicated in the field "Selector No. 2"
There are sites on which the pagination links do not contain the current link to the page (link to the product category), then the pagination may not be determined correctly, an example of an incorrect definition, when the link contains only the page number, as a result, the link will lead to the main page of the site
To solve this problem, you need to know the current page address. You need to open the source code of the site and try to find the address to the current page, if one is found, then you need to specify in the Selector No. 3 field the tags how to get it, for example, from the "Breadcrumbs" block (breadcrumbs): div. breadcrumbs a. active
Stage 2 can be used to get only a list of goods, this mode will be useful when you need to get only prices for goods, without descriptions, technical specifications. characteristics and photos, while the speed of obtaining data from the site will be many times higher (there is no need to go to the product cards on the site). To activate this mode, set the "Products in this operation" flag for the "List of links to products" operation type, then specify the selectors for the fields to be filled in from the site. Accordingly, you do not need to fill in the lecturer to get links to products, only getting "pagination".
Stage number 3. Getting data from the product card.
By analogy with the search for a product link selector, you need to find selectors for the fields you need in the product card, for this we write a link to the test product in the "Link for testing" field and open it
You need to right-click on the product name and select the "View code" item, after which a browser window will open with the source code of the site.
For example, the product name is in the h1 tag
Let's write the selector h1 in the settings table
Next, we are looking for a selector for the price of the product
Write the selector like this
div. main-price span. price-number span
Next, we are looking for a selector for the description of the product
Write the selector like this
div[itemprop="description"]
For links to photos, we prescribe such a selector
div. image img::attr(src)
Checking the result
Stage number 4. Getting product attributes.
To get product attributes, you must specify a selector for an attribute block (table) and a string selector that contains the attribute name and value.
Procedure:
- In the "Selector No. 1" field, specify the selector for the attribute block
- In the "Selector No. 2" field, specify the selector for the block that contains the name and value of the attribute (that is, for the row of the attribute table)
- In the "Attribute name" field, specify the selector where the attribute name is located
- In the "Attributes value" field, specify the selector where the attribute value is located
Setting example
An example of customization based on the source code of the site
The result of checking the receipt of product attributes (characteristics, properties)
If the attributes are on a separate page
If the attributes are on a separate page, for example, clicking on the "Features" tab opens a new page, then there are two solutions, they all come down to getting a link to the page where the product attributes are, so that the program can go through it and get data.
Option number 1. The link is in the html source code.
Selector #3 needs to be configured for the "Product Attributes" operation type to get a link (or part of a link to an attribute page).
For example, when clicking on a tab on the site, there is such html code, then the selector for getting the link will be: a. nav-tabs-link
Option number 2. A prefix is added to the link to the product, which is not explicitly in the html source code.
It is necessary for the operation type "Product attributes" to register the link prefix in selector No. 4 to add it to the product link.
For example, you can write: tab=characteristics, then the program will open a link to the product + prefix, thereby the parser will go to the product attributes page. What exactly to prescribe in the prefix is determined empirically after a thorough analysis of the site.
Starting parsing with loading a product catalog from a third-party site.
Downloading products from the website of the online store will be done in the following order:
- Getting links to product categories
- Getting links to products
- We receive product cards and save the necessary information
For download testing, set the maximum number of results for stages to quickly check the data parsing from the online store site. In this example, one link to a product category with a list of products will be loaded from which three product links will be obtained
How to get product options
In the Elbuz program, option products are virtual products that are linked to one main product, while on the source site this is one product card with a set of options. To get options, you need to register a selector to get the names of options, specify a comma separator and set the flag "Product option"
When testing, the values will be displayed separated by commas
After loading products from the site, 1 main product and several option products will be created for each value specified on the site.
How to scrape a site
There are several modes of parsing:
- Specify links manually to the categories or products you need.
If you need to receive products only from certain categories, then you need to add links to the necessary categories in the "List of links" tab. It is also possible to receive information only on the products you need, for this, specify a link to the product and check the "Link to product" flag. - Load a list of links from a file that require information from the site. The file must be in CSV (text file) format.
- Upload your products to the base catalog and start searching for products on the site based on your products, the program will insert your product name into the search bar of the site and save the received product to the program database. In this mode, it is important that your product names are identical to the names on the site or very close to them, because the accuracy of the search depends on the algorithm of the site itself, whether it can find the product you need or not.
To load a list of links from a file follow these steps
Automated search for your products on websites
- You must specify a link to the search in the settings. The link is individual for each site. To get a link, enter any text in the search bar on the site, copy the link without text from the browser, an example of a link:
https://site.com/?search_text={NAME}
Instead of {NAME}, the program will substitute your keyword and generate links to search for your product on the source site. You can also specify the {SKU} macro substitution so that the search is carried out by the value from the "Manufacturer's Article" field, instead of the name. - Activate the option "Search for your products"
This is just a brief presentation of the capabilities of the E-Trae Jumper program, which automates the processes of a modern online store.
Contact us for a detailed consultation on solving your individual problems. Contact details are on this site.
Program for online store