Examples of selectors (CSS, XPath) for parsing data from online store sites
Examples of selectors (CSS, XPath) for getting data from the pages of online stores (website parsing).
We get the manufacturer's SKU from the list of values (tags ul li)
CSS selector:
ul. desc-list li:nth-child(7)::text()
or XPATH for SKU:
//ul[li/text() = "Article"]/span[2]/text()
Get the manufacturer's SKU from the product attribute table (filter by product attribute name)
XPath selector: //tr[td/text() = "Article"]/td[2]/text()
We get the price of the product (based on the element id, which changes on each page with the product)
XPath selector searching for the word "retail_price": //span[contains(@id,"retail_price")]
We get a list of links to other product pages (pagination), except for the current page (because links to products have already been received on it)
CSS selector: ul. pagination li:not(. active) a
CSS locators:
div#pocks — looking for a div whose ID is equal to pocks
div. perl - we are looking for a div whose class is called perl
body[vlink=1] — looking for body tag with vlink=1 attribute
body[vlink*=1] — we are looking for a body tag, in which the vlink attribute contains one
body[vlink€=1] — looking for a body tag whose vlink attribute ends with one
body[vlink^=1] — looking for a body tag whose vlink attribute starts with one
Space finds all descendants of an element. Example:
div#ires a - finds all links from a div with ID ires
div#ires a:nth-of-type(1) - finds all links from the div with ID ires first
div >a - all divs that have a child immediately after them a
div+div - finds the div that comes immediately after the first div
div+a - all divs immediately followed by a elements (links)
div ~ div - skip element by element
a:contains("ggdgdgd") - finds a
*. warning - any element with the warning class
div * p - we are looking for a p element that has a div ancestor and there may be elements between them
h1. opener+h2 — look for element h2 neighbor before which element h1 has class opener
a[rel~="copyring"] - we are looking for a link with the rel attribute, which has a class inside with the copyring value
span[hello='Cleveland'][goodbye='Columbus'] - Looks for a span element that has a hello attribute set to Cleveland and a goodbye attribute set to Columbus
div. flyout > a - Find all links that are immediately after the div element with class flyout
div#action_list_body_current li:nth-of-type(1) — Find second task in current list
#quick search a[accesskey ="p"] — Find the second image with accesskey attribute "p" in quick search
#context_list a:contains('line') - find the context in the Contexts table that contains the text "line"
XPath locators:
/body/. . - the parent of the badi, tobish html tag
What is the difference between xpath and css, in xspace we can go from bottom to top, and in css only from top to bottom. //
//a[text()='some value'] — find a link with text some value
author[last-name [position()=1]= "Bob"] - find an author element that has a last-name element and last-name is the first position
//div[@id='header'] — div element with id header
//div[1] - first div
//div[position()=1] is the same as //div[1]
//div[2 and 3] - second and third div
In xpath, element relationships define axes
// - means that we are looking for all nested elements
/descendant:div[@id='header'] - finds all descendants of the div with id header
book/*/last-name - we find the element beech after which there is any element and immediately after it comes the lastname element
*[@specialty] - any element with a specialty attribute
author[first-name][3] is an element named author that has a first-name element child and is the third
author[not(degree or award) and publication] - find an author element that has no descendant of a degree or award element, but has a publication element
ancestor::author[parent::book][1] — find an ancestor that has the element name author and that has a non-child parent book and select the first position
//a[text() ="Preferences"][ancestor::*[@id='header']] — find the Preferences link in the top menu (go from top to bottom, first write a link with the text Preferences
//*[@id ='action_list_current']//span[@class='next_action_name'][following-sibling::*/a[contains(@href,'contexts') and text() ='Offline'] ] - Find all tasks in the current list with Offline context
Getting value from style
substring-before(substring-after(//div[@class="Header"]/div[@class="Header-jpeg"]/@style, "background-image: url("), ")")
Reg. expression
\€\(". prod-img"\). css\("backgroundImage", "url\((. *?)\)
to get a link to an image from a text
€(". prod-img"). css("backgroundImage", "url(https://site.com/img/014/114288.jpg)");
Links:
w3.org/TR/selectors/
w3schools.com/css/css_examples. asp
Site parsing general description.
Monitoring the prices of competitors on the Internet