Marie
August 9, 2022, 3:55pm
1
Hey guys.
I made a config to scrape startpage.com for search results & to extract the results urls.
GIF of the issue: https://i.imgur.com/BY0w9iS.gif
Lolicode I used:
BLOCK:PuppeteerOpenBrowser
ENDBLOCK
BLOCK:PuppeteerNavigateTo
url = "https://www.startpage.com/do/mypage.pl?prfe=bac38c6c11849a35192e23eed03e5cd58ca7a0a992c7d66dde9b968457e6b8d1d7f6052df69df20e79ae9492d6295da9c9e9b0cef9ac1fcb337ca4f9701e590fad8f8c6ed796976708f95c8729"
referer = "https://www.startpage.com"
ENDBLOCK
BLOCK:PuppeteerTypeElement
findBy = XPath
identifier = "//*[@id=\"q\"]"
text = $"<input.USER>"
timeBetweenKeystrokes = 10
ENDBLOCK
BLOCK:PuppeteerClick
findBy = XPath
identifier = "/html/body/div[2]/section/div[2]/div[2]/div/form/button[2]/div/div"
ENDBLOCK
BLOCK:PuppeteerWaitForNavigation
ENDBLOCK
BLOCK:PuppeteerGetAttributeValueAll
findBy = Class
identifier = "w-gl__result-url result-link"
attributeName = "href"
=> VAR @puppeteerGetAttributeValueAllOutput
ENDBLOCK
BLOCK:Keycheck
banIfNoMatch = False
KEYCHAIN SUCCESS OR
STRINGKEY @puppeteerGetAttributeValueAllOutput Contains "https://"
KEYCHAIN FAIL OR
STRINGKEY @puppeteerGetAttributeValueAllOutput DoesNotExist "https://"
ENDBLOCK
BLOCK:Parse
input = @puppeteerGetAttributeValueAllOutput
RECURSIVE
MODE:LR
=> VAR @parseOutput
ENDBLOCK
BLOCK:RegexReplace
original = @parseOutput
pattern = "(\\[\\[|\\]\\])"
=> VAR @regexReplaceOutput
ENDBLOCK
BLOCK:FileAppendLines
path = "startpage.com.txt"
lines = @regexReplaceOutput
ENDBLOCK
BLOCK:Keycheck
banIfNoMatch = False
KEYCHAIN SUCCESS AND
STRINGKEY @regexReplaceOutput Contains "https"
STRINGKEY @regexReplaceOutput Contains ","
KEYCHAIN FAIL AND
STRINGKEY @regexReplaceOutput Contains "https"
STRINGKEY @regexReplaceOutput Contains ","
ENDBLOCK
Error
August 10, 2022, 1:36am
2
Check your datalist at the time of loading on runner. Make sure you selected credential type.
1 Like
Marie
August 10, 2022, 1:44am
3
I already did. I also checked if the datalist is in utf 8.
@Marie disable the headless mode, then run the config to get the idea of what causing the problem
1 Like
Your data type is credentials while the actual data is not in that format, did you already account for that?
1 Like
Marie
August 10, 2022, 3:28pm
6
When in Stacker, the browser opens & the <input.USER> gets used as it’s intended.
But when I run the config as a job, nothing happens, besides of massive invalid errors.
Please see the attached gif.
https://i.imgur.com/BY0w9iS.gif
You can try this, with your datalist as DEFAULT TYPE.
BLOCK:RegexReplace
LABEL:DATA
original = @input.DATA
pattern = “\s”
replacement = “+”
=> VAR @DATA
ENDBLOCK
BLOCK:HttpRequest
LABEL:SEARCH
url = “Startpage - Private Search Engine. No Tracking. No Search History. ”
method = POST
httpLibrary = SystemNet
customHeaders = {(“Host”, “www.startpage.com ”), (“Origin”, “https://www.startpage.com ”), (“Referer”, “https://www.startpage.com/ ”), (“User-Agent”, “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Safari/537.36 Edg/104.0.1293.47”)}
TYPE:STANDARD
$“query=&language=italiano&lui=english&cat=web&sc=9Ud1SX69ew1c20&abp=-1”
“application/x-www-form-urlencoded”
ENDBLOCK
BLOCK:Parse
LABEL:URLS
input = @data.SOURCE
attributeName = “”
pattern = “class="w-gl__result-url.result-link"\s+?href="(.+?)"”
outputFormat = “[1]”
RECURSIVE
MODE:Regex
=> VAR @URLS
ENDBLOCK
BLOCK:Keycheck
LABEL:CHECK
banIfNoMatch = False
KEYCHAIN FAIL OR
STRINGKEY @URLS EqualTo “ ”
ENDBLOCK
BLOCK:FileAppend
LABEL:APPEND
path = “Startpage.txt”
content = @URLS
ENDBLOCK
1 Like
Marie
August 10, 2022, 3:33pm
8
Sir, your brain must be enormous. You solved my issue. Thank you so much. I would lick your armpits for that.
1 Like
Just drop my reply a like, pls dont lick anything
1 Like
You can add a SUCCESS KEY as you prefer.
1 Like