Scrapy Getting Start_Urls
Date : March 29 2020, 07:55 AM
To fix this issue item['link'], as opposed to item['title'], is just a string, not a list: self.cursor.execute("INSERT INTO items (title, url) VALUES (%s, %s)",
(item['title'][0], item['link']))
|
start_urls in Scrapy
Tag : python , By : user119413
Date : March 29 2020, 07:55 AM
it should still fix some issue What you could do is set the start_urls to the main page then based on the number of pages shown in the footer pagination (in this case 3), use a loop to create a yield Request for each of the pages: allowed_domains = ["go-on.fi"]
start_urls = ["http://www.go-on.fi/tyopaikat"]
def parse(self, response):
pages = response.xpath('//ul[@class="pagination"][last()-1]/a/text()').extract()
page = 1
start = 0
while page <= pages:
url = "http://www.go-on.fi/tyopaikat?start="+str(start)
start += 20
page += 1
yield Request(url, callback=self.parse_page)
def parse_page(self,response):
hxs = HtmlXPathSelector(response)
items = []
titles = hxs.select("//tr")
for row in titles:
item = JobData()
item['header'] = row.select("./td[1]/a/text()").extract()
item['link'] = row.select("./td[1]/a/@href").extract()
items.append(item)
|
Scrapy: How to set scrapy start_urls from a setting file?
Date : March 29 2020, 07:55 AM
it helps some times Let's say you put your config files inside the spiders directory and config dir. So overall path scrapy_project -> spiders -> configs-> .txt Then you can override init of your spiders to populate your start_urls something like this. def __init__(self, *args, **kwargs):
script_dir = os.path.dirname(__file__)
abs_file_path = os.path.join(script_dir, "configs/%s.txt" % self.name)
with open(abs_file_path) as f:
self.start_urls = [line.strip() for line in f.readlines()]
|
Python scrapy start_urls
Tag : python , By : Ambarish Singh
Date : March 29 2020, 07:55 AM
To fix the issue you can do is it possible to do something like below but with multiple url like below? Each link will have about 50 pages to crawl and loop. The current solution is working but only working if I use 1 URL instead of multiple urls. , I recommend to use start_requests for this: def start_requests(self):
base_urls = [
'https://www.xxxxxxx.com.au/home-garden/page-{page_number}/c18397',
'https://www.xxxxxxx.com.au/automotive/page-{page_number}/c21159',
'https://www.xxxxxxx.com.au/garden/page-{page_number}/c25449',
]
for page in range(1, 50):
for base_url in base_urls:
url = base_url.format( page_number=page )
yield scrapy.Request( url, callback=self.parse )
|
Scrapy start_urls
Date : March 29 2020, 07:55 AM
|