logo
down
shadow

How to wrap the process of creating start_urls in scrapy?


How to wrap the process of creating start_urls in scrapy?

Content Index :

How to wrap the process of creating start_urls in scrapy?
Tag : python , By : Guyou
Date : November 25 2020, 03:01 PM

it helps some times This is what start_requests method of spider is for. It serves the purpose of creating initial set of requests. Building on your example, it would read as:
class TestSpider(scrapy.Spider):
    def __init__(self, *args, **kw):
        self.timeout = 10

    def start_requests(self):
        url_nasdaq = "ftp://ftp.nasdaqtrader.com/SymbolDirectory/nasdaqlisted.txt"
        s = urllib.request.urlopen(url_nasdaq).read().decode('ascii')
        s1 = s.split('\r\n')[1:-2]
        namelist = []
        for item in s1:
            if "NASDAQ TEST STOCK" not in item : namelist.append(item)
        s2 = [s.split('|')[0] for s in namelist] 
        s3=[]
        for symbol in s2:
            if  "." not in symbol : s3.append(symbol)
        for s in s3:
            yield scrapy.Request("https://finance.yahoo.com/quote/"+s+"/financials?p="+s, callback=self.parse)

Comments
No Comments Right Now !

Boards Message :
You Must Login Or Sign Up to Add Your Comments .

Share : facebook icon twitter icon

Scrapy Getting Start_Urls


Tag : python , By : Mena
Date : March 29 2020, 07:55 AM
To fix this issue item['link'], as opposed to item['title'], is just a string, not a list:
self.cursor.execute("INSERT INTO items (title, url) VALUES (%s, %s)",
                    (item['title'][0], item['link']))

start_urls in Scrapy


Tag : python , By : user119413
Date : March 29 2020, 07:55 AM
it should still fix some issue What you could do is set the start_urls to the main page then based on the number of pages shown in the footer pagination (in this case 3), use a loop to create a yield Request for each of the pages:
allowed_domains = ["go-on.fi"]
start_urls = ["http://www.go-on.fi/tyopaikat"]

def parse(self, response):
    pages = response.xpath('//ul[@class="pagination"][last()-1]/a/text()').extract()
    page = 1
    start = 0
    while page <= pages:
        url = "http://www.go-on.fi/tyopaikat?start="+str(start)
        start += 20
        page += 1
        yield Request(url, callback=self.parse_page)

def parse_page(self,response):
    hxs = HtmlXPathSelector(response)
    items = []
    titles = hxs.select("//tr")

    for row in titles:
        item = JobData()
        item['header'] = row.select("./td[1]/a/text()").extract()
        item['link'] = row.select("./td[1]/a/@href").extract()
        items.append(item)

Scrapy: How to set scrapy start_urls from a setting file?


Tag : development , By : Stephen Judge
Date : March 29 2020, 07:55 AM
it helps some times Let's say you put your config files inside the spiders directory and config dir. So overall path scrapy_project -> spiders -> configs-> .txt
Then you can override init of your spiders to populate your start_urls something like this.
def __init__(self, *args, **kwargs):
    script_dir = os.path.dirname(__file__)
    abs_file_path = os.path.join(script_dir, "configs/%s.txt" % self.name)
    with open(abs_file_path) as f:
        self.start_urls  = [line.strip() for line in f.readlines()]

Python scrapy start_urls


Tag : python , By : Ambarish Singh
Date : March 29 2020, 07:55 AM
To fix the issue you can do is it possible to do something like below but with multiple url like below? Each link will have about 50 pages to crawl and loop. The current solution is working but only working if I use 1 URL instead of multiple urls. , I recommend to use start_requests for this:
def start_requests(self):
    base_urls = [

        'https://www.xxxxxxx.com.au/home-garden/page-{page_number}/c18397',
        'https://www.xxxxxxx.com.au/automotive/page-{page_number}/c21159',
        'https://www.xxxxxxx.com.au/garden/page-{page_number}/c25449',
    ]

    for page in range(1, 50):
        for base_url in base_urls:
            url = base_url.format( page_number=page )
            yield scrapy.Request( url, callback=self.parse )

Scrapy start_urls


Tag : python , By : Gazza
Date : March 29 2020, 07:55 AM
Related Posts Related QUESTIONS :
  • 7C in cs circles- python Im not sure what is wrong with this yet
  • How to fix 'AttributeError: 'list' object has no attribute 'shape'' error in python with Tensorflow / Keras when loading
  • python - thread`s target is a method of an object
  • Retrieve Variable From Class
  • What is the reason for matplotlib for printing labels multiple times?
  • Why would people use ThreadPoolExecutor instead of direct function call?
  • When clear_widgets is called, it doesnt remove screens in ScreenManager
  • Python can't import function
  • Pieces doesn't stack after one loop on my connect4
  • How to change font size of all .docx document with python-docx
  • How to store a word with # in .cfg file
  • How to append dictionaries to a dictionary?
  • How can I scrape text within paragraph tag with some other tags then within the paragraph text?
  • Custom entity ruler with SpaCy did not return a match
  • Logging with two handlers - one to file and one to stderr
  • How to do pivot_table in dask with aggfunc 'min'?
  • This for loop displays only the last entry of the student record
  • How to split a string by a specific pattern in number of characters?
  • Python 3: how to scrape research results from a website using CSFR?
  • Setting the scoring parameter of RandomizedSeachCV to r2
  • How to send alert or message from view.py to template?
  • How to add qml ScatterSeries to existing qml defined ChartView?
  • Django + tox: Apps aren't loaded yet
  • My css and images arent showing in django
  • Probability mass function sum 2 dice roll?
  • Cannot call ubuntu 'ulimit' from python subprocess without using shell option
  • Dataframe Timestamp Filter for new/repeating value
  • Problem with clicking select2 dropdownlist in selenium
  • pandas dataframe masks to write values into new column
  • How to click on item in navigation bar on top of page using selenium python?
  • Add multiple EntityRuler with spaCy (ValueError: 'entity_ruler' already exists in pipeline)
  • error when replacing missing ')' using negative look ahead regex in python
  • Is there a way to remove specific strings from indexes using a for loop?
  • select multiple tags by position in beautifulSoup
  • pytest: getting AttributeError: 'CaptureFixture' object has no attribute 'readouterror' capturing stdout
  • Shipping PyGObject/GTK+ app on Windows with MingW
  • Python script to deduplicate lines in multiple files
  • How to prevent window and widgets in a pyqt5 application from changing size when the visibility of one widget is altered
  • How to draw stacked bar plot from df.groupby('feature')['label'].value_counts()
  • Python subprocess doesn't work without sleep
  • How can I adjust 'the time' in python with module Re
  • Join original np array with resulting np array in a form of dictionary? multidimensional array? etc?
  • Forcing labels on histograms in each individual graph in a figure
  • For an infinite dataset, is the data used in each epoch the same?
  • Is there a more efficent way to extend a string?
  • How to calculate each single element of a numpy array based on conditions
  • How do I change the width of Jupyter notebook's cell's left part?
  • Measure distance between lat/lon coordinates and utm coordinates
  • Installing megam for NLTK on Windows
  • filter dataframe on each value of a samn column have a specific value of another column in Panda\Python
  • Threading with pubsub throwing AssertionError: 'callableObj is not callable' in wxPython
  • Get grouped data from 2 dataframes with condition
  • How can I import all of sklearns regressors
  • How to take all elements except the first k
  • Whats wrong with my iteration list of lists from csv
  • Tensorflow Estimator API save image summary in eval mode
  • How to Pack with PyQt - how to make QFrame/Layout adapt to content
  • How do I get certain Time Range in Python
  • python doubly linked list - insertAfter node
  • Open .h5 file in Python
  • shadow
    Privacy Policy - Terms - Contact Us © scrbit.com