asyncio web scraping 101: fetching multiple urls with aiohttp
Date : March 29 2020, 07:55 AM
like below fixes the issue I would use gather instead of wait, which can return exceptions as objects, without raising them. Then you can check each result, if it is instance of some exception. import aiohttp
import asyncio
async def fetch(session, url):
with aiohttp.Timeout(10):
async with session.get(url) as response:
return await response.text()
async def fetch_all(session, urls, loop):
results = await asyncio.gather(
*[fetch(session, url) for url in urls],
return_exceptions=True # default is false, that would raise
)
# for testing purposes only
# gather returns results in the order of coros
for idx, url in enumerate(urls):
print('{}: {}'.format(url, 'ERR' if isinstance(results[idx], Exception) else 'OK'))
return results
if __name__ == '__main__':
loop = asyncio.get_event_loop()
# breaks because of the first url
urls = [
'http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
'http://google.com',
'http://twitter.com']
with aiohttp.ClientSession(loop=loop) as session:
the_results = loop.run_until_complete(
fetch_all(session, urls, loop))
$python test.py
http://SDFKHSKHGKLHSKLJHGSDFKSJH.com: ERR
http://google.com: OK
http://twitter.com: OK
|
Web Scraping with Python in combination with asyncio
Tag : python , By : Shrek Qian
Date : March 29 2020, 07:55 AM
I think the issue was by ths following , I've written a script in python to get some information from a webpage. The code itself is running flawlessly if it is taken out of the asyncio. However, as my script runs synchronously I wanted to make it go through asyncronous process so that it accomplishes the task within the shortest possible time providing optimum performance and obviously not in a blocking manner. As i didn't ever work with this asyncio library, I'm seriously confused how to make it a go. I've tried to fit my script within the asyncio process but it doesn't seem right. If somebody stretches a helping hand to complete this, I would really be grateful to him. is advance. Here is my erroneous code: , You need to call processing_docs() with await. Replace: processing_docs(base_link + titles.attrib['href'])
await processing_docs(base_link + titles.attrib['href'])
processing_docs(page_link)
await processing_docs(page_link)
|
Python: Pyppeteer with asyncio
Date : March 29 2020, 07:55 AM
Hope that helps According to pyppeteer source code, it is using subprocess without pipes to manage Chromium processes, and websockets to communicate, therefore it is async. You have 31 sites, then you'll have 31+1 processes. So unless you have a CPU with 32 cores (there might also be threads, system processes, locks, hyper-threading and all different factors infecting the result, so this is just an imprecise example), it won't be fully executed in parallel. Therefore, the bottleneck I think is CPU opening browsers, rendering web pages and dumping into images. Using executor won't help.
|
Python asyncio web scraping output not exporting in excel
Date : March 29 2020, 07:55 AM
hop of those help? Looks like run_in_executor does not add a Task to the loop. It has to be awaited. So you need to wrap it in a co-routine and create a task in the loop. Simpler example below. import asyncio
from urllib.request import urlopen
import json
URLS = [
"http://localhost:8000/a",
"http://localhost:8000/b",
"http://localhost:8000/c",
"http://localhost:8000/d",
]
data = []
def load_html(url):
print(url)
res = urlopen(url)
data.append(res.read().decode())
async def scrape(url, loop):
await loop.run_in_executor(None, load_html, url)
def main():
loop = asyncio.get_event_loop()
for url in URLS:
loop.create_task(scrape(url, loop))
loop.run_until_complete(asyncio.gather(*asyncio.all_tasks(loop)))
with open('/tmp/j_dump', 'w') as fp:
json.dump(data, fp)
if __name__ == '__main__':
main()
def load_html(url):
print(url)
res = urlopen(url)
return res.read().decode()
def main():
loop = asyncio.get_event_loop()
tasks = [loop.run_in_executor(None, load_html, url) for url in URLS]
data = loop.run_until_complete(asyncio.gather(*tasks))
with open('/tmp/j_dump', 'w') as fp:
json.dump(data, fp)
|
How to set up Accept-Encoding to gzip in Python pyppeteer and print pyppeteer headers?
Tag : python , By : Joshua Johnson
Date : March 29 2020, 07:55 AM
Hope this helps how to set headers for pyppeteer for example:Accept-Encoding: gzip how to print pyppeteer headers in python. i know java import pyppeteer
import asyncio
from pyppeteer.network_manager import Request, Response
async def req_intercept(req: Request):
print(f'Original header: {req.headers}')
req.headers.update({'Accept-Encoding': 'gzip'})
await req.continue_(overrides={'headers': req.headers})
async def resp_intercept(resp: Response):
print(f"New header: {resp.request.headers}")
async def test():
browser = await pyppeteer.launch()
page = await browser.newPage()
await page.setRequestInterception(True)
page.on('request', req_intercept)
page.on('response', resp_intercept)
resp = await page.goto('https://example.org/')
print(resp.headers)
asyncio.get_event_loop().run_until_complete(test())
Original header: {'upgrade-insecure-requests': '1', 'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/69.0.3494.0 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'}
New header: {'upgrade-insecure-requests': '1', 'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/69.0.3494.0 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip'}
{'status': '200', 'content-encoding': 'gzip', 'accept-ranges': 'bytes', 'cache-control': 'max-age=604800', 'content-type': 'text/html; charset=UTF-8', 'date': 'Sat, 13 Apr 2019 03:07:49 GMT', 'etag': '"1541025663"', 'expires': 'Sat, 20 Apr 2019 03:07:49 GMT', 'last-modified': 'Fri, 09 Aug 2013 23:54:35 GMT', 'server': 'ECS (dcb/7F84)', 'vary': 'Accept-Encoding', 'x-cache': 'HIT', 'content-length': '606'}
|