How do I set up Scrapy to deal with a captcha
Tag : python , By : user185144
Date : March 29 2020, 07:55 AM
will help you It's a really deep topic with a bunch of solutions. But if you want to apply the logic you've defined in your post you can use scrapy Downloader Middlewares. Something like: class CaptchaMiddleware(object):
max_retries = 5
def process_response(request, response, spider):
if not request.meta.get('solve_captcha', False):
return response # only solve requests that are marked with meta key
catpcha = find_catpcha(response)
if not captcha: # it might not have captcha at all!
return response
solved = solve_captcha(captcha)
if solved:
response.meta['catpcha'] = captcha
response.meta['solved_catpcha'] = solved
return response
else:
# retry page for new captcha
# prevent endless loop
if request.meta.get('catpcha_retries', 0) == 5:
logging.warning('max retries for captcha reached for {}'.format(request.url))
raise IgnoreRequest
request.meta['dont_filter'] = True
request.meta['captcha_retries'] = request.meta.get('captcha_retries', 0) + 1
return request
class MySpider(scrapy.Spider):
def parse(self, response):
url = ''# url that requires captcha
yield Request(url, callback=self.parse_captchad, meta={'solve_captcha': True},
errback=self.parse_fail)
def parse_captchad(self, response):
solved = response['solved']
# do stuff
def parse_fail(self, response):
# failed to retrieve captcha in 5 tries :(
# do stuff
|
login from scrapy when website has captcha
Date : March 29 2020, 07:55 AM
|
Scrapy, login with captcha failed
Date : March 29 2020, 07:55 AM
This might help you ok, the problem here is that the captcha image needs to receive the cookies from the actual response, and you are using urllib2 to make the captcha request, so Scrapy isn't handling that by default. Use a scrapy request to check the captcha, something like: def parse(self, response):
yield Request(url="http://tinyz.us/securimage/securimage_show.php", callback=self.parse_captcha, meta={'previous_response': response})
def parse_captcha(self, response):
with open('captcha.png', 'wb') as f:
f.write(response.body)
captcha = raw_input("-----> Enter the captcha in manually :")
return FormRequest.from_response(
response=response.meta['previous_response'],
formdata={"login_user": "myusername",
"login_password": "mypass",
"captcha_code": captcha},
formxpath="//*[@id='login-form']",
callback=self.after_login)
|
Scrapy - simple captcha solving example
Tag : python , By : user179938
Date : March 29 2020, 07:55 AM
Hope that helps Solving the captcha itself is easy using Pillow and Python Tesseract. The hard part was to realize how to handle cookies (PHPSESSID). Here's complete working example for your case (using Python 2): # -*- coding: utf-8 -*-
import io
import urllib2
from PIL import Image
import pytesseract
import scrapy
class CaptchaSpider(scrapy.Spider):
name = 'captcha'
def start_requests(self):
yield scrapy.Request('http://145.100.108.148/login3/',
cookies={'PHPSESSID': 'xyz'})
def parse(self, response):
img_url = response.urljoin(response.xpath('//img/@src').extract_first())
url_opener = urllib2.build_opener()
url_opener.addheaders.append(('Cookie', 'PHPSESSID=xyz'))
img_bytes = url_opener.open(img_url).read()
img = Image.open(io.BytesIO(img_bytes))
captcha = pytesseract.image_to_string(img)
print 'Captcha solved:', captcha
return scrapy.FormRequest.from_response(
response, formdata={'captcha': captcha},
callback=self.after_captcha)
def after_captcha(self, response):
print 'Result:', response.body
|
Trying to response Amazon's Captcha with scrapy, strange behavior on spider generator
Date : March 29 2020, 07:55 AM
seems to work fine Currently, your form request object is never returned to Scrapy for handling. Replace self.solve_captcha(response, self.parse) by yield from self.solve_captcha(response, self.parse).
|