it should still fix some issue I was able to accomplish your homework in the following way (please take the time to learn this):
from bs4 import BeautifulSoup
# This function will get the Nth link object from the given url.
# To be safe you should make sure the nth link exists (I did not)
def getNthLink(url, n):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
url = "https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html"
# This iterates 4 times, each time grabbing the 3rd link object
# For convenience it prints the url each time.
for i in xrange(4):
tag = getNthLink(url,3)
url = tag.get('href')
# Finally after 4 times we grab the content from the last tag
This might help you How can I retrieve the links of a webpage and copy the url address of the links using Python? , Here's a short snippet using the SoupStrainer class in BeautifulSoup:
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')
for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
Python BeautifulSoup - Possible to get X number of links from page?
I wish this help you There is no need to use BeautifulSoup for this. The site is returning perfectly valid XML that can be parsed with Python's included tools:
import xml.etree.ElementTree as et
req = requests.get('http://www.agenzia-interinale.it/sitemap-5.xml')
root = et.fromstring(req.content)
for i in root:
print i.text # the <loc> text
I'm attempting to extract some links from craigslist using beautifulsoup but it's pulling the links 100 times rather tha
seems to work fine You don't need a nested loop here. Other notes/improvements: opener.open() result can be passed directly to BeautifulSoup constructor, no need for read() urlopener can be defined once and reused in the loop to follow links use find_all() instead of findAll() use urljoin() for concatenating url parts use csv module for writing the delimited data use with context manager while dealing with files
from urlparse import urljoin
from bs4 import BeautifulSoup
BASE_URL = 'http://omaha.craigslist.org/sys/'
URL = 'http://omaha.craigslist.org/sya/'
FILENAME = 'C:/Python27/Folder/Folder/Folder/craigstvs.txt'
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
soup = BeautifulSoup(opener.open(URL))
with open(FILENAME, 'a') as f:
writer = csv.writer(f, delimiter=';')
for link in soup.find_all('a', class_=re.compile("hdrlnk")):
timeset = time.strftime("%m-%d %H:%M")
item_url = urljoin(BASE_URL, link['href'])
item_soup = BeautifulSoup(opener.open(item_url))
# do smth with the item_soup? or why did you need to follow this link?
writer.writerow([timeset, link.text, item_url])
08-10 16:56;Dell Inspiron-15 Laptop;http://omaha.craigslist.org/sys/4612666460.html
08-10 16:56;macbook 13 inch 160 gig wifi dvdrw ;http://omaha.craigslist.org/sys/4612480237.html
08-10 16:56;MAC mini intel core wifi dvdrw great cond ;http://omaha.craigslist.org/sys/4612480593.html