retrieve links from web page using python and BeautifulSoup
Date : March 29 2020, 07:55 AM
This might help you How can I retrieve the links of a webpage and copy the url address of the links using Python? , Here's a short snippet using the SoupStrainer class in BeautifulSoup: import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')
for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
if link.has_attr('href'):
print(link['href'])
|
Python BeautifulSoup - Possible to get X number of links from page?
Date : March 29 2020, 07:55 AM
I wish this help you Is it somehow possible to only get lets say the first 10 links from a page with BeautifulSoup? , Would something like this work: for tag in soupan.findAll('a')[:10]:
|
Using Beautifulsoup in Python to iterate over non href links within an xml and retrieve specific information
Date : March 29 2020, 07:55 AM
I wish this help you There is no need to use BeautifulSoup for this. The site is returning perfectly valid XML that can be parsed with Python's included tools: import requests
import xml.etree.ElementTree as et
req = requests.get('http://www.agenzia-interinale.it/sitemap-5.xml')
root = et.fromstring(req.content)
for i in root:
print i[0].text # the <loc> text
|
Date : March 29 2020, 07:55 AM
seems to work fine You don't need a nested loop here. Other notes/improvements: opener.open() result can be passed directly to BeautifulSoup constructor, no need for read() urlopener can be defined once and reused in the loop to follow links use find_all() instead of findAll() use urljoin() for concatenating url parts use csv module for writing the delimited data use with context manager while dealing with files import csv
import re
import time
import urllib2
from urlparse import urljoin
from bs4 import BeautifulSoup
BASE_URL = 'http://omaha.craigslist.org/sys/'
URL = 'http://omaha.craigslist.org/sya/'
FILENAME = 'C:/Python27/Folder/Folder/Folder/craigstvs.txt'
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
soup = BeautifulSoup(opener.open(URL))
with open(FILENAME, 'a') as f:
writer = csv.writer(f, delimiter=';')
for link in soup.find_all('a', class_=re.compile("hdrlnk")):
timeset = time.strftime("%m-%d %H:%M")
item_url = urljoin(BASE_URL, link['href'])
item_soup = BeautifulSoup(opener.open(item_url))
# do smth with the item_soup? or why did you need to follow this link?
writer.writerow([timeset, link.text, item_url])
08-10 16:56;Dell Inspiron-15 Laptop;http://omaha.craigslist.org/sys/4612666460.html
08-10 16:56;computer????;http://omaha.craigslist.org/sys/4612637389.html
08-10 16:56;macbook 13 inch 160 gig wifi dvdrw ;http://omaha.craigslist.org/sys/4612480237.html
08-10 16:56;MAC mini intel core wifi dvdrw great cond ;http://omaha.craigslist.org/sys/4612480593.html
...
|
Retrieve links from web page using BeautifulSoup
Date : March 29 2020, 07:55 AM
seems to work fine it could be that the tags are only extracted once for the initial document: # Retrieve all of the anchor tags
tags = soup('a')
|