logo
down
shadow

retrieve links from web page using python and BeautifulSoup than select 3 link and run it 4 times


retrieve links from web page using python and BeautifulSoup than select 3 link and run it 4 times

Content Index :

it should still fix some issue I was able to accomplish your homework in the following way (please take the time to learn this):
import urllib
from bs4 import BeautifulSoup

# This function will get the Nth link object from the given url.
# To be safe you should make sure the nth link exists (I did not)
def getNthLink(url, n):
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')
    return tags[n-1]

url = "https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html"

# This iterates 4 times, each time grabbing the 3rd link object
# For convenience it prints the url each time.
for i in xrange(4):
    tag = getNthLink(url,3)
    url = tag.get('href')
    print url

# Finally after 4 times we grab the content from the last tag
print tag.contents[0]

Comments
No Comments Right Now !

Boards Message :
You Must Login Or Sign Up to Add Your Comments .

Share : facebook icon twitter icon

Tag : python , By : sayuki288
Date : March 29 2020, 07:55 AM
This might help you How can I retrieve the links of a webpage and copy the url address of the links using Python? , Here's a short snippet using the SoupStrainer class in BeautifulSoup:
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

Tag : python , By : CyberGreg
Date : March 29 2020, 07:55 AM
I wish this help you Is it somehow possible to only get lets say the first 10 links from a page with BeautifulSoup? , Would something like this work:
for tag in soupan.findAll('a')[:10]:

Tag : python , By : Anna
Date : March 29 2020, 07:55 AM
I wish this help you There is no need to use BeautifulSoup for this. The site is returning perfectly valid XML that can be parsed with Python's included tools:
import requests
import xml.etree.ElementTree as et

req = requests.get('http://www.agenzia-interinale.it/sitemap-5.xml')
root = et.fromstring(req.content)
for i in root:
    print i[0].text  # the <loc> text

Tag : python , By : S Hall
Date : March 29 2020, 07:55 AM
seems to work fine You don't need a nested loop here. Other notes/improvements:
opener.open() result can be passed directly to BeautifulSoup constructor, no need for read() urlopener can be defined once and reused in the loop to follow links use find_all() instead of findAll() use urljoin() for concatenating url parts use csv module for writing the delimited data use with context manager while dealing with files
import csv
import re
import time
import urllib2
from urlparse import urljoin
from bs4 import BeautifulSoup

BASE_URL = 'http://omaha.craigslist.org/sys/'
URL = 'http://omaha.craigslist.org/sya/'
FILENAME = 'C:/Python27/Folder/Folder/Folder/craigstvs.txt'

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
soup = BeautifulSoup(opener.open(URL))

with open(FILENAME, 'a') as f:
    writer = csv.writer(f, delimiter=';')
    for link in soup.find_all('a', class_=re.compile("hdrlnk")):
        timeset = time.strftime("%m-%d %H:%M")

        item_url = urljoin(BASE_URL, link['href'])
        item_soup = BeautifulSoup(opener.open(item_url))

        # do smth with the item_soup? or why did you need to follow this link?

        writer.writerow([timeset, link.text, item_url])
08-10 16:56;Dell Inspiron-15 Laptop;http://omaha.craigslist.org/sys/4612666460.html
08-10 16:56;computer????;http://omaha.craigslist.org/sys/4612637389.html
08-10 16:56;macbook 13 inch 160 gig wifi dvdrw ;http://omaha.craigslist.org/sys/4612480237.html
08-10 16:56;MAC mini intel core wifi dvdrw great cond ;http://omaha.craigslist.org/sys/4612480593.html
...

Tag : python , By : mansoor
Date : March 29 2020, 07:55 AM
seems to work fine it could be that the tags are only extracted once for the initial document:
# Retrieve all of the anchor tags
tags = soup('a')
Related Posts Related QUESTIONS :
  • Tensorflow Estimator API save image summary in eval mode
  • How to Pack with PyQt - how to make QFrame/Layout adapt to content
  • How do I get certain Time Range in Python
  • python doubly linked list - insertAfter node
  • Open .h5 file in Python
  • Joining a directory name with a binary file name
  • python, sort list with two arguments in compare function
  • Is it possible to print from Python using non-ANSI colors?
  • Pandas concat historical data using date minus some number of days
  • CV2: Import Error in Python OpenCV
  • Is it possible to do this loop in a one-liner?
  • invalid literal for int() with base 10: - django
  • Why does my code print a value that I have not assigned as yet?
  • the collatz func in automate boring stuff with python
  • How to find all possible combinations of parameters and funtions
  • about backpropagation deep neural network in tensorflow
  • Sort strings in pandas
  • How do access my flask app hosted in docker?
  • Replace the sentence include some text with Python regex
  • Counting the most common element in a 2D List in Python
  • logout a user from the system using a function in python
  • mp4 metadata not found but exists
  • Django: QuerySet with ExpressionWrapper
  • Pandas string search in list of dicts
  • Decryption from RSA encrypted string from sqlite is not the same
  • need of maximum value in int
  • a list of several tuples, how to extract the same of the first two elements in the small tuple in the large tuple
  • Display image of 2D Sinewaves in 3D
  • how to prevent a for loop from overwriting a dictionary?
  • How To Fix: RuntimeError: size mismatch in pyTorch
  • Concatenating two Pandas DataFrames while maintaining index order
  • Why does this not run into an infinite loop?
  • Python Multithreading no current event loop
  • Element Tree - Seaching for specific element value without looping
  • Ignore Nulls in pandas map dictionary
  • How do I get scrap data from web pages using beautifulsoup in python
  • Variable used, golobal or local?
  • I have a regex statement to pull all numbers out of a text file, but it only finds 77 out of the 81 numbers in the file
  • How do I create a dataframe of jobs and companies that includes hyperlinks?
  • Detect if user has clicked the 'maximized' button
  • Does flask_login automatically set the "next" argument?
  • Indents in python 3
  • How to create a pool of threads
  • Pandas giving IndexError on one dataframe but not on another similar dataframe
  • Django Rest Framework - Testing client.login doesn't login user, ret anonymous user
  • Running dag without dag file in airflow
  • Filling across a specified dimension of a numpy array
  • Python populating dataframe in pandas from text files
  • How to interpolate a single ("non-piecewise") cubic spline from a set of data points?
  • Divide 2 integers (leetcode 29) - recursion issue
  • Can someone explain why do I get this output in Python?
  • How do I scrape pdf and html from search results without obvious url
  • Is there a way to automatically make a "collage" of plots with matplotlib?
  • How to combine multiple rows in pandas with shared column values
  • How do I get LOAD_CLASSDEREF instruction after dis.dis?
  • Django - How to add items to Bootstrap dropdown?
  • Linear Regression - Does the below implementation of ridge regression finding coefficient term using gradient method is
  • How to drop all rows in pandas dataframe with negative values?
  • Most Efficient Way to Find Closest Date Between 2 Dataframes
  • Execution error when Passing arguments to a python script using os.system. The script takes sys.argv arguments
  • shadow
    Privacy Policy - Terms - Contact Us © scrbit.com