How do I scrape pdf and html from search results without obvious url

How do I scrape pdf and html from search results without obvious url

Content Index :

How do I scrape pdf and html from search results without obvious url
Tag : python , By : Martin
Date : January 11 2021, 03:34 PM

hop of those help? The page is making POST request with search term and server returns a response - a HTML page with results.
This script will go through all results and prints all .pdf links found on page. The search term is in variable search_term, in this example case it's set to health:
import requests
from bs4 import BeautifulSoup

url = 'http://www.nas.gov.sg/archivesonline/speeches/search-result'

search_term = 'health'

data = {
    'keywords': search_term,
    'search-type': 'basic',
    'keywords-type': 'all',
    'page-num': 1

soup = BeautifulSoup(requests.post(url, data=data).text, 'lxml')

cnt = 1
while True:

    print('Page no. {}'.format(cnt))
    print('-' * 80)

    for a in soup.select('a[href$=".pdf"]'):

    if soup.select_one('span.next-10'):
        data['page-num'] += 10
        cnt += 1
        soup = BeautifulSoup(requests.post(url, data=data).text, 'lxml')
Page no. 1

...and so on.

No Comments Right Now !

Boards Message :
You Must Login Or Sign Up to Add Your Comments .

Share : facebook icon twitter icon

Store scrape results and search in results with Python and Pandas?

Tag : python-3.x , By : fayoh
Date : March 29 2020, 07:55 AM
help you fix your problem You can download the content of the urls and save them in separate files in a directory (eg: 'links')
def get_link(url): 
    file_name = os.path.join('/path/to/links', url.replace('/', '_').replace(':', '_'))
        r = requests.get(url)
    except Exception as e:
        print("Failded to get " + url)
        with open(file_name, 'w') as f: 
def contains_keywords(link, keywords):
    file_name = os.path.join('/path/to/links', link.replace('/', '_').replace(':', '_'))
        with open(file_name) as f: 
            output = f.read()
        return int(any(x in output for x in keywords))
    except Exception as e:
        print("Can't access file: {}\n{}".format(file_name, e))
        return "Wrong/Missing URL"

Beautifulsoup - scrape search results

Tag : python , By : walshtp
Date : March 29 2020, 07:55 AM
help you fix your problem I am new to Beautifulsoup and I am trying to learn how to scrape search results from websites.
import requests
from bs4 import BeautifulSoup

library_list = []

data = {'action' : 'LibSearch', 'termtype' : 'Keyword', 'libstate' : 'NSW', 'dosearch' : 'Search', 'libtype' : 'All', 'chunk' : 20}

page = requests.get("http://www.nla.gov.au/apps/libraries/", params=data)
soup = BeautifulSoup(page.content, 'html.parser')

libraries = soup.find_all("a")

for library in libraries[5:]:
Design Centre Enmore Library
Sydney Institute

A.B. 'Banjo' Paterson Library
Sydney Grammar School

ANSTO Library
Australian Nuclear Science and Technology Organisation


Cannot click search result elements after submitting HTML web form with embedded results table - VBA web scrape

Tag : html , By : David B
Date : March 29 2020, 07:55 AM
around this issue It's an aspx page. You can perform the same GET and POST requests it does in a simplified form. I use clipboard to write out sample tables. You can amend as you choose.
Option Explicit

Public Sub GetPropertyInfo()
    Dim html As MSHTML.HTMLDocument, xhr As Object

    Application.ScreenUpdating = False

    Set html = New MSHTML.HTMLDocument
    Set xhr = CreateObject("MSXML2.ServerXMLHTTP")

    Dim body As String, propertyId As String

    propertyId = "R000001972"

    With xhr
        .Open "GET", "http://iswdataclient.azurewebsites.net/webSearchID.aspx?dbkey=parkercad&stype=id&sdata=" & propertyId, False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        html.body.innerHTML = .responseText
        If html.querySelectorAll("#dvPrimary table tr").Length <= 1 Then Exit Sub
        body = GetPostBody(html, propertyId)
        .Open "POST", "http://iswdataclient.azurewebsites.net/webProperty.aspx?dbkey=parkercad&stype=id&sdata=" _
                   & propertyId & "&id=" & propertyId, False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send body
        html.body.innerHTML = .responseText
    End With

    Dim ws As Worksheet, clipboard As Object, i As Long

    Set ws = ThisWorkbook.Worksheets(1)
    Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")

    With ws.Cells
    End With

    With html.querySelectorAll("table")
        For i = 8 To .Length - 1
            clipboard.SetText .Item(i).outerHTML
            ws.Range("A" & GetLastRow(ws) + 2).PasteSpecial
    End With
    Application.ScreenUpdating = True
End Sub

Public Function GetPostBody(ByVal html As MSHTML.HTMLDocument, ByVal propertyId As String) As String
    Dim i As Long, result As String

    With html.querySelectorAll("input[type=hidden]")
        For i = 0 To .Length - 1
            result = result & .Item(i).ID & "=" & .Item(i).Value & "&"
    End With
    result = result & "__EVENTTARGET=ucResultsGrid$" & propertyId
    GetPostBody = result
End Function

Public Function GetLastRow(ByVal sh As Worksheet) As Long
    On Error Resume Next
    GetLastRow = sh.Cells.Find(What:="*", _
                               After:=sh.Range("A1"), _
                               Lookat:=xlPart, _
                               LookIn:=xlFormulas, _
                               SearchOrder:=xlByRows, _
                               SearchDirection:=xlPrevious, _
    On Error GoTo 0
End Function

Google search: Scrape results page in PHP for total results

Tag : php , By : Mark
Date : January 02 2021, 06:48 AM

Scrape Google-Search results page in PHP for total results and parse them

Tag : php , By : jonagh
Date : March 29 2020, 07:55 AM
Related Posts Related QUESTIONS :
  • Element Tree - Seaching for specific element value without looping
  • Ignore Nulls in pandas map dictionary
  • How do I get scrap data from web pages using beautifulsoup in python
  • Variable used, golobal or local?
  • I have a regex statement to pull all numbers out of a text file, but it only finds 77 out of the 81 numbers in the file
  • How do I create a dataframe of jobs and companies that includes hyperlinks?
  • Detect if user has clicked the 'maximized' button
  • Does flask_login automatically set the "next" argument?
  • Indents in python 3
  • How to create a pool of threads
  • Pandas giving IndexError on one dataframe but not on another similar dataframe
  • Django Rest Framework - Testing client.login doesn't login user, ret anonymous user
  • Running dag without dag file in airflow
  • Filling across a specified dimension of a numpy array
  • Python populating dataframe in pandas from text files
  • How to interpolate a single ("non-piecewise") cubic spline from a set of data points?
  • Divide 2 integers (leetcode 29) - recursion issue
  • Can someone explain why do I get this output in Python?
  • Is there a way to automatically make a "collage" of plots with matplotlib?
  • How to combine multiple rows in pandas with shared column values
  • How do I get LOAD_CLASSDEREF instruction after dis.dis?
  • Django - How to add items to Bootstrap dropdown?
  • Linear Regression - Does the below implementation of ridge regression finding coefficient term using gradient method is
  • How to drop all rows in pandas dataframe with negative values?
  • Most Efficient Way to Find Closest Date Between 2 Dataframes
  • Execution error when Passing arguments to a python script using os.system. The script takes sys.argv arguments
  • Looping through a function
  • Create a plot for each unique ID
  • a thread python with 'while' got another thread never start
  • Solution from SciPy solve_ivp contains oscillations for a system of first-order ODEs
  • trigger python events driven by selenium controlled browser
  • Passing line-edits to a contextmanager to set validators
  • Python: globals().items() iterations try to change a dict
  • Is it possible to specify starting values for each parameter (instead of bounds) for scipy's differential evolution?
  • why datetime.now() and constructed datetime using all fields(like year,month...) of now has big timedelta?
  • MySQL multiple table UPDATE query using sqlalchemy core?
  • find if a semantic version is superset of of another version python
  • Type checking against dynamically created objects
  • Struggling with simple reverse function
  • Is there a function for finding the midpoint of n points on sklearn.neighbors.NearestNeighbors?
  • How to set max number of tweets to fetch
  • PYTHON 3.7.4 NOT USING SQLITE 3.29.0
  • How to replace Nan value with zeros in a numpy array?
  • How to speed up calculating variance among sparse matrix
  • cupy code is not fast enough compared with numpy
  • How to count frequency of select values in Python pandas dataframe
  • Scrape Span Text from Google
  • Python watchdog, watch a directory and rename file on event.modification
  • Filtering rows in DataFrame with dependent conditions
  • How to check if a character is a not a part of number or URL in string?
  • Compare corresponding elements of a list
  • Python misinterprets 3 character string as UTF-8 continuation byte
  • Merge two columns in Pandas
  • Side Effect error in Python in an online compiler
  • How to convert a navigation list with depth levels to a parent-child flat list?
  • Retrieving values from a paired key dictionary in Python
  • How to test the current text of a Tkinter text box widget before inserting new text after user clicks on a button?
  • computing the Cumulative Sum, where Sum can be reset by a condition
  • Where do you specify your API key when making a request with the Google API python library?
  • Pandas DataFrame, computing the Time Difference between one row and other row which satisfies a condition
  • shadow
    Privacy Policy - Terms - Contact Us © scrbit.com