logo
down
shadow

How do I scrape pdf and html from search results without obvious url


How do I scrape pdf and html from search results without obvious url

Content Index :

How do I scrape pdf and html from search results without obvious url
Tag : python , By : Martin
Date : January 11 2021, 03:34 PM

hop of those help? The page is making POST request with search term and server returns a response - a HTML page with results.
This script will go through all results and prints all .pdf links found on page. The search term is in variable search_term, in this example case it's set to health:
import requests
from bs4 import BeautifulSoup

url = 'http://www.nas.gov.sg/archivesonline/speeches/search-result'

search_term = 'health'

data = {
    'keywords': search_term,
    'search-type': 'basic',
    'keywords-type': 'all',
    'page-num': 1
}

soup = BeautifulSoup(requests.post(url, data=data).text, 'lxml')

cnt = 1
while True:

    print()
    print('Page no. {}'.format(cnt))
    print('-' * 80)

    for a in soup.select('a[href$=".pdf"]'):
        print(a['href'])

    if soup.select_one('span.next-10'):
        data['page-num'] += 10
        cnt += 1
        soup = BeautifulSoup(requests.post(url, data=data).text, 'lxml')
    else:
        break
Page no. 1
--------------------------------------------------------------------------------
http://www.nas.gov.sg/archivesonline/data/pdfdoc/20160727009/Speech%20for%20WSHC%20Chairman%20for%20WSH%20Awards%202016.pdf
http://www.nas.gov.sg/archivesonline/data/pdfdoc/20160727009/Annex%20A%20-%20Factsheet%20on%20WSH%20Awards%202016.pdf
http://www.nas.gov.sg/archivesonline/data/pdfdoc/20160727009/Annex%20B%20-%20Factsheet%20on%20Train-the-Trainer%20programme.pdf

...and so on.

Comments
No Comments Right Now !

Boards Message :
You Must Login Or Sign Up to Add Your Comments .

Share : facebook icon twitter icon

Store scrape results and search in results with Python and Pandas?


Tag : python-3.x , By : fayoh
Date : March 29 2020, 07:55 AM
help you fix your problem You can download the content of the urls and save them in separate files in a directory (eg: 'links')
def get_link(url): 
    file_name = os.path.join('/path/to/links', url.replace('/', '_').replace(':', '_'))
    try: 
        r = requests.get(url)
    except Exception as e:
        print("Failded to get " + url)
    else:
        with open(file_name, 'w') as f: 
            f.write(r.text)
def contains_keywords(link, keywords):
    file_name = os.path.join('/path/to/links', link.replace('/', '_').replace(':', '_'))
    try: 
        with open(file_name) as f: 
            output = f.read()
        return int(any(x in output for x in keywords))
    except Exception as e:
        print("Can't access file: {}\n{}".format(file_name, e))
        return "Wrong/Missing URL"

Beautifulsoup - scrape search results


Tag : python , By : walshtp
Date : March 29 2020, 07:55 AM
help you fix your problem I am new to Beautifulsoup and I am trying to learn how to scrape search results from websites.
import requests
from bs4 import BeautifulSoup

library_list = []

data = {'action' : 'LibSearch', 'termtype' : 'Keyword', 'libstate' : 'NSW', 'dosearch' : 'Search', 'libtype' : 'All', 'chunk' : 20}

page = requests.get("http://www.nla.gov.au/apps/libraries/", params=data)
soup = BeautifulSoup(page.content, 'html.parser')


libraries = soup.find_all("a")


for library in libraries[5:]:
    print(library.text)
    library_list.append(library.text)
Design Centre Enmore Library
Sydney Institute

A.B. 'Banjo' Paterson Library
Sydney Grammar School
.
.

ANSTO Library
Australian Nuclear Science and Technology Organisation

.
.

Cannot click search result elements after submitting HTML web form with embedded results table - VBA web scrape


Tag : html , By : David B
Date : March 29 2020, 07:55 AM
around this issue It's an aspx page. You can perform the same GET and POST requests it does in a simplified form. I use clipboard to write out sample tables. You can amend as you choose.
Option Explicit

Public Sub GetPropertyInfo()
    Dim html As MSHTML.HTMLDocument, xhr As Object

    Application.ScreenUpdating = False

    Set html = New MSHTML.HTMLDocument
    Set xhr = CreateObject("MSXML2.ServerXMLHTTP")

    Dim body As String, propertyId As String

    propertyId = "R000001972"

    With xhr
        .Open "GET", "http://iswdataclient.azurewebsites.net/webSearchID.aspx?dbkey=parkercad&stype=id&sdata=" & propertyId, False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send
        html.body.innerHTML = .responseText
        If html.querySelectorAll("#dvPrimary table tr").Length <= 1 Then Exit Sub
        body = GetPostBody(html, propertyId)
        .Open "POST", "http://iswdataclient.azurewebsites.net/webProperty.aspx?dbkey=parkercad&stype=id&sdata=" _
                   & propertyId & "&id=" & propertyId, False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send body
        html.body.innerHTML = .responseText
    End With

    Dim ws As Worksheet, clipboard As Object, i As Long

    Set ws = ThisWorkbook.Worksheets(1)
    Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")

    With ws.Cells
        .ClearContents
        .ClearFormats
    End With

    With html.querySelectorAll("table")
        For i = 8 To .Length - 1
            clipboard.SetText .Item(i).outerHTML
            clipboard.PutInClipboard
            ws.Range("A" & GetLastRow(ws) + 2).PasteSpecial
        Next
    End With
    Application.ScreenUpdating = True
End Sub

Public Function GetPostBody(ByVal html As MSHTML.HTMLDocument, ByVal propertyId As String) As String
    Dim i As Long, result As String

    With html.querySelectorAll("input[type=hidden]")
        For i = 0 To .Length - 1
            result = result & .Item(i).ID & "=" & .Item(i).Value & "&"
        Next
    End With
    result = result & "__EVENTTARGET=ucResultsGrid$" & propertyId
    GetPostBody = result
End Function

Public Function GetLastRow(ByVal sh As Worksheet) As Long
    On Error Resume Next
    GetLastRow = sh.Cells.Find(What:="*", _
                               After:=sh.Range("A1"), _
                               Lookat:=xlPart, _
                               LookIn:=xlFormulas, _
                               SearchOrder:=xlByRows, _
                               SearchDirection:=xlPrevious, _
                               MatchCase:=False).Row
    On Error GoTo 0
End Function

Google search: Scrape results page in PHP for total results


Tag : php , By : Mark
Date : January 02 2021, 06:48 AM

Scrape Google-Search results page in PHP for total results and parse them


Tag : php , By : jonagh
Date : March 29 2020, 07:55 AM
Related Posts Related QUESTIONS :
  • Element Tree - Seaching for specific element value without looping
  • Ignore Nulls in pandas map dictionary
  • How do I get scrap data from web pages using beautifulsoup in python
  • Variable used, golobal or local?
  • I have a regex statement to pull all numbers out of a text file, but it only finds 77 out of the 81 numbers in the file
  • How do I create a dataframe of jobs and companies that includes hyperlinks?
  • Detect if user has clicked the 'maximized' button
  • Does flask_login automatically set the "next" argument?
  • Indents in python 3
  • How to create a pool of threads
  • Pandas giving IndexError on one dataframe but not on another similar dataframe
  • Django Rest Framework - Testing client.login doesn't login user, ret anonymous user
  • Running dag without dag file in airflow
  • Filling across a specified dimension of a numpy array
  • Python populating dataframe in pandas from text files
  • How to interpolate a single ("non-piecewise") cubic spline from a set of data points?
  • Divide 2 integers (leetcode 29) - recursion issue
  • Can someone explain why do I get this output in Python?
  • Is there a way to automatically make a "collage" of plots with matplotlib?
  • How to combine multiple rows in pandas with shared column values
  • How do I get LOAD_CLASSDEREF instruction after dis.dis?
  • Django - How to add items to Bootstrap dropdown?
  • Linear Regression - Does the below implementation of ridge regression finding coefficient term using gradient method is
  • How to drop all rows in pandas dataframe with negative values?
  • Most Efficient Way to Find Closest Date Between 2 Dataframes
  • Execution error when Passing arguments to a python script using os.system. The script takes sys.argv arguments
  • Looping through a function
  • Create a plot for each unique ID
  • a thread python with 'while' got another thread never start
  • Solution from SciPy solve_ivp contains oscillations for a system of first-order ODEs
  • trigger python events driven by selenium controlled browser
  • Passing line-edits to a contextmanager to set validators
  • Python: globals().items() iterations try to change a dict
  • Is it possible to specify starting values for each parameter (instead of bounds) for scipy's differential evolution?
  • why datetime.now() and constructed datetime using all fields(like year,month...) of now has big timedelta?
  • MySQL multiple table UPDATE query using sqlalchemy core?
  • find if a semantic version is superset of of another version python
  • Type checking against dynamically created objects
  • Struggling with simple reverse function
  • Is there a function for finding the midpoint of n points on sklearn.neighbors.NearestNeighbors?
  • How to set max number of tweets to fetch
  • PYTHON 3.7.4 NOT USING SQLITE 3.29.0
  • How to replace Nan value with zeros in a numpy array?
  • How to speed up calculating variance among sparse matrix
  • cupy code is not fast enough compared with numpy
  • How to count frequency of select values in Python pandas dataframe
  • Scrape Span Text from Google
  • Python watchdog, watch a directory and rename file on event.modification
  • Filtering rows in DataFrame with dependent conditions
  • How to check if a character is a not a part of number or URL in string?
  • Compare corresponding elements of a list
  • Python misinterprets 3 character string as UTF-8 continuation byte
  • Merge two columns in Pandas
  • Side Effect error in Python in an online compiler
  • How to convert a navigation list with depth levels to a parent-child flat list?
  • Retrieving values from a paired key dictionary in Python
  • How to test the current text of a Tkinter text box widget before inserting new text after user clicks on a button?
  • computing the Cumulative Sum, where Sum can be reset by a condition
  • Where do you specify your API key when making a request with the Google API python library?
  • Pandas DataFrame, computing the Time Difference between one row and other row which satisfies a condition
  • shadow
    Privacy Policy - Terms - Contact Us © scrbit.com