logo
down
shadow

How do I get scrap data from web pages using beautifulsoup in python


How do I get scrap data from web pages using beautifulsoup in python

Content Index :

How do I get scrap data from web pages using beautifulsoup in python
Tag : python , By : MJRider
Date : January 11 2021, 05:14 PM

To fix the issue you can do Problem is in line movies=list(name.text) - you are creating list, where each item is character from the string name.text.
Instead of this list(), you can use list-comprehension movies = [name.text for name in names.find_all('a')]:
from bs4 import BeautifulSoup
import requests
import csv

url = 'https://www.imdb.com/chart/top'
res = requests.get(url)
soup = BeautifulSoup(res.text)
movie = soup.find_all(class_='titleColumn')

for names in movie:
    movies = [name.text for name in names.find_all('a')]
    # print(movies)

    # IN CSV
    with open('TopMovies.csv', 'a') as csvFile:
        writer = csv.writer(csvFile, delimiter = ' ')
        writer.writerow(movies)
    csvFile.close()
    print(movies)

print("Successfully inserted")

Comments
No Comments Right Now !

Boards Message :
You Must Login Or Sign Up to Add Your Comments .

Share : facebook icon twitter icon

xbmc/kodi python scrap data using BeautifulSoup


Tag : python , By : Kiltec
Date : March 29 2020, 07:55 AM
Hope that helps i want to edit some kodi addon that use re.compile to scrap data into BeautifulSoup4 the original code is like this
from bs4 import BeautifulSoup

html = """<div id="content">
  <span class="someclass">
    <span class="sec">
      <a class="frame" href="http://somlink.com/section/name-here" title="name here">
         <img src="http://www.somlink.com/thumb/imgsection/thumbnail.jpg" >
      </a>
    </span>
    <h3 class="title">
        <a href="http://somlink.com/section/name-here">name here</a>
    </h3>
    <span class="details"><span class="length">Length: 99:99</span>
 </span>
</div>
"""

soup = BeautifulSoup(html, "lxml")
sec = soup.find("span", {"class": "someclass"})
# get a tag with frame class
fr = sec.find("a", {"class": "frame"})

# pull img src and href from the a/frame
url, img = fr["href"], fr.find("img")["src"]

# get h3 with title class and extract the text from the anchor
name =  sec.select("h3.title a")[0].text

# "size" is in the span with the details class
size = sec.select("span.details")[0].text.split(None,1)[-1]


print(url, img, name.strip(), size.split(None,1)[1].strip())
('http://somlink.com/section/name-here', 'http://www.somlink.com/thumb/imgsection/thumbnail.jpg', u'name here', u'99:99')
def secs():
    soup = BeautifulSoup(html, "lxml")
    sections = soup.find_all("span", {"class": "someclass"})
    for sec in sections:
        fr = sec.find("a", {"class": "frame"})
        url, img = fr["href"], fr.find("img")["src"]
        name, size =  sec.select("h3.title a")[0].text, sec.select("span.details")[0].text.split(None,1)[-1]
        yield url, name, img,size
 sec.find("img")["src"]

Scrap the article with Python 3.4 and BeautifulSoup ,Requests


Tag : python , By : gbodunski
Date : March 29 2020, 07:55 AM
I hope this helps you . The desired data is not actually located inside the element with status-list class. If you would inspect the source, you would find an empty container instead:
<div class="status_bd">
    <div id="statusLists" class="allStatuses no-head"></div>
</div>
import json
import re
import requests
from bs4 import BeautifulSoup

url = 'https://xueqiu.com/yaodewang'
headers = {
    'user-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36'
}
r = requests.get(url, headers=headers).content
soup = BeautifulSoup(r, 'lxml')

pattern = re.compile(r"SNB\.data\.statuses = ({.*?});", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)

data = json.loads(pattern.search(script.text).group(1))
for item in data["statuses"]:
    print(item["description"])
The best advice: Remember common courtesy and act toward others as you want them to act toward you.
Lighten up! It&#39;s the weekend. we&#39;re just having a little fun! Industrial Bank is expected to rise,next week...
...
点.点.点... 点到这个,学位、学历、成绩单翻译一下要50块、100块的...

Data Scrap from a website to a csv file format using python and beautifulsoup


Tag : python , By : user169463
Date : March 29 2020, 07:55 AM
Hope that helps This should get you started. I'll break it down a bit too for you so you can modify and play while you're learning. I'm also suggesting to use Pandas, as it's a popular library for data manipulation and you'll be using in the near future if you're already not using it
I first initialize a results dataframe to store all the data you'll be parsing:
import bs4
import requests
import pandas as pd

results = pd.DataFrame()
my_url = 'https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&DEPA=0&Order=BESTMATCH&Description=graphics+card&N=-1&isNodeId=1'

response = requests.get(my_url)
html = response.text

soup = bs4.BeautifulSoup(html, 'html.parser')
Container_Main = soup.find_all("div",{"class":"item-container"})
for container in Container_Main:
results.to_csv('path/file.csv', index=False)
import bs4
import requests
import pandas as pd

results = pd.DataFrame()

my_url = 'https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&DEPA=0&Order=BESTMATCH&Description=graphics+card&N=-1&isNodeId=1'

response = requests.get(my_url)
html = response.text

soup = bs4.BeautifulSoup(html, 'html.parser')

Container_Main = soup.find_all("div",{"class":"item-container"})
for container in Container_Main:

    item_features = container.find("ul",{"class":"item-features"})

    # if there are no item-fetures, move on to the next container
    if item_features == None:
        continue

    temp_df = pd.DataFrame(index=[0])
    features_list = item_features.find_all('li')
    for feature in features_list:
        split_str = feature.text.split(':')        
        header = split_str[0]
        data = split_str[1].strip()
        temp_df[header] = data

    promo = container.find_all("p",{"class":"item-promo"})[0].text
    temp_df['promo'] = promo

    results = results.append(temp_df, sort = False).reset_index(drop = True)


results.to_csv('path/file.csv', index=False)

scrap top 100 job results from indeed using BeautifulSoup python


Tag : python , By : stu73
Date : March 29 2020, 07:55 AM
should help you out Do it in batches of 10 changing the start value in the url. You can loop incrementing and adding the add variable
https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru%2C+Karnataka&start=0
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
results = []
url = 'https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start={}'
with requests.Session() as s:
    for page in range(5):
        res = s.get(url.format(page))
        soup = bs(res.content, 'lxml')
        titles = [item.text.strip() for item in soup.select('[data-tn-element=jobTitle]')]
        companies = [item.text.strip() for item in soup.select('.company')]
        data = list(zip(titles, companies))
        results.append(data)
newList = [item for sublist in results for item in sublist]
df = pd.DataFrame(newList)
df.to_json(r'C:\Users\User\Desktop\data.json')

how to scrap the data from the url in python using beautifulsoup


Tag : python-3.x , By : itsmegb
Date : March 29 2020, 07:55 AM
help you fix your problem Using the following url:
https://desiopt.com/search-results-jobs/?action=search&page=&listings_per_page=&view=list
import requests
from bs4 import BeautifulSoup
import csv

links = []
try:
    for item in range(1, 372):
        print(f"Extraction Page# {item}")
        r = requests.get(
            f"https://desiopt.com/search-results-jobs/?action=search&page={item}&listings_per_page=100&view=list")
        if r.status_code == 200:
            soup = BeautifulSoup(r.text, 'html.parser')
            for item in soup.findAll('span', attrs={'class': 'captions-field'}):
                for a in item.findAll('a'):
                    a = a.get('href')
                    if a not in links:
                        links.append(a)
except KeyboardInterrupt:
    print("Good Bye!")
    exit()

data = []
try:
    for link in links:
        r = requests.get(link)
        if r.status_code == 200:
            soup = BeautifulSoup(r.text, 'html.parser')
            for item in soup.findAll('div', attrs={'class': 'compProfileInfo'}):
                a = [a.text.strip() for a in item.findAll('span')]
                if a[6] == '':
                    a[6] = 'N/A'
                data.append(a[0:7:2])
except KeyboardInterrupt:
    print("Good Bye!")
    exit()

while True:
    try:
        with open('output.csv', 'w+', newline='') as file:
            writer = csv.writer(file)
            writer.writerow(['Name', 'Phone', 'Email', 'Website'])
            writer.writerows(data)
            print("Operation Completed")
    except PermissionError:
        print("Please Close The File")
        continue
    except KeyboardInterrupt:
        print("Good Bye")
        exit()
    break
Related Posts Related QUESTIONS :
  • How can I scrape text within paragraph tag with some other tags then within the paragraph text?
  • Custom entity ruler with SpaCy did not return a match
  • Logging with two handlers - one to file and one to stderr
  • How to do pivot_table in dask with aggfunc 'min'?
  • This for loop displays only the last entry of the student record
  • How to split a string by a specific pattern in number of characters?
  • Python 3: how to scrape research results from a website using CSFR?
  • Setting the scoring parameter of RandomizedSeachCV to r2
  • How to send alert or message from view.py to template?
  • How to add qml ScatterSeries to existing qml defined ChartView?
  • Django + tox: Apps aren't loaded yet
  • My css and images arent showing in django
  • Probability mass function sum 2 dice roll?
  • Cannot call ubuntu 'ulimit' from python subprocess without using shell option
  • Dataframe Timestamp Filter for new/repeating value
  • Problem with clicking select2 dropdownlist in selenium
  • pandas dataframe masks to write values into new column
  • How to click on item in navigation bar on top of page using selenium python?
  • Add multiple EntityRuler with spaCy (ValueError: 'entity_ruler' already exists in pipeline)
  • error when replacing missing ')' using negative look ahead regex in python
  • Is there a way to remove specific strings from indexes using a for loop?
  • select multiple tags by position in beautifulSoup
  • pytest: getting AttributeError: 'CaptureFixture' object has no attribute 'readouterror' capturing stdout
  • Shipping PyGObject/GTK+ app on Windows with MingW
  • Python script to deduplicate lines in multiple files
  • How to prevent window and widgets in a pyqt5 application from changing size when the visibility of one widget is altered
  • How to draw stacked bar plot from df.groupby('feature')['label'].value_counts()
  • Python subprocess doesn't work without sleep
  • How can I adjust 'the time' in python with module Re
  • Join original np array with resulting np array in a form of dictionary? multidimensional array? etc?
  • Forcing labels on histograms in each individual graph in a figure
  • For an infinite dataset, is the data used in each epoch the same?
  • Is there a more efficent way to extend a string?
  • How to calculate each single element of a numpy array based on conditions
  • How do I change the width of Jupyter notebook's cell's left part?
  • Measure distance between lat/lon coordinates and utm coordinates
  • Installing megam for NLTK on Windows
  • filter dataframe on each value of a samn column have a specific value of another column in Panda\Python
  • Threading with pubsub throwing AssertionError: 'callableObj is not callable' in wxPython
  • Get grouped data from 2 dataframes with condition
  • How can I import all of sklearns regressors
  • How to take all elements except the first k
  • Whats wrong with my iteration list of lists from csv
  • Tensorflow Estimator API save image summary in eval mode
  • How to Pack with PyQt - how to make QFrame/Layout adapt to content
  • How do I get certain Time Range in Python
  • python doubly linked list - insertAfter node
  • Open .h5 file in Python
  • Joining a directory name with a binary file name
  • python, sort list with two arguments in compare function
  • Is it possible to print from Python using non-ANSI colors?
  • Pandas concat historical data using date minus some number of days
  • CV2: Import Error in Python OpenCV
  • Is it possible to do this loop in a one-liner?
  • invalid literal for int() with base 10: - django
  • Why does my code print a value that I have not assigned as yet?
  • the collatz func in automate boring stuff with python
  • How to find all possible combinations of parameters and funtions
  • about backpropagation deep neural network in tensorflow
  • Sort strings in pandas
  • shadow
    Privacy Policy - Terms - Contact Us © scrbit.com