logo
down
shadow

How do I create a dataframe of jobs and companies that includes hyperlinks?


How do I create a dataframe of jobs and companies that includes hyperlinks?

Content Index :

How do I create a dataframe of jobs and companies that includes hyperlinks?
Tag : python , By : Dennizzz
Date : January 11 2021, 05:14 PM

I hope this helps . I would regex out from html returned the required info and construct the url from the parameters the page javascript uses to dynamically construct each url. Interestingly, the total number of listings is different when using requests than using browser. You can manually enter the number of listings e.g. 6175 (currently) or use the number returned by the request (which is lower and you miss some results). You could also use selenium to get the correct initial result count). You can then issue requests with offsets to get all listings.
Listings can be randomized in terms of ordering.
https://www.indeed.com/jobs?q=software+developer&l=San+Francisco&limit=50&start=0
import requests, re, hjson, math
import pandas as pd
from bs4 import BeautifulSoup as bs

p = re.compile(r"jobmap\[\d+\]= ({.*?})")
p1 = re.compile(r"var searchUID = '(.*?)';") 
counter = 0 
final = {}

with requests.Session() as s:
    r = s.get('https://www.indeed.com/q-software-developer-l-San-Francisco-jobs.html#')
    soup = bs(r.content, 'lxml')
    tk = p1.findall(r.text)[0] 
    listings_per_page = 10
    number_of_listings = int(soup.select_one('[name=description]')['content'].split(' ')[0].replace(',',''))
    #number_of_pages = math.ceil(number_of_listings/listings_per_page)
    number_of_pages =  math.ceil(6175/listings_per_page) #manually calculated
    for page in range(1, number_of_pages + 1):
        if page > 1:
            r = s.get('https://www.indeed.com/jobs?q=software+developer&l=San+Francisco&start={}'.format(10*page-1))
            soup = bs(r.content, 'lxml')
            tk = p1.findall(r.text)[0] 

        for item in p.findall(r.text):
            data = hjson.loads(item)
            jk = data['jk']
            row = {'title' : data['title']
               ,'company' : data['cmp']
               ,'url' : f'https://www.indeed.com/viewjob?jk={jk}&tk={tk}&from=serp&vjs=3'
              }
            final[counter] = row
            counter+=1

df = pd.DataFrame(final)
output_df = df.T
output_df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )
import requests, re, hjson, math
import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()  
options.add_argument("--headless") 
d = webdriver.Chrome(r'C:\Users\HarrisQ\Documents\chromedriver.exe', options = options)
d.get('https://www.indeed.com/q-software-developer-l-San-Francisco-jobs.html#')
number_of_listings = int(d.find_element_by_css_selector('[name=description]').get_attribute('content').split(' ')[0].replace(',',''))
d.quit()
p = re.compile(r"jobmap\[\d+\]= ({.*?})")
p1 = re.compile(r"var searchUID = '(.*?)';") 
counter = 0 
final = {}

with requests.Session() as s:
    r = s.get('https://www.indeed.com/q-software-developer-l-San-Francisco-jobs.html#')
    soup = bs(r.content, 'lxml')
    tk = p1.findall(r.text)[0] 
    listings_per_page = 10
    number_of_pages =  math.ceil(6175/listings_per_page) #manually calculated
    for page in range(1, number_of_pages + 1):
        if page > 1:
            r = s.get('https://www.indeed.com/jobs?q=software+developer&l=San+Francisco&start={}'.format(10*page-1))
            soup = bs(r.content, 'lxml')
            tk = p1.findall(r.text)[0] 

        for item in p.findall(r.text):
            data = hjson.loads(item)
            jk = data['jk']
            row = {'title' : data['title']
               ,'company' : data['cmp']
               ,'url' : f'https://www.indeed.com/viewjob?jk={jk}&tk={tk}&from=serp&vjs=3'
              }
            final[counter] = row
            counter+=1

df = pd.DataFrame(final)
output_df = df.T
output_df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )

Comments
No Comments Right Now !

Boards Message :
You Must Login Or Sign Up to Add Your Comments .

Share : facebook icon twitter icon

Tag : html , By : user98832
Date : March 29 2020, 07:55 AM
wish help you to fix your issue I am struggling to find the right label for the hyperlinks, currently called companies. Right now it look like the table below. Companies is a hyperlink to a page that shows all those companies (in this table, either 1, 12 or 26 companies). , One option is this:
Industry_Name | Companies
-------------------------
Accounting    |   (1) 
Advertising   |   (1)  
Art           |   (1)  
Assets        |   (1)  
Audio         |   (12)  
Causes        |   (1)  
Clubs         |   (1)  
Consulting    |   (26)
Industry 
-------------------------
Accounting (1) [details] 
Advertising (1) [details]
Art (1) [details]
Assets (1) [details]
Audio (12) [details]
Causes (1) [details]
Clubs (1) [details]
Consulting (26) [details]
Industry_Name |               Companies
------------------------------------------------------
Accounting    | [ABC Corp]
Advertising   | [AdCo]  
Art           | [ArtOGram]  
Assets        | [StuffCo]  
Audio         | [SoundCo, Bass Inc, and 10 others]  
Causes        | [WeHelp]  
Clubs         | [ClubNet]  
Consulting    | [Foo Cons, Baz Cons, and 24 others]

Why doesn't this Rails SQL query work: companies.status = 'active' AND (companies.status_override = '' OR companies.stat


Tag : mysql , By : mtnmuncher
Date : March 29 2020, 07:55 AM
I wish did fix the issue. Without knowing the DB structure, the = NULL is a suspect. Try IS NULL instead.
Arithmetic comparison with NULL doesn't behave as you might expect; more details can be found in the Mysql Reference

Tag : python , By : ranja
Date : March 29 2020, 07:55 AM
it fixes the issue I don't think so. The HTMLFormatter used by DataFrame.to_html helps to pretty render a DataFrame in a IPython HTML Notebooks I think.
The method does not parse each element of your DataFrame, i.e. recognizes an URI pattern to write Content or something else.

Rails 5 How can I change the url from companies/:id/jobs/:id to jobs/:id


Tag : development , By : Roel van Dijk
Date : March 29 2020, 07:55 AM
Hope that helps I had the controllers : companies and jobs, and company has_many jobs , jobs belong_to company. , You just need to change your routes as below
resources :companies do
  member do
    post :star
    delete :unstar
    get :destroys
    get :jobs
  end
  resources :jobs,except: [:index, :show]
end
resources :jobs,only: [:index, show]

How to create a dataframe that includes all of the null values from an original dataframe?


Tag : python , By : ffmmjj
Date : March 29 2020, 07:55 AM
seems to work fine I am hoping to create a pandas dataframe from an original dataframe that contains just rows with NA values in them , IIUC, use:
df[df.isna().any(1)]

   A  B    C
1  1  2  NaN
3  2  1  NaN
Related Posts Related QUESTIONS :
  • My css and images arent showing in django
  • Probability mass function sum 2 dice roll?
  • Cannot call ubuntu 'ulimit' from python subprocess without using shell option
  • Dataframe Timestamp Filter for new/repeating value
  • Problem with clicking select2 dropdownlist in selenium
  • pandas dataframe masks to write values into new column
  • How to click on item in navigation bar on top of page using selenium python?
  • Add multiple EntityRuler with spaCy (ValueError: 'entity_ruler' already exists in pipeline)
  • error when replacing missing ')' using negative look ahead regex in python
  • Is there a way to remove specific strings from indexes using a for loop?
  • select multiple tags by position in beautifulSoup
  • pytest: getting AttributeError: 'CaptureFixture' object has no attribute 'readouterror' capturing stdout
  • Shipping PyGObject/GTK+ app on Windows with MingW
  • Python script to deduplicate lines in multiple files
  • How to prevent window and widgets in a pyqt5 application from changing size when the visibility of one widget is altered
  • How to draw stacked bar plot from df.groupby('feature')['label'].value_counts()
  • Python subprocess doesn't work without sleep
  • How can I adjust 'the time' in python with module Re
  • Join original np array with resulting np array in a form of dictionary? multidimensional array? etc?
  • Forcing labels on histograms in each individual graph in a figure
  • For an infinite dataset, is the data used in each epoch the same?
  • Is there a more efficent way to extend a string?
  • How to calculate each single element of a numpy array based on conditions
  • How do I change the width of Jupyter notebook's cell's left part?
  • Measure distance between lat/lon coordinates and utm coordinates
  • Installing megam for NLTK on Windows
  • filter dataframe on each value of a samn column have a specific value of another column in Panda\Python
  • Threading with pubsub throwing AssertionError: 'callableObj is not callable' in wxPython
  • Get grouped data from 2 dataframes with condition
  • How can I import all of sklearns regressors
  • How to take all elements except the first k
  • Whats wrong with my iteration list of lists from csv
  • Tensorflow Estimator API save image summary in eval mode
  • How to Pack with PyQt - how to make QFrame/Layout adapt to content
  • How do I get certain Time Range in Python
  • python doubly linked list - insertAfter node
  • Open .h5 file in Python
  • Joining a directory name with a binary file name
  • python, sort list with two arguments in compare function
  • Is it possible to print from Python using non-ANSI colors?
  • Pandas concat historical data using date minus some number of days
  • CV2: Import Error in Python OpenCV
  • Is it possible to do this loop in a one-liner?
  • invalid literal for int() with base 10: - django
  • Why does my code print a value that I have not assigned as yet?
  • the collatz func in automate boring stuff with python
  • How to find all possible combinations of parameters and funtions
  • about backpropagation deep neural network in tensorflow
  • Sort strings in pandas
  • How do access my flask app hosted in docker?
  • Replace the sentence include some text with Python regex
  • Counting the most common element in a 2D List in Python
  • logout a user from the system using a function in python
  • mp4 metadata not found but exists
  • Django: QuerySet with ExpressionWrapper
  • Pandas string search in list of dicts
  • Decryption from RSA encrypted string from sqlite is not the same
  • need of maximum value in int
  • a list of several tuples, how to extract the same of the first two elements in the small tuple in the large tuple
  • Display image of 2D Sinewaves in 3D
  • shadow
    Privacy Policy - Terms - Contact Us © scrbit.com