logo
down
shadow

Extracting data from HTML and formatting the output


Extracting data from HTML and formatting the output

Content Index :

Extracting data from HTML and formatting the output
Tag : java , By : Dasharath Yadav
Date : November 29 2020, 04:01 AM

wish helps you To scrap data from HTML source I use a little method I created named getBetween() to carry out the task. Of course the data I personally want seems to always be between strings of some sort:
/**
 * Retrieves any string data located between the supplied string leftString
 * parameter and the supplied string rightString parameter.<br><br>
 * <p>
 * <p>
 * This method will return all instances of a substring located between the
 * supplied Left String and the supplied Right String which may be found
 * within the supplied Input String.<br>
 *
 * @param inputString (String) The string to look for substring(s) in.
 *
 * @param leftString  (String) What may be to the Left side of the substring
 *                    we want within the main input string. Sometimes the
 *                    substring you want may be contained at the very
 *                    beginning of a string and therefore there is no
 *                    Left-String available. In this case you would simply
 *                    pass a Null String ("") to this parameter which
 *                    basically informs the method of this fact. Null can
 *                    not be supplied and will ultimately generate a
 *                    NullPointerException.
 *
 * @param rightString (String) What may be to the Right side of the
 *                    substring we want within the main input string.
 *                    Sometimes the substring you want may be contained at
 *                    the very end of a string and therefore there is no
 *                    Right-String available. In this case you would simply
 *                    pass a Null String ("") to this parameter which
 *                    basically informs the method of this fact. Null can
 *                    not be supplied and will ultimately generate a
 *                    NullPointerException.
 *
 * @param options     (Optional - Boolean - 2 Parameters):<pre>
 *
 *      ignoreLetterCase    - Default is false. This option works against the
 *                            string supplied within the leftString parameter
 *                            and the string supplied within the rightString
 *                            parameter. If set to true then letter case is
 *                            ignored when searching for strings supplied in
 *                            these two parameters. If left at default false
 *                            then letter case is not ignored.
 *
 *      trimFound           - Default is true. By default this method will trim
 *                            off leading and trailing white-spaces from found
 *                            sub-string items. General sentences which obviously
 *                            contain spaces will almost always give you a white-
 *                            space within an extracted sub-string. By setting
 *                            this parameter to false, leading and trailing white-
 *                            spaces are not trimmed off before they are placed
 *                            into the returned Array.</pre>
 *
 * @return (1D String Array) Returns a Single Dimensional String Array
 *         containing all the sub-strings found within the supplied Input
 *         String which are between the supplied Left String and supplied
 *         Right String. You can shorten this method up a little by
 *         returning a List&lt;String&gt; ArrayList and removing the 'List
 *         to 1D Array' conversion code at the end of this method. This
 *         method initially stores its findings within a List object
 *         anyways.
 */
public String[] getBetween(String inputString, String leftString, 
                    String rightString, boolean... options) {
    // Return nothing if nothing was supplied.
    if (inputString.equals("") || (leftString.equals("") && rightString.equals(""))) {
        return null;
    }

    // Prepare optional parameters if any supplied.
    // If none supplied then use Defaults...
    boolean ignoreCase = false; // Default.
    boolean trimFound = true;   // Default.
    if (options.length > 0) {
        if (options.length >= 1) {
            ignoreCase = options[0];
        }
        if (options.length >= 2) {
            trimFound = options[1];
        }
    }

    // Remove any ASCII control characters from the
    // supplied string (if they exist).
    String modString = inputString.replaceAll("\\p{Cntrl}", "");

    // Establish a List String Array Object to hold
    // our found substrings between the supplied Left
    // String and supplied Right String.
    List<String> list = new ArrayList<>();

    // Use Pattern Matching to locate our possible
    // substrings within the supplied Input String.
    String regEx = Pattern.quote(leftString)
            + (!rightString.equals("") ? "(.*?)" : "(.*)?")
            + Pattern.quote(rightString);
    if (ignoreCase) {
        regEx = "(?i)" + regEx;
    }
    Pattern pattern = Pattern.compile(regEx);
    Matcher matcher = pattern.matcher(modString);
    while (matcher.find()) {
        // Add the found substrings into the List.
        String found = matcher.group(1);
        if (trimFound) {
            found = found.trim();
        }
        list.add(found);
    }

    String[] res;
    // Convert the ArrayList to a 1D String Array.
    // If the List contains something then convert
    if (list.size() > 0) {
        res = new String[list.size()];
        res = list.toArray(res);
    } // Otherwise return Null.
    else {
        res = null;
    }
    // Return the String Array.
    return res;
}
 <td align="right" class="nowrap"> <a href="website" onclick="return 
 doWindow(this, 700, 500);" class="popup">0</a> </td>
 <td align="right" class="nowrap"><a href="website" 
 onclick="doWindow(this.href, '1024', '768'); return false;">10 000</a> [10 
 000]</td>
 <td align="right" class="nowrap">10 000</td>
 <td align="right" class="nowrap">20.48</td>
 <td align="right" class="nowrap">0.00</td>
 <td align="right" class="nowrap">$28.65</td>
 <td align="right" class="nowrap">0.00 %</td>
 <td align="right" class="nowrap">$894.69</td>
 <td align="right" class="nowrap">10.11</td>
 <td align="right" class="nowrap">0.21</td>
 <td align="right" class="nowrap"> <a href="website" onclick="return 
  doWindow(this, 700, 500);" class="popup">0</a> </td>
 <td align="right" class="nowrap"><a href="website" 
 onclick="doWindow(this.href, '1024', '768'); return false;">10 000</a> [10 
 000]</td>
 <td align="right" class="nowrap">10 000</td>
 <td align="right" class="nowrap">46.21</td>
 <td align="right" class="nowrap">0.00</td>
 <td align="right" class="nowrap">$53.82</td>
 <td align="right" class="nowrap">0.00 %</td>
 <td align="right" class="nowrap">$1 151.78</td>
 <td align="right" class="nowrap">8.01</td>
 <td align="right" class="nowrap">0.00</td>
 <td align="right" class="nowrap"> <a href="website" onclick="return 
 doWindow(this, 700, 500);" class="popup">0</a> </td>
 <td align="right" class="nowrap"><a href="website" 
 onclick="doWindow(this.href, '1024', '768'); return false;">5 000</a> [5 
 000]</td>
 <td align="right" class="nowrap">5 000</td>
 <td align="right" class="nowrap">22.51</td>
 <td align="right" class="nowrap">0.00</td>
 <td align="right" class="nowrap">$222.53</td>
 <td align="right" class="nowrap">0.00 %</td>
 <td align="right" class="nowrap">$2 399.92</td>
 <td align="right" class="nowrap">5.94</td>
 <td align="right" class="nowrap">0.01</td>
 <td align="right" class="nowrap"> <a href="website" onclick="return
 <td align="center">
     <input type="text" tabindex="2" name="productData[price]       
     [{33013477}]" size="10" value="3000.00">    
 </td> 
// Let's assume the "desired output" you acquired 
// is contained within a Text file named "HtmlData.txt".

// Hold our scraped data in a 2D List inteface.
List<List<String>> list = new ArrayList<>();

// Read File using BufferedReader in a Try With Resources block...
try (BufferedReader reader = new BufferedReader(new FileReader("HtmlData.txt"))) {
    String line;
    List<String> numbers = null;
    while ((line = reader.readLine()) != null) {
        numbers = new ArrayList<>();
        line = line.trim();
        if (line.equals("")) {
            continue;
        }
        if (line.startsWith("onclick=\"doWindow(this.href,")) {
            while ((line = reader.readLine()) != null) {
                line = line.trim();
                if (line.endsWith("return")) {
                    list.add(numbers);
                    break;
                }
                if (line.equals("")) {
                    continue;
                }
                if (line.startsWith("<td align=\"right\" class=\"nowrap\">")) {
                    numbers.add(getBetween(line, "<td align=\"right\" class=\"nowrap\">", "</td>", true, true)[0]);
                }
            }
        }
        if (line.contains("name=\"productData[price]")) {
            while ((line = reader.readLine()) != null) {
                line = line.trim();
                if (line.equals("")) {
                    continue;
                }
                if (line.startsWith("[{33013477}]")) {
                    numbers.add("Product Price: " + getBetween(line, "value=\"", "\">", true, true)[0]);
                    list.add(numbers);
                    break;  // DONE
                }
            }
        }
    }
    if (numbers != null && !numbers.isEmpty()) {
        list.add(numbers);
    }
}
catch (IOException ex) {
    ex.printStackTrace();
}

// Display our findings to the Console Window in a 
// table style format:
for (int i = 0; i < list.size(); i++) {
    for (int j = 0; j < list.get(i).size(); j++) {
        System.out.printf("%-10s ", list.get(i).get(j));
    }
    System.out.println("");
}
<td align="center">
    <input type="text" tabindex="2" name="productData[price]       
    [{33013477}]" size="10" value="3000.00">    
</td>
10 000     20.48      0.00       $28.65     0.00 %     $894.69    10.11      0.21       
10 000     46.21      0.00       $53.82     0.00 %     $1 151.78  8.01       0.00       
5 000      22.51      0.00       $222.53    0.00 %     $2 399.92  5.94       0.01       
Product Price: 3000.00 

Comments
No Comments Right Now !

Boards Message :
You Must Login Or Sign Up to Add Your Comments .

Share : facebook icon twitter icon

Extracting data from a URL result with special formatting


Tag : python , By : kbrust
Date : March 29 2020, 07:55 AM
wish helps you It sounds like you can break this problem up into several subproblems.
Subproblems
url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
url = url_template.format(limit=2, seedterm='seedterm')
url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
url = url_template % dict(limit=2, seedterm='seedterm')
import urllib.request
data = urllib.request.urlopen(url) # url from previous section
with urllib.request.urlopen(url) as data:
    # do processing here
result = data.read()

prefix = 'oo.visualization.Query.setResponse('
suffix = ');'

if result.startswith(prefix) and result.endswith(suffix):
    result = result[len(prefix):-len(suffix)]
import json

result_object = json.loads(result)
# Get the rows in the table, then get the second column's value for
# each row
terms = [row['c'][2]['v'] for row in result_object['table']['rows']]
#!/usr/bin/env python3

"""A script for retrieving and parsing results from requests to
somewhere.com.

This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python3 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""

import urllib.request
import json
import sys

E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2

def parse_result(result):
    """Parse a JSONP result string and return a list of terms"""
    prefix = 'oo.visualization.Query.setResponse('
    suffix = ');'

    # Strip JSONP function wrapper
    if result.startswith(prefix) and result.endswith(suffix):
        result = result[len(prefix):-len(suffix)]

    # Deserialize JSON to Python objects
    result_object = json.loads(result)

    # Get the rows in the table, then get the second column's value
    # for each row
    return [row['c'][2]['v'] for row in result_object['table']['rows']]

def retrieve_terms(limit, seedterm):
    """Retrieves and parses data and returns a list of terms"""
    url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
    url = url_template.format(limit=limit, seedterm=seedterm)

    try:
        with urllib.request.urlopen(url) as data:
            data = perform_request(limit, seedterm)
            result = data.read()
    except:
        print('Could not request data from server', file=sys.stderr)
        exit(E_OPERATION_ERROR)

    terms = parse_result(result)
    print(terms)

def main(limit, seedterm):
    """Retrieves and parses data and prints each term to standard output"""
    terms = retrieve_terms(limit, seedterm)
    for term in terms:
        print(term)

if __name__ == '__main__'
    try:
        limit = int(sys.argv[1])
        seedterm = sys.argv[2]
    except:
        error_message = '''{} limit seedterm

limit must be an integer'''.format(sys.argv[0])
        print(error_message, file=sys.stderr)
        exit(2)

    exit(main(limit, seedterm))
#!/usr/bin/env python2.7

"""A script for retrieving and parsing results from requests to
somewhere.com.

This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python2.7 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""

import urllib2
import json
import sys

E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2

def parse_result(result):
    """Parse a JSONP result string and return a list of terms"""
    prefix = 'oo.visualization.Query.setResponse('
    suffix = ');'

    # Strip JSONP function wrapper
    if result.startswith(prefix) and result.endswith(suffix):
        result = result[len(prefix):-len(suffix)]

    # Deserialize JSON to Python objects
    result_object = json.loads(result)

    # Get the rows in the table, then get the second column's value
    # for each row
    return [row['c'][2]['v'] for row in result_object['table']['rows']]

def retrieve_terms(limit, seedterm):
    """Retrieves and parses data and returns a list of terms"""
    url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
    url = url_template % dict(limit=2, seedterm='seedterm')

    try:
        with urllib2.urlopen(url) as data:
            data = perform_request(limit, seedterm)
            result = data.read()
    except:
        sys.stderr.write('%s\n' % 'Could not request data from server')
        exit(E_OPERATION_ERROR)

    terms = parse_result(result)
    print terms

def main(limit, seedterm):
    """Retrieves and parses data and prints each term to standard output"""
    terms = retrieve_terms(limit, seedterm)
    for term in terms:
        print term

if __name__ == '__main__'
    try:
        limit = int(sys.argv[1])
        seedterm = sys.argv[2]
    except:
        error_message = '''{} limit seedterm

limit must be an integer'''.format(sys.argv[0])
        sys.stderr.write('%s\n' % error_message)
        exit(2)

    exit(main(limit, seedterm))

Technique for selectively formatting data in a PowerShell pipeline and output as HTML


Tag : html , By : Tony Z
Date : March 29 2020, 07:55 AM
this will help A much faster way:
Ok, I keep promising myself that I won't spend time on solved problems anymore, but ... that switch statement in my second answer was taking over 10 seconds to run on my system, -- because it's doing the "where" stuff in PowerShell instead of in LINQ.
Add-Type -Language CSharpVersion3 -ReferencedAssemblies System.Xml, System.Xml.Linq -UsingNamespace System.Linq -Name XUtilities -Namespace Huddled -MemberDefinition @"    
    public static System.Collections.Generic.IEnumerable<System.Xml.Linq.XElement> GetElementByIndex( System.Xml.Linq.XContainer doc, System.Xml.Linq.XName element, int index) {
        return from e in doc.Descendants(element) where e.NodesBeforeSelf().Count() == index select e;
    }
    public static System.Collections.Generic.IEnumerable<System.Xml.Linq.XElement> GetElementByValue( System.Xml.Linq.XContainer doc, System.Xml.Linq.XName element, string value) {
        return from e in doc.Descendants(element) where e.Value == value select e;
    }
"@

# Get the running processes to x(ht)ml
$xml = [System.Xml.Linq.XDocument]::Parse( "$(Get-Process | ConvertTo-Html)" )

# Find the index of the column you want to format:
$wsIndex = [Huddled.XUtilities]::GetElementByValue( $xml, "{http://www.w3.org/1999/xhtml}th", "WS" ) | %{ ($_.NodesBeforeSelf() | Measure).Count }


switch([Huddled.XUtilities]::GetElementByIndex( $xml, "{http://www.w3.org/1999/xhtml}td", $wsIndex )) {
   {200MB -lt $_.Value } { $_.SetAttributeValue( "style", "background: red;"); continue } 
   {20MB  -lt $_.Value } { $_.SetAttributeValue( "style", "background: orange;"); continue } 
   {10MB  -lt $_.Value } { $_.SetAttributeValue( "style", "background: yellow;"); continue } 
}

# Save the html out to a file
$xml.Save("$pwd/procs2.html")

# Open the thing in your browser to see what we've wrought
ii .\procs2.html
Add-Type -AssemblyName System.Xml.Linq

$Process = $(Get-Process | Select Handles, NPM, PM, WS, VM, CPU, Id, ProcessName)

$xml = [System.Xml.Linq.XDocument]::Parse( "$($Process | ConvertTo-Html)" )
if($Namespace = $xml.Root.Attribute("xmlns").Value) {
    $Namespace = "{{{0}}}" -f $Namespace
}

# Find the index of the column you want to format:
$wsIndex = [Array]::IndexOf( $xml.Descendants("${Namespace}th").Value, "WS")

foreach($row in $xml.Descendants("${Namespace}tr")){
    switch(@($row.Descendants("${Namespace}td"))[$wsIndex]) {
       {200MB -lt $_.Value } { $_.SetAttributeValue( "style", "background: red;"); continue } 
       {20MB  -lt $_.Value } { $_.SetAttributeValue( "style", "background: orange;"); continue } 
       {10MB  -lt $_.Value } { $_.SetAttributeValue( "style", "background: yellow;"); continue } 
    }
}
# Save the html out to a file
$xml.Save("$pwd/procs1.html")

# Open the thing in your browser to see what we've wrought
ii .\procs2.html

Extracting multiple rows of data from mysql and formatting in a web page table using PDO


Tag : php , By : user179190
Date : March 29 2020, 07:55 AM
this will help $pdo->query()
 foreach ($result as $row) {
           echo
            "<tr>   <td>".$row['comment']." </td>                   
                    <td>".$row['name']." </td>
                    <td>".$row['entered']." </td>
            </tr>\n";
            }

Jsoup Trouble extracting formatting from html tables


Tag : java , By : cameron
Date : March 29 2020, 07:55 AM
around this issue You just need to parse the HTML code you want to scrap into JSOUP and then select the attributes of the HTML tags you want, using the attr selector from JSOUP Elements, and that gives you the value of that attribute for every th tag in the HTML. To retrieve also the text contained between the span tags you need to select the nested span in the th and get the .text().
    Document document = Jsoup.parse(YOUT HTML GOES HERE);
    System.out.println(document);
    Elements elements = document.select("tr > th");

    for (Element element : elements) {
        String align = element.attr("align");
        String color = element.attr("bgcolor");
        String spanText = element.select("span").text();

        System.out.println("Align is " + align +
                "\nBackground Color is " + color +
                "\nSpan Text is " + spanText);
    }
String fullText = element.text();

Extracting data in Excel and transposing/formatting it in another sheet


Tag : excel , By : Karina
Date : March 29 2020, 07:55 AM
Related Posts Related QUESTIONS :
  • Manage CompletionStage inside of Netty handler
  • Url Problem while Developing on Localhost and deploy on Remote Virtual Server
  • How to identify the missing type id in Jackson error?
  • android data binding error: cannot find symbol
  • Spring Boot application with a jar dependency does not run after maven build
  • Spring Data JPA query , filter ? search engine ? JPQL?
  • Why LiveData returns null in ViewModel?
  • what this line of code mean....new URLClassLoader(new URL[0],getClass().getClassLoader());
  • Why do need to use new Random() instead of just Random Randomnum?
  • I want to access zk components from the java file
  • How do I cast FieldValue.serverTimestamp() to Kotlin/Java Date Class
  • Insertion Sort Double Array with User Input - JAVA
  • Creating 2 dimesional array with user input and find sum of specific columns
  • can not get Advertising ID Provider in android
  • Convert list of Objects to map of properties
  • How to represent an undirected weighted graph in java
  • Return values as array from collection
  • ByteBuddy generic method return cast to concrete type
  • ImageView hides the round corners of the parent
  • Is there a way to find setter method by its getter method or vice versa in a class?
  • Get aggregated list of properties from list of Objects(Java 8)
  • Unable to find a document in Mongodb where exact date match in java
  • UsernamePasswordAuthenticationFilter skips success handler
  • Use Java filter on stream with in a stream filter
  • Default Login not successful in spring boot 2.1.7
  • Adding key value pairs from a file to a Hashmap
  • Rub regex: matching a char except when after by another char
  • Convert Base64 String to String Array
  • Escape Unicode Character 'POPCORN' to HTML Entity
  • An empty JSON field which is a boolean/nullable field in Java model, is getting converted as null
  • Mongo java driver cannot find public constructor for interface
  • How to unit test writing a file to AWS Lambda output stream?
  • How to make a GitHub GraphQL API Call from Java
  • What's the difference between @ComponentScan and @Bean in a context configuration?
  • Expected class or package adding a view using a class
  • can be delete of a element in a static array be O(1)?
  • Instance variable heap or stack ? ( with specific example)
  • Assert progress of ProgressBar in Espresso test
  • How to detect if gson.fromjson() has excess elements
  • I cant generate the proper code to select the a specific filter on a BI dashboard I am working on
  • How to Inject Dependencies into a Servlet Filter with Spring Boot Filter Registration Bean?
  • Thrift types as a Generic
  • Effective algorithm to random 4 unique integers less than a big max such as 100_000
  • Combining or and negation in Java regex?
  • Unable to instantiate default tuplizer Exception
  • Multi-tenant migration to work with quarkus
  • Ignite persisting a Set: Cannot find metadata for object with compact footer
  • Maven cannot resolve Jacob dependency using eclipse
  • testcontainers oracle database container starts before database user is created
  • Launching two spring boot apps in integration test
  • Is there a way to add a HashMap's value that is a integer array into a ArrayList?
  • Is there any way that I can get a parameter in paintComponent?
  • Empty stack with one recursive method and one iterative method
  • What's the behavior of onBackpressureBuffer in RxJava2
  • Java regex can only use 1 quantifier in a lookback (need 2)
  • How to fix error in native query : it is showing syntax error near or at
  • How to retrieve nested object from a document and display it in FirestoreRecyclerOptions?
  • Why not use ListIterator for full LinkedList Operation?
  • Android Webview EvaluateJavascript sometimes does not return a response
  • Matcher java doesn't work but regex seems to be good
  • shadow
    Privacy Policy - Terms - Contact Us © scrbit.com