Navigation X
ALERT
Click here to register with a few steps and explore all our cool stuff we have to offer!



 1628

AutoScraper automatically deletes duplicates... how to keep them?

by essie98927 - 22 April, 2022 - 11:35 PM
This post is by a banned member (essie98927) - Unhide
88
Posts
9
Threads
2 Years of service
#1
(This post was last modified: 22 April, 2022 - 11:37 PM by essie98927. Edited 2 times in total.)
I'm trying to use the AutoScraper module with python 3.9.6 to scrape some information on websites but it deletes automatically duplicate information but i need them...
 
**I'm trying to scrape the % in the March column
 
Here's my code :
 
Code:
import autoscraper
from autoscraper import AutoScraper
url = "https://store.steampowered.com/hwsurvey/videocard/"

wanted_list = ["8.18%"]
scraper = AutoScraper()
result_per100 = scraper.build(url, wanted_list)

print(result_per100)
output:
Code:
['92.00%', '1.94%', '1.12%', '0.03%', '0.00%', '4.91%', '8.18%', '6.08%', '5.62%', '5.53%', '2.87%', '2.71%', '2.48%', '2.45%', '2.35%', '2.22%', '2.11%', '2.06%', '1.98%', '1.65%', '1.55%', '1.54%', '1.46%', '1.41%', '1.32%', '1.27%', '1.25%', '1.19%', '1.11%', '1.08%', '1.04%', '1.02%', '0.95%', '0.94%', '0.89%', '0.86%', '0.83%', '0.77%', '0.72%', '0.70%', '0.69%', '0.68%', '0.64%', '0.63%', '0.56%', '0.54%', '0.51%', '0.50%', '0.47%', '0.46%', '0.45%', '0.44%', '0.42%', '0.39%', '0.37%', '0.36%', '0.35%', '0.34%', '0.33%', '0.30%', '0.27%', '0.26%', '0.25%', '0.24%', '0.23%', '0.22%', '0.21%', '0.19%', '0.18%', '0.17%', '0.16%', '10.02%', '8.64%', '6.40%', '5.93%', '5.83%', '3.02%', '2.86%', '2.61%', '2.58%', '2.49%', '2.33%', '2.17%', '2.09%', '1.74%', '1.63%', '1.62%', '1.48%', '1.39%', '1.34%', '1.31%', '1.16%', '1.15%', '1.10%', '1.00%', '0.99%', '0.93%', '0.91%', '0.87%', '0.81%', '0.76%', '0.66%', '0.59%', '0.57%', '0.53%', '0.48%', '0.41%', '0.38%', '0.31%', '14.83%', '24.09%', '17.74%', '15.10%', '6.56%', '6.15%', '2.93%', '2.24%', '2.20%', '2.10%', '1.84%', '1.73%', '1.14%', '1.07%', '0.79%', '0.74%', '0.73%', '0.65%', '0.58%', '0.52%', '0.43%', '0.32%', '1.58%', '24.30%', '14.36%', '6.70%', '6.32%', '2.84%', '2.26%', '2.02%', '2.00%', '1.69%', '0.92%', '0.84%', '0.62%', '0.60%', '0.55%', '10.69%', '22.32%', '14.34%', '13.79%', '7.08%', '5.08%', '2.90%', '2.72%', '2.36%', '1.81%', '1.45%', '1.09%', '6.72%', '21.62%', '16.22%', '10.81%', '8.11%', '5.41%', '2.70%']


as you can see there's no duplicates but there's some in the website that I need...
 
Anyone knows if that can be fixed and how ?
 
Any help is much appreciated thanks !
This post is by a banned member (UberFuck) - Unhide
UberFuck  
Godlike
1.557
Posts
375
Threads
5 Years of service
#2
Not gonna lie...most people here will be using Beautiful Soup or Scrapy for scraping, not AutoScraper (I'd never even heard of it).

Here is code to get what you want w/ Beautiful Soup...
 
Code:
from bs4 import BeautifulSoup, Tag
from requests import Session

url = 'https://store.steampowered.com/hwsurvey/videocard/'
session = Session()


def GetResults(url) -> list[str]:
    resp = session.get(url)
    if not resp.ok:
        raise ConnectionError(f'Problems encountered retrieving {url}')
    soup = BeautifulSoup(resp.content, 'lxml')
    lastmo: list[Tag] = soup.find_all('div', class_='substats_col_month_last_pct')
    if not lastmo:
        return []
    return [e.text for e in lastmo if 'col_header' not in e.attrs['class']]


res = GetResults(url)
print(res)
This post is by a banned member (essie98927) - Unhide
88
Posts
9
Threads
2 Years of service
#3
(24 April, 2022 - 09:12 AM)foxegado Wrote: Show More
Not gonna lie...most people here will be using Beautiful Soup or Scrapy for scraping, not AutoScraper (I'd never even heard of it).

Here is code to get what you want w/ Beautiful Soup...
 
Code:
from bs4 import BeautifulSoup, Tag
from requests import Session

url = 'https://store.steampowered.com/hwsurvey/videocard/'
session = Session()


def GetResults(url) -> list[str]:
    resp = session.get(url)
    if not resp.ok:
        raise ConnectionError(f'Problems encountered retrieving {url}')
    soup = BeautifulSoup(resp.content, 'lxml')
    lastmo: list[Tag] = soup.find_all('div', class_='substats_col_month_last_pct')
    if not lastmo:
        return []
    return [e.text for e in lastmo if 'col_header' not in e.attrs['class']]


res = GetResults(url)
print(res)


will test that thanks you very much !

Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
or
Sign in
Already have an account? Sign in here.


Forum Jump:


Users browsing this thread: 1 Guest(s)