Writing a Screen Scraper in Python helped by AI

Рет қаралды 40

Ай бұрын

How text can be taken off a web page and put into a text file and audio file.
Uses Python but I was helped by using chatGPT tools.

Пікірлер: 4

@johnsim3722 10 күн бұрын

I thought it would have come out with a HTML package of images and other content within the page so that you could view off-line like it was a live page. I've seen some bits of software that do this, with varying degrees of success. The problem seems to be with scripts behind the scenes that serve up a page, and with content constantly changing on each view. Those programs can get into a mess constantly trying to update themselves. Creating a PDF is certainly one way to take a static copy, better if it does it as a continuous strip rather than paginated. As for the Daily Mail website, other than the content is a load of BS, their adverts are dangerous. My sister-in-law visited on her work computer and got infected by a virus from that site. Machine shut down, she had to then get technical support from the work who ended up buying a new one just to get her back up and running quickly. Daily Mail is a nasty bit of work.

@TJMoir 10 күн бұрын

Yes you can do images too. That's a bit harder because it gives you everything on the page including logos etc and you need to sort it all out but it can be done. I tried it and got tons of images. So you need to keep track of where they are relevant to the text. So that's where AI is good and it can pop an image up relevant to a paragraph or whatever. But the text bit is the easiest.

@johnsim3722 10 күн бұрын

@@TJMoir I've wondered if there was something smart enough to just get the main content with inline images and not with all the junk they push up the sides of any of these sites. Even removing all the in-line adverts. Some sites are almost impossible to read because they're so broken up with adverts. An Ad Blocker is essential, or they'll steal all the resources from your computer to run.

@TJMoir Ай бұрын

Code: #Screen scraper from a URL #import requests # T.J.Moir and chatGPT import requests as requests import winsound from bs4 import BeautifulSoup import os import tkinter as tk from tkinter import simpledialog from gtts import gTTS import shutil # for beep sound frequency = 500 # Set Frequency To 500 Hertz duration = 500 # Set Duration To 500 ms == 0.5 second # Create the root window root = tk.Tk() root.geometry('500x100+1200+500') root.title("Screen Scraper") # Set window title # Create a StringVar to associate with the label text_var = tk.StringVar() text_var.set("Screen Scraper") # Create the label widget with all options label = tk.Label(root, textvariable=text_var, anchor=tk.CENTER, bg="white", height=3, width=30, bd=3, font=("Times", 16, "bold"), cursor="hand2", fg="Grey", padx=15, pady=15, justify=tk.CENTER, relief=tk.RAISED, wraplength=250 ) # Pack the label into the window label.pack(pady=20) # Add some padding to the top #hide root window #root.withdraw() winsound.Beep(frequency, duration) # Load the URL of web page to scrape url= simpledialog.askstring("Input", "URL for screenscrape") # Send a GET request to the URL response = requests.get(url) # Parse the HTML content soup = BeautifulSoup(response.content, 'html.parser') # Extract the desired data (example: all paragraphs) paragraphs = soup.find_all('p') # Get the user's home directory home_dir = os.path.expanduser('~') text_var.set("Please wait, doing audio first") text="" # Print the text of each paragraph for p in paragraphs: text+= p.get_text() tts = gTTS(text, lang='en-uk') #mp3 director is same as one for text file mp3_dir = os.path.join(home_dir, 'Downloads\screenscraper') # Construct the path to the folder. I created a folder under Downloads called screenscraper download_dir = os.path.join(home_dir, 'Downloads\screenscraper') # Ensure the Desktop directory exists if not os.path.exists(download_dir): raise FileNotFoundError(f"The directory {download_dir} does not exist.") winsound.Beep(frequency, duration) # Create the input dialog mp3_name = simpledialog.askstring("Input", "Enter .mp3 filename:") # Save the audio file tts.save(mp3_name) # move file to desired directory screenscraper shutil.move(mp3_name, download_dir) text_var.set("Finished audio part") winsound.Beep(frequency, duration) # Create the input dialog file_name = simpledialog.askstring("Input", "Enter filename:") # Destroy the root window root.destroy() # Full path to the file file_path = os.path.join(download_dir, file_name) # Save the text to a file with open(file_path, 'w', encoding='utf-8') as file: for p in paragraphs: file.write(p.get_text() + ' ')