"But in the end scrappers are hard coded, Never rely on them."
--- Badshah Namdar Gaohar
This piece is an installment of the series I am presently writing on the basics of network programming. The unfinished drafts of the nine-part series are kept on my HDD-- Oh Aergia, leave me alone!!, Oh Aergia, Go on home!!!
However, this eighth installment is published under the auspices of the FOSSNation group, especially Samnan Rahee and Emon Shahriar, as part of HackToberFest.
In this segment we’ll write two python scripts to scrape internet movie database (imdb) for movie details and BBC for news updates.
The question is why would we overdo the simple task rather than just using the browser? Well, my intentions are ---
1. For learning basics of how the http request works and scrape basic data from the response-- straight forward.
2. Web browsing with fascinating frontend representation often distract our focus and consume a lot of time. In CLI we get only the texts. So, during works, glimpses at a movie detail or news updates, unless an asteroid hits the planet Earth [ Don’t make me hopeful :’( ], is worth refreshing.
Let’s roll our sleeves and make the hands dirty --
Requests and Beautiful Soup are two of my favourite python libraries that meet my geeky needs.
The documentations are informative and I would highly recommend the audience interested in these FOSS scripts.
Before starting note two things – web scraping is HIGHLY HARD CODED as per the site we are interested. The rule of thumbs are 1) the script works today won’t work tomorrow if the admin changes site’s html or css codes and 2) author must see or query the contents of interest manually and script them – this is tedious of course and might not match to the previous scripts.
For the experts, there’s a whole framework dedicated for web scraping called Scrapy, NOT Scapy, that would overdo for our simple task.
Scraping IMDB
The script is straightforward-- it requests imdb for a movie title or a seven digit title id e.g. tt1234567. It takes input from the user in a pseudo shell mode for working interactively.
The script is self explanatory and well commented I think – if not please give feedback the FOSS community is for working together, even on toy projects.
The script’s main request is the get_raw_html(), Line 29, method that takes user input and returns raw html whether the user searched for a movie title or seven-digit id.
Lines 131 and 132 declare variables
url = "https://www.imdb.com/title/" # this would concat if the tt1234567 type id is given
search = "https://www.imdb.com/find?ref_=nv_sr_fn&q=" # or would search for a string in movie database.
The search types are decided on line 140 condition if re.match("^tt[0-9]{7}",case) if the user input searches for a movie id or a title.
We’ll use javascript object notation or json library for simplicity and Beautiful Soup for extracting raw html.
Line 33 extracts raw html from request response raw_html = BeautifulSoup(response.text,"html.parser") while json is returned from the function html_to_json(raw_html) on line 62.
The extraction process is json_data = raw_html.find(attrs={"type":"application/ld+json"}) – line 65.
If the user requests for a movie id, then the script searches for a specific movie for details with json data loaded with movie title, director, stars and movie poster etc. with loaded_json = html_to_json(raw_html) and parse the outputs using parse_loaded_json(loaded_json) routines – lines 145 and 146.
Now see the functions how they extract data from loaded json in parse_loaded_json(loaded_json) method on line 78.
case = raw_input("[imdb_ia]% ") on line 138 triggers an interactive session in an infinite while loop.
Usually I go with this process while read someone else’s code.
Scrape BBC
The beauty and incredibility of python is its readability, and the kiddie just only reads the document to know the classes, objects and parameters. This script is also self explanatory with a simple tweak. I hope the readers would need no further explanation this case.
OK, enough for today. Hope I would come back with more interesting clichés in future; just keep me in your prayers. Oh yes…. Greeting everyone…. Happy Durga Puja.
Faquir Foysol
A 90s script kiddie…
--- Badshah Namdar Gaohar
Python requests..... HTTP for Humans |
This piece is an installment of the series I am presently writing on the basics of network programming. The unfinished drafts of the nine-part series are kept on my HDD-- Oh Aergia, leave me alone!!, Oh Aergia, Go on home!!!
However, this eighth installment is published under the auspices of the FOSSNation group, especially Samnan Rahee and Emon Shahriar, as part of HackToberFest.
In this segment we’ll write two python scripts to scrape internet movie database (imdb) for movie details and BBC for news updates.
The question is why would we overdo the simple task rather than just using the browser? Well, my intentions are ---
1. For learning basics of how the http request works and scrape basic data from the response-- straight forward.
2. Web browsing with fascinating frontend representation often distract our focus and consume a lot of time. In CLI we get only the texts. So, during works, glimpses at a movie detail or news updates, unless an asteroid hits the planet Earth [ Don’t make me hopeful :’( ], is worth refreshing.
Let’s roll our sleeves and make the hands dirty --
Requests and Beautiful Soup are two of my favourite python libraries that meet my geeky needs.
The documentations are informative and I would highly recommend the audience interested in these FOSS scripts.
Beautiful Soup ..... Deal with Raw HTML |
Before starting note two things – web scraping is HIGHLY HARD CODED as per the site we are interested. The rule of thumbs are 1) the script works today won’t work tomorrow if the admin changes site’s html or css codes and 2) author must see or query the contents of interest manually and script them – this is tedious of course and might not match to the previous scripts.
For the experts, there’s a whole framework dedicated for web scraping called Scrapy, NOT Scapy, that would overdo for our simple task.
Scraping IMDB
The script is straightforward-- it requests imdb for a movie title or a seven digit title id e.g. tt1234567. It takes input from the user in a pseudo shell mode for working interactively.
The script is self explanatory and well commented I think – if not please give feedback the FOSS community is for working together, even on toy projects.
The script’s main request is the get_raw_html(), Line 29, method that takes user input and returns raw html whether the user searched for a movie title or seven-digit id.
Lines 131 and 132 declare variables
url = "https://www.imdb.com/title/" # this would concat if the tt1234567 type id is given
search = "https://www.imdb.com/find?ref_=nv_sr_fn&q=" # or would search for a string in movie database.
The search types are decided on line 140 condition if re.match("^tt[0-9]{7}",case) if the user input searches for a movie id or a title.
We’ll use javascript object notation or json library for simplicity and Beautiful Soup for extracting raw html.
Line 33 extracts raw html from request response raw_html = BeautifulSoup(response.text,"html.parser") while json is returned from the function html_to_json(raw_html) on line 62.
The extraction process is json_data = raw_html.find(attrs={"type":"application/ld+json"}) – line 65.
If the user requests for a movie id, then the script searches for a specific movie for details with json data loaded with movie title, director, stars and movie poster etc. with loaded_json = html_to_json(raw_html) and parse the outputs using parse_loaded_json(loaded_json) routines – lines 145 and 146.
Now see the functions how they extract data from loaded json in parse_loaded_json(loaded_json) method on line 78.
case = raw_input("[imdb_ia]% ") on line 138 triggers an interactive session in an infinite while loop.
Usually I go with this process while read someone else’s code.
imdb_ia in action |
Scrape BBC
The beauty and incredibility of python is its readability, and the kiddie just only reads the document to know the classes, objects and parameters. This script is also self explanatory with a simple tweak. I hope the readers would need no further explanation this case.
bbc_scrape in action |
OK, enough for today. Hope I would come back with more interesting clichés in future; just keep me in your prayers. Oh yes…. Greeting everyone…. Happy Durga Puja.
Faquir Foysol
A 90s script kiddie…
Comments
Post a Comment