How Web Scraping is Used to Extract Movie Details from YIFY Movies?

Web Scraping

YIFY Movies:

  • YIFY Movies is a website that provides free movie torrent connections and has a large database of movies and documentaries.
  • For our project, we would like to extract movie information (such as title, year, genre, rating, movie link, synopsis, and number of times downloaded).

Tools

Outline

  1. Using the queries, download the webpage.
  2. Beautiful soup is used to parse the HTML source code.
  3. <Tags> usually contains information for movie title, year, genre, rating, movie-url, synopsis, and number of downloads are being searched.
  4. Scrape information from multiple pages (in our case, 20) and publish it into Python lists and dictionaries.
  5. Save the features extracted as a CSV file.
Movie,Year,Genre,Ratings,Url,Synopsis,Downloaded Whale Hunting,1984,Drama,6.5 / 10,https://yts.rs/movie/whale-hunting-1984," A disillusioned student meets a eccentric beggar and a mute prostitute he falls in love with. Together, without money, they cross South Korea to help the girl go home. "," Downloaded 101 times Sep 27, 2021 at 09:08 PM ........

Download the Webpage using Requests

def get_doc(url): """Download a web page and return a beautiful soup doc""" # Download the page response = requests.get(url) # Check if download was successful if response.status_code != 200: raise Exception('Failed to load page {}'.format(url)) # Create a bs4 doc doc = BeautifulSoup(response.text, 'html.parser') return doc doc = get_doc(site_url) doc.find('title') <title>Search and Browse YIFY Movies Torrent Downloads - YTS</title>

Searching <tags> Containing Movie Data

  • Movie
  • Year
  • Genre
  • Rating
  • URL
  • Synopsis
  • Downloads

Extracting the Movie Titles from Web Page

def get_movie_titles(doc): # get all the <a> tags with a unique class movie_title_tags = doc.find_all('a', class_ ='text--bold palewhite title') # create an empty list movie_titles = [] for tag in movie_title_tags: # for 'title' in each <tag7gt; append it to the list movie_titles.append(tag.text) # return list return movie_titles The get_movie_titles() function successfully returns a list of movie tiltes.

Extract Movie Years from the Web Page

def get_movie_years(doc): # get all the <sapn7gt; tags with a unique class movie_year_tags = doc.find_all('span', class_ = 'text--gray year') # create an empty list movie_years =[] for tag in movie_year_tags: # for year in each <tag7gt; append it to the list. movie_years.append(tag.text) return movie_years

Extract Movie Genres from Web Page

def get_movie_genres(doc): # get all the <h4> tags with unique a class genre_tags = doc.find_all('h4', class_ = 'genre') # create an empty list movie_genres = [] for tag in genre_tags: # for genre in each <tag> append it to the list. movie_genres.append(tag.text) return movie_genres

Extract Movie Ratings from Web Page

def get_movie_ratings(doc): # get all the <h4> tags with a unique class rating_tags= doc.find_all('h4', class_ = 'rating') # create an empty list movie_ratings = [] for tag in rating_tags: # for rating in each append it to the list. movie_ratings.append(tag.text) return movie_ratings

Extract Movie URLs from the Web Page

def get_movie_urls(doc): # get all the <a> tags with a unique class movie_url_tags = doc.find_all('a', class_ ='text--bold palewhite title') # create an empty list movie_urls = [] # the base url for the website base_url = 'https://yts.rs' for tag in movie_url_tags: # for url in each tag, append it to the list after adding the base_url with url from each tag movie_urls.append(base_url + tag['href']) return movie_urls The get_movie_urls() function successfully returns a list of movie urls.

Extract Movie Synopsis from Web Page

def get_synopsis(doc): # create an empty list synopses =[] # get all the movie urls from the page urls = get_movie_urls(doc) for url in urls: # for each url (page) get the beautiful soup doc object movie_doc = get_doc(url) # get all thetags with a unique class div_tag = movie_doc.find_all('div', class_ = 'synopsis col-sm-10 col-md-13 col-lg-12') # get all thetags inside the firsttag p_tags = div_tag[0].find_all('p') # the text (i,e the synopsis) part from thetag is extracted using .text feature synopsis = p_tags[0].text # the synopsis is appended to the list synopses synopses.append(synopsis) return synopsesThe get_synopsis() function gets a list of synopsis for every movie of a web page and returns it.Extract the Movie Downloads from the Web Pagedef get_downloaded(doc): # create an empty list downloadeds = [] # get all the movie urls on page urls = get_movie_urls(doc) for url in urls: # for each url(page) create a beautiful soup doc object movie_doc = get_doc(url) # get all the <div> tags with unique class div_tag = movie_doc.find_all('div', class_ = 'synopsis col-sm-10 col-md-13 col-lg-12') # get all the <p> tags inside the first <div> tag p_tags = div_tag[0].find_all('p') # get all the <em> tags inside the second <p> tag em_tag = p_tags[1].find_all('em') # extarct the text from the <em> tag using .text download = em_tag[0].text # using reular expressions to strip of alphabets from the text using .compile() regex = re.compile('[^0-9]') downloaded = regex.sub('',download) # append the integer to the list downloadeds downloadeds.append(downloaded) return downloadedsThe get_downloaded() function retrieves and returns a list of download counts for each movie on a web page. To match and extract our string, we used the re (regular expression operations) functions.Extract Movie Details for a URL (page)def scrap_page(url): # get beautiful soup doc object for url doc = get_doc(url) # create 7 empty lists for each field movies,years,genres,ratings,urls,synopses,downloadeds=[],[],[],[],[],[],[] # get list of movie titles movies = get_movie_titles(doc) # get list of years years = get_movie_years(doc) # get list of genres genres = get_movie_genres(doc) # get list of ratings ratings = get_movie_ratings(doc) # get list of urls urls = get_movie_urls(doc) # get list of synopsis synopses = get_synopsis(doc) # get list of downloads downloadeds = get_downloaded(doc) return movies,years,genres,ratings,urls,synopses,downloadedsThe scrap_page() function effectively reverts a list of movies, years, genres, ratings, urls, synopses, and downloads for a new website whose url is passed as an argument to the scrape_page(url)' function.Extract Movie Details for the Entire Websitedef website_scrap(): # create 7 empty list for each field to append the corrsponding field list being returned all_movies,all_years,all_genres,all_ratings,all_urls,all_synopses,all_downloadeds = [],[],[],[],[],[],[] for i in range(1,21): url = 'https://yts.rs/browse-movies?page={}'.format(i) # get lists of movie filed details and append them to the final list movies,years,genres,ratings,urls,synopses,downloadeds = scrap_page(url) all_movies += movies all_years += years all_genres += genres all_ratings += ratings all_urls += urls all_synopses += synopses all_downloadeds += downloadeds # create a dictionary from the final list attained for each 'key' as movie detail movies_dict = { 'Movie': all_movies, 'Year': all_years, 'Genre': all_genres, 'Rating': all_ratings, 'Url': all_urls, 'Synopsis': all_synopses, 'Downloads': all_downloadeds }The above website_scrap() function is the primary function from which all other defined functions are executed.It collects a list of details (movies, years, genres, ratings, urls, synopses, and downloadeds) from various pages and adds them to a relating larger list (all_movies, all_years, all_genres, all_ratings, all_urls, all_synopses, and all_downloads). Finally, a vocabulary movie_dict is created, with the larger lists serving as 'values' for the dictionary'Keys'.Create a Pandas DataFrame using the Dictionary movies_dictmovies_df = pd.DataFrame(movies_dict, index = None) return movies_dfNoMovieYearGenreRatingUrlSynopsisDownloads

Converting and Saving the DataFrame data type (above output) to a csv file

movies_df.to_csv('movies_data.csv') # Converts the Dataframe object 'movies_df' to a csv file and saves it in .csv format

When Opened in Notepad, the Contents of the.csv File Would Look like This.

Conclusion

  • So far, and here is what we’ve accomplished with our project:
  • Using requests.get() and the web page’s URL, I downloaded a web page.
  • BeautifulSoup was used to parse the HTML source script of the site and create an item doc of category beautiful soup.
  • A function was defined to generate a doc object for each URL page.
  • Defined functions to extract movie details from each page, such as movies, years, genres, ratings, URLs, synopses, and downloaded.
  • Python lists and dictionaries were created from the extracted data.
  • you can create a pandas Data frame project to display the extracted information in tabular form.
  • The data frame was converted and saved as a.csv file.
If you are looking to scrape the movie details from YIFY Movies, contact iWeb Scraping today.

--

--

--

Web Scraping services with iWeb Scraping Company is best Data scraping services provider in the USA, India, Australia, UAE, UK, and more countries at affordable

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

STOP! You don’t need Microservices.

NoteBurner Spotify Music Converter 2.4.3 [Latest]

NoteBurner Spotify Music Converter Full

2 Effective Tools to Repair Damaged Outlook PST Files

2 Effective Tools to Repair Damaged Outlook PST Files

Why 2020 is the best time to be a Microsoft developer!

WSO2 Enterprise Integrator — How to stream a large file to an HTTP server

Write Better Code — Outside In

A DATA INTEGRATION APPROACH TO MAXIMIZE YOUR ROI

Data Integration Approach

Python Pandas — Panel

Panel in Python Pandas | Insideaiml

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
iWeb Scraping Services

iWeb Scraping Services

Web Scraping services with iWeb Scraping Company is best Data scraping services provider in the USA, India, Australia, UAE, UK, and more countries at affordable

More from Medium

Scrape YouTube comments for Sentimental Analysis using SerpApi

Google Sheets Formula & Telegram Message ✈️

Web Scraping: How to Extract Content from a Web Page Requiring Login Credentials

7 Ways to Find Content Ideas from the Web and Create Like a Pro