jeudi 30 avril 2020

Python: test links, collect the path of broken links

I did a script that collect all the links from my website (internal and external) and give me the broken links.

Here is my code, it is working well:

import requests
# from urllib.parse import urljoin
from urlparse import urlparse, urljoin
from bs4 import BeautifulSoup
import sys

# initialize the set of links (unique links)
internal_urls = set()
external_urls = set()
# number of urls visited so far will be stored here
total_urls_visited = 0
total_broken_link = set()
output = 'output.txt'

def is_valid(url):
    """
    Checks whether `url` is a valid URL.
    """
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme) 
    """
    Almost any value is evaluated to True if it has some sort of content.
    Every Url should follow a specific format: <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
    Example: http://www.example.com/index?search=src
    Here, www.example.com is your netloc, while index is the path, 
    search is the query parameter, and src is the value being passed along the parameter search.
    This will make sure that a proper scheme (protocol, e.g http or https) and domain name exists in the URL.
    """

def get_all_website_links(url):
    """
    Returns all URLs that is found on `url` in which it belongs to the same website
    """
    # all URLs of `url`, we use python set() cause we don't redondant links
    urls = set()
    # domain name of the URL without the protocol, to check if the link is internal or external
    domain_name = urlparse(url).netloc
    #Python library for pulling data out of HTML or XML files
    soup = BeautifulSoup(requests.get(url).content, "html.parser")

    # print(soup.prettify()) #test if the html of the page is correctly displaying
    # print(soup.find_all('a')) #collect all the anchor tag

    for a_tag in soup.findAll("a"):
        href = a_tag.get("href")
        if href == "" or href is None:
            # href empty tag
            continue
        href = urljoin(url, href) #internal urls
        #print(internal_urls)
        # print('href:' + href)
        if not is_valid(href):
            # not a valid URL
            continue
        if href in internal_urls:
            # already in the set
            continue
        if domain_name not in href:
            # external link
            if href not in external_urls:
                # print("External link:" + href)
                # print((requests.get(href)).status_code)
                is_broken_link(href)
                external_urls.add(href)
            continue
        # print("Internal link:" + href)
        # print((requests.get(href)).status_code)
        is_broken_link(href)
        urls.add(href) #because it is not an external link
        internal_urls.add(href) #because it is not an external link 
    return urls

def is_broken_link(url):
    if ((requests.get(url)).status_code) != 200:
        #print("This link is broken")
        print(url.encode('utf-8'))
        total_broken_link.add(url)
        return True
    else:
        #print("This link works well")
        return False


def crawl(url, max_urls=80):
    """
    Crawls a web page and extracts all links.
    You'll find all links in `external_urls` and `internal_urls` global set variables.
    params:
        max_urls (int): number of max urls to crawl.
    """
    global total_urls_visited
    total_urls_visited += 1
    links = get_all_website_links(url)
    for link in links:
        if total_urls_visited > max_urls:
            break
        crawl(link, max_urls=max_urls)


if __name__ == "__main__":
    crawl('https://www.example.com/')

    print('Total External links:' + str(len(external_urls)))
    print('Total Internal links:' + str(len(internal_urls)))
    print('Total:' + str(len(external_urls) + len(internal_urls)))
    print('Be careful: ' + str(len(total_broken_link)) + ' broken links found !')

When I am running my script, it returns me all the broken links and also the number of broken links.

But I also want to display the path of each broken link.

For example, if I find this broken link https://www.example.com/brokenlink (internal broken link) or this one https://www.otherwebsite.com/brokenlink (external broken link).

I want to know where those broken links are called in my code, I mean in which page in order to solve the problem. If I know where are these broken links in my code I can easily find them and remove them in order not to have this issue anymore.

So I want this script to allow me to display each broken link with its path and then the number of broken links.

I hope it was clear enough!

Aucun commentaire:

Enregistrer un commentaire