Parser for wikipedia.org

Parser supports both russiand and english languages.
I wrote this parser as part of my Career Guidence pet-project to collect open source data (wikipedia.org in this particular case). This parser designed in way that anyone be able to extend it functionality by redefining parse_func method. More details in the following description.

Parser works with wikipedia categories pages. For example if you interseted in data about science teachers you are googling wiki science teachers category after that google will return you a link to this category. Usually it contains alphabetic list of profiles of persons belonging to this category. Using this parser you can parse any data you need from these profiles.

Use case - parsing birth days of serial killers

Get URL for category page with structure https://en.wikipedia.org/wiki/Category:x...x.
Such URLs mean that page contains profiles of persons from some category. To get this URL you can just google something like wiki category serial killers and google gives you https://en.wikipedia.org/wiki/Category:Serial_killers. Important: url should be full version not shortened. If link is shortened the parser will not work.
Define a parse_func. This is a empty method in base parser class which should be defined by user. It is defines what data should be parsed from a single personal profile. Here is an example of such function for parsing birth days from a person profile. Parser uses multiprocessing so it is important that types of outputs from a function should be defined directly in the function.

def parse_bday(x: tuple):
    """Returns name and birth day of a person derived from wikipedia person's profile page.
    
    Args:
        x (str): url to personal profile of a person on wikipedia
    """
    # unpack tuple
    name, url = x  
    
    # parse url
    html = BeautifulSoup(urllib.request.urlopen(url), 'lxml')
    
    # parse required data
    try:
        bday = html.find_all('span', {'class' : 'bday'})
        bday = BeautifulSoup(str(bday), 'lxml')
        bday = bday.span.string
        
    # handle case when bday is unavailable
    except AttributeError: 
        bday = None
        
    # for multiprocessing is very important to directly define type of output vars
    return str(name), str(bday)

Explanation: As you can see the function takes a tuple with predefined structure as input argument. The tuple has a following structure ('Jhon Smith', 'URL to his wikipedia profile'). The purpose of this function is to define how to parse required by user data from a profile.

Run parser python wiki_parser.py --category_url='https://en.wikipedia.org/wiki/Category:Serial_killers' --category='serial_killers' --threads=10
--category all parsed profiles will have category name in output csv file. Just to know what category was parsed. --threads this var defines how many processes you want to use while parsing. Some categories for example football players containing 30K profiles and it takes a while to parse it.
Explanation: Parser will take defined parse_func and using multiprocessing will parse data from each profile in category and save results in csv file in the current folder.

Use case - Parsing images from persons in category

This will parse url for personal images per each person in category, e.g. in category: 'engineers'. All as decribed above except pars_func:

def parse_image(x: tuple):
    """Parse image from person's url.
    
    Args:
        x (tuple): tuple of strings with following structure ('person name', 'url to wiki page')

    Returns:
        name (str): the same as output, i.e. just copy. Important should be directly defined as str()
        image (str): string with url to image of a person from wikipedia 
    """
    
    name, url = x  # unpack tuple
    html = BeautifulSoup(urllib.request.urlopen(url), 'lxml')
    
    # check it contains photo
    images = html.find_all('img')
    img_exist = images[0]['alt'] == 'Фотография' or images[0]['src'].endswith('.jpg')
    
    if img_exist:
        return str(name), str(images[0].attrs['src'][2:])
    else:
        return str(name), str('None')

Parsing several categories in a row

Currently I'm parsing a list of profession's categories from Jupyter Notebook using wiki_parser.py script in a loop. It will iteratively return professionX.csv profession by profession. A list of professions should dict. Please see example below:

from wiki_parser import *

professions_url =  {
                    'mechanics': 'url_to_wiki_category_mechanics',
                    'engineers': 'url_to_wiki_category_engineers',
                    }
                    
# run wikiparser in a loop for a list of professions                    
for i in professions_url.items():
        category_name = i[0]
        category_url  = i[1]
        parser = WikiParser(
            category_url = category_url, 
            category = category_name, 
            threads  = 8,
            )      
    
        # define func for parsing, in this case I want just get birth days
        parser.parse_func = parse_all
    
        # run parser
        parser()
        del parser

But for now there is a bug that you cann't run all professions in a row since multiprocessing module gives an error: MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f9e7bd86b90>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")
So when it gives an error you need to Restart kernel and run from point you stopped by commenting professions in dict which already parsed. As output you will receive:

├── script_folder               
│   ├── professionA.csv       
│   ├── ...
|   ├── professionN.csv
|   ├── requirements.txt
|   ├── README.md
│   └── wiki_parser.py

TODO:

support english wikipedia (change prefix) DONE
pre-define several methods for parse_func DONE
add use case with parsing images DONE
rework function for parsing image and from ENG wikipedia

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
README.md		README.md
concate_raw_csv_files.py		concate_raw_csv_files.py
requirements.txt		requirements.txt
wiki_parser.py		wiki_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Parser for wikipedia.org

Use case - parsing birth days of serial killers

Use case - Parsing images from persons in category

Parsing several categories in a row

TODO:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

linkyeu/wikipedia_parser

Folders and files

Latest commit

History

Repository files navigation

Parser for wikipedia.org

Use case - parsing birth days of serial killers

Use case - Parsing images from persons in category

Parsing several categories in a row

TODO:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages