Contents
Introduction
This repository contains the SoccerNews2018 dataset. It as database that contains statistics and news about the teams participating in the 2018 Brazillian Championship.
The content is separated into a set of JSON files.
Data
The data was divided into four parts: teams, matches, statistics and news. In the data directory are all data in JSON files.
The set o JSON files are structured exactly the same as the SQL. Thus, they follow a unique pattern for queries and use in any programming languages.
- Teams is the main file. He contains the twenty championship teams and each has a identifier from one to twenty, plus the twenty one wich is the draw.
- Matches is a set of files that contains all matches of the league. Each match contains a set of identifiers, the date, the game attendance and the income of the game.
- Statistics is a set of files that contains all basic statistics of the matches. Each match contains two statistics, one of the mandant team and other of the visitant team. Each group represents one statistic, so each match contains two statistic groups.
- News is a set of files that contains news about all teams of the league. The news comes from all teams participating in the championship, from April to December 2018.
Algorithms
The NewsCrawler algorithms used for create this database are in directory NewsCrawler. The Statistics_Crawler algorithms used for create this database are in directory crawler_statistics.
- NewsCrawler is the news crawler. This directory contains the entities that are responsible for extracting and storing the news about the teams.
- For execute this crawler, use the command:
python3 NewsCrawler.py
- For execute this crawler, use the command:
- GetStatistics is the statistics crawler. This crawler works by reading a file of links and visiting all of them to extract the HTML element that has a class called game_stats.
- For execute this crawler, use the command:
python3 get_statistics.py
- For execute this crawler, use the command:
For any question, send a email to juliocmalvares07@gmail.com