Webscraping HTML to CSV
Developed with Davide Costa
For the Football Players' Exercises we are trying to scrape data from Sports Reference's Football Reference website. The particular data that we want to scrape is player data from the Big 5 European Leauges table. This data consists on the stats of the players that played in the big 5 football european leagues in the most recent season. Since we need to scrape the data from a webpage and store it in a csv file we will need to use 2 type providers from FSharp.Data:
- We'll use the HTML Type Provider to scrape the html from the webpage and access the players' table.
- We'll use the CSV Type Provider to save the data to a csv file.
Both type providers are located in FSharp.Data
assembly. To use it we need to load and open FSharp.Data
.
#r "nuget: FSharp.Data, 5.0.2"
open FSharp.Data
Get the players' table using HTML type provider.
We need to define the type of data that we're going to access. We do this by providing a sample of the data. The type provider uses the sample to generate code to read data with that format.
type Big5EuropeanLeagues =
HtmlProvider<"https://fbref.com/en/comps/Big5/2021-2022/stats/players/2021-2022-Big-5-European-Leagues-Stats">
The type Big5EuropeanLeagues
contains information about the structure of the web page. It knows what the html is, it knows what the html tables are, etc.
Now that we have the Big5EuropeanLeagues
HTML type defined we can use GetSample()
to load the sample webpage
that we used to define the type.
let big5 = Big5EuropeanLeagues.GetSample()
Now, we have the HTML of the webpage stored in the variable big5
. Let's observe the first 200 characters of the HTML.
big5.Html.ToString()[..200]
|
From the tables available in the webpage, let's assign the player stats table to a variable called footballers
.
let footballers = big5.Tables.``Player Standard Stats 2021-2022 Big 5 European Leagues``
Now let's observe the first 3 table rows. The Rows
property gives us an array of rows.
We can index into the array using [..2]
to limit it to the first 3 rows.
footballers.Rows[..2]
|
Let's look at the header fields in the table using the property Headers
.
footballers.Headers
|
Let´s look at the first 5 rows of fields Player
and Age
.
[ for row in footballers.Rows[..4] do row.Player, row.Age]
|
Clean the data
The table data is not exactly what we need. To work with it, we need to clean it up.
Repeated headers
The table header line repeats after every 25 players. We can see this if we look at the 26th row of the table.
footballers.Rows[25]
|
We can remove these lines by using Array.filter. Array.filter
applies a function to each element of the array and
returns only those elements where the function evaluates to true
.
We'll create a function that evaluates to true
if the player's name does not equal "Player"
let footballersNoHeaders =
footballers.Rows
// 1. Rows where row.Player does not equal "Player"
|> Array.filter (fun row -> row.Player <> "Player")
Removing missing values
In order to make some arithmetic operations with age and goals scored, we need to convert the data into integers or floats/decimals. But some players' age and goals scored is missing, and converting empty strings ("") into these types returns an error.
We remove those lines by removing filtering to rows where age and goals are not "".
let footballersClean =
footballersNoHeaders
|> Array.filter (fun row ->
row.Age <> "" &&
row.``Performance - Gls`` <> "")
Now we have the data we need, but there are more than 2500 rows.
footballersClean.Length
For our exercises we don't need so many observations, so let's transform it to get the best players by position!
let playersByPosition =
footballersClean
// 1. filter out odd positions with few players
|> Array.filter(fun x -> x.Pos <> "GK,FW" && x.Pos <> "FW,DF" && x.Pos <> "DF,FW" )
// 2. group players by position
|> Array.groupBy(fun x -> x.Pos)
// 3. sort positions.
|> Array.sortBy (fun (position, _players) -> position)
let bestPlayersByPosition =
// This will get top 25 players for each position
// We create a new list by looping through each position group.
[| for (position, playersInPosition) in playersByPosition do
// 4. Sort best to worst (and convert to list at the end)
let bestToWorst =
playersInPosition
|> Array.sortByDescending (fun player ->
int player.``Performance - Gls``, player.Matches)
// 5. Truncate to top 25
let top25 = bestToWorst |> Array.truncate 25
top25 |]
// 6. concatenate to a single big array
|> Array.concat
let bestPlayers =
// sort best by position to overall top goal scorers,
// with ties broken by alphabetical sorting on
// the players name and squad.
bestPlayersByPosition
|> Array.sortByDescending (fun player ->
int player.``Performance - Gls``,
player.Player,
player.Squad)
Create a Csv and store the Data using Csv Provider.
Now that the data is scraped from the webpage and stored to FootballPlayersParsedTable
variable.
We need to save the data to a Csv file in order to use it in the Exercises.
First we need to construct a sample of the data that will be stored in the comma separated (csv) File.
By running footballers.Headers
as done previously, we can easily observe that the table has fields that will not be used.
The csv file doesn't need to contain all those fields, we only need:
Player
(Players' Name)Nation
(Players' Nationality)Pos
(Players' Position)Squad
(Players' Team)Comp
(Players' League)Age
(Players' Age)Playing Time - MP
(Players' Matches Played)Performance - Gls
(Players' Goals Scored)
The csv type provider will infer csv field types from a sample that you give it. It can infer the types from rows of the sample or from explicitly defined types added in parentheses after column names. We'll use explicit column type definitions in our sample.
[<Literal>]
let FootballPlayersCsvSample =
"Player (string)," +
"Nation (string)," +
"Position (string)," +
"Team (string)," +
"League (string)," +
"Age (int)," +
"MatchesPlayed (int)," +
"GoalsScored (int)"
// the sample csv file that we've created:
FootballPlayersCsvSample
|
With the sample created, now we define the type from the sample.
type FootballPlayersCsv = CsvProvider<FootballPlayersCsvSample,ResolutionFolder = __SOURCE_DIRECTORY__>
Now that we have the data and the Csv sample let's create a "list of CSV rows".
let bestPlayersCsvRows =
[ for player in bestPlayers do
FootballPlayersCsv.Row(
player = player.Player,
nation = player.Nation,
position = player.Pos,
team = player.Squad,
league = player.Comp,
// For the age variable just take the year (first two digits).
// Sometimes 29 and 185 days old is given as "29-185" and we
// want to throw away the days part.
age = int player.Age.[0..1],
matchesPlayed = int player.``Playing Time - MP``,
goalsScored = int player.``Performance - Gls`` ) ]
Note that we use int
to convert age
, matchesPlayed
and goalsScored
because those fields' values are strings
in the html table and we want integers
instead.
Let's look at the first 5 csv rows. We don't need to use bestPlayersCsvRows.Rows
because the variable is already a list of csv rows.
bestPlayersCsvRows[0..4]
|
Rather than a "list of Csv rows", we want a "Csv file". Here's how we do that.
let bestPlayersCsvFile = new FootballPlayersCsv(bestPlayersCsvRows)
Ok, let's write the file. Remember that __SOURCE_DIRECTORY__
is a magic variable that points
to whatever folder this code file (aka the source code file) is contained in.
So this will write the data to a csv file named "FootballPlayers.csv" in the current directory.
let filePath = System.IO.Path.Combine(__SOURCE_DIRECTORY__,"FootballPlayers.csv")
bestPlayersCsvFile.Save(filePath)
And if you want to read the data back in from the file to see that it works:
let backIn = FootballPlayersCsv.Load(filePath)
backIn.Rows
|> Seq.truncate 5
|> Seq.iter (printfn "%A")
|
namespace FSharp
--------------------
namespace Microsoft.FSharp
namespace FSharp.Data
--------------------
namespace Microsoft.FSharp.Data
<summary>Typed representation of an HTML file.</summary> <param name='Sample'>Location of an HTML sample file or a string containing a sample HTML document.</param> <param name='PreferOptionals'>When set to true, inference will prefer to use the option type instead of nullable types, <c>double.NaN</c> or <c>""</c> for missing values. Defaults to false.</param> <param name='IncludeLayoutTables'>Includes tables that are potentially layout tables (with cellpadding=0 and cellspacing=0 attributes)</param> <param name='MissingValues'>The set of strings recognized as missing values. Defaults to <c>NaN,NA,N/A,#N/A,:,-,TBA,TBD</c>.</param> <param name='Culture'>The culture used for parsing numbers and dates. Defaults to the invariant culture.</param> <param name='Encoding'>The encoding used to read the sample. You can specify either the character set name or the codepage number. Defaults to UTF8 for files, and to ISO-8859-1 the for HTTP requests, unless <c>charset</c> is specified in the <c>Content-Type</c> response header.</param> <param name='ResolutionFolder'>A directory that is used when resolving relative file references (at design time and in hosted execution).</param> <param name='EmbeddedResource'>When specified, the type provider first attempts to load the sample from the specified resource (e.g. 'MyCompany.MyAssembly, resource_name.html'). This is useful when exposing types generated by the type provider.</param>
<summary>Gets the total number of elements in all the dimensions of the <see cref="T:System.Array" />.</summary>
<exception cref="T:System.OverflowException">The array is multidimensional and contains more than <see cref="F:System.Int32.MaxValue">Int32.MaxValue</see> elements.</exception>
<returns>The total number of elements in all the dimensions of the <see cref="T:System.Array" />; zero if there are no elements in the array.</returns>
val int: value: 'T -> int (requires member op_Explicit)
--------------------
type int = int32
--------------------
type int<'Measure> = int
type LiteralAttribute = inherit Attribute new: unit -> LiteralAttribute
--------------------
new: unit -> LiteralAttribute
<summary>Typed representation of a CSV file.</summary> <param name='Sample'>Location of a CSV sample file or a string containing a sample CSV document.</param> <param name='Separators'>Column delimiter(s). Defaults to <c>,</c>.</param> <param name='InferRows'>Number of rows to use for inference. Defaults to <c>1000</c>. If this is zero, all rows are used.</param> <param name='Schema'>Optional column types, in a comma separated list. Valid types are <c>int</c>, <c>int64</c>, <c>bool</c>, <c>float</c>, <c>decimal</c>, <c>date</c>, <c>datetimeoffset</c>, <c>timespan</c>, <c>guid</c>, <c>string</c>, <c>int?</c>, <c>int64?</c>, <c>bool?</c>, <c>float?</c>, <c>decimal?</c>, <c>date?</c>, <c>datetimeoffset?</c>, <c>timespan?</c>, <c>guid?</c>, <c>int option</c>, <c>int64 option</c>, <c>bool option</c>, <c>float option</c>, <c>decimal option</c>, <c>date option</c>, <c>datetimeoffset option</c>, <c>timespan option</c>, <c>guid option</c> and <c>string option</c>. You can also specify a unit and the name of the column like this: <c>Name (type<unit>)</c>, or you can override only the name. If you don't want to specify all the columns, you can reference the columns by name like this: <c>ColumnName=type</c>.</param> <param name='HasHeaders'>Whether the sample contains the names of the columns as its first line.</param> <param name='IgnoreErrors'>Whether to ignore rows that have the wrong number of columns or which can't be parsed using the inferred or specified schema. Otherwise an exception is thrown when these rows are encountered.</param> <param name='SkipRows'>Skips the first n rows of the CSV file.</param> <param name='AssumeMissingValues'>When set to true, the type provider will assume all columns can have missing values, even if in the provided sample all values are present. Defaults to false.</param> <param name='PreferOptionals'>When set to true, inference will prefer to use the option type instead of nullable types, <c>double.NaN</c> or <c>""</c> for missing values. Defaults to false.</param> <param name='Quote'>The quotation mark (for surrounding values containing the delimiter). Defaults to <c>"</c>.</param> <param name='MissingValues'>The set of strings recognized as missing values specified as a comma-separated string (e.g., "NA,N/A"). Defaults to <c>NaN,NA,N/A,#N/A,:,-,TBA,TBD</c>.</param> <param name='CacheRows'>Whether the rows should be caches so they can be iterated multiple times. Defaults to true. Disable for large datasets.</param> <param name='Culture'>The culture used for parsing numbers and dates. Defaults to the invariant culture.</param> <param name='Encoding'>The encoding used to read the sample. You can specify either the character set name or the codepage number. Defaults to UTF8 for files, and to ISO-8859-1 the for HTTP requests, unless <c>charset</c> is specified in the <c>Content-Type</c> response header.</param> <param name='ResolutionFolder'>A directory that is used when resolving relative file references (at design time and in hosted execution).</param> <param name='EmbeddedResource'>When specified, the type provider first attempts to load the sample from the specified resource (e.g. 'MyCompany.MyAssembly, resource_name.csv'). This is useful when exposing types generated by the type provider.</param>
<summary>Performs operations on <see cref="T:System.String" /> instances that contain file or directory path information. These operations are performed in a cross-platform manner.</summary>
System.IO.Path.Combine(path1: string, path2: string) : string
System.IO.Path.Combine(path1: string, path2: string, path3: string) : string
System.IO.Path.Combine(path1: string, path2: string, path3: string, path4: string) : string
member Runtime.CsvFile.Save: stream: System.IO.Stream * [<System.Runtime.InteropServices.Optional>] ?separator: char * [<System.Runtime.InteropServices.Optional>] ?quote: char -> unit
member Runtime.CsvFile.Save: writer: System.IO.TextWriter * [<System.Runtime.InteropServices.Optional>] ?separator: char * [<System.Runtime.InteropServices.Optional>] ?quote: char -> unit
Loads CSV from the specified uri
CsvProvider<...>.Load(reader: System.IO.TextReader) : CsvProvider<...>
Loads CSV from the specified reader
CsvProvider<...>.Load(stream: System.IO.Stream) : CsvProvider<...>
Loads CSV from the specified stream
<summary> The rows with data </summary>