Header menu logo Teaching

BinderScriptNotebook

Webscraping HTML to CSV

Developed with Davide Costa

For the Football Players' Exercises we are trying to scrape data from Sports Reference's Football Reference website. The particular data that we want to scrape is player data from the Big 5 European Leauges table. This data consists on the stats of the players that played in the big 5 football european leagues in the most recent season. Since we need to scrape the data from a webpage and store it in a csv file we will need to use 2 type providers from FSharp.Data:

Both type providers are located in FSharp.Data assembly. To use it we need to load and open FSharp.Data.

#r "nuget: FSharp.Data, 5.0.2"
open FSharp.Data

Get the players' table using HTML type provider.

We need to define the type of data that we're going to access. We do this by providing a sample of the data. The type provider uses the sample to generate code to read data with that format.

type Big5EuropeanLeagues = 
    HtmlProvider<"https://fbref.com/en/comps/Big5/2021-2022/stats/players/2021-2022-Big-5-European-Leagues-Stats">

The type Big5EuropeanLeagues contains information about the structure of the web page. It knows what the html is, it knows what the html tables are, etc.

Now that we have the Big5EuropeanLeagues HTML type defined we can use GetSample() to load the sample webpage that we used to define the type.

let big5 = Big5EuropeanLeagues.GetSample()

Now, we have the HTML of the webpage stored in the variable big5. Let's observe the first 200 characters of the HTML.

big5.Html.ToString()[..200]
"<!DOCTYPE html>
<html data-version="klecko-" data-root="/home/fb/deploy/www/base" lang="en" class="no-js">
  <head>
    <meta charset="utf-8" /><meta http-equiv="x-ua-compatible" content="ie=edge" /"

From the tables available in the webpage, let's assign the player stats table to a variable called footballers.

let footballers = big5.Tables.``Player Standard Stats 2021-2022 Big 5 European Leagues``

Now let's observe the first 3 table rows. The Rows property gives us an array of rows. We can index into the array using [..2] to limit it to the first 3 rows.

footballers.Rows[..2]
[|("1", "Max Aarons", "eng ENG", "DF", "Norwich City", "engPremier League", "21",
   "2000", "34", "32", "2,881", "32.0", "0", "2", "2", "0", "0", "0", "8", "0",
   "0.8", "0.8", "1.7", "2.5", "78", "117", "85", "0.00", "0.06", "0.06", "0.00",
   "0.06", "0.03", "0.05", "0.08", "0.03", "0.08", "Matches");
  ("2", "Yunis Abdelhamid", "ma MAR", "DF", "Reims", "frLigue 1", "33", "1987",
   "34", "34", "2,983", "33.1", "2", "0", "2", "2", "0", "0", "5", "1", "1.2",
   "1.2", "0.8", "2.0", "23", "119", "14", "0.06", "0.00", "0.06", "0.06",
   "0.06", "0.04", "0.02", "0.06", "0.04", "0.06", "Matches");
  ("3", "Salis Abdul Samed", "gh GHA", "MF", "Clermont Foot", "frLigue 1", "21",
   "2000", "31", "29", "2,462", "27.4", "1", "0", "1", "1", "0", "0", "12", "3",
   "0.7", "0.7", "0.9", "1.6", "24", "108", "47", "0.04", "0.00", "0.04", "0.04",
   "0.04", "0.03", "0.03", "0.06", "0.03", "0.06", "Matches")|]

Let's look at the header fields in the table using the property Headers.

footballers.Headers
Some
  [|"Rk"; "Player"; "Nation"; "Pos"; "Squad"; "Comp"; "Age"; "Born";
    "Playing Time - MP"; "Playing Time - Starts"; "Playing Time - Min";
    "Playing Time - 90s"; "Performance - Gls"; "Performance - Ast";
    "Performance - G+A"; "Performance - G-PK"; "Performance - PK";
    "Performance - PKatt"; "Performance - CrdY"; "Performance - CrdR";
    "Expected - xG"; "Expected - npxG"; "Expected - xAG"; "Expected - npxG+xAG";
    "Progression - PrgC"; "Progression - PrgP"; "Progression - PrgR";
    "Per 90 Minutes - Gls"; "Per 90 Minutes - Ast"; "Per 90 Minutes - G+A";
    "Per 90 Minutes - G-PK"; "Per 90 Minutes - G+A-PK"; "Per 90 Minutes - xG";
    "Per 90 Minutes - xAG"; "Per 90 Minutes - xG+xAG"; "Per 90 Minutes - npxG";
    "Per 90 Minutes - npxG+xAG"; "Matches"|]

Let´s look at the first 5 rows of fields Player and Age.

[ for row in footballers.Rows[..4] do row.Player, row.Age]
[("Max Aarons", "21"); ("Yunis Abdelhamid", "33"); ("Salis Abdul Samed", "21");
 ("Laurent Abergel", "28"); ("Charles Abi", "21")]

Clean the data

The table data is not exactly what we need. To work with it, we need to clean it up.

Repeated headers

The table header line repeats after every 25 players. We can see this if we look at the 26th row of the table.

footballers.Rows[25]
("Rk", "Player", "Nation", "Pos", "Squad", "Comp", "Age", "Born", "MP", "Starts",
 "Min", "90s", "Gls", "Ast", "G+A", "G-PK", "PK", "PKatt", "CrdY", "CrdR", "xG",
 "npxG", "xAG", "npxG+xAG", "PrgC", "PrgP", "PrgR", "Gls", "Ast", "G+A", "G-PK",
 "G+A-PK", "xG", "xAG", "xG+xAG", "npxG", "npxG+xAG", "Matches")

We can remove these lines by using Array.filter. Array.filter applies a function to each element of the array and returns only those elements where the function evaluates to true. We'll create a function that evaluates to true if the player's name does not equal "Player"

let footballersNoHeaders = 
    footballers.Rows
    // 1. Rows where row.Player does not equal "Player"
    |> Array.filter (fun row -> row.Player <> "Player") 

Removing missing values

In order to make some arithmetic operations with age and goals scored, we need to convert the data into integers or floats/decimals. But some players' age and goals scored is missing, and converting empty strings ("") into these types returns an error.
We remove those lines by removing filtering to rows where age and goals are not "".

let footballersClean =
    footballersNoHeaders
    |> Array.filter (fun row -> 
        row.Age <> "" && 
        row.``Performance - Gls`` <> "")

Now we have the data we need, but there are more than 2500 rows.

footballersClean.Length

For our exercises we don't need so many observations, so let's transform it to get the best players by position!

let playersByPosition =
    footballersClean
    // 1. filter out odd positions with few players
    |> Array.filter(fun x -> x.Pos <> "GK,FW" && x.Pos <> "FW,DF" && x.Pos <> "DF,FW" )
    // 2. group players by position
    |> Array.groupBy(fun x -> x.Pos)
    // 3. sort positions.
    |> Array.sortBy (fun (position, _players) -> position)


let bestPlayersByPosition =
    // This will get top 25 players for each position
    // We create a new list by looping through each position group.
    [| for (position, playersInPosition) in playersByPosition do
        // 4. Sort best to worst (and convert to list at the end)
        let bestToWorst =
            playersInPosition
            |> Array.sortByDescending (fun player ->
                int player.``Performance - Gls``, player.Matches)
        // 5. Truncate to top 25
        let top25 = bestToWorst |> Array.truncate 25         
        top25 |]
    // 6. concatenate to a single big array
    |> Array.concat
    
let bestPlayers = 
    // sort best by position to overall top goal scorers,
    // with ties broken by alphabetical sorting on 
    // the players name and squad.
    bestPlayersByPosition
    |> Array.sortByDescending (fun player -> 
        int player.``Performance - Gls``,
        player.Player,
        player.Squad)

Create a Csv and store the Data using Csv Provider.

Now that the data is scraped from the webpage and stored to FootballPlayersParsedTable variable. We need to save the data to a Csv file in order to use it in the Exercises.

First we need to construct a sample of the data that will be stored in the comma separated (csv) File. By running footballers.Headers as done previously, we can easily observe that the table has fields that will not be used.
The csv file doesn't need to contain all those fields, we only need:

The csv type provider will infer csv field types from a sample that you give it. It can infer the types from rows of the sample or from explicitly defined types added in parentheses after column names. We'll use explicit column type definitions in our sample.

[<Literal>]
let FootballPlayersCsvSample = 
    "Player (string)," +
    "Nation (string)," +
    "Position (string)," +
    "Team (string)," +
    "League (string)," +
    "Age (int)," +
    "MatchesPlayed (int)," +
    "GoalsScored (int)"

// the sample csv file that we've created:
FootballPlayersCsvSample
"Player (string),Nation (string),Position (string),Team (string),League (string),Age (int),MatchesPlayed (int),GoalsScored (int)"

With the sample created, now we define the type from the sample.

type FootballPlayersCsv = CsvProvider<FootballPlayersCsvSample,ResolutionFolder = __SOURCE_DIRECTORY__>

Now that we have the data and the Csv sample let's create a "list of CSV rows".

let bestPlayersCsvRows = 
    [ for player in bestPlayers do
        FootballPlayersCsv.Row( 
            player = player.Player,
            nation = player.Nation,
            position = player.Pos,
            team = player.Squad,
            league = player.Comp,
            // For the age variable just take the year (first two digits).
            // Sometimes 29 and 185 days old is given as "29-185" and we
            // want to throw away the days part.
            age = int player.Age.[0..1], 
            matchesPlayed = int player.``Playing Time - MP``,
            goalsScored = int player.``Performance - Gls`` ) ]

Note that we use int to convert age, matchesPlayed and goalsScored because those fields' values are strings in the html table and we want integers instead.

Let's look at the first 5 csv rows. We don't need to use bestPlayersCsvRows.Rows because the variable is already a list of csv rows.

bestPlayersCsvRows[0..4]
[("Robert Lewandowski", "pl POL", "FW", "Bayern Munich", "deBundesliga", 32, 34,
  35); ("Kylian Mbappé", "fr FRA", "FW", "Paris S-G", "frLigue 1", 22, 35, 28);
 ("Karim Benzema", "fr FRA", "FW", "Real Madrid", "esLa Liga", 33, 32, 27);
 ("Ciro Immobile", "it ITA", "FW", "Lazio", "itSerie A", 31, 31, 27);
 ("Wissam Ben Yedder", "fr FRA", "FW", "Monaco", "frLigue 1", 30, 37, 25)]

Rather than a "list of Csv rows", we want a "Csv file". Here's how we do that.

let bestPlayersCsvFile = new FootballPlayersCsv(bestPlayersCsvRows)

Ok, let's write the file. Remember that __SOURCE_DIRECTORY__ is a magic variable that points to whatever folder this code file (aka the source code file) is contained in. So this will write the data to a csv file named "FootballPlayers.csv" in the current directory.

let filePath = System.IO.Path.Combine(__SOURCE_DIRECTORY__,"FootballPlayers.csv")
bestPlayersCsvFile.Save(filePath)

And if you want to read the data back in from the file to see that it works:

let backIn = FootballPlayersCsv.Load(filePath)

backIn.Rows
|> Seq.truncate 5
|> Seq.iter (printfn "%A")
("Robert Lewandowski", "pl POL", "FW", "Bayern Munich", "deBundesliga", 32, 34,
 35)
("Kylian Mbappé", "fr FRA", "FW", "Paris S-G", "frLigue 1", 22, 35, 28)
("Karim Benzema", "fr FRA", "FW", "Real Madrid", "esLa Liga", 33, 32, 27)
("Ciro Immobile", "it ITA", "FW", "Lazio", "itSerie A", 31, 31, 27)
("Wissam Ben Yedder", "fr FRA", "FW", "Monaco", "frLigue 1", 30, 37, 25)
Multiple items
namespace FSharp

--------------------
namespace Microsoft.FSharp
Multiple items
namespace FSharp.Data

--------------------
namespace Microsoft.FSharp.Data
type Big5EuropeanLeagues = HtmlProvider<...>
type HtmlProvider
<summary>Typed representation of an HTML file.</summary> <param name='Sample'>Location of an HTML sample file or a string containing a sample HTML document.</param> <param name='PreferOptionals'>When set to true, inference will prefer to use the option type instead of nullable types, <c>double.NaN</c> or <c>""</c> for missing values. Defaults to false.</param> <param name='IncludeLayoutTables'>Includes tables that are potentially layout tables (with cellpadding=0 and cellspacing=0 attributes)</param> <param name='MissingValues'>The set of strings recognized as missing values. Defaults to <c>NaN,NA,N/A,#N/A,:,-,TBA,TBD</c>.</param> <param name='Culture'>The culture used for parsing numbers and dates. Defaults to the invariant culture.</param> <param name='Encoding'>The encoding used to read the sample. You can specify either the character set name or the codepage number. Defaults to UTF8 for files, and to ISO-8859-1 the for HTTP requests, unless <c>charset</c> is specified in the <c>Content-Type</c> response header.</param> <param name='ResolutionFolder'>A directory that is used when resolving relative file references (at design time and in hosted execution).</param> <param name='EmbeddedResource'>When specified, the type provider first attempts to load the sample from the specified resource (e.g. 'MyCompany.MyAssembly, resource_name.html'). This is useful when exposing types generated by the type provider.</param>
val big5: HtmlProvider<...>
HtmlProvider<...>.GetSample() : HtmlProvider<...>
property Runtime.BaseTypes.HtmlDocument.Html: HtmlDocument with get
override HtmlDocument.ToString: unit -> string
val footballers: HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues
property HtmlProvider<...>.Tables: HtmlProvider<...>.TablesContainer with get
property Runtime.BaseTypes.HtmlTable.Rows: HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row array with get
property Runtime.BaseTypes.HtmlTable.Headers: string array option with get
val row: HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row
property HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row.Player: string with get
property HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row.Age: string with get
val footballersNoHeaders: HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row array
module Array from Microsoft.FSharp.Collections
val filter: predicate: ('T -> bool) -> array: 'T array -> 'T array
val footballersClean: HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row array
property System.Array.Length: int with get
<summary>Gets the total number of elements in all the dimensions of the <see cref="T:System.Array" />.</summary>
<exception cref="T:System.OverflowException">The array is multidimensional and contains more than <see cref="F:System.Int32.MaxValue">Int32.MaxValue</see> elements.</exception>
<returns>The total number of elements in all the dimensions of the <see cref="T:System.Array" />; zero if there are no elements in the array.</returns>
val playersByPosition: (string * HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row array) array
val x: HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row
property HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row.Pos: string with get
val groupBy: projection: ('T -> 'Key) -> array: 'T array -> ('Key * 'T array) array (requires equality)
val sortBy: projection: ('T -> 'Key) -> array: 'T array -> 'T array (requires comparison)
val position: string
val _players: HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row array
val bestPlayersByPosition: HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row array
val playersInPosition: HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row array
val bestToWorst: HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row array
val sortByDescending: projection: ('T -> 'Key) -> array: 'T array -> 'T array (requires comparison)
val player: HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row
Multiple items
val int: value: 'T -> int (requires member op_Explicit)

--------------------
type int = int32

--------------------
type int<'Measure> = int
property HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row.Matches: string with get
val top25: HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row array
val truncate: count: int -> array: 'T array -> 'T array
val concat: arrays: 'T array seq -> 'T array
val bestPlayers: HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row array
property HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row.Squad: string with get
Multiple items
type LiteralAttribute = inherit Attribute new: unit -> LiteralAttribute

--------------------
new: unit -> LiteralAttribute
[<Literal>] val FootballPlayersCsvSample: string = "Player (string),Nation (string),Position (string),Team (string),League (string),Age (int),MatchesPlayed (int),GoalsScored (int)"
type FootballPlayersCsv = CsvProvider<...>
type CsvProvider
<summary>Typed representation of a CSV file.</summary> <param name='Sample'>Location of a CSV sample file or a string containing a sample CSV document.</param> <param name='Separators'>Column delimiter(s). Defaults to <c>,</c>.</param> <param name='InferRows'>Number of rows to use for inference. Defaults to <c>1000</c>. If this is zero, all rows are used.</param> <param name='Schema'>Optional column types, in a comma separated list. Valid types are <c>int</c>, <c>int64</c>, <c>bool</c>, <c>float</c>, <c>decimal</c>, <c>date</c>, <c>datetimeoffset</c>, <c>timespan</c>, <c>guid</c>, <c>string</c>, <c>int?</c>, <c>int64?</c>, <c>bool?</c>, <c>float?</c>, <c>decimal?</c>, <c>date?</c>, <c>datetimeoffset?</c>, <c>timespan?</c>, <c>guid?</c>, <c>int option</c>, <c>int64 option</c>, <c>bool option</c>, <c>float option</c>, <c>decimal option</c>, <c>date option</c>, <c>datetimeoffset option</c>, <c>timespan option</c>, <c>guid option</c> and <c>string option</c>. You can also specify a unit and the name of the column like this: <c>Name (type&lt;unit&gt;)</c>, or you can override only the name. If you don't want to specify all the columns, you can reference the columns by name like this: <c>ColumnName=type</c>.</param> <param name='HasHeaders'>Whether the sample contains the names of the columns as its first line.</param> <param name='IgnoreErrors'>Whether to ignore rows that have the wrong number of columns or which can't be parsed using the inferred or specified schema. Otherwise an exception is thrown when these rows are encountered.</param> <param name='SkipRows'>Skips the first n rows of the CSV file.</param> <param name='AssumeMissingValues'>When set to true, the type provider will assume all columns can have missing values, even if in the provided sample all values are present. Defaults to false.</param> <param name='PreferOptionals'>When set to true, inference will prefer to use the option type instead of nullable types, <c>double.NaN</c> or <c>""</c> for missing values. Defaults to false.</param> <param name='Quote'>The quotation mark (for surrounding values containing the delimiter). Defaults to <c>"</c>.</param> <param name='MissingValues'>The set of strings recognized as missing values specified as a comma-separated string (e.g., "NA,N/A"). Defaults to <c>NaN,NA,N/A,#N/A,:,-,TBA,TBD</c>.</param> <param name='CacheRows'>Whether the rows should be caches so they can be iterated multiple times. Defaults to true. Disable for large datasets.</param> <param name='Culture'>The culture used for parsing numbers and dates. Defaults to the invariant culture.</param> <param name='Encoding'>The encoding used to read the sample. You can specify either the character set name or the codepage number. Defaults to UTF8 for files, and to ISO-8859-1 the for HTTP requests, unless <c>charset</c> is specified in the <c>Content-Type</c> response header.</param> <param name='ResolutionFolder'>A directory that is used when resolving relative file references (at design time and in hosted execution).</param> <param name='EmbeddedResource'>When specified, the type provider first attempts to load the sample from the specified resource (e.g. 'MyCompany.MyAssembly, resource_name.csv'). This is useful when exposing types generated by the type provider.</param>
val bestPlayersCsvRows: CsvProvider<...>.Row list
type Row = inherit string * string * string * string * string * int * int * int new: player: string * nation: string * position: string * team: string * league: string * age: int * matchesPlayed: int * goalsScored: int -> Row member Age: int member GoalsScored: int member League: string member MatchesPlayed: int member Nation: string member Player: string member Position: string member Team: string
property HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row.Nation: string with get
property HtmlProvider<...>.PlayerStandardStats20212022Big5EuropeanLeagues.Row.Comp: string with get
val bestPlayersCsvFile: FootballPlayersCsv
val filePath: string
namespace System
namespace System.IO
type Path = static member ChangeExtension: path: string * extension: string -> string static member Combine: path1: string * path2: string -> string + 3 overloads static member EndsInDirectorySeparator: path: ReadOnlySpan<char> -> bool + 1 overload static member Exists: path: string -> bool static member GetDirectoryName: path: ReadOnlySpan<char> -> ReadOnlySpan<char> + 1 overload static member GetExtension: path: ReadOnlySpan<char> -> ReadOnlySpan<char> + 1 overload static member GetFileName: path: ReadOnlySpan<char> -> ReadOnlySpan<char> + 1 overload static member GetFileNameWithoutExtension: path: ReadOnlySpan<char> -> ReadOnlySpan<char> + 1 overload static member GetFullPath: path: string -> string + 1 overload static member GetInvalidFileNameChars: unit -> char array ...
<summary>Performs operations on <see cref="T:System.String" /> instances that contain file or directory path information. These operations are performed in a cross-platform manner.</summary>
System.IO.Path.Combine([<System.ParamArray>] paths: string array) : string
System.IO.Path.Combine(path1: string, path2: string) : string
System.IO.Path.Combine(path1: string, path2: string, path3: string) : string
System.IO.Path.Combine(path1: string, path2: string, path3: string, path4: string) : string
member Runtime.CsvFile.Save: path: string * [<System.Runtime.InteropServices.Optional>] ?separator: char * [<System.Runtime.InteropServices.Optional>] ?quote: char -> unit
member Runtime.CsvFile.Save: stream: System.IO.Stream * [<System.Runtime.InteropServices.Optional>] ?separator: char * [<System.Runtime.InteropServices.Optional>] ?quote: char -> unit
member Runtime.CsvFile.Save: writer: System.IO.TextWriter * [<System.Runtime.InteropServices.Optional>] ?separator: char * [<System.Runtime.InteropServices.Optional>] ?quote: char -> unit
val backIn: CsvProvider<...>
CsvProvider<...>.Load(uri: string) : CsvProvider<...>
Loads CSV from the specified uri
CsvProvider<...>.Load(reader: System.IO.TextReader) : CsvProvider<...>
Loads CSV from the specified reader
CsvProvider<...>.Load(stream: System.IO.Stream) : CsvProvider<...>
Loads CSV from the specified stream
property Runtime.CsvFile.Rows: CsvProvider<...>.Row seq with get
<summary> The rows with data </summary>
module Seq from Microsoft.FSharp.Collections
val truncate: count: int -> source: 'T seq -> 'T seq
val iter: action: ('T -> unit) -> source: 'T seq -> unit
val printfn: format: Printf.TextWriterFormat<'T> -> 'T

Type something to start searching.