(** [](https://mybinder.org/v2/gh/nhirschey/teaching/gh-pages?filepath=football-1-download-data.ipynb) [](/Teaching//football-1-download-data.fsx) [](/Teaching//football-1-download-data.ipynb) # Webscraping HTML to CSV > Developed with [Davide Costa](https://github.com/DavideGCosta) > For the **Football Players' Exercises** we are trying to scrape data from [Sports Reference's](https://www.sports-reference.com/) [Football Reference](https://fbref.com/en/) website. The particular data that we want to scrape is player data from the [Big 5 European Leauges table](https://fbref.com/en/comps/Big5/stats/players/Big-5-European-Leagues-Stats). This data consists on the stats of the players that played in the big 5 football european leagues in the most recent season. Since we need to scrape the data from a webpage and store it in a csv file we will need to use 2 type providers from [FSharp.Data](https://fsharp.github.io/FSharp.Data/): * We'll use the [HTML Type Provider](https://fsprojects.github.io/FSharp.Data/library/HtmlProvider.html) to scrape the html from the webpage and access the players' table. * We'll use the [CSV Type Provider](https://fsprojects.github.io/FSharp.Data/library/CsvProvider.html) to save the data to a csv file. Both type providers are located in `FSharp.Data` assembly. To use it we need to load and open `FSharp.Data`. *) #r "nuget: FSharp.Data, 5.0.2" open FSharp.Data (** ## Get the players' table using HTML type provider. We need to define the type of data that we're going to access. We do this by providing a sample of the data. The type provider uses the sample to generate code to read data with that format. *) type Big5EuropeanLeagues = HtmlProvider<"https://fbref.com/en/comps/Big5/2021-2022/stats/players/2021-2022-Big-5-European-Leagues-Stats"> (** The type `Big5EuropeanLeagues` contains information about the structure of the web page. It knows what the html is, it knows what the html tables are, etc. Now that we have the `Big5EuropeanLeagues` HTML type defined we can use `GetSample()` to load the sample webpage that we used to define the type. *) let big5 = Big5EuropeanLeagues.GetSample() (** Now, we have the HTML of the webpage stored in the variable `big5`. Let's observe the first 200 characters of the HTML. *) big5.Html.ToString()[..200](* output: "
Array.filter (fun row -> row.Player <> "Player") (** ### Removing missing values In order to make some arithmetic operations with age and goals scored, we need to convert the data into integers or floats/decimals. But some players' age and goals scored is missing, and converting empty strings ("") into these types returns an error. We remove those lines by removing filtering to rows where age and goals are not "". *) let footballersClean = footballersNoHeaders |> Array.filter (fun row -> row.Age <> "" && row.``Performance - Gls`` <> "") (** Now we have the data we need, but there are more than 2500 rows. *) footballersClean.Length (** For our exercises we don't need so many observations, so let's transform it to get the best players by position! *) let playersByPosition = footballersClean // 1. filter out odd positions with few players |> Array.filter(fun x -> x.Pos <> "GK,FW" && x.Pos <> "FW,DF" && x.Pos <> "DF,FW" ) // 2. group players by position |> Array.groupBy(fun x -> x.Pos) // 3. sort positions. |> Array.sortBy (fun (position, _players) -> position) let bestPlayersByPosition = // This will get top 25 players for each position // We create a new list by looping through each position group. [| for (position, playersInPosition) in playersByPosition do // 4. Sort best to worst (and convert to list at the end) let bestToWorst = playersInPosition |> Array.sortByDescending (fun player -> int player.``Performance - Gls``, player.Matches) // 5. Truncate to top 25 let top25 = bestToWorst |> Array.truncate 25 top25 |] // 6. concatenate to a single big array |> Array.concat let bestPlayers = // sort best by position to overall top goal scorers, // with ties broken by alphabetical sorting on // the players name and squad. bestPlayersByPosition |> Array.sortByDescending (fun player -> int player.``Performance - Gls``, player.Player, player.Squad) (** ## Create a Csv and store the Data using Csv Provider. Now that the data is scraped from the webpage and stored to `FootballPlayersParsedTable` variable. We need to save the data to a Csv file in order to use it in the Exercises. First we need to construct a sample of the data that will be stored in the comma separated (csv) File. By running `footballers.Headers` as done previously, we can easily observe that the table has fields that will not be used. The csv file doesn't need to contain all those fields, we only need: * `Player` (Players' Name) * `Nation` (Players' Nationality) * `Pos` (Players' Position) * `Squad` (Players' Team) * `Comp` (Players' League) * `Age` (Players' Age) * `Playing Time - MP` (Players' Matches Played) * `Performance - Gls` (Players' Goals Scored) The csv type provider will infer csv field types from a sample that you give it. It can infer the types from rows of the sample or from explicitly defined types added in parentheses after column names. We'll use explicit column type definitions in our sample. *) [