Using List collection functions and calculating summary statistics.

Developed with Davide Costa

You should now feel comfortable with the footballer dataset and how to work with tuples, records, anonymous records. You should also know how to perform simple transformations. With a large and heterogeneous dataset, it's useful to understand how to sort, group, and filter the data, and also many other interesting List functions.

It is a good idea to browse the documentation for lists at the F# language reference and the F# core library documentation sites before you start. For further discussion of collection functions, the related F# for fun and profit page is also useful.

Reference needed nuget packages and open namespaces

#r "nuget: FSharp.Data, 5.0.2"
#r "nuget: FSharp.Stats, 0.5.0"

open FSharp.Data
open FSharp.Stats
open FSharp.Stats.Correlation

Load the Csv file.

let [<Literal>] CsvPath = __SOURCE_DIRECTORY__ + "/FootballPlayers.csv"
type FootballPlayersCsv = CsvProvider<CsvPath>

let playerStatsTable = 
    FootballPlayersCsv.GetSample().Rows
    |> Seq.toList

List Functions.

1 List.take

List.take 5 takes the first 5 rows.
List.take 2 takes the first 2 rows

Example: Take the first 4 rows from playerStatsTable with List.take.

playerStatsTable
|> List.take 4

val it: CsvProvider<...>.Row list =
  [("Robert Lewandowski", "pl POL", "FW", "Bayern Munich", "deBundesliga", 32,
    34, 35);
   ("Kylian Mbappé", "fr FRA", "FW", "Paris S-G", "frLigue 1", 22, 35, 28);
   ("Karim Benzema", "fr FRA", "FW", "Real Madrid", "esLa Liga", 33, 32, 27);
   ("Ciro Immobile", "it ITA", "FW", "Lazio", "itSerie A", 31, 31, 27)]

Take the first 7 rows from playerStatsTable with List.take.

answer

playerStatsTable
|> List.take 7

val it: CsvProvider<...>.Row list =
  [("Robert Lewandowski", "pl POL", "FW", "Bayern Munich", "deBundesliga", 32,
    34, 35);
   ("Kylian Mbappé", "fr FRA", "FW", "Paris S-G", "frLigue 1", 22, 35, 28);
   ("Karim Benzema", "fr FRA", "FW", "Real Madrid", "esLa Liga", 33, 32, 27);
   ("Ciro Immobile", "it ITA", "FW", "Lazio", "itSerie A", 31, 31, 27);
   ("Wissam Ben Yedder", "fr FRA", "FW", "Monaco", "frLigue 1", 30, 37, 25);
   ("Patrik Schick", "cz CZE", "FW", "Leverkusen", "deBundesliga", 25, 27, 24);
   ("Son Heung-min", "kr KOR", "MF,FW", "Tottenham", "engPremier League", 29,
    35, 23)]

2 List.truncate

List.truncate 5 takes the first 5 rows.
List.truncate 2 takes the first 2 rows

You must have noted that List.take and List.truncate return similar outputs, but these are not exactly the same. List.take gives you the exact number of items that you specify in the parameters, while List.truncate takes at maximum the number of items you specified in the parameters. Thus, in most cases both give you the exact same output, except if you ask for more items then the ones available in the List (List length). In this particular scenario List.truncate returns the maximum number of elements (all the elements in the List), while List.take returns an error, since it is supposed to take the exact number of elements you asked for, which is impossible in this particular case.

Example: Take the first 4 rows from playerStatsTable with List.truncate.

playerStatsTable
|> List.truncate 4

val it: CsvProvider<...>.Row list =
  [("Robert Lewandowski", "pl POL", "FW", "Bayern Munich", "deBundesliga", 32,
    34, 35);
   ("Kylian Mbappé", "fr FRA", "FW", "Paris S-G", "frLigue 1", 22, 35, 28);
   ("Karim Benzema", "fr FRA", "FW", "Real Madrid", "esLa Liga", 33, 32, 27);
   ("Ciro Immobile", "it ITA", "FW", "Lazio", "itSerie A", 31, 31, 27)]

Take the first 7 rows from playerStatsTable with List.truncate.

answer

playerStatsTable
|> List.truncate 7

val it: CsvProvider<...>.Row list =
  [("Robert Lewandowski", "pl POL", "FW", "Bayern Munich", "deBundesliga", 32,
    34, 35);
   ("Kylian Mbappé", "fr FRA", "FW", "Paris S-G", "frLigue 1", 22, 35, 28);
   ("Karim Benzema", "fr FRA", "FW", "Real Madrid", "esLa Liga", 33, 32, 27);
   ("Ciro Immobile", "it ITA", "FW", "Lazio", "itSerie A", 31, 31, 27);
   ("Wissam Ben Yedder", "fr FRA", "FW", "Monaco", "frLigue 1", 30, 37, 25);
   ("Patrik Schick", "cz CZE", "FW", "Leverkusen", "deBundesliga", 25, 27, 24);
   ("Son Heung-min", "kr KOR", "MF,FW", "Tottenham", "engPremier League", 29,
    35, 23)]

3 List.distinct

List.distinct returns the unique elements from the List.
["hello"; "world"; "hello"; "hi"] |> List.distinct returns ["hello"; "world"; "hi"]

Example: From playerStatsTable Nation field find the unique elements with List.distinct.

playerStatsTable
|> List.map(fun x -> x.Nation)
|> List.distinct

val it: string list =
  ["pl POL"; "fr FRA"; "it ITA"; "cz CZE"; "kr KOR"; "eg EGY"; "no NOR";
   "ar ARG"; "es ESP"; "pt POR"; "br BRA"; "eng ENG"; "rs SRB"; "sn SEN";
   "tr TUR"; "dz ALG"; "be BEL"; "ca CAN"; "hr CRO"; "de GER"; "tn TUN";
   "ng NGA"; "co COL"; "ci CIV"; "jp JPN"; "at AUT"; "zw ZIM"; "nl NED";
   "sct SCO"; "uy URU"; "xk KVX"; "cm CMR"; "dk DEN"; "ml MLI"; "ch SUI";
   "ir IRN"; "pe PER"; ""; "se SWE"; "gq EQG"; "ro ROU"; "me MNE"]

From playerStatsTable League field find the unique elements with List.distinct.

answer

playerStatsTable
|> List.map(fun x -> x.League)
|> List.distinct

val it: string list =
  ["deBundesliga"; "frLigue 1"; "esLa Liga"; "itSerie A"; "engPremier League"]

4 List.countBy

List.countBy returns a list of paired tuples with the unique elements and their counts.

Example: From playerStatsTable Team field find the unique elements and their counts with List.countBy.

playerStatsTable
|> List.countBy(fun x -> x.Team)
|> List.truncate 5 //just to observe the first 5 rows, not a part of the exercise.

val it: (string * int) list =
  [("Bayern Munich", 3); ("Paris S-G", 3); ("Real Madrid", 3); ("Lazio", 2);
   ("Monaco", 3)]

From playerStatsTable League field find the unique elements and their counts with List.countBy.

answer

playerStatsTable
|> List.countBy(fun x -> x.League)

val it: (string * int) list =
  [("deBundesliga", 36); ("frLigue 1", 46); ("esLa Liga", 30);
   ("itSerie A", 52); ("engPremier League", 36)]

5 List.filter

List.filter allows you to extract a subset of the dataset based on one or multiple conditions.

Example: Filter the playerStatsTable to get only portuguese players. (Nation = "pt POR").
Remember that we have to look to the dataset to find the string correspondent to portuguese players, which in this case is "pt POR"

playerStatsTable
|> List.filter(fun x -> x.Nation = "pt POR")
|> List.truncate 5 //just to observe the first 5 rows, not a part of the exercise.

val it: CsvProvider<...>.Row list =
  [("Cristiano Ronaldo", "pt POR", "FW", "Manchester Utd", "engPremier League",
    36, 30, 18);
   ("Gonçalo Guedes", "pt POR", "FW,MF", "Valencia", "esLa Liga", 24, 36, 11);
   ("Bruno Fernandes", "pt POR", "MF", "Manchester Utd", "engPremier League",
    26, 36, 10);
   ("Bernardo Silva", "pt POR", "MF,FW", "Manchester City",
    "engPremier League", 26, 35, 8);
   ("Raphaël Guerreiro", "pt POR", "DF", "Dortmund", "deBundesliga", 27, 23, 4)]

Filter the playerStatsTable to get only 16 year-old players. (Age = 16).

answer

playerStatsTable
|> List.filter(fun x -> x.Age = 16)

val it: CsvProvider<...>.Row list = []

6 List.sort and List.sortDescending

[1; 4; 5; 3; 6] |> List.sort returns [1; 3; 4; 5; 6] (ascending sort).
[1; 4; 5; 3; 6] |> List.sortDescending returns [6; 5; 4; 3; 1] (descending sort).

Example: map playerStatsTable to get a list of Age and sort it (ascending).

Since we want to sort the age List we first use List.map to get only that List. Then we use List.sort to sort it.

playerStatsTable
|> List.map(fun x -> x.Age)
|> List.sort
|> List.truncate 60 //just to observe the first 60 values, not a part of the exercise.

val it: int list =
  [17; 17; 18; 18; 19; 19; 19; 19; 19; 20; 20; 20; 20; 21; 21; 21; 21; 21; 21;
   21; 21; 21; 21; 21; 21; 21; 21; 21; 21; 22; 22; 22; 22; 22; 22; 22; 22; 22;
   22; 22; 23; 23; 23; 23; 23; 23; 23; 23; 23; 23; 23; 23; 23; 23; 24; 24; 24;
   24; 24; 24]

map playerStatsTable to get a list of GoalsScored and sort it (ascending).
Hint: To sort the GoalsScored List you first need to use List.map to get only that List. Then use List.sort to sort it.

answer

playerStatsTable
|> List.map(fun x -> x.GoalsScored)
|> List.sort
|> List.truncate 60 //just to observe the first 60 values, not a part of the exercise.

val it: int list =
  [0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0;
   1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 2; 2; 2;
   2; 2; 2; 2; 2; 2; 2; 2; 2; 2]

Example: Map playerStatsTable to get a list of Age and sort it (descending).

Since we want to sort the age List we first use List.map to get only that List. Then we use List.sortDescending to sort it.

playerStatsTable
|> List.map(fun x -> x.Age)
|> List.sortDescending
|> List.truncate 60 //just to observe the first 60 values, not a part of the exercise.

val it: int list =
  [40; 36; 36; 36; 35; 34; 34; 34; 34; 34; 34; 34; 34; 33; 33; 33; 33; 33; 33;
   33; 33; 33; 33; 32; 32; 32; 32; 32; 32; 31; 31; 31; 31; 31; 31; 30; 30; 30;
   30; 30; 30; 30; 30; 30; 29; 29; 29; 29; 29; 29; 29; 29; 29; 29; 29; 29; 29;
   29; 28; 28]

Map playerStatsTable to get a list of GoalsScored and sort it (descending).
Hint: To sort the GoalsScored List you first need to use List.map to get only that List. Then use List.sortDescending to sort it.

answer

playerStatsTable
|> List.map(fun x -> x.GoalsScored)
|> List.sortDescending
|> List.truncate 60 //just to observe the first 60 values, not a part of the exercise.

val it: int list =
  [35; 28; 27; 27; 25; 24; 23; 23; 22; 21; 21; 21; 20; 20; 18; 18; 17; 17; 17;
   17; 17; 17; 16; 16; 16; 16; 16; 15; 15; 13; 13; 13; 13; 13; 12; 12; 12; 12;
   12; 12; 11; 11; 11; 11; 11; 11; 11; 11; 11; 11; 11; 10; 10; 10; 10; 10; 10;
   10; 10; 10]

7 List.sortBy and List.sortByDescending

List.sortBy is very usefull to sort the dataset accordingly to a certain dataset field.

Example: sort (ascending) playerStatsTable by Age (List.sortBy).

playerStatsTable
|> List.sortBy(fun x -> x.Age)
|> List.truncate 5 //just to observe the first 5 rows, not a part of the exercise.

val it: CsvProvider<...>.Row list =
  [("Giorgio Scalvini", "it ITA", "DF,MF", "Atalanta", "itSerie A", 17, 18, 1);
   ("Alejandro Primo", "es ESP", "GK", "Levante", "esLa Liga", 17, 1, 0);
   ("Florian Wirtz", "de GER", "MF,FW", "Leverkusen", "deBundesliga", 18, 24,
    7); ("Destiny Udogie", "it ITA", "DF", "Udinese", "itSerie A", 18, 35, 5);
   ("Bukayo Saka", "eng ENG", "FW,MF", "Arsenal", "engPremier League", 19, 38,
    11)]

sort (ascending) playerStatsTable by GoalsScored (List.sortBy).

answer

playerStatsTable
|> List.sortBy(fun x -> x.GoalsScored)
|> List.truncate 5 //just to observe the first 5 rows, not a part of the exercise.

val it: CsvProvider<...>.Row list =
  [("Stefan Ortega", "de GER", "GK", "Arminia", "deBundesliga", 28, 33, 0);
   ("Rui Patrício", "pt POR", "GK", "Roma", "itSerie A", 33, 38, 0);
   ("Philipp Pentke", "de GER", "GK", "Hoffenheim", "deBundesliga", 36, 1, 0);
   ("Pavao Pervan", "at AUT", "GK", "Wolfsburg", "deBundesliga", 33, 6, 0);
   ("Nick Pope", "eng ENG", "GK", "Burnley", "engPremier League", 29, 36, 0)]

Example: sort (descending) playerStatsTable by Age (List.sortByDescending).

playerStatsTable
|> List.sortByDescending(fun x -> x.Age)
|> List.truncate 5 //just to observe the first 5 rows, not a part of the exercise.

val it: CsvProvider<...>.Row list =
  [("Gianluca Pegolo", "it ITA", "GK", "Sassuolo", "itSerie A", 40, 1, 0);
   ("Cristiano Ronaldo", "pt POR", "FW", "Manchester Utd", "engPremier League",
    36, 30, 18);
   ("Fernandinho", "br BRA", "MF,DF", "Manchester City", "engPremier League",
    36, 19, 2);
   ("Philipp Pentke", "de GER", "GK", "Hoffenheim", "deBundesliga", 36, 1, 0);
   ("Daniele Padelli", "it ITA", "GK", "Udinese", "itSerie A", 35, 3, 0)]

sort (descending) playerStatsTable by GoalsScored (List.sortByDescending).

answer

playerStatsTable
|> List.sortByDescending(fun x -> x.GoalsScored)
|> List.truncate 5 //just to observe the first 5 rows, not a part of the exercise.

val it: CsvProvider<...>.Row list =
  [("Robert Lewandowski", "pl POL", "FW", "Bayern Munich", "deBundesliga", 32,
    34, 35);
   ("Kylian Mbappé", "fr FRA", "FW", "Paris S-G", "frLigue 1", 22, 35, 28);
   ("Karim Benzema", "fr FRA", "FW", "Real Madrid", "esLa Liga", 33, 32, 27);
   ("Ciro Immobile", "it ITA", "FW", "Lazio", "itSerie A", 31, 31, 27);
   ("Wissam Ben Yedder", "fr FRA", "FW", "Monaco", "frLigue 1", 30, 37, 25)]

8 List.splitInto

List.splitInto is very usefull to split your dataset into multiple subsets. This function is commonly used to generate quantiles by splitting a sorted List. For instance, for investment strategies financial assets are usually sorted by a certain signal and then splitted into quantiles. If the signal has a positive sign, it means that the long strategy consists of going long on the first quantile stocks, and the long-short strategy consists of going long on the first quantile stocks and short on the last quantile stocks.

Note: List.splitInto receives one parameter which refers to the number of groups you want to create out of the dataset.

Example: Sort the playerStatsTable by GoalsScored and then split the dataset into 4 groups using List.sortBy and List.splitInto.

playerStatsTable
|> List.sortBy(fun x -> x.GoalsScored)
|> List.splitInto 4
|> List.truncate 2 //just to observe the first 2 groups Lists, not a part of the exercise.
|> List.map(fun x -> x |> List.truncate 5) //just to observe the first 5 rows of each group List, not a part of the exercise.

val it: CsvProvider<...>.Row list list =
  [[("Stefan Ortega", "de GER", "GK", "Arminia", "deBundesliga", 28, 33, 0);
    ("Rui Patrício", "pt POR", "GK", "Roma", "itSerie A", 33, 38, 0);
    ("Philipp Pentke", "de GER", "GK", "Hoffenheim", "deBundesliga", 36, 1, 0);
    ("Pavao Pervan", "at AUT", "GK", "Wolfsburg", "deBundesliga", 33, 6, 0);
    ("Nick Pope", "eng ENG", "GK", "Burnley", "engPremier League", 29, 36, 0)];
   [("Quentin Merlin", "fr FRA", "DF,MF", "Nantes", "frLigue 1", 19, 28, 2);
    ("Pascal Groß", "de GER", "MF,DF", "Brighton", "engPremier League", 30, 29,
     2);
    ("Mads Pedersen", "dk DEN", "MF,DF", "Augsburg", "deBundesliga", 24, 29, 2);
    ("Lukas Kübler", "de GER", "DF,MF", "Freiburg", "deBundesliga", 28, 29, 2);
    ("Josan", "es ESP", "DF,MF", "Elche", "esLa Liga", 31, 31, 2)]]

Sort the playerStatsTable by Age and then split the dataset into 5 groups using List.sortBy and List.splitInto.

answer

playerStatsTable
|> List.sortBy(fun x -> x.Age)
|> List.splitInto 5
|> List.truncate 2 //just to observe the first 2 groups Lists, not a part of the exercise.
|> List.map(fun x -> x |> List.truncate 5) //just to observe the first 5 rows of each group List, not a part of the exercise.

val it: CsvProvider<...>.Row list list =
  [[("Giorgio Scalvini", "it ITA", "DF,MF", "Atalanta", "itSerie A", 17, 18, 1);
    ("Alejandro Primo", "es ESP", "GK", "Levante", "esLa Liga", 17, 1, 0);
    ("Florian Wirtz", "de GER", "MF,FW", "Leverkusen", "deBundesliga", 18, 24,
     7); ("Destiny Udogie", "it ITA", "DF", "Udinese", "itSerie A", 18, 35, 5);
    ("Bukayo Saka", "eng ENG", "FW,MF", "Arsenal", "engPremier League", 19, 38,
     11)];
   [("Lautaro Martínez", "ar ARG", "FW", "Inter", "itSerie A", 23, 35, 21);
    ("Christopher Nkunku", "fr FRA", "FW,MF", "RB Leipzig", "deBundesliga", 23,
     34, 20);
    ("Tammy Abraham", "eng ENG", "FW", "Roma", "itSerie A", 23, 37, 17);
    ("Ludovic Blas", "fr FRA", "MF,FW", "Nantes", "frLigue 1", 23, 35, 10);
    ("Emmanuel Dennis", "ng NGA", "FW,MF", "Watford", "engPremier League", 23,
     33, 10)]]

9 List.groupBy

List.groupBy allows you to group elements of a list. It takes a key-generating function and a list as inputs. The function is executed on each element of the List, returning a list of tuples where the first element of each tuple is the key and the second is a list of the elements for which the function produced that key.

Example: Group the playerStatsTable by Nation using List.groupBy.

playerStatsTable
|> List.groupBy(fun x -> x.Nation)
|> List.truncate 2 //just to observe the first 2 groups Lists, not a part of the exercise.
|> List.map(fun (x, xs) -> x, xs |> List.truncate 5) //just to observe the first 5 rows of each group List, not a part of the exercise.

val it: (string * CsvProvider<...>.Row list) list =
  [("pl POL",
    [("Robert Lewandowski", "pl POL", "FW", "Bayern Munich", "deBundesliga",
      32, 34, 35);
     ("Przemysław Frankowski", "pl POL", "DF", "Lens", "frLigue 1", 26, 37, 6);
     ("Matty Cash", "pl POL", "DF", "Aston Villa", "engPremier League", 23, 38,
      4)]);
   ("fr FRA",
    [("Kylian Mbappé", "fr FRA", "FW", "Paris S-G", "frLigue 1", 22, 35, 28);
     ("Karim Benzema", "fr FRA", "FW", "Real Madrid", "esLa Liga", 33, 32, 27);
     ("Wissam Ben Yedder", "fr FRA", "FW", "Monaco", "frLigue 1", 30, 37, 25);
     ("Moussa Dembélé", "fr FRA", "FW", "Lyon", "frLigue 1", 25, 30, 21);
     ("Martin Terrier", "fr FRA", "FW,MF", "Rennes", "frLigue 1", 24, 37, 21)])]

Group the playerStatsTable by Age using List.groupBy.

answer

playerStatsTable
|> List.groupBy(fun x -> x.Age)
|> List.map(fun (x, xs) -> x, xs |> List.truncate 5) //just to observe the first 5 rows of each group List, not a part of the exercise.
|> List.truncate 2 //just to observe the first 2 groups Lists, not a part of the exercise.

val it: (int * CsvProvider<...>.Row list) list =
  [(32,
    [("Robert Lewandowski", "pl POL", "FW", "Bayern Munich", "deBundesliga",
      32, 34, 35);
     ("Marco Reus", "de GER", "MF,FW", "Dortmund", "deBundesliga", 32, 29, 9);
     ("Ivan Perišić", "hr CRO", "DF", "Inter", "itSerie A", 32, 35, 8);
     ("Axel Witsel", "be BEL", "MF,DF", "Dortmund", "deBundesliga", 32, 29, 2);
     ("Ivan Radovanović", "rs SRB", "DF,MF", "Salernitana", "itSerie A", 32,
      14, 1)]);
   (22,
    [("Kylian Mbappé", "fr FRA", "FW", "Paris S-G", "frLigue 1", 22, 35, 28);
     ("Gianluca Scamacca", "it ITA", "FW", "Sassuolo", "itSerie A", 22, 36, 16);
     ("Moussa Diaby", "fr FRA", "FW,MF", "Leverkusen", "deBundesliga", 22, 32,
      13);
     ("Randal Kolo Muani", "fr FRA", "FW,MF", "Nantes", "frLigue 1", 22, 36,
      12);
     ("Mason Mount", "eng ENG", "MF", "Chelsea", "engPremier League", 22, 32,
      11)])]

Statistics List Functions

1 List.max

[1; 4; 5; 3; 6] |> List.max returns 6 (the highest value in the List).

Example: Map playerStatsTable to get the Age List, and find the maximum (List.max).

playerStatsTable
|> List.map(fun x -> x.Age)
|> List.max

val it: int = 40

Map playerStatsTable to get the GoalsScored List, and find the maximum (List.max).

answer

playerStatsTable
|> List.map(fun x -> x.GoalsScored)
|> List.max

val it: int = 35

2 List.min

[1; 4; 5; 3; 6] |> List.min returns 1 (the lowest value in the List).

Example: Map playerStatsTable to get the Age List, and find the minimum (List.min).

playerStatsTable
|> List.map(fun x -> x.Age)
|> List.min

val it: int = 17

Map playerStatsTable to get the GoalsScored List, and find the minimum (List.min).

answer

playerStatsTable
|> List.map(fun x -> x.GoalsScored)
|> List.min

val it: int = 0

3 List.maxBy

Sometimes you want the element with the "maximum y" where "y" is the result of applying a particular function to a list element. This is what List.maxBy is for. This function is best understood by seeing an example.

Example: Find the player in playerStatsTable with the maximum Age using maxBy. What we need to do then is write a function that takes a player as input and outputs the players age. List.maxBy will then find the player that is the maxiumum after transforming it using this function.

playerStatsTable
|> List.maxBy(fun x -> x.Age)

val it: CsvProvider<...>.Row =
  ("Gianluca Pegolo", "it ITA", "GK", "Sassuolo", "itSerie A", 40, 1, 0)

Find the maximum playerStatsTable row by GoalsScored using maxBy.

answer

playerStatsTable
|> List.maxBy(fun x -> x.GoalsScored)

val it: CsvProvider<...>.Row =
  ("Robert Lewandowski", "pl POL", "FW", "Bayern Munich", "deBundesliga", 32,
   34, 35)

4 List.minBy

Sometimes you want the element with the "minimum y" where "y" is the result of applying a particular function to a list element. This is what List.minBy is for.

Example: Find the player in playerStatsTable with the minimum Age using minBy.

playerStatsTable
|> List.minBy(fun x -> x.Age)

val it: CsvProvider<...>.Row =
  ("Giorgio Scalvini", "it ITA", "DF,MF", "Atalanta", "itSerie A", 17, 18, 1)

Find the minimum playerStatsTable row by GoalsScored using minBy.

answer

playerStatsTable
|> List.minBy(fun x -> x.GoalsScored)

val it: CsvProvider<...>.Row =
  ("Stefan Ortega", "de GER", "GK", "Arminia", "deBundesliga", 28, 33, 0)

5 List.sum

[1; 4; 5; 3; 6] |> List.sum returns 19 (sum of the List elements).

Example: Calculate the total number of years lived by all players. Hint: transform (List.map) each element of playerStatsTable into an integer representing the player's Age and then get the sum (List.sum) of all the players' ages (the result should be an int).

playerStatsTable
|> List.map(fun x -> x.Age)
|> List.sum

val it: int = 5270

Calculate the total goals scored (GoalsScored) by all players in playerStatsTable.

answer

playerStatsTable
|> List.map(fun x -> x.GoalsScored)
|> List.sum

val it: int = 1470

6 List.sumBy

We are using a dataset that has multiple fields per List element. If you want to get the sum for particular fields it convenient to use List.sumBy. It takes a function and transforms each element using that function and afterward sums all the transformed elements. It is like an List.map and List.sum combined into one function.

Example: Use List.sumBy to calculate the total number of years lived by all players in playerStatsTable. Remember that each player has lived Age years.

playerStatsTable
|> List.sumBy(fun x -> x.Age)

val it: int = 5270

Find the sum of the GoalsScored by all players in playerStatsTable using List.sumBy.

answer

playerStatsTable
|> List.sumBy(fun x -> x.GoalsScored)

val it: int = 1470

7 List.average

[1.0; 2.0; 5.0; 2.0] |> List.average returns 2.5 (the average of all the List elements).

Example: Transform playerStatsTable into a list of the players' ages (Age) and find the average Age (List.average).
The field x.Age needs to be transformed from int to float because List.average only works with floats or decimals.

playerStatsTable
|> List.map(fun x -> float x.Age)
|> List.average

val it: float = 26.35

Use List.map to transform playerStatsTable into a list of the players' GoalsScored and find the average GoalsScored (List.average).
Hint: The variable x.GoalsScored needs to be transformed from int to float since List.average only works with floats or decimals.

answer

playerStatsTable
|> List.map(fun x -> float x.GoalsScored)
|> List.average

val it: float = 7.35

8 List.averageBy

We are using a dataset that has multiple fields per List element. If you want to get the average for particular fields it convenient to use List.averageBy. It takes a function and transforms each element using that function and afterward averages all the transformed elements. It is like an List.map and List.average combined into one function.

Example: Find the average Age using List.averageBy.
The Age needs to be transformed from int to float since List.averageBy only works with floats or decimals.

playerStatsTable
|> List.averageBy(fun x -> float x.Age)

val it: float = 26.35

Find the average GoalsScored using List.averageBy.
Hint: The GoalsScored needs to be transformed from int to float since List.averageBy only works with floats or decimals.

answer

playerStatsTable
|> List.averageBy(fun x -> float x.GoalsScored)

val it: float = 7.35

9 Seq.stDev

For Seq.stDev to work, we loaded the FSharp.Stats nuget (#r "nuget: FSharp.Stats, 0.5.0"). This nuget contains the standard deviation function. Besides this we also opened the module FSharp.Stats (open FSharp.Stats). FSharp.Stats documentation

Example: Use List.map to transform playerStatsTable by GoalsScored and find the standard deviation. (Seq.stDev).
Note that for Seq.stDev to work the values need to be floats or decimals, so we need to transform the GoalsScored from int to float.

playerStatsTable
|> List.map(fun x -> float x.GoalsScored)
|> Seq.stDev

val it: float = 6.733811781

Transform playerStatsTable into a list of the players' Age's and find the standard deviation. (Seq.stDev).
Hint: You need to transform Age values from int to floats.

answer

playerStatsTable
|> List.map(fun x -> float x.Age)
|> Seq.stDev

val it: float = 4.343018426

10 Seq.pearsonOfPairs

In order to perform correlations we have to load and open the namespace FSharp.Stats.
Also, we open FSharpe.Stats.Correlation to allow a easier access to the correlation functions.

It will be helpfull to check the FSharp.Stats.Correlation Documentation before starting the exercises.

Example: Test the correlation between MatchesPlayed and GoalsScored using pearsonOfPairs.

Seq.pearsonOfPairs expects a list of tuples (x1 * x2), computing the correlation between x1 and x2. So we use List.map to get a list of tuples with (MatchesPlayed, GoalsScored). Then we only need to pipe (|>) to Seq.pearsonOfPairs.

playerStatsTable
|> List.map(fun x -> x.MatchesPlayed, x.GoalsScored)
|> Seq.pearsonOfPairs

val it: float = 0.4641226145

Test the correlation between MatchesPlayed and Age using pearsonOfPairs.
Hints: Seq.pearsonOfPairs expects a list of tuples (x1 * x2). Use List.map to get a list of tuples with (MatchesPlayed, Age). Then you only need to pipe (|>) to Seq.pearsonOfPairs.

answer

playerStatsTable
|> List.map(fun x -> x.MatchesPlayed, x.Age)
|> Seq.pearsonOfPairs

val it: float = -0.07750635099

Test the correlation between GoalsScored and Age using pearsonOfPairs.
Hints: Seq.pearsonOfPairs expects a list of tuples (x1 * x2). Use List.map to get a list of tuples with (GoalsScored, Age). Then you only need to pipe (|>) to Seq.pearsonOfPairs.

answer

playerStatsTable
|> List.map(fun x -> x.GoalsScored, x.Age)
|> Seq.pearsonOfPairs

val it: float = 0.01881518088

Further Statistics practice

Now that you should feel confortable with List.filter, List.groupBy, List.splitInto
and also some f# statistics functions, let's combine those concepts together.

1 List.countBy, List.filter and List.averageBy

Example: Find the average goals scored by portuguese players.

In order to find the average goals for portuguese players we know that we need to use List.filter. But we need to know what is the string correspondent to portuguese players! Using List.distinct or List.countBy we can observe all the Nation strings, which allow us to see that portuguese Nation string is "pt POR".

playerStatsTable
|> List.countBy(fun x -> x.Nation)

Now that we know what is the Portuguese string we can filter x.Nation = "pt POR" in order to only get portuguese players' rows! Then we can easily pipe it (|>) to List.averageBy (fun x -> float x.Age) to get the average age of portuguese players.

playerStatsTable
|> List.filter(fun x -> x.Nation = "pt POR")
|> List.averageBy(fun x -> float x.Age)

val it: float = 28.66666667

Find the average age for players playing on the Premier League . Hint: You'll first need to use List.filter to get only players from the Premier League (x.League = "engPremier League"). Then use averageBy to compute the average by age, don't forget to use float x.Age to transform age values to float type.

answer

playerStatsTable
|> List.filter(fun x -> x.League = "engPremier League")
|> List.averageBy(fun x -> float x.Age)

val it: float = 25.58333333

2. List.groupBy, List.map and transformations.

Example: Group playerStatsTable by Team and compute the average number of GoalsScored.

//example using record:
type TeamAndAvgGls =
    { Team : string
      AvgGoalsScored : float }

playerStatsTable
|> List.groupBy(fun x -> x.Team)
|> List.map(fun (team, playerStats) -> 
    { Team = team
      AvgGoalsScored = playerStats |> List.averageBy(fun playerStats -> float playerStats.GoalsScored)})
|> List.truncate 5 //just to observe the first 5 rows, not a part of the exercise.

type TeamAndAvgGls =
  {
    Team: string
    AvgGoalsScored: float
  }
val it: TeamAndAvgGls list =
  [{ Team = "Bayern Munich"
     AvgGoalsScored = 14.66666667 }; { Team = "Paris S-G"
                                       AvgGoalsScored = 15.33333333 };
   { Team = "Real Madrid"
     AvgGoalsScored = 18.0 }; { Team = "Lazio"
                                AvgGoalsScored = 19.0 };
   { Team = "Monaco"
     AvgGoalsScored = 11.66666667 }]

//example using tuple:
playerStatsTable
|> List.groupBy(fun x -> x.Team)
|> List.map(fun (team, playerStats) -> team, playerStats |> List.averageBy(fun playerStats -> float playerStats.GoalsScored))
|> List.truncate 5 //just to observe the first 5 rows, not a part of the exercise.

val it: (string * float) list =
  [("Bayern Munich", 14.66666667); ("Paris S-G", 15.33333333);
   ("Real Madrid", 18.0); ("Lazio", 19.0); ("Monaco", 11.66666667)]

Group playerStatsTable by League and then compute the Average Age by group.
Hint: Use groupBy to group by league (League). Then use averageBy to compute the average by age (Age) and pipe it (|>) to List.map to organize the data in a record or tuple with League (League) and Average Age.

answer

//solution using record:
type LeagueAndAvgAge =
    { League : string 
      AverageAge : float }

playerStatsTable
|> List.groupBy(fun x -> x.League)
|> List.map(fun (leagues, playerStats) ->
    { League = leagues
      AverageAge = playerStats |> List.averageBy(fun playerStats -> float playerStats.Age) })

//solution using tuples:
playerStatsTable
|> List.groupBy(fun x -> x.League)
|> List.map(fun (leagues, playerStats) -> 
    leagues, 
    playerStats |> List.averageBy(fun playerStats -> float playerStats.Age) )

type LeagueAndAvgAge =
  {
    League: string
    AverageAge: float
  }
val it: (string * float) list =
  [("deBundesliga", 27.11111111); ("frLigue 1", 25.7173913);
   ("esLa Liga", 26.53333333); ("itSerie A", 26.80769231);
   ("engPremier League", 25.58333333)]

3 List.sortDescending, List.splitInto, List.map and Seq.stDev

From playerStatsTable sort the players' Age (descending), split the dataset into quartiles (4-quantiles) and compute the standard deviation for each quantile.
Hint: You only need the Age field from the dataset, so you can use map straight away to get the Age List. Sort that List with List.sortDescending, and then split it into 4 parts using List.splitInto. Finally use List.map to iterate through each quantile and apply the function Seq.stDev.

answer

playerStatsTable
|> List.map(fun x -> float x.Age)
|> List.sortDescending
|> List.splitInto 4
|> List.map(fun x -> x |> Seq.stDev)

val it: float list = [2.294714424; 0.9082389329; 0.9171829097; 1.59604102]

val makeNumberedHeading: htmlTag: string -> text: string -> string

val htmlTag: string

Multiple items
val string: value: 'T -> string

--------------------
type string = System.String

val text: string

val name: string

System.String.Replace(oldValue: string, newValue: string) : string
System.String.Replace(oldChar: char, newChar: char) : string
System.String.Replace(oldValue: string, newValue: string, comparisonType: System.StringComparison) : string
System.String.Replace(oldValue: string, newValue: string, ignoreCase: bool, culture: System.Globalization.CultureInfo) : string

val snippet: string

val sprintf: format: Printf.StringFormat<'T> -> 'T

val H2: (string -> string)

val H3: (string -> string)

Multiple items
namespace FSharp

--------------------
namespace Microsoft.FSharp

Multiple items
namespace FSharp.Data

--------------------
namespace Microsoft.FSharp.Data

namespace FSharp.Stats

module Correlation from FSharp.Stats
<summary> Contains correlation functions for different data types </summary>

Multiple items
type LiteralAttribute = inherit Attribute new: unit -> LiteralAttribute

--------------------
new: unit -> LiteralAttribute

[<Literal>] val CsvPath: string = "D:\a\Teaching\Teaching\docs/FootballPlayers.csv"

type FootballPlayersCsv = CsvProvider<...>

type CsvProvider
<summary>Typed representation of a CSV file.</summary> <param name='Sample'>Location of a CSV sample file or a string containing a sample CSV document.</param> <param name='Separators'>Column delimiter(s). Defaults to <c>,</c>.</param> <param name='InferRows'>Number of rows to use for inference. Defaults to <c>1000</c>. If this is zero, all rows are used.</param> <param name='Schema'>Optional column types, in a comma separated list. Valid types are <c>int</c>, <c>int64</c>, <c>bool</c>, <c>float</c>, <c>decimal</c>, <c>date</c>, <c>datetimeoffset</c>, <c>timespan</c>, <c>guid</c>, <c>string</c>, <c>int?</c>, <c>int64?</c>, <c>bool?</c>, <c>float?</c>, <c>decimal?</c>, <c>date?</c>, <c>datetimeoffset?</c>, <c>timespan?</c>, <c>guid?</c>, <c>int option</c>, <c>int64 option</c>, <c>bool option</c>, <c>float option</c>, <c>decimal option</c>, <c>date option</c>, <c>datetimeoffset option</c>, <c>timespan option</c>, <c>guid option</c> and <c>string option</c>. You can also specify a unit and the name of the column like this: <c>Name (type<unit>)</c>, or you can override only the name. If you don't want to specify all the columns, you can reference the columns by name like this: <c>ColumnName=type</c>.</param> <param name='HasHeaders'>Whether the sample contains the names of the columns as its first line.</param> <param name='IgnoreErrors'>Whether to ignore rows that have the wrong number of columns or which can't be parsed using the inferred or specified schema. Otherwise an exception is thrown when these rows are encountered.</param> <param name='SkipRows'>Skips the first n rows of the CSV file.</param> <param name='AssumeMissingValues'>When set to true, the type provider will assume all columns can have missing values, even if in the provided sample all values are present. Defaults to false.</param> <param name='PreferOptionals'>When set to true, inference will prefer to use the option type instead of nullable types, <c>double.NaN</c> or <c>""</c> for missing values. Defaults to false.</param> <param name='Quote'>The quotation mark (for surrounding values containing the delimiter). Defaults to <c>"</c>.</param> <param name='MissingValues'>The set of strings recognized as missing values specified as a comma-separated string (e.g., "NA,N/A"). Defaults to <c>NaN,NA,N/A,#N/A,:,-,TBA,TBD</c>.</param> <param name='CacheRows'>Whether the rows should be caches so they can be iterated multiple times. Defaults to true. Disable for large datasets.</param> <param name='Culture'>The culture used for parsing numbers and dates. Defaults to the invariant culture.</param> <param name='Encoding'>The encoding used to read the sample. You can specify either the character set name or the codepage number. Defaults to UTF8 for files, and to ISO-8859-1 the for HTTP requests, unless <c>charset</c> is specified in the <c>Content-Type</c> response header.</param> <param name='ResolutionFolder'>A directory that is used when resolving relative file references (at design time and in hosted execution).</param> <param name='EmbeddedResource'>When specified, the type provider first attempts to load the sample from the specified resource (e.g. 'MyCompany.MyAssembly, resource_name.csv'). This is useful when exposing types generated by the type provider.</param>

val playerStatsTable: CsvProvider<...>.Row list

CsvProvider<...>.GetSample() : CsvProvider<...>

Multiple items
module Seq from FSharp.Stats.Correlation
<summary> Contains correlation functions optimized for sequences </summary>

--------------------
module Seq from FSharp.Stats
<summary> Module to compute common statistical measure </summary>

--------------------
module Seq from Microsoft.FSharp.Collections

--------------------
type Seq = new: unit -> Seq static member geomspace: start: float * stop: float * num: int * ?IncludeEndpoint: bool -> float seq static member linspace: start: float * stop: float * num: int * ?IncludeEndpoint: bool -> float seq

--------------------
new: unit -> Seq

val toList: source: 'T seq -> 'T list

Multiple items
module List from FSharp.Stats
<summary> Module to compute common statistical measure on list </summary>

--------------------
module List from Microsoft.FSharp.Collections

--------------------
type List = new: unit -> List static member geomspace: start: float * stop: float * num: int * ?IncludeEndpoint: bool -> float list static member linspace: start: float * stop: float * num: int * ?IncludeEndpoint: bool -> float list

--------------------
type List<'T> = | op_Nil | op_ColonColon of Head: 'T * Tail: 'T list interface IReadOnlyList<'T> interface IReadOnlyCollection<'T> interface IEnumerable interface IEnumerable<'T> member GetReverseIndex: rank: int * offset: int -> int member GetSlice: startIndex: int option * endIndex: int option -> 'T list static member Cons: head: 'T * tail: 'T list -> 'T list member Head: 'T member IsEmpty: bool member Item: index: int -> 'T with get ...

--------------------
new: unit -> List

val take: count: int -> list: 'T list -> 'T list

val truncate: count: int -> list: 'T list -> 'T list

val map: mapping: ('T -> 'U) -> list: 'T list -> 'U list

val x: CsvProvider<...>.Row

property CsvProvider<...>.Row.Nation: string with get

val distinct: list: 'T list -> 'T list (requires equality)

property CsvProvider<...>.Row.League: string with get

val countBy: projection: ('T -> 'Key) -> list: 'T list -> ('Key * int) list (requires equality)

property CsvProvider<...>.Row.Team: string with get

val filter: predicate: ('T -> bool) -> list: 'T list -> 'T list

property CsvProvider<...>.Row.Age: int with get

val sort: list: 'T list -> 'T list (requires comparison)

property CsvProvider<...>.Row.GoalsScored: int with get

val sortDescending: list: 'T list -> 'T list (requires comparison)

val sortBy: projection: ('T -> 'Key) -> list: 'T list -> 'T list (requires comparison)

val sortByDescending: projection: ('T -> 'Key) -> list: 'T list -> 'T list (requires comparison)

val splitInto: count: int -> list: 'T list -> 'T list list

val x: CsvProvider<...>.Row list

val groupBy: projection: ('T -> 'Key) -> list: 'T list -> ('Key * 'T list) list (requires equality)

val x: string

val xs: CsvProvider<...>.Row list

val x: int

val max: list: 'T list -> 'T (requires comparison)

val min: list: 'T list -> 'T (requires comparison)

val maxBy: projection: ('T -> 'U) -> list: 'T list -> 'T (requires comparison)

val minBy: projection: ('T -> 'U) -> list: 'T list -> 'T (requires comparison)

val sum: list: 'T list -> 'T (requires member (+) and member Zero)

val sumBy: projection: ('T -> 'U) -> list: 'T list -> 'U (requires member (+) and member Zero)

Multiple items
val float: value: 'T -> float (requires member op_Explicit)

--------------------
type float = System.Double

--------------------
type float<'Measure> = float

val average: list: 'T list -> 'T (requires member (+) and member DivideByInt and member Zero)

val averageBy: projection: ('T -> 'U) -> list: 'T list -> 'U (requires member (+) and member DivideByInt and member Zero)

val stDev: items: 'T seq -> 'U (requires member (-) and member Zero and member DivideByInt and member (+) and member ( * ) and member (+) and member (/) and member Sqrt)
<summary> Computes the sample standard deviation </summary>
<param name="items">The input sequence.</param>
<remarks>Returns NaN if data is empty or if any entry is NaN.</remarks>
<returns>standard deviation of a sample (Bessel's correction by N-1)</returns>

property CsvProvider<...>.Row.MatchesPlayed: int with get

val pearsonOfPairs: seq: ('T * 'T) seq -> float (requires member op_Explicit and member Zero and member One)
<summary> Calculates the pearson correlation of two samples given as a sequence of paired values. Homoscedasticity must be assumed. </summary>
<param name="seq">The input sequence.</param>
<typeparam name="'T"></typeparam>
<returns>The pearson correlation.</returns>
<example><code> // Consider a sequence of paired x and y values: // [(x1, y1); (x2, y2); (x3, y3); (x4, y4); ... ] let xy = [(312.7, 315.5); (104.2, 101.3); (104.0, 108.0); (34.7, 32.2)] // To get the correlation between x and y: xy |> Seq.pearsonOfPairs // evaluates to 0.9997053729 </code></example>

type TeamAndAvgGls = { Team: string AvgGoalsScored: float }

val team: string

val playerStats: CsvProvider<...>.Row list

val playerStats: CsvProvider<...>.Row

type LeagueAndAvgAge = { League: string AverageAge: float }

val leagues: string

val x: float list