STATISTICAL LEARNING
Syllabus
Obiettivi Formativi
L'attività economica e aziendale genera set informativi di dimensione molto elevata che possono essere utilizzati per inferire e validare regolarità, associazioni e relazioni causali tra fenomeni, la cui conoscenza offre un vantaggio competitivo. Il corso approfondisce le principali metodologie statistiche del “"data mining”" e dell'apprendimento statistico orientato dai dati. In particolare, viene affrontato il problema della previsione di output quantitativi o qualitativi sulla base di insieme di predittori potenzialmente sovrabbondante, noto come "supervised learning”"
Il corso ha i seguenti obiettivi formativi:
- conoscere le principali e più avanzate e moderne tecniche di data mining;
- saper gestire la complessità informativa, distillando le informazioni rilevanti da una mole elevata dei dati;
- saper prevedere i fenomeni economici e aziendali;
- acquisire la capacità di selezionare una regola predittiva tra quelle disponibili;
- essere in grado di comunicare le principali evidenze empiriche che emergono dall’analisi;
- svolgere analisi statistiche col il software appropriato;
- apprezzare criticamente le potenzialità e i limiti delle metodologie disponibili, acquisendo la capacità di discriminare tra di esse.
CONOSCENZA E CAPACITÀ DI COMPRENSIONE:
Il corso tratta la logica e le metodogie fondamentali del data mining e dell’apprendimento statistico, che costituiscono un approccio alla conoscenza dell’economia e dei mercati basato sull’esplorazione di elevate moli di dati e sulla scoperta di relazioni difficilmente riconducibili alla conoscenza a priori dei fenomeni.
Il tema fondamentale è rappresentato dalla previsione di variabili quantitative e qualitative mediante variabili di input.
Ampio risalto è dato al problema della selezione delle variabili e del modello e dei criteri per ottimizzare il trade-off tra complessità modellistica e generalizzabilità in campioni di validazione.
CAPACITÀ DI APPLICARE CONOSCENZA E COMPRENSIONE:
Le conoscenze acquisite vengono applicate a problemi di credit scoring, previsione delle vendite e di pricing dei prodotti. Costituiscono parte integrante del corso le esercitazioni di laboratorio che vengono svolte mediante i software R e Matlab. Gli studenti utilizzano le loro conoscenze per analizzare casi di studio, sia in laboratorio che negli assignments.
AUTONOMIA DI GIUDIZIO:
Il corso affronta tematiche di teoria delle decisioni. In particolare, viene affrontato il problema della classificazione (predizione di output categorici), discutendo la regola di decisione di Bayes. Inoltre, il tema di fondo che viene affrontato è la selezione del previsore ottimale per un dato problema. Lo studente viene stimolato a trarre conclusioni sulla validità interna ed esterna dei modelli considerati sulla base del confronto tra l’adattamento del modello nel campione di training e nel campione di validazione.
ABILITÀ COMUNICATIVE:
Il corso dedica molta attenzione alla comunicazione delle evidenze empiriche mediante grafici e statistiche di sintesi e sulla capacità di saper presentare le suddette evidenze a non esperti, in maniera efficace e sintetica. Al fine di accertare il conseguimento di questo obiettivo formativo sono previsti 4 assignments, il cui deliverable fondamentale è una relazione scritta che evidenzi i principali elementi interpretativi delle applicazioni. Il software utilizzato (R e SAS) è fortemente orientato verso la comunicazione grafica delle evidenze statistiche.
CAPACITÀ DI APPRENDIMENTO:
Lo studente sviluppa le proprie capacità di apprendimento in modo auto-diretto o autonomo confrontando il materiale didattico messo a disposizione dal docente con le letture suggerite settimanalmente dal medesimo. Inoltre, viene stimolato ad affrontare e risolvere casi di studio circa la selezione del migliore predittore e delle evidenze empiriche più rilevanti.
Learning Objectives
The course provides an introduction to Statistical Learning and Data Mining.
The advances in information technology have made available very rich information data sets, often generated automatically as a by-product of the main institutional activity of a firm or business unit. Most organizations today produce an electronic record of essentially every transaction in which they are involved. Firms collect terabytes data over operating periods (transactions data, e.g. credit cards). Most often these data are collected as secondary data, with no specific sampling design or research question on top.
The course offers an insight into the main statistical methodologies for the visualisation and the analysis of business and market data, providing the information requirements for specific tasks such as credit scoring, prediction and classification, market segmentation and product positioning. Emphasis will be given to empirical applications using modern software tools (Rstudio).
The course has the following intended learning outcomes:
- to provide a thorough knowledge of data mining methods and statistical learning techniques;
- to provide the expertise to manage complexity in information and to be able to distil the stylized facts that are relevant for interpretation;
- to be able to predict business outcomes;
- to be able to select a predictive method among those available;
- to be able to communicate the statistical findings to a non expert audience;
- to be able to perform sophisticated statistical analyses with the appropriate software.
- to critically appraise the potential and the limitation of the available methodologies.
KNOWLEDGE AND UNDERSTANDING:
The course covers the modern statistical methodologies for the visualisation and the analysis of business and market data, that are relevant for making decisions in a complex and rapidly changing business environment.
The fundamental theme is supervised statistical learning, which deals with the prediction of quantitative and qualitative outcomes using a potentially large set of inputs. The two problems, regression and classification, constitute the core of the course.
Emphasis is given to the problem of variable and model selection and on the generalizability of a prediction method outside the training sample, via the optimization of the trade-off between model complexity and the in-sample goodness of fit.
APPLYING KNOWLEDGE AND UNDERSTANDING:
The methodologies exposed during the course are applied to real life datasets and case studies, dealing with the prediction of sales, credit scoring and pricing goods.
Two hours per week are dedicated to tutorials where statistical analyses are conducted in the Laboratory and implemented in the software R-studio.
Students are expected to perform their statistical analyses in a group assignment.
MAKING JUDGEMENTS:
The prediction of an outcome is an informed decision based on the knowledge of covariates and antecedents. An important supervised learning problem is classification. We discuss Bayes classification rule and how to select the prediction rule that is optimal for a particular target variable. The student is expected to be able to draw conclusions on the basis of the statistical evidence and to validate those conclusions on validation or test samples drawn from the same target population.
COMMUNICATION SKILLS:
Particular attention is dedicated to the ability to communicate the statistical evidence in a systematic and synthetic way, using graphs and summaries, to a non-specialist target audience.
The software used in the tutorials is oriented towards graphical displays and visualization of data. The student is asked to report on the statistical analysis carried out for a particular purpose in the individual assignments.
LEARNING SKILLS:
Students develop their learning skills by comparing the teaching material provided by the instructor and exposed in the lectures with the readings suggested with weekly periodicity. The software tutorials and the analysis of cases studies in the assignments will help build their applied skills and their autonomous progress towards the intended learning outcomes.
Prerequisiti
Prerequisites
Programma
(Lectures 1-3)
2. The linear regression model. Estimation and prediction. (Lectures 4-6)
3.Model selection and evaluation: bias-variance trade-off, model complexity and goodness of fit. Cross-validation. Selection using information criteria. (Lectures 6-9)
4. Regularization and shrinkage methods: rigde regression, lasso, forward stagewise regression. Principal components regression. (Lectures 10-11).
5. Linear methods for classication: Bayes Classication Rule.
Discriminant analysis. Canonical variates. Logistic regression. (Lectures 12-14)
6. Semiparametric regression: Regression splines and smoothing splines. (Lecture 15)
7. Kernel smoothing methods: Local polynomial regression.
Density estimation. Nearest neighbor classication. (Lecture 16)
8. Additive Models, tree-based methods. GAM, Regression and classication trees. Boosting. (Lectures 17-18)
Program
(Lectures 1-3)
2. The linear regression model. Estimation and prediction. (Lectures 4-6)
3.Model selection and evaluation: bias-variance trade-off, model complexity and goodness of fit. Cross-validation. Selection using information criteria. (Lectures 6-9)
4. Regularization and shrinkage methods: rigde regression, lasso, forward stagewise regression. Principal components regression. (Lectures 10-11).
5. Linear methods for classication: Bayes Classication Rule.
Discriminant analysis. Canonical variates. Logistic regression. (Lectures 12-14)
6. Semiparametric regression: Regression splines and smoothing splines. (Lecture 15)
7. Kernel smoothing methods: Local polynomial regression.
Density estimation. Nearest neighbor classication. (Lecture 16)
8. Additive Models, tree-based methods. GAM, Regression and classication trees. Boosting. (Lectures 17-18)
Testi Adottati
G James, D Witten, T Hastie, and R Tibshirani and J Friedman. An Introduction to Statistical Learning with Applications in R. Springer, Springer Series in Statistics, 2009.
Disponibile all’indirizzo http://www-bcf.usc.edu/~gareth/ISL/
Il docente mette a disposizione sul sito web del corso: lucidi delle lezioni, letture suggerite, datasets e materiale supplementare (script di Matlab e R).
Altri riferimenti utili:
• T Hastie, R Tibshirani and J Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction, Second Edition. Springer, Springer Series in Statistics, 2009. Website: http://www-stat.stanford.edu/ElemStatLearn/
G. Bekes and G. Kezdi. Data Analysis for Business, Economics, and Policy
Books
G James, D Witten, T Hastie, and R Tibshirani and J Friedman. An Introduction to Statistical Learning with Applications in R. Springer, Springer Series in Statistics, 2009.
Dowloadable at http://www-bcf.usc.edu/~gareth/ISL/
The course material will be made available on the course website: slides, suggested readings, datasets, supplementary materials (script of Matlab, R and SAS).
Additional useful reference:
-
• T Hastie, R Tibshirani and J Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction, Second Edition. Springer, Springer Series in Statistics, 2009. Website: http://www-stat.stanford.edu/ElemStatLearn/
* G. Bekes and G. Kezdi. Data Analysis for Business, Economics, and Policy
Bibliografia
Dowloadable at http://www-bcf.usc.edu/~gareth/ISL/
•T Hastie, R Tibshirani and J Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction, Second Edition. Springer, Springer Series in Statistics, 2009. Website: http://www-stat.stanford.edu/ElemStatLearn/
G. Bekes and G. Kezdi. Data Analysis for Business, Economics, and Policy
Bibliography
Dowloadable at http://www-bcf.usc.edu/~gareth/ISL/
•- Hastie, R Tibshirani and J Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction, Second Edition. Springer, Springer Series in Statistics, 2009. Website: http://www-stat.stanford.edu/ElemStatLearn/
G. Bekes and G. Kezdi. Data Analysis for Business, Economics, and Policy
Modalità di svolgimento
• Esercitazioni
• Esercizi
• Laboratori (Matlab, R)
Teaching methods
• Classes
• Exercises
• Tutorials (Matlab, R)
Regolamento Esame
30% Compiti individuali o di gruppo
70% Prova scritta finale
Il lavoro individuale o di gruppo (può essere svolto in entrambe le modalità a scelta dello studente ed è valevole per il 30% della valutazione finale) mira a valutare le capacità di elaborazione e modellazione statistica, nonché le capacità di comunicare le principali evidenze. Lo studente affronta un caso di studio che riguarda la previsione di una variabile quantitativa o la classificazione in una categoria di appartenenza e deve produrre un rapporto tecnico in cui descrive i risultati conseguiti.
L'esame finale è una prova scritta di 120 minuti che valuta l' effettiva acquisizione parte dello studente degli obiettivi formativi e dei risultati di apprendimento attesi. Esso consta di 5 domande a risposta aperta con sotto-quesiti che richiedono l'elaborazione gli elementi fondamentali della specificazione dei modelli per l'apprendimento supervisionato per la regressione e classificazione, la stima mediante il training sample, la verifica empirica e la validazione predittiva. Lo studente deve saper valutare criticamente le assunzioni sottostanti alla specificazione ed essere in grado di sintetizzare le proprietà statistiche dei metodi utilizzati e di prevedere i processi sottostanti. La valutazione degli elaborati è fondata sulla capacità di valutare criticamente gli ambiti di applicazione di ciascuna metodologia, l'apprendimento dei fondamentali trade-off tra distorsione e varianza, nonchè delle basi della teoria statistica. Ciascun quesito e sotto-quesito ha un numero di punti dichiarato che concorre al punteggio finale.
Il voto finale sarà espresso in trentesimi con l'articolazione che segue:
- Non idoneo: importanti carenze e/o inaccuratezze nella conoscenza e comprensione degli argomenti; limitate capacità di analisi e sintesi, frequenti generalizzazioni.
- 18-20: conoscenza e comprensione degli argomenti appena sufficiente con possibili imperfezioni; capacità di analisi sintesi e autonomia di giudizio sufficienti.
- 21-23: Conoscenza e comprensione degli argomenti routinaria; Capacità di analisi e sintesi corrette con argomentazione logica coerente.
- 24-26: Buona conoscenza e comprensione degli argomenti; buone capacità di analisi e sintesi con argomentazioni espresse in modo rigoroso.
- 27-29: Distinta conoscenza e comprensione degli argomenti completa; notevoli capacità di analisi, sintesi. Distinta autonomia di giudizio.
- 30-30L: Ottimo livello di conoscenza e comprensione degli argomenti. Notevoli capacità di analisi e di sintesi e di autonomia di giudizio. Argomentazioni espresse in modo originale.
Exam Rules
30% Individual and Group Assignments
70% Final Exam
The assignments contribute to 30% of the final assessment and aim at evaluating the capabilities of processing and analysing statistical modelling, as well as the ability to communicate the relevant findings. Students face a case study and a real life dataset; they are expected to produce a technical report which summarizes their statistical findings and provides the necessary insight into the solution of the case study.
The final exam is a 2 hours written test that evaluates the learning of the program topics. Students face open questions with subquestions that test the understanding of the techniques presented throughout the course and the ability to critically assess their scope. The questions deal with the specification, estimation and validation of models for the prediction of quantitative (regression) and qualitative variables (classification). The students will have to prove their proficiency in understanding the basic assumptions that are made, how the data are used to learn about the model parameters, and finally how we diagnose the external and predictive validity of the methods and models. The assessment criteria are based on the students' critical appraisal of the scope and applicability of the methods, on their deep understanding of the trade-offs between goodness of fit and complexity, bias and variance, and on the rigour with which the properties are presented in the written exam paper. Main questions and items are scored according to difficulty. The score is disclosed to the students directly on the exam paper.
The final grade will be expressed in thirtieths with the following breakdown:
- Fail: significant deficiencies and/or inaccuracies in the knowledge and understanding of the topics; limited analytical and synthesis skills, frequent generalizations.
- 18-20: barely sufficient knowledge and understanding of the topics with possible imperfections; sufficient analytical, synthesis, and judgment autonomy skills.
- 21-23: routine knowledge and understanding of the topics; correct analytical and synthesis skills with consistent logical reasoning.
- 24-26: good knowledge and understanding of the topics; good analytical and synthesis skills with rigorously expressed arguments.
- 27-29: excellent and comprehensive knowledge and understanding of the topics; notable analytical and synthesis skills, and excellent judgment autonomy.
- 30-30L: outstanding level of knowledge and understanding of the topics; notable analytical, synthesis, and judgment autonomy skills. Arguments are expressed in an original manner.
Obiettivi Formativi
L'attività economica e aziendale genera set informativi di dimensione molto elevata che possono essere utilizzati per inferire e validare regolarità, associazioni e relazioni causali tra fenomeni, la cui conoscenza offre un vantaggio competitivo. Il corso approfondisce le principali metodologie statistiche del “"data mining”" e dell'apprendimento statistico orientato dai dati. In particolare, viene affrontato il problema della previsione di output quantitativi o qualitativi sulla base di insieme di predittori potenzialmente sovrabbondante, noto come "supervised learning”"
Il corso ha i seguenti obiettivi formativi:
- conoscere le principali e più avanzate e moderne tecniche di data mining;
- saper gestire la complessità informativa, distillando le informazioni rilevanti da una mole elevata dei dati;
- saper prevedere i fenomeni economici e aziendali;
- acquisire la capacità di selezionare una regola predittiva tra quelle disponibili;
- essere in grado di comunicare le principali evidenze empiriche che emergono dall’analisi;
- svolgere analisi statistiche col il software appropriato;
- apprezzare criticamente le potenzialità e i limiti delle metodologie disponibili, acquisendo la capacità di discriminare tra di esse.
CONOSCENZA E CAPACITÀ DI COMPRENSIONE:
Il corso tratta la logica e le metodogie fondamentali del data mining e dell’apprendimento statistico, che costituiscono un approccio alla conoscenza dell’economia e dei mercati basato sull’esplorazione di elevate moli di dati e sulla scoperta di relazioni difficilmente riconducibili alla conoscenza a priori dei fenomeni.
Il tema fondamentale è rappresentato dalla previsione di variabili quantitative e qualitative mediante variabili di input.
Ampio risalto è dato al problema della selezione delle variabili e del modello e dei criteri per ottimizzare il trade-off tra complessità modellistica e generalizzabilità in campioni di validazione.
CAPACITÀ DI APPLICARE CONOSCENZA E COMPRENSIONE:
Le conoscenze acquisite vengono applicate a problemi di credit scoring, previsione delle vendite e di pricing dei prodotti. Costituiscono parte integrante del corso le esercitazioni di laboratorio che vengono svolte mediante i software R e Matlab. Gli studenti utilizzano le loro conoscenze per analizzare casi di studio, sia in laboratorio che negli assignments.
AUTONOMIA DI GIUDIZIO:
Il corso affronta tematiche di teoria delle decisioni. In particolare, viene affrontato il problema della classificazione (predizione di output categorici), discutendo la regola di decisione di Bayes. Inoltre, il tema di fondo che viene affrontato è la selezione del previsore ottimale per un dato problema. Lo studente viene stimolato a trarre conclusioni sulla validità interna ed esterna dei modelli considerati sulla base del confronto tra l’adattamento del modello nel campione di training e nel campione di validazione.
ABILITÀ COMUNICATIVE:
Il corso dedica molta attenzione alla comunicazione delle evidenze empiriche mediante grafici e statistiche di sintesi e sulla capacità di saper presentare le suddette evidenze a non esperti, in maniera efficace e sintetica. Al fine di accertare il conseguimento di questo obiettivo formativo sono previsti 4 assignments, il cui deliverable fondamentale è una relazione scritta che evidenzi i principali elementi interpretativi delle applicazioni. Il software utilizzato (R e SAS) è fortemente orientato verso la comunicazione grafica delle evidenze statistiche.
CAPACITÀ DI APPRENDIMENTO:
Lo studente sviluppa le proprie capacità di apprendimento in modo auto-diretto o autonomo confrontando il materiale didattico messo a disposizione dal docente con le letture suggerite settimanalmente dal medesimo. Inoltre, viene stimolato ad affrontare e risolvere casi di studio circa la selezione del migliore predittore e delle evidenze empiriche più rilevanti.
Learning Objectives
The course provides an introduction to Statistical Learning and Data Mining.
The advances in information technology have made available very rich information data sets, often generated automatically as a by-product of the main institutional activity of a firm or business unit. Most organizations today produce an electronic record of essentially every transaction in which they are involved. Firms collect terabytes data over operating periods (transactions data, e.g. credit cards). Most often these data are collected as secondary data, with no specific sampling design or research question on top.
The course offers an insight into the main statistical methodologies for the visualisation and the analysis of business and market data, providing the information requirements for specific tasks such as credit scoring, prediction and classification, market segmentation and product positioning. Emphasis will be given to empirical applications using modern software tools (Rstudio).
The course has the following intended learning outcomes:
- to provide a thorough knowledge of data mining methods and statistical learning techniques;
- to provide the expertise to manage complexity in information and to be able to distil the stylized facts that are relevant for interpretation;
- to be able to predict business outcomes;
- to be able to select a predictive method among those available;
- to be able to communicate the statistical findings to a non expert audience;
- to be able to perform sophisticated statistical analyses with the appropriate software.
- to critically appraise the potential and the limitation of the available methodologies.
KNOWLEDGE AND UNDERSTANDING:
The course covers the modern statistical methodologies for the visualisation and the analysis of business and market data, that are relevant for making decisions in a complex and rapidly changing business environment.
The fundamental theme is supervised statistical learning, which deals with the prediction of quantitative and qualitative outcomes using a potentially large set of inputs. The two problems, regression and classification, constitute the core of the course.
Emphasis is given to the problem of variable and model selection and on the generalizability of a prediction method outside the training sample, via the optimization of the trade-off between model complexity and the in-sample goodness of fit.
APPLYING KNOWLEDGE AND UNDERSTANDING:
The methodologies exposed during the course are applied to real life datasets and case studies, dealing with the prediction of sales, credit scoring and pricing goods.
Two hours per week are dedicated to tutorials where statistical analyses are conducted in the Laboratory and implemented in the software R-studio.
Students are expected to perform their statistical analyses in a group assignment.
MAKING JUDGEMENTS:
The prediction of an outcome is an informed decision based on the knowledge of covariates and antecedents. An important supervised learning problem is classification. We discuss Bayes classification rule and how to select the prediction rule that is optimal for a particular target variable. The student is expected to be able to draw conclusions on the basis of the statistical evidence and to validate those conclusions on validation or test samples drawn from the same target population.
COMMUNICATION SKILLS:
Particular attention is dedicated to the ability to communicate the statistical evidence in a systematic and synthetic way, using graphs and summaries, to a non-specialist target audience.
The software used in the tutorials is oriented towards graphical displays and visualization of data. The student is asked to report on the statistical analysis carried out for a particular purpose in the individual assignments.
LEARNING SKILLS:
Students develop their learning skills by comparing the teaching material provided by the instructor and exposed in the lectures with the readings suggested with weekly periodicity. The software tutorials and the analysis of cases studies in the assignments will help build their applied skills and their autonomous progress towards the intended learning outcomes.
Prerequisiti
Prerequisites
Programma
2. The linear regression model.
3.Model selection and evaluation: bias-variance trade-off, model complexity and goodness of fit. Cross-validation. Selection using information criteria.
4. Regularization and shrinkage methods: rigde regression, lasso, forward stagewise regression. Principal components regression.
5. Linear methods for classication: Bayes Classication Rule.
Discriminant analysis. Canonical variates. Logistic regression.
6. Semiparametric regression: Regression splines and smoothing splines.
7. Kernel smoothing methods: Local polynomial regression.
Density estimation. Nearest neighbor classication.
8. Additive Models, tree-based methods. GAM, Regression and classication trees. Boosting.
Program
2. The linear regression model.
3.Model selection and evaluation: bias-variance trade-off, model complexity and goodness of fit. Cross-validation. Selection using information criteria.
4. Regularization and shrinkage methods: rigde regression, lasso, forward stagewise regression. Principal components regression.
5. Linear methods for classication: Bayes Classication Rule. Discriminant analysis. Canonical variates. Logistic regression.
6. Semiparametric regression: Regression splines and smoothing
splines.
7. Kernel smoothing methods: Local polynomial regression.
Density estimation. Nearest neighbor classication.
8. Additive Models, tree-based methods. GAM, Regression and
classication trees. Boosting.
Testi Adottati
G James, D Witten, T Hastie, and R Tibshirani and J Friedman. An Introduction to Statistical Learning with Applications in R. Springer, Springer Series in Statistics, 2009.
Disponibile all’indirizzo http://www-bcf.usc.edu/~gareth/ISL/
Il docente mette a disposizione sul sito web del corso: lucidi delle lezioni, letture suggerite, datasets e materiale supplementare (script di Matlab e R).
Altri riferimenti utili:
• T Hastie, R Tibshirani and J Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction, Second Edition. Springer, Springer Series in Statistics, 2009. Website: http://www-stat.stanford.edu/ElemStatLearn/
G. Bekes and G. Kezdi. Data Analysis for Business, Economics, and Policy
Books
G James, D Witten, T Hastie, and R Tibshirani and J Friedman. An Introduction to Statistical Learning with Applications in R. Springer, Springer Series in Statistics, 2009.
Dowloadable at http://www-bcf.usc.edu/~gareth/ISL/
The course material will be made available on the course website: slides, suggested readings, datasets, supplementary materials (script of Matlab, R and SAS).
Additional useful reference:
-
• T Hastie, R Tibshirani and J Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction, Second Edition. Springer, Springer Series in Statistics, 2009. Website: http://www-stat.stanford.edu/ElemStatLearn/
* G. Bekes and G. Kezdi. Data Analysis for Business, Economics, and Policy
Bibliografia
Dowloadable at http://www-bcf.usc.edu/~gareth/ISL/
•T Hastie, R Tibshirani and J Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction, Second Edition. Springer, Springer Series in Statistics, 2009. Website: http://www-stat.stanford.edu/ElemStatLearn/
G. Bekes and G. Kezdi. Data Analysis for Business, Economics, and Policy
Bibliography
Dowloadable at http://www-bcf.usc.edu/~gareth/ISL/
•- Hastie, R Tibshirani and J Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction, Second Edition. Springer, Springer Series in Statistics, 2009. Website: http://www-stat.stanford.edu/ElemStatLearn/
G. Bekes and G. Kezdi. Data Analysis for Business, Economics, and Policy
Modalità di svolgimento
• Esercitazioni
• Esercizi
• Laboratori (Matlab, R)
Teaching methods
• Classes
• Exercises
• Tutorials (Matlab, R)
Regolamento Esame
30% Compiti individuali o di gruppo
70% Prova scritta finale
Il lavoro individuale o di gruppo (può essere svolto in entrambe le modalità a scelta dello studente ed è valevole per il 30% della valutazione finale) mira a valutare le capacità di elaborazione e modellazione statistica, nonché le capacità di comunicare le principali evidenze. Lo studente affronta un caso di studio che riguarda la previsione di una variabile quantitativa o la classificazione in una categoria di appartenenza e deve produrre un rapporto tecnico in cui descrive i risultati conseguiti.
L'esame finale è una prova scritta di 120 minuti che valuta l' effettiva acquisizione parte dello studente degli obiettivi formativi e dei risultati di apprendimento attesi. Esso consta di 5 domande a risposta aperta con sotto-quesiti che richiedono l'elaborazione gli elementi fondamentali della specificazione dei modelli per l'apprendimento supervisionato per la regressione e classificazione, la stima mediante il training sample, la verifica empirica e la validazione predittiva. Lo studente deve saper valutare criticamente le assunzioni sottostanti alla specificazione ed essere in grado di sintetizzare le proprietà statistiche dei metodi utilizzati e di prevedere i processi sottostanti. La valutazione degli elaborati è fondata sulla capacità di valutare criticamente gli ambiti di applicazione di ciascuna metodologia, l'apprendimento dei fondamentali trade-off tra distorsione e varianza, nonchè delle basi della teoria statistica. Ciascun quesito e sotto-quesito ha un numero di punti dichiarato che concorre al punteggio finale.
Exam Rules
30% Individual and Group Assignments
70% Final Exam
The assignments contribute to 30% of the final assessment and aim at evaluating the capabilities of processing and analysing statistical modelling, as well as the ability to communicate the relevant findings. Students face a case study and a real life dataset; they are expected to produce a technical report which summarizes their statistical findings and provides the necessary insight into the solution of the case study.
The final exam is a 2 hours written test that evaluates the learning of the program topics. Students face open questions with subquestions that test the understanding of the techniques presented throughout the course and the ability to critically assess their scope. The questions deal with the specification, estimation and validation of models for the prediction of quantitative (regression) and qualitative variables (classification). The students will have to prove their proficiency in understanding the basic assumptions that are made, how the data are used to learn about the model parameters, and finally how we diagnose the external and predictive validity of the methods and models. The assessment criteria are based on the students' critical appraisal of the scope and applicability of the methods, on their deep understanding of the trade-offs between goodness of fit and complexity, bias and variance, and on the rigour with which the properties are presented in the written exam paper. Main questions and items are scored according to difficulty. The score is disclosed to the students directly on the exam paper.
Obiettivi Formativi
L’'attività economica e aziendale genera set informativi di dimensione molto elevata che possono essere utilizzati per inferire e validare regolarità, associazioni e relazioni causali tra fenomeni, la cui conoscenza offre un vantaggio competitivo. Il corso approfondisce le principali metodologie statistiche del “"data mining”" e dell'apprendimento statistico orientato dai dati. In particolare, viene affrontato il problema della previsione di output quantitativi o qualitativi sulla base di insieme di predittori potenzialmente sovrabbondante, noto come "supervised learning”"
Il corso ha i seguenti obiettivi formativi:
- conoscere le principali e più avanzate e moderne tecniche di data mining;
- saper gestire la complessità informativa, distillando le informazioni rilevanti da una mole elevata dei dati;
- saper prevedere i fenomeni economici e aziendali;
- acquisire la capacità di selezionare una regola predittiva tra quelle disponibili;
- essere in grado di comunicare le principali evidenze empiriche che emergono dall’analisi;
- svolgere analisi statistiche col il software appropriato;
- apprezzare criticamente le potenzialità e i limiti delle metodologie disponibili, acquisendo la capacità di discriminare tra di esse.
CONOSCENZA E CAPACITÀ DI COMPRENSIONE:
Il corso tratta la logica e le metodogie fondamentali del data mining e dell’apprendimento statistico, che costituiscono un approccio alla conoscenza dell’economia e dei mercati basato sull’esplorazione di elevate moli di dati e sulla scoperta di relazioni difficilmente riconducibili alla conoscenza a priori dei fenomeni.
Il tema fondamentale è rappresentato dalla previsione di variabili quantitative e qualitative mediante variabili di input.
Ampio risalto è dato al problema della selezione delle variabili e del modello e dei criteri per ottimizzare il trade-off tra complessità modellistica e generalizzabilità in campioni di validazione.
CAPACITÀ DI APPLICARE CONOSCENZA E COMPRENSIONE:
Le conoscenze acquisite vengono applicate a problemi di credit scoring, previsione delle vendite e di pricing dei prodotti. Costituiscono parte integrante del corso le esercitazioni di laboratorio che vengono svolte mediante i software R e Matlab. Gli studenti utilizzano le loro conoscenze per analizzare casi di studio, sia in laboratorio che negli assignments.
AUTONOMIA DI GIUDIZIO:
Il corso affronta tematiche di teoria delle decisioni. In particolare, viene affrontato il problema della classificazione (predizione di output categorici), discutendo la regola di decisione di Bayes. Inoltre, il tema di fondo che viene affrontato è la selezione del previsore ottimale per un dato problema. Lo studente viene stimolato a trarre conclusioni sulla validità interna ed esterna dei modelli considerati sulla base del confronto tra l’adattamento del modello nel campione di training e nel campione di validazione.
ABILITÀ COMUNICATIVE:
Il corso dedica molta attenzione alla comunicazione delle evidenze empiriche mediante grafici e statistiche di sintesi e sulla capacità di saper presentare le suddette evidenze a non esperti, in maniera efficace e sintetica. Al fine di accertare il conseguimento di questo obiettivo formativo sono previsti 4 assignments, il cui deliverable fondamentale è una relazione scritta che evidenzi i principali elementi interpretativi delle applicazioni. Il software utilizzato (R e SAS) è fortemente orientato verso la comunicazione grafica delle evidenze statistiche.
CAPACITÀ DI APPRENDIMENTO:
Lo studente sviluppa le proprie capacità di apprendimento in modo auto-diretto o autonomo confrontando il materiale didattico messo a disposizione dal docente con le letture suggerite settimanalmente dal medesimo. Inoltre, viene stimolato ad affrontare e risolvere casi di studio circa la selezione del migliore predittore e delle evidenze empiriche più rilevanti.
Learning Objectives
The course provides an introduction to Statistical Learning and Data Mining.
The advances in information technology have made available very rich information data sets, often generated automatically as a by-product of the main institutional activity of a firm or business unit. Most organizations today produce an electronic record of essentially every transaction in which they are involved. Firms collect terabytes data over operating periods (transactions data, e.g. credit cards). Most often these data are collected as secondary data, with no specific sampling design or research question on top.
The course offers an insight into the main statistical methodologies for the visualisation and the analysis of business and market data, providing the information requirements for specific tasks such as credit scoring, prediction and classification, market segmentation and product positioning. Emphasis will be given to empirical applications using modern software tools (Rstudio).
The course has the following intended learning outcomes:
- to provide a thorough knowledge of data mining methods and statistical learning techniques;
- to provide the expertise to manage complexity in information and to be able to distil the stylized facts that are relevant for interpretation;
- to be able to predict business outcomes;
- to be able to select a predictive method among those available;
- to be able to communicate the statistical findings to a non expert audience;
- to be able to perform sophisticated statistical analyses with the appropriate software.
- to critically appraise the potential and the limitation of the available methodologies.
KNOWLEDGE AND UNDERSTANDING:
The course covers the modern statistical methodologies for the visualisation and the analysis of business and market data, that are relevant for making decisions in a complex and rapidly changing business environment.
The fundamental theme is supervised statistical learning, which deals with the prediction of quantitative and qualitative outcomes using a potentially large set of inputs. The two problems, regression and classification, constitute the core of the course.
Emphasis is given to the problem of variable and model selection and on the generalizability of a prediction method outside the training sample, via the optimization of the trade-off between model complexity and the in-sample goodness of fit.
APPLYING KNOWLEDGE AND UNDERSTANDING:
The methodologies exposed during the course are applied to real life datasets and case studies, dealing with the prediction of sales, credit scoring and pricing goods.
Two hours per week are dedicated to tutorials where statistical analyses are conducted in the Laboratory and implemented in the software R-studio.
Students are expected to perform their statistical analyses in a group assignment.
MAKING JUDGEMENTS:
The prediction of an outcome is an informed decision based on the knowledge of covariates and antecedents. An important supervised learning problem is classification. We discuss Bayes classification rule and how to select the prediction rule that is optimal for a particular target variable. The student is expected to be able to draw conclusions on the basis of the statistical evidence and to validate those conclusions on validation or test samples drawn from the same target population.
COMMUNICATION SKILLS:
Particular attention is dedicated to the ability to communicate the statistical evidence in a systematic and synthetic way, using graphs and summaries, to a non-specialist target audience.
The software used in the tutorials is oriented towards graphical displays and visualization of data. The student is asked to report on the statistical analysis carried out for a particular purpose in the individual assignments.
LEARNING SKILLS:
Students develop their learning skills by comparing the teaching material provided by the instructor and exposed in the lectures with the readings suggested with weekly periodicity. The software tutorials and the analysis of cases studies in the assignments will help build their applied skills and their autonomous progress towards the intended learning outcomes.
Prerequisiti
Prerequisites
Programma
2. The linear regression model.
3.Model selection and evaluation: bias-variance trade-off, model complexity and goodness of fit. Cross-validation. Selection using information criteria.
4. Regularization and shrinkage methods: rigde regression, lasso,
forward stagewise regression. Principal components regression.
5. Linear methods for classication: Bayes Classication Rule.
Discriminant analysis. Canonical variates. Logistic regression.
6. Semiparametric regression: Regression splines and smoothing
splines.
7. Kernel smoothing methods: Local polynomial regression.
Density estimation. Nearest neighbor classication.
8. Additive Models, tree-based methods. GAM, Regression and
classication trees. Boosting.
Program
2. The linear regression model.
3.Model selection and evaluation: bias-variance trade-off, model complexity and goodness of fit. Cross-validation. Selection using information criteria.
4. Regularization and shrinkage methods: rigde regression, lasso, forward stagewise regression. Principal components regression.
5. Linear methods for classication: Bayes Classication Rule. Discriminant analysis. Canonical variates. Logistic regression.
6. Semiparametric regression: Regression splines and smoothing
splines.
7. Kernel smoothing methods: Local polynomial regression.
Density estimation. Nearest neighbor classication.
8. Additive Models, tree-based methods. GAM, Regression and
classication trees. Boosting.
Testi Adottati
G James, D Witten, T Hastie, and R Tibshirani and J Friedman. An Introduction to Statistical Learning with Applications in R. Springer, Springer Series in Statistics, 2009.
Disponibile all’indirizzo http://www-bcf.usc.edu/~gareth/ISL/
Il docente mette a disposizione sul sito web del corso: lucidi delle lezioni, letture suggerite, datasets e materiale supplementare (script di Matlab e R).
Altri riferimenti utili:
• T Hastie, R Tibshirani and J Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction, Second Edition. Springer, Springer Series in Statistics, 2009. Website: http://www-stat.stanford.edu/ElemStatLearn/
G. Bekes and G. Kezdi. Data Analysis for Business, Economics, and Policy
Books
G James, D Witten, T Hastie, and R Tibshirani and J Friedman. An Introduction to Statistical Learning with Applications in R. Springer, Springer Series in Statistics, 2009.
Dowloadable at http://www-bcf.usc.edu/~gareth/ISL/
The course material will be made available on the course website: slides, suggested readings, datasets, supplementary materials (script of Matlab, R and SAS).
Additional useful reference:
-
• T Hastie, R Tibshirani and J Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction, Second Edition. Springer, Springer Series in Statistics, 2009. Website: http://www-stat.stanford.edu/ElemStatLearn/
* G. Bekes and G. Kezdi. Data Analysis for Business, Economics, and Policy
Bibliografia
Dowloadable at http://www-bcf.usc.edu/~gareth/ISL/
•T Hastie, R Tibshirani and J Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction, Second Edition. Springer, Springer Series in Statistics, 2009. Website: http://www-stat.stanford.edu/ElemStatLearn/
G. Bekes and G. Kezdi. Data Analysis for Business, Economics, and Policy
Bibliography
Dowloadable at http://www-bcf.usc.edu/~gareth/ISL/
•- Hastie, R Tibshirani and J Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction, Second Edition. Springer, Springer Series in Statistics, 2009. Website: http://www-stat.stanford.edu/ElemStatLearn/
G. Bekes and G. Kezdi. Data Analysis for Business, Economics, and Policy
Modalità di svolgimento
• Esercitazioni
• Esercizi
• Laboratori (Matlab, R)
Teaching methods
• Classes
• Exercises
• Tutorials (Matlab, R)
Regolamento Esame
30% Compiti individuali and di gruppo
70% Prova scritta finale
Il lavoro di gruppo (valevole per il 30% della valutazione finale) mira a valutare le capacità di elaborazione e modellazione statistica, nonché le capacità di comunicare le principali evidenze. Lo studente deve produrre un rapporto tecnico di massimo 10 pagine.
L’esame finale è una prova scritta di 120 minuti che valuta l’apprendimento dei temi del programma.
Exam Rules
70% Final Exam
The assignments aim at assessing the capabilities of processing and analysing statistical modelling, as well as the ability to communicate the relevant findings. The students are expected to produce a technical report.
The final exam is a written test of 120 minutes which assesses the learning of the program.
Updated A.Y. 2021-2022
LEARNING OBJECTIVES
The course provides an introduction to Statistical Learning and Data Mining.
The advances in information technology have made available very rich information data sets, often generated automatically as a by-product of the main institutional activity of a firm or business unit. Most organizations today produce an electronic record of essentially every transaction in which they are involved. Firms collect terabytes data over operating periods (transactions data, e.g. credit cards). Most often these data are collected as secondary data, with no specific sampling design or research question on top.
Data Mining deals with inferring and validating patterns, structures and relationships in data, as a tool to support decisions in the business environment.
The course offers an insight into the main statistical methodologies for the visualisation and the analysis of business and market data, providing the information requirements for specific tasks such as credit scoring, prediction and classification, market segmentation and product positioning. Emphasis will be given to empirical applications using modern software tools (Rstudio, Matlab).
The course has the following intended learning outcomes:
- - to provide a thorough knowledge of data mining methods and statistical learning techniques;
- - to provide the expertise to manage complexity in information and to be able to distill the stylized facts that are relevant for interpretation;
- - to be able to predict business outcomes;
- - to be able to select a predictive method among those available;
- - to be able to communicate the statistical findings to a non expert audience;
- - to be able to perform sophisticated statistical analyses with the appropriate software.
- - to critically appraise the potential and the limitation of the available methodologies.
PROGRAMME
Introduction to data mining. Tools for data analysis, visualisation and description.The linear regression model.
Model selection and evaluation: bias-variance trade-off, model complexity and goodness of t. Cross-validation. Selection using information criteria.
Regularization and shrinkage methods: rigde regression, lasso, forward stagewise regression. Principal components regression.
Linear methods for classication: Bayes Classication Rule. Discriminant analysis. Canonical variates.Logistic regression.
Semiparametric regression: Regression splines and smoothing splines.
Kernel smoothing methods: Local polynomial regression.
Density estimation. Nearest neighbor classication.
Additive Models, tree-based methods. GAM, Regression andclassication trees. Boosting.
Knowledge and Understanding:
The course covers the modern statistical methodologies for the visualisation and the analysis of business and market data, that are relevant for making decisions in a complex and rapidly changing business environment.
The fundamental theme is supervised statistical learning, which deals with the prediction of quantitative and qualitative outcomes using a potentially large set of inputs. The two problems, regression and classification, constitute the core of the course.
Emphasis is given to the problem of variable and model selection and on the generalizability of a prediction method outside the training sample, via the optimization of the trade-off between model complexity and the in-sample goodness of fit.
TEXBOOK
G James, D Witten, T Hastie, and R Tibshirani and J Friedman. An Introduction to Statistical Learning with Applications in R. Springer, Springer Series in Statistics, 2009.
Dowloadable at http://www-bcf.usc.edu/~gareth/ISL/
Updated A.Y. 2021-2022
LEARNING OBJECTIVES
The course provides an introduction to Statistical Learning and Data Mining.
The advances in information technology have made available very rich information data sets, often generated automatically as a by-product of the main institutional activity of a firm or business unit. Most organizations today produce an electronic record of essentially every transaction in which they are involved. Firms collect terabytes data over operating periods (transactions data, e.g. credit cards). Most often these data are collected as secondary data, with no specific sampling design or research question on top.
Data Mining deals with inferring and validating patterns, structures and relationships in data, as a tool to support decisions in the business environment.
The course offers an insight into the main statistical methodologies for the visualisation and the analysis of business and market data, providing the information requirements for specific tasks such as credit scoring, prediction and classification, market segmentation and product positioning. Emphasis will be given to empirical applications using modern software tools (Rstudio, Matlab).
The course has the following intended learning outcomes:
- - to provide a thorough knowledge of data mining methods and statistical learning techniques;
- - to provide the expertise to manage complexity in information and to be able to distill the stylized facts that are relevant for interpretation;
- - to be able to predict business outcomes;
- - to be able to select a predictive method among those available;
- - to be able to communicate the statistical findings to a non expert audience;
- - to be able to perform sophisticated statistical analyses with the appropriate software.
- - to critically appraise the potential and the limitation of the available methodologies.
PROGRAMME
Introduction to data mining. Tools for data analysis, visualisation and description.The linear regression model.
Model selection and evaluation: bias-variance trade-off, model complexity and goodness of t. Cross-validation. Selection using information criteria.
Regularization and shrinkage methods: rigde regression, lasso, forward stagewise regression. Principal components regression.
Linear methods for classication: Bayes Classication Rule. Discriminant analysis. Canonical variates.Logistic regression.
Semiparametric regression: Regression splines and smoothing splines.
Kernel smoothing methods: Local polynomial regression.
Density estimation. Nearest neighbor classication.
Additive Models, tree-based methods. GAM, Regression andclassication trees. Boosting.
Knowledge and Understanding:
The course covers the modern statistical methodologies for the visualisation and the analysis of business and market data, that are relevant for making decisions in a complex and rapidly changing business environment.
The fundamental theme is supervised statistical learning, which deals with the prediction of quantitative and qualitative outcomes using a potentially large set of inputs. The two problems, regression and classification, constitute the core of the course.
Emphasis is given to the problem of variable and model selection and on the generalizability of a prediction method outside the training sample, via the optimization of the trade-off between model complexity and the in-sample goodness of fit.
TEXBOOK
G James, D Witten, T Hastie, and R Tibshirani and J Friedman. An Introduction to Statistical Learning with Applications in R. Springer, Springer Series in Statistics, 2009.
Dowloadable at http://www-bcf.usc.edu/~gareth/ISL/
Updated A.Y. 2020-2021
LEARNING OBJECTIVES
The course provides an introduction to Statistical Learning and Data Mining.
The advances in information technology have made available very rich information data sets, often generated automatically as a by-product of the main institutional activity of a firm or business unit. Most organizations today produce an electronic record of essentially every transaction in which they are involved. Firms collect terabytes data over operating periods (transactions data, e.g. credit cards). Most often these data are collected as secondary data, with no specific sampling design or research question on top.
Data Mining deals with inferring and validating patterns, structures and relationships in data, as a tool to support decisions in the business environment.
The course offers an insight into the main statistical methodologies for the visualisation and the analysis of business and market data, providing the information requirements for specific tasks such as credit scoring, prediction and classification, market segmentation and product positioning. Emphasis will be given to empirical applications using modern software tools (Rstudio, Matlab).
The course has the following intended learning outcomes:
- - to provide a thorough knowledge of data mining methods and statistical learning techniques;
- - to provide the expertise to manage complexity in information and to be able to distill the stylized facts that are relevant for interpretation;
- - to be able to predict business outcomes;
- - to be able to select a predictive method among those available;
- - to be able to communicate the statistical findings to a non expert audience;
- - to be able to perform sophisticated statistical analyses with the appropriate software.
- - to critically appraise the potential and the limitation of the available methodologies.
PROGRAMME
Introduction to data mining. Tools for data analysis, visualisation and description.The linear regression model.
Model selection and evaluation: bias-variance trade-off, model complexity and goodness of t. Cross-validation. Selection using information criteria.
Regularization and shrinkage methods: rigde regression, lasso, forward stagewise regression. Principal components regression.
Linear methods for classication: Bayes Classication Rule. Discriminant analysis. Canonical variates.Logistic regression.
Semiparametric regression: Regression splines and smoothing splines.
Kernel smoothing methods: Local polynomial regression.
Density estimation. Nearest neighbor classication.
Additive Models, tree-based methods. GAM, Regression andclassication trees. Boosting.
Knowledge and Understanding:
The course covers the modern statistical methodologies for the visualisation and the analysis of business and market data, that are relevant for making decisions in a complex and rapidly changing business environment.
The fundamental theme is supervised statistical learning, which deals with the prediction of quantitative and qualitative outcomes using a potentially large set of inputs. The two problems, regression and classification, constitute the core of the course.
Emphasis is given to the problem of variable and model selection and on the generalizability of a prediction method outside the training sample, via the optimization of the trade-off between model complexity and the in-sample goodness of fit.
TEXBOOK
G James, D Witten, T Hastie, and R Tibshirani and J Friedman. An Introduction to Statistical Learning with Applications in R. Springer, Springer Series in Statistics, 2009.
Dowloadable at http://www-bcf.usc.edu/~gareth/ISL/
Updated A.Y. 2020-2021
LEARNING OBJECTIVES
The course provides an introduction to Statistical Learning and Data Mining.
The advances in information technology have made available very rich information data sets, often generated automatically as a by-product of the main institutional activity of a firm or business unit. Most organizations today produce an electronic record of essentially every transaction in which they are involved. Firms collect terabytes data over operating periods (transactions data, e.g. credit cards). Most often these data are collected as secondary data, with no specific sampling design or research question on top.
Data Mining deals with inferring and validating patterns, structures and relationships in data, as a tool to support decisions in the business environment.
The course offers an insight into the main statistical methodologies for the visualisation and the analysis of business and market data, providing the information requirements for specific tasks such as credit scoring, prediction and classification, market segmentation and product positioning. Emphasis will be given to empirical applications using modern software tools (Rstudio, Matlab).
The course has the following intended learning outcomes:
- - to provide a thorough knowledge of data mining methods and statistical learning techniques;
- - to provide the expertise to manage complexity in information and to be able to distill the stylized facts that are relevant for interpretation;
- - to be able to predict business outcomes;
- - to be able to select a predictive method among those available;
- - to be able to communicate the statistical findings to a non expert audience;
- - to be able to perform sophisticated statistical analyses with the appropriate software.
- - to critically appraise the potential and the limitation of the available methodologies.
PROGRAMME
Introduction to data mining. Tools for data analysis, visualisation and description.The linear regression model.
Model selection and evaluation: bias-variance trade-off, model complexity and goodness of t. Cross-validation. Selection using information criteria.
Regularization and shrinkage methods: rigde regression, lasso, forward stagewise regression. Principal components regression.
Linear methods for classication: Bayes Classication Rule. Discriminant analysis. Canonical variates.Logistic regression.
Semiparametric regression: Regression splines and smoothing splines.
Kernel smoothing methods: Local polynomial regression.
Density estimation. Nearest neighbor classication.
Additive Models, tree-based methods. GAM, Regression andclassication trees. Boosting.
Knowledge and Understanding:
The course covers the modern statistical methodologies for the visualisation and the analysis of business and market data, that are relevant for making decisions in a complex and rapidly changing business environment.
The fundamental theme is supervised statistical learning, which deals with the prediction of quantitative and qualitative outcomes using a potentially large set of inputs. The two problems, regression and classification, constitute the core of the course.
Emphasis is given to the problem of variable and model selection and on the generalizability of a prediction method outside the training sample, via the optimization of the trade-off between model complexity and the in-sample goodness of fit.
TEXBOOK
G James, D Witten, T Hastie, and R Tibshirani and J Friedman. An Introduction to Statistical Learning with Applications in R. Springer, Springer Series in Statistics, 2009.
Dowloadable at http://www-bcf.usc.edu/~gareth/ISL/
Updated A.Y. 2019-2020
LEARNING OBJECTIVES
The course provides an introduction to Statistical Learning and Data Mining.
The advances in information technology have made available very rich information data sets, often generated automatically as a by-product of the main institutional activity of a firm or business unit. Most organizations today produce an electronic record of essentially every transaction in which they are involved. Firms collect terabytes data over operating periods (transactions data, e.g. credit cards). Most often these data are collected as secondary data, with no specific sampling design or research question on top.
Data Mining deals with inferring and validating patterns, structures and relationships in data, as a tool to support decisions in the business environment.
The course offers an insight into the main statistical methodologies for the visualisation and the analysis of business and market data, providing the information requirements for specific tasks such as credit scoring, prediction and classification, market segmentation and product positioning. Emphasis will be given to empirical applications using modern software tools (Rstudio, Matlab, SAS).
The course has the following intended learning outcomes:
- - to provide a thorough knowledge of data mining methods and statistical learning techniques;
- - to provide the expertise to manage complexity in information and to be able to distill the stylized facts that are relevant for interpretation;
- - to be able to predict business outcomes;
- - to be able to select a predictive method among those available;
- - to be able to communicate the statistical findings to a non expert audience;
- - to be able to perform sophisticated statistical analyses with the appropriate software.
- - to critically appraise the potential and the limitation of the available methodologies.
PROGRAMME
Introduction to data mining. Tools for data analysis, visualisation and description.The linear regression model.
Model selection and evaluation: bias-variance trade-off, model complexity and goodness of t. Cross-validation. Selection using information criteria.
Regularization and shrinkage methods: rigde regression, lasso, forward stagewise regression. Principal components regression.
Linear methods for classication: Bayes Classication Rule. Discriminant analysis. Canonical variates.Logistic regression.
Semiparametric regression: Regression splines and smoothing splines.
Kernel smoothing methods: Local polynomial regression.
Density estimation. Nearest neighbor classication.
Additive Models, tree-based methods. GAM, Regression andclassication trees. Boosting.
Knowledge and Understanding:
The course covers the modern statistical methodologies for the visualisation and the analysis of business and market data, that are relevant for making decisions in a complex and rapidly changing business environment.
The fundamental theme is supervised statistical learning, which deals with the prediction of quantitative and qualitative outcomes using a potentially large set of inputs. The two problems, regression and classification, constitute the core of the course.
Emphasis is given to the problem of variable and model selection and on the generalizability of a prediction method outside the training sample, via the optimization of the trade-off between model complexity and the in-sample goodness of fit.
TEXBOOK
G James, D Witten, T Hastie, and R Tibshirani and J Friedman. An Introduction to Statistical Learning with Applications in R. Springer, Springer Series in Statistics, 2009.
Dowloadable at http://www-bcf.usc.edu/~gareth/ISL/
Updated A.Y. 2019-2020
LEARNING OBJECTIVES
The course provides an introduction to Statistical Learning and Data Mining.
The advances in information technology have made available very rich information data sets, often generated automatically as a by-product of the main institutional activity of a firm or business unit. Most organizations today produce an electronic record of essentially every transaction in which they are involved. Firms collect terabytes data over operating periods (transactions data, e.g. credit cards). Most often these data are collected as secondary data, with no specific sampling design or research question on top.
Data Mining deals with inferring and validating patterns, structures and relationships in data, as a tool to support decisions in the business environment.
The course offers an insight into the main statistical methodologies for the visualisation and the analysis of business and market data, providing the information requirements for specific tasks such as credit scoring, prediction and classification, market segmentation and product positioning. Emphasis will be given to empirical applications using modern software tools (Rstudio, Matlab, SAS).
The course has the following intended learning outcomes:
- - to provide a thorough knowledge of data mining methods and statistical learning techniques;
- - to provide the expertise to manage complexity in information and to be able to distill the stylized facts that are relevant for interpretation;
- - to be able to predict business outcomes;
- - to be able to select a predictive method among those available;
- - to be able to communicate the statistical findings to a non expert audience;
- - to be able to perform sophisticated statistical analyses with the appropriate software.
- - to critically appraise the potential and the limitation of the available methodologies.
PROGRAMME
Introduction to data mining. Tools for data analysis, visualisation and description.The linear regression model.
Model selection and evaluation: bias-variance trade-off, model complexity and goodness of t. Cross-validation. Selection using information criteria.
Regularization and shrinkage methods: rigde regression, lasso, forward stagewise regression. Principal components regression.
Linear methods for classication: Bayes Classication Rule. Discriminant analysis. Canonical variates.Logistic regression.
Semiparametric regression: Regression splines and smoothing splines.
Kernel smoothing methods: Local polynomial regression.
Density estimation. Nearest neighbor classication.
Additive Models, tree-based methods. GAM, Regression andclassication trees. Boosting.
Knowledge and Understanding:
The course covers the modern statistical methodologies for the visualisation and the analysis of business and market data, that are relevant for making decisions in a complex and rapidly changing business environment.
The fundamental theme is supervised statistical learning, which deals with the prediction of quantitative and qualitative outcomes using a potentially large set of inputs. The two problems, regression and classification, constitute the core of the course.
Emphasis is given to the problem of variable and model selection and on the generalizability of a prediction method outside the training sample, via the optimization of the trade-off between model complexity and the in-sample goodness of fit.
TEXBOOK
G James, D Witten, T Hastie, and R Tibshirani and J Friedman. An Introduction to Statistical Learning with Applications in R. Springer, Springer Series in Statistics, 2009.
Dowloadable at http://www-bcf.usc.edu/~gareth/ISL/