ELEN0062 - Introduction to machine learning (iML)

Random ML quote

With too little data, you won’t be able to make any conclusions that you trust. With loads of data you will find relationships that aren’t real… Big data isn’t about bits, it’s about talent.

Douglas Merrill

Informations

Schedule

Installation 04 Oct. 2017 Python, Numpy, Scipy, Scikit-learn installation with anaconda
TD
Assignment
11 Oct. 2017

Bring your laptop if you want to assist
A crash course about the Pythonsphere
If you have questions regarding the exercises, you can email me.

First assignment

Q&A 25 Oct. 2017 Question/answer session regarding the first assignment.
Q&A 18 Oct. 2017 Question/answer session regarding the first assignment.
Deadline
Assignment
Q&A
31 Oct. 2017

Don't forget to submit your first assignment.
You can go up to 5 pages (with 1 page for question 3.1)

Second assignment (Antonio Sutera is the reference TA for this assignment)

Deadline
Project
Feedback
24 Nov. 2017

Don't forget to submit your second assignment.

Assignment 3 (challenge)

On how to present results:

Deadline
27 Nov. 2017 [Setup] Find a group, register on Kaggle, download the data, make the toy submission.
Deadline
15 Dec. 2017 End of challenge
Deadline
17 Dec. 2017 Don't forget to submit your report regarding the challenge.

Third assignment: the challenge

The third project is organized in the form of a challenge, where you will compete against each other. This year, the challenge is about speech recognition. More precisely, you will have to develop a model which can recognize digits from 0 to 9 as well as non-digits. All the relevant information can be found on the Kaggle plateform which will hold the challenge.

The project is divided into four parts. All the deadlines can be found in the schedule section above.

  1. Setup for the project
    • Create an account on the Kaggle plateform. Use your real name so that we can identify you.
    • Form groups of two (or three)
    • Test the toy example
  2. Propose the best model you can
  3. Submit an archive on the submission platform in tar.gz format, containing a report that describes the different steps of your approach and your main results along with your source code. Use the same ids as for the Kaggle plateform. The report must contain the following information:
    • A detailed description of all the approaches that you have used to win the challenge, including the feature engineering you performed. The kaggle winning model guideline should be followed for each approach (you can disregard the background info, just specify your section).
    • A detailed description of your hyper-parameters optimization approach and your model validation technique.
    • A table summarizing the performance of your differents approaches containing for each approach at least the name of the approach, the validation score, the score on the public and the private leaderboard.
    • Any complementary information or figures that you want to mention.
  4. Present succinctly your approach to the rest of the class. (More information coming soon)

Have fun!

Cheat sheet for ML in Python

Check out datacamp for more.

Supplementary material

Here is a very scarce list of supplementary material related to the field of machine learning. I tend to update this section when I come across interesting stuff but if you feel like you need more material on some topic, do not hesitate to ask!

Machine learning in general

There are tons of online and accessible material in the domain of machine learning:

Linear regression

The geometry of Least Squares (1 variable)

Note that the ANOVA is a special case of linear models where the input variables are dummy one-hot class variables. Consequently, the basis vector of the column space are orthogonal and the problem reduces to many 1 variable least squares.

Artifical neural networks

There have been three hypes about ANN. The first one was about the perceptrons in the 60s until it was discovered it could not solve a XOR problem. The second hype started with the discovery of backpropagation but it soon became clear that the large and/or deep neural nets were very hard to train. We are in the midts of the third one right now with "deep learning": neural nets with several (many) invisible layers. As a consequence, internet is bursting with resource on the topic, from the simplest models (multi-layer perceptron) to the most advanced architectures (such as GANs), going through more classical ones (such as Convnets and LSTM).

Learning theory (Bias/Variance...)

Support Vector Machines

Unsupervised learning

Misc.

There are many YouTube channels about ML. Here are a few:

Pre-requisites

Machine learning requires a solid background in maths, especially in linear algebra, (advanced) probability theory and (multivariable) calculus. There are even more resources on those than on deep learning. Here is a short selection, which emphasizes intuition.

Linear algebra

  • 3 brown 1 blue serie on linear algebra
  • If you prefer paper (or PDF): Practical Linear Algebra: A Geometry Toolbox 2nd Edition by Farin, Gerald, Hansford, Dianne. A K Peters/CRC Press (2004)

Calculus

Last modified on November 23 2017 10:46