Hello World, it’s Siraj! In this video, we’re going to use genetic programming to identify if some energy is gamma radiation or not. I’m getting angry. Gamma rays! Augh! Nah, I wish. Data science is a way of thinking about discovery. A data scientist needs to decide the right question to ask, like “Who’s the best candidate to vote for in the US election?,” then decide what dataset to use, like tweet history of candidates and past endorsements of each candidate, and lastly decide what machine learning model to use on the data to discover the right answer. ♫ Life goes on! ♫ With the right data, computing power, and machine learning model, you can discover a solution to any problem, but knowing which model to use can be challenging for new data scientists. There are so many of them! That’s where genetic programming can help. Genetic algorithms are inspired by the Darwinian process of natural selection, and they’re used to generate solutions to optimization and search problems. They have three properties: selection, crossover, and mutation. You have a population of possible solutions to a given problem and a fitness function. Every iteration, we evaluate how fit each solution is with our fitness function. Then we select the fittest ones and perform crossover to create a new population. We take those children and mutate them with some random modification and repeat the process until we get the fittest or best solution. So take this problem, for instance. Let’s say you want to take a road trip across a bunch of cities. What’s the shortest possible path you could take to hit up each city once and then return back to your home city? This is popularly called the “traveling salesman problem” in computer science, and we can use a genetic algorithm to help us solve it. Let’s look at some high-level Python code. We have the number of generations set to 5,000 and the population size set to 100. So we start by initializing our population using our size parameter. Each individual in our population represents a different solution path. Then, for each generation, we compute the fitness of each solution and store it in our population fitness array. Now we’ll perform selection by only taking the top 10% of the population which are our shortest road trips and produce offspring from them by performing crossover. Then you take those offspring randomly and repeat the process. As you can see in the animation, eventually we will get an optimal solution using this process, unlike Apple Maps. Alright, so how does this all fit into data science? Well, it turns out that choosing the right machine learning model and all the best hyperparameters for that model is itself an optimization problem. We’re going to use a Python library called TPOT, built on top of scikit-learn, that uses genetic programming to optimize our machine learning pipeline. So after formatting our data properly, we need to know what features to input to our model and how we should construct those features. Once we have those features, we’ll input them into our model to train on, and we’ll want to tune our hyperparameters, or tuning knobs, to get the optimal results. Instead of doing this all ourselves through trial and error, TPOT automates these steps for us with genetic programming, and it will output the optimal code for us when it’s done so we can use it later. So we’re going to create a classifier for gamma radiation using TPOT after installing our depencies, and then analyze the results. TPOT is built on the popular scikit-learn machine learning library, so we’ll want to make sure that we have that installed first. Then we’ll install pandas to help us analyze our data and numpy to perform math calculations. Our first step is to load our dataset. We’ll use pandas’ read_csv() method and set the parameter to the name of our saved CSV file. This is data collected from a scientific instrument called a “Cherenkov telescope” that measures radiation in the atmosphere and these are a bunch of features of whatever type of radiation it picks up. Thanks, Putin! Since the class object is already organized, we’ll shuffle our data to get a better result. The iloc() function of the telescope variable is pandas’ way of getting the positions in the index. And we’ll generate a sequence of random indices the size of our data using the permutation function of numpy’s ‘random’ submodule. Since all the instances are now randomly rearranged, we’ll just reset all these indices so they are ordered even though the data is now shuffled, using the reset_index() method of pandas with the drop parameter set to “True.” We’ll now let our ‘tele’ variable know what our two classes are by mapping both of them to an integer with the map() method. So ‘g’ for “gamma” is set to 0; ‘h’ for “hadron” is set to 1. Let’s store those ‘Class’ labels, which we’re going to predict, in a seperate variable called ‘tele_class’ and use the ‘values’ attribute to retrieve it. Before we train our model, we need to split our data into training and validation sets. We’ll use the train_test_split() method scikit-learn that we imported to create the indices for both. The parameters will be the size of our dataset. We want both sets to be arrays, so we’ll set the ‘stratify’ parameter to our array type. Then we’ll define what percent of our data we want to be training and testing with these last two parameters. We have a 75/25 split now in our data and we’re ready to train our model. We’ll initialize the ‘tpot’ variable using the ‘TPOT’ class with the number of generations set to 5. On a standard laptop with 4 gigs of RAM, it takes five minutes per generation to run so this will take about 25 minutes. This is so TPOT’s genetic algorithm knows how many iterations to run for, and we’ll set ‘verbosity’ to 2, which just means “Show a progress bar in terminals during the optimization process.” Then we can call our fit() method on our training data to let it perform optimization using genetic programming. The first parameter is the training feature set which we’ll retrieve from our ‘tele’ variable along the first access for every training index. The second variable is our training class set, which we’ll retrieve from our ‘tele’ variable like so. We can compute the testing error for validation using TPOT’s score() method with validation feature set as the first parameter and the validation class set as the second. We’ll export the computed Python code to the pipeline.py class using this method and name it in the parameter as a string. Let’s demo this thing. After training, we’ll see that after five generations, TPOT chose the gradient_boosting classifier as the most accurate machine learning model to use. It also shows the optimal hyperparameters like the learning rate and number of estimators for us. ♫ Yeah, boyyy! ♫ So, to break it down: with the right amount of data, computing power, and machine learning model, you can discover a solution to any problem. Genetic algorithms replicate evolution via selection, crossover, and mutation to find an optimal solution to a problem, and TPOT is a Python library that uses genetic programming to help you find the best model and hyperparameters for your use case. The winner of the coding challenge from the last video is Peter Mitrano. He added some great Deep Dream samples to his repository, and even Deep Dream’d my own video. Badass of the week! And the runner-up is Kyle Jordaan. Good job stitching all the Deep Dream’d frames together with one line of code The challenge for this video is to use TPOT and a climate change dataset that I’ll provide to predict the answer to a question you decide. This will be great practice in learning to think like a data scientist. Post your GitHub link in the comments and I’ll announce the winner next time. For now, I’ve got to stay fit to reproduce, so thanks for watching.