Pairing two people with Genetic Algorithms

Pairing two people with Genetic Algorithms

Problem

I was recently selected to participate as a student leader during my PyTorch Facebook Challenge for the student mentor channel. The main purpose of this group is to pair students with a mentor that can advise them when necessary. One of the most time consuming part is to match the students and mentors since everyone has a different level skillset and speak different languages.

Solution

I attempted to solve this problem by using genetic algorithms to optimize the selection process. Genetic algorithms (GA) are powerful because they are extremely robust and adaptable. It can be implemented to any problem without much additional coding. Also, GA typically will provide some sort of solution but it will not be guaranteed to be a global minimum. However, most if not all of the solutions, for a properly constructed GA code will provide usable results. You can find more about Genetic Algorithms with python example code using the same toolbox I selected here.

You can find my final result notebook through my Github.

As with any optimization process, these three attributes has to be clearly defined:

  1. The system
  2. The cost function
  3. How the system should move according to the cost function

System

I defined the system through a relationship matrix. The relationship matrix defines how each person is related with each other. The higher the relationship, the higher probability that the two people will be paired with each other. I do this by iterating through every student (column) with every mentor (row). The weights for the relationship are the following:

  1. Language = 500
  2. Skillset Matching = 200

This is done in python with the following code:

relationship_m = np.zeros((number_of_students, number_of_mentors))

for index_s, student in students.iterrows():
    student_level = skill[student['Skill']]
    for index_m, mentor in mentors.iterrows():
        score = 0
        if student['Language'] == mentor['Language']:
            score += language_weight
        mentor_level = skill[mentor['Skill']]
        skill_difference = mentor_level - student_level
        if skill_difference > 0:
            score += skill_weights
        relationship_m[index_s,index_m] = score

Cost Function

The cost function is the sum of the relationship matrix for the selected pairs.

def evalrel(individual):
    cost = 0
    for gene1, gene2 in zip(individual[0], individual[1]):
        cost += relationship_m[gene1][gene2]
    return cost,

Minima movement

To reach the minima, GA mutates a set of the results and merges (crossover) with the current result to make a “stronger” child (mimicking our biological process). I decided to mutate the offspring through random index generation.

def myMutation(individual):
    
    shuffle(individual[0])
    shuffle(individual[1])
    return (individual,)

DEAP (Evolutionary Algorithms for Python)

The most widespread and flexible library for Python is DEAP. DEAP is powerful since you can define anything you want for the evolutionary process. Also, using it is extremely easy an straight forward. A good explanation and application can be found here. Any genetic algorithm follows these steps:

  1. Initialization
  2. Fitness Assignment
  3. Selection
  4. Crossover
  5. Mutation
  6. Rinse and repeat until cost function is satisfactory

Initialization and Fitness Assignment

The fundamental idea of DEAP is to track everything with objects. The documentation for DEAP is plentiful. I recommend reading their documentation and referencing the One Max Problem example to learn more about DEAP. The main idea behind DEAP is to create a object and adding/changing attributes to streamline the process. DEAP also has a set of useful preset tools that can be used to make the process even simpler.

First, you must create the objects. One for the fitness assignment and one for the individual. The individual is an extremely important idea for DEAP. The individual is the selected offspring for the given generation. This is normally used to evaluate the current generation by passing it through the fitness function. Note the weights=(1.0,) in the code. If you want to minimize the cost function, change the weight to -1. If you want to optimize multiple fitness functions, increase the weight length. For example, if you are trying to optimize two parameters, the weights will be declared as weights=(1.0, 1.0). In addition, the individual can be anything you want. It can be a list or it can be an array. You can change the type by changing the list in the code to array.array

toolbox = base.Toolbox()

creator.create("FitnessMin", base.Fitness, weights=(1.0,))
creator.create("Individual", list, typecode='i', fitness=creator.FitnessMin)

Now that individual is created, there must be some value associated with it. To assign the individual with values, indices needs to be created. In this case, the individual is two lists of randomly scrambled indexes with the value range of [0,4]. Note that the toolbox.register adds attributes to the toolbox class which keeps track of all the parameters and attributes you define. In addition, the values are added with the parameter tools.initCycle. Since the individual is two lists, this is the only tool that makes sense and works. If the individual was just one list, the tools.initIterate can be used. The created individual is then fed to the population.

toolbox.register("indices1", random.sample, range(IND_SIZE), IND_SIZE)
toolbox.register("indices2", random.sample, range(IND_SIZE), IND_SIZE)

toolbox.register("individual", tools.initCycle, creator.Individual,
            (toolbox.indices1, toolbox.indices2))
                
toolbox.register("population", tools.initRepeat, list, toolbox.individual)

Selection, Crossover, Mutation

The remainder process is pretty straightforward since DEAP takes care of most of this part. They have to be defined similar to how the individual was defined. For this program, I used the mu + lambda solver. To keep track of the solving progress, the statistic of the population is tracked and outputted. The best offspring through the evolutionary process is kept tracked through the hall of fame tool that DEAP provides (hof). If multiple ideal solutions exist, the hof variable will have multiple lists.

toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("select", tools.selBest)
toolbox.register("evaluate", evalrel)
toolbox.register("mutate", myMutation)

random.seed(169)

NGEN = 50
MU = 50
LAMBDA = 100
CXPB = 0.7
MUTPB = 0.2

pop = toolbox.population(n=MU)
hof = tools.ParetoFront()
stats = tools.Statistics(lambda ind: ind.fitness.values)
stats.register("avg", numpy.mean, axis=0)
stats.register("std", numpy.std, axis=0)
stats.register("min", numpy.min, axis=0)
stats.register("max", numpy.max, axis=0)

algorithms.eaMuPlusLambda(pop, toolbox, MU, LAMBDA, CXPB, MUTPB, NGEN, stats,
                            halloffame=hof)

Using the Results

The results are taken and ran through a check. I double check that all the pairs speak the same language and that the skillset matching are actually coherent. If everything looks good, the results are saved into a new pandas data frame.

    print("MATCHED RESULTS")
    student_list = []
    mentors_list = []
    language_list = []
    for student_idx,mentor_idx in zip(hof.items[-1][0],hof.items[-1][1]):
          flag_language = True if (students['Language'][student_idx] == mentors['Language'][mentor_idx]) else print("Language Error") 
          
          student_skill = skill[students['Skill'][student_idx]]
          mentor_skill = skill[mentors['Skill'][student_idx]]
          flag_skill = True if (mentor_skill >=student_skill) else print("Skill Error") 
          if flag_language and flag_skill:
            student_list.append(students['Name'][student_idx])
            mentors_list.append(mentors['Name'][mentor_idx])
            language_list.append(mentors['Language'][mentor_idx])
        
    result = pd.DataFrame({'Students':student_list, "Mentors":mentors_list, "Langugage":language_list})
    
    display(result)