RiskIntel - Auto Insurance | Taissir Boukrouba

Note: This project description is still being developed and will be updated soon. You may review the current content in the meantime.

Problem Definition

An auto-insurance company was having problems conserning sales and customer risk such :

Adverse Selection : high-risk drivers (those more likely to get into accidents) meaning younger and wreckless drivers are more likely to buy insurance, while safe drivers (older for instance) drop coverage because they find it expensive.
Fraudulent Claims : some policyholders file fake or exaggerated claims to get money from the insurance company. This increases overall costs for insurers, which can then be passed on to honest customers through higher premiums
Difficulty Assessing Risk : Traditionally, factors like driving history, age, and location have been used to assess risk. However, these don't always paint the whole picture. New factors like distracted driving due to phones or the increasing complexity of modern vehicles with advanced driver-assistance systems (ADAS) make risk assessment more challenging.

Problem	Solution
Low Sales	Descriptive dashboard containing the most important KPIs, for the company to know the current state of sales.
Customer Risk	A machine learning model that predicts the riskiness of the customer according to the presented data.

Data Description

The dataset comprises four CSV files containing auto-insurance data collected over a period from 2004 to 2008. In total, the dataset includes approximately 2,183,505 rows and 36 features (columns) across four distinct types. The features are categorized as follows:

Policyholder Information consists of 8 features related to the individuals who hold the insurance policies, such as age, gender, location, and other demographic details. Vehicle Information includes 14 features that encompass details about the vehicles insured, including make, model, year, mileage, and safety ratings. Policy Information contains 10 features regarding the insurance policies themselves, such as coverage type, premium amounts, deductibles, and policy duration.

Lastly, there are 2 additional features referred to as c_49 and year, which may serve specific analytical purposes within the dataset. This comprehensive dataset provides a rich resource for analyzing trends in auto insurance, assessing risk factors, and developing predictive models. This diagram summarises it perfectly :

The dataset consists of various data types, which include integers, strings, doubles, and timestamps (representing both date and hour). This diversity in data types is crucial for accurate analysis and modeling, as different features may require specific data handling approaches. The size of the data varies significantly across the years.

The CSV files collected by the client company for 2004 contained approximately 439K records, while 2005 saw a slight increase with around 452K records The years 2006 and 2007 had the highest volume, with a combined total of approximately 859K rows, indicating a significant amount of data collection during this period.

In contrast, the dataset for 2008 had the fewest records, totaling about 431K entries. This fluctuation in record counts across the years highlights the dynamic nature of data collection and provides valuable insights into trends in auto insurance during this time frame.All of this is summarised in the following bar chart :

Data Quality Assesement

Data quality was extremely poor, and that’s what our exploratory data analysis revealed. Numerous issues were identified, indicating significant inconsistencies and gaps within the dataset. These problems not only impacted the overall reliability of the data but also posed challenges for any subsequent analyses or machine learning applications. The following points summarize the key data quality concerns we encountered:

Issue	Description
Missing Data	Most, if not all, data is nullable, leading to potential missing data.
Datatype Mismatch	Many columns contained values of different types than the column's intended type.
Spelling Mismatch	Some columns had spelling errors for the same data point values, resulting in a higher number of unique values than expected.
Inaccurate Data	Several features had their numerical equivalents in the data, but inconsistencies existed in these value assignments.
Ambiguous Data	Some inputs were odd and ambiguous, especially with French accented letters or values that did not fit within the column's defined range.

Examples

Missing Data

Customer Information: The civility column had values such as null and ???.
Car Information: The Weight_when_empty and Weight_when_full_loaded columns had a lot of missing values.

Datatype Mismatch

Customer Class: The Customer_class column had invalid values like "ES" and "DI".

Spelling Mismatch

Auto Make: The Auto_Make column had about 9 different spellings for the brand "Chevrolet", such as "Chevrelli", "Chevroet", "Chevrolot", and "Chevrilet".

Inaccurate Data

Insured Occupation: The insured job of "Institrice" (meaning teacher) had both codes "14" and "2".

Ambiguous Data

Civility: The feature civility had values like "société" written as "soci?t?".
Year: The feature year had odd years such as "8006", "2105", and "4191".

Insurance KPIs

Insurance Key Performance Indicators (KPIs) are crucial measurements that insurance companies use to evaluate their performance and effectiveness across various domains. These metrics help organizations assess their operational efficiency, customer satisfaction, and financial health, ultimately guiding strategic decision-making. By analyzing KPIs, insurance providers can identify strengths and weaknesses within their processes, ensuring they remain competitive in the market. Some common insurance KPIs include:

Loss Ratio:

This KPI measures the amount of claims paid out by an insurance company in relation to the amount of premiums collected. A high loss ratio can indicate that the insurance company is taking on too much risk or not pricing their policies appropriately.

\[ \text{Loss Ratio} = \frac{\text{Total Claims Paid Out}}{\text{Total Premiums Earned}} \]

Combined Ratio:

This KPI measures the total expense of an insurance company, including the loss ratio, underwriting expenses, and other costs. A combined ratio below 100% indicates that the company is profitable, while a ratio above 100% indicates that the company is paying out more in claims and expenses than it is collecting in premiums.

\[ \text{Combined Ratio} = \text{Loss Ratio} + \text{Expense Ratio} \]

\[ \text{Expense Ratio} = \frac{\text{Operating Expenses}}{\text{Written Premiums}} \]

Customer Retention Rate:

This KPI measures the percentage of customers who renew their policies with the company. A high retention rate can indicate that the company is providing good customer service and value, while a low retention rate may indicate problems with customer satisfaction.

\[ \text{CRR} = \frac{\text{Number of Customers at End of Period} - \text{Number of New Customers}}{\text{Number of Customers at Beginning of Period}} \times 100 \, \% \]

Premium Growth Rate:

This KPI measures the rate at which the company is growing its premium revenue over time. A high growth rate can indicate that the company is attracting new customers and expanding its market share.

\[ \text{PGR} = \frac{\text{Premiums Earned In Current Period} - \text{Premiums Earned In Previous Period}}{\text{Premiums Earned In Previous Period}} \times 100 \, \% \]

By tracking and analyzing these KPIs, the insurance company can gain insights into their performance and make data-driven decisions to improve their operations and profitability.