This isn’t meant an an exhaustive list but just the most common errors I’ve seen over the years.
- Taking the average of values that are already an average
- Trying to join database tables on fields with different data types
- Processing data you will later discard, do filtering early on
- Using averages when mean or mode is better (the average person has 1,9999 legs)
- Forgetting seasonality in data
- Accidentally filtering out data by incorrect combining of AND/OR statements
- Errors caused by fixed length string in databases (often space padded like ‘DataPoint ‘)
- Accidentally missing data by being provided data from a source that already has hidden filters applied
- Losing data when joining tables in SQL due to missing data in one table (use outer joins)
- Poor data quality (countries stored as UK, GB, England etc. in the same field, phone number with 00 and + prefixes)
- Null data breaking report filters and SQL where statements
- Misusing null in Boolean fields (1 is true, 0 is false and null is unknown)
- Storing data as generic data types (dates as text, numbers as text, Boolean as integers and so on)
- Storing the same data point in different tables with different data types
- Being inconsistent in naming conventions (my_table, MyTable), pick one and stick to it
- Using cryptic names for fields instead of names that make sense (max_temp is OK m_tmp is ambiguous)