Do you slice your training data to evaluate how well it generalizes for different subsets of your data?

How deep do you get into evaluating your model performance? Do you slice your data and evaluate the slices? Do you try to measure fairness or accuracy for different subsets of your users or use-cases?