Applying a Smart Text Classification System for Customer Support Tickets

I want to build a text classification system that can automatically categorize customer support tickets into predefined categories (billing, technical, account, product). The system should:

  1. Accept text input from customer tickets
  2. Process and clean the text data
  3. Classify the ticket into one of the predefined categories
  4. Return confidence scores for the classification

Requirements:

  • Use Python and relevant ML libraries
  • Handle common text preprocessing tasks
  • Support at least 4 categories
  • Include sample training data format
  • Provide evaluation metrics

What would be the best approach to implement this system, and what specific techniques should I use for optimal classification accuracy?

The recommended approach in Google ecosystem would be:

  1. Data Preprocessing : Use VertexAI workbench with pandas, nltk or TextVectorization via TF for cleaninup and tokenization.
  2. Modelling Options : VertexAI AutoML Text Classification for no-code/low-code option, where we can train directly on labeled csv in cloud storage. Else use custom model in VertexAI such as pretrained models like BERT from TF or HF, train on vertex pipelines or workbench and deploy via vertex model registry.
  3. Serving and Inference : Deploy model on VertexAI Endpoint, Use Rest API to send a ticket string and receive the prediction category with its confidence score.
    Note: In case of Vertex AutoML the prediction score is returned by default, for custom models we may need to use softmax based probability outputs for the same.