How to binary classify mixed structured data?

zenon · September 26, 2022, 10:30am

Hi, I have the following sample data and need to create a neural network to predict if a record will match (MATCH column). I have added the reason for a match in the Description column. I have looked at the structured data and text classification tutorials, but I don’t know how to proceed here. Can you help me? Thanks

id

date

amount

name

bic

iban

subject

eref

customer_name

customer_meta

customer_verf

customer_amount

customer_date

MATCH

description

8594

05.09.2022

12500

Woody Stanley

BYLADEM1001

DE02120300000000202051

monthly rate member 123

Woody Stanley

member 123

XO4MFIZ2

12500

12.08.2022

1

metadata match in subject

8661

05.09.2022

5000

Hazel Estrada

INGDDEFF

DE02500105170137075030

monthly rate member 758

HIYKRHS8

William Estrada

member 758

HIYKRHS8

5000

13.08.2022

1

customer_verf equal to eref

8657

05.09.2022

2500

Bill Glover

BELADEBE

DE02100500000054540402

monthly rate

Bill Glover

member 122

YQ48CXBT

2500

31.08.2022

1

dates, amounts and names suits

8650

05.09.2022

2500

Rosalind Reynolds

CMCIDEDD

DE02300209000106531065

monthly rate member 147

Rosalind Reynolds

member 147

5AXP4WKC

2500

15.08.2022

1

metadata match in subject

8649

05.09.2022

60000

Isabella Wells

HASPDEHH

DE02200505501015871393

rate 254

YQ48CXAB

Isabella Wells

member 254

YQ48CXAB

60000

09.08.2022

1

metadata match in subject

8647

05.09.2022

5000

Sabrina Woodward

PBNKDEFF

DE02100100100006820101

monthly rate

AD7OX0OA

Sabrina Woodward

member 756

AD7OX0OA

5000

21.08.2022

1

customer_verf equal to eref

8645

05.09.2022

10000

Lulu Moore

DAAEDEDD

DE02300606010002474689

monthly rate

YR3L1C93

Lulu Moore

member 635

YR3L1C93

10000

30.08.2022

1

customer_verf equal to eref

8644

05.09.2022

40000

Gilbert Hodgson

SOLADEST600

DE02600501010002034304

monthly rate H9219EYX

Gilbert Hodgson

member 654

H92I9EYX

40000

01.09.2022

1

customer_verf match in subject with typo

8643

05.09.2022

15000

Milton Parker

HYVEDEMM

DE02700202700010108669

monthly rate

SNLF2U1O

Milton Parker

member 962

SNLF2U1O

15000

20.08.2022

1

customer_verf equal to eref

8641

05.09.2022

30000

Warren Webb

PBNKDEFF

DE02700100800030876808

monthly rate member 356

Warren Webb

member 356

BP0CFA9R

30000

20.08.2022

1

metadata match in subject

8633

05.09.2022

80000

Emmett Todd

BEVODEBB

DE88100900001234567892

monthly rate 426

RZT9YRV8

Emmett Todd

member 426

RZT9YRV8

80000

01.09.2022

1

customer_verf equal to eref

8622

05.09.2022

10000

Noah Wise

SSKMDEMM

DE02701500000000594937

monthly rate member 444

Noah Wise

member 444

LPE7UL1Y

10000

13.08.2022

1

metadata match in subject

8620

05.09.2022

2500

Zoe Malcom

OPSKATWW

AT026000000001349870

monthly member_number 765

PDXYSV6F

Zoe Malcom

member 765

PDXYSV6F

2500

15.08.2022

1

customer_verf equal to eref

8794

05.09.2022

12500

Woody Stanley

BYLADEM1001

DE02120300000000202051

monthly rate member 123

Woody Stanley

member 123

XO4MFIZ2

12500

06.09.2022

0

date earlier than customer_date

8761

05.09.2022

5000

Hazel Estrada

INGDDEFF

DE02500105170137075030

monthly rate member 758

HIYKRHS8

William Estrada

member 758

HIYKRHS8

10000

13.08.2022

0

amount differs

8741

05.09.2022

30000

Warren Webb

PBNKDEFF

DE02700100800030876808

monthly rate member 536

Barclay Richardson

member 356

BP0CFA9R

30000

20.08.2022

0

no match

Siva_Sravana_Kumar_N · September 26, 2022, 9:12pm

Hi @zenon,

I can clearly see a pattern from the table where ever the data is a mismatch you directly classify them as 0. First make sure if the problem can be solved using with few if and else statements, because there are fixed number of columns and the MATCH being 0 or 1 is directly related to mismatch in one or the other columns based on a condition. I can see that for date column you have predefined condition, so like that if you any set of rules for each column you can solve the problem easily and achieve 100% accuracy. Let me know if this is helpful and if any further queries we can discuss.

Thanks.

zenon · September 27, 2022, 7:47am

Hi @Siva_Sravana_Kumar_N
thank you for your response. The current implementation uses exactly what you wrote. I check if eref matches customer_verf depending on the amounts and the dates. Unfortunately, some of the data is entered manually and some is converted from handwritten to text. Sometimes eref is missing and users write customer_verf in the subject.
Probably I have not selected the best data as an example. Currently I have a matching rate of about 40%. My goal is to make it a little smarter and thus increase the rate.

My current idea is to turn the strings into tokens and then use them to train an RNN. Do you think this could work?
Thanks!

chunduriv · September 27, 2022, 11:38am

We notice your response with an unusual hyperlink for the above query. Could you help us to understand the purpose? We are here to help you to resolve your problem. Thank you.

Siva_Sravana_Kumar_N · September 27, 2022, 2:56pm

Hi @zenon,

Even with RNN you cannot only consider text as your input because from the data I guess that date, name and amount features as well effect the output. So, while building the model, use TF-IDF for text to tokens and keep the numerical values the same combine all the features and feed it to normal ML models like Logistic Regression or SVM. Once you think that you want to increase accuracy on top of this, you can consider RNN for textual data for converting to a fixed length of features and then combine with numerical features and the make classification. I hope this will help you get started initially.

Thanks & Regards,
Sravana Neeli.

Topic		Replies	Views
Nan values for sparse categorical cross entropy loss using RNN General Discussion models , keras	1	1648	September 21, 2023
Which model to use for multicolumn string data General Discussion models , keras	3	397	January 6, 2023
Handling mixed inputs to keras model Keras models , tensorflow	1	103	March 3, 2025
How to add array feature in tensorflow recommendation General Discussion recommenders , datasets , keras , help_request	1	1290	October 11, 2024
How to jointly predict a sequence and its associated scoremo Keras models , keras , help_request	1	1466	January 18, 2024

How to binary classify mixed structured data?

Related topics