How to binary classify mixed structured data?

Hi, I have the following sample data and need to create a neural network to predict if a record will match (MATCH column). I have added the reason for a match in the Description column. I have looked at the structured data and text classification tutorials, but I don’t know how to proceed here. Can you help me? Thanks

iddateamountnamebicibansubjecterefcustomer_namecustomer_metacustomer_verfcustomer_amountcustomer_dateMATCHdescription
859405.09.202212500Woody StanleyBYLADEM1001DE02120300000000202051monthly rate member 123Woody Stanleymember 123XO4MFIZ21250012.08.20221metadata match in subject
866105.09.20225000Hazel EstradaINGDDEFFDE02500105170137075030monthly rate member 758HIYKRHS8William Estradamember 758HIYKRHS8500013.08.20221customer_verf equal to eref
865705.09.20222500Bill GloverBELADEBEDE02100500000054540402monthly rateBill Glovermember 122YQ48CXBT250031.08.20221dates, amounts and names suits
865005.09.20222500Rosalind ReynoldsCMCIDEDDDE02300209000106531065monthly rate member 147Rosalind Reynoldsmember 1475AXP4WKC250015.08.20221metadata match in subject
864905.09.202260000Isabella WellsHASPDEHHDE02200505501015871393rate 254YQ48CXABIsabella Wellsmember 254YQ48CXAB6000009.08.20221metadata match in subject
864705.09.20225000Sabrina WoodwardPBNKDEFFDE02100100100006820101monthly rateAD7OX0OASabrina Woodwardmember 756AD7OX0OA500021.08.20221customer_verf equal to eref
864505.09.202210000Lulu MooreDAAEDEDDDE02300606010002474689monthly rateYR3L1C93Lulu Mooremember 635YR3L1C931000030.08.20221customer_verf equal to eref
864405.09.202240000Gilbert HodgsonSOLADEST600DE02600501010002034304monthly rate H9219EYXGilbert Hodgsonmember 654H92I9EYX4000001.09.20221customer_verf match in subject with typo
864305.09.202215000Milton ParkerHYVEDEMMDE02700202700010108669monthly rateSNLF2U1OMilton Parkermember 962SNLF2U1O1500020.08.20221customer_verf equal to eref
864105.09.202230000Warren WebbPBNKDEFFDE02700100800030876808monthly rate member 356Warren Webbmember 356BP0CFA9R3000020.08.20221metadata match in subject
863305.09.202280000Emmett ToddBEVODEBBDE88100900001234567892monthly rate 426RZT9YRV8Emmett Toddmember 426RZT9YRV88000001.09.20221customer_verf equal to eref
862205.09.202210000Noah WiseSSKMDEMMDE02701500000000594937monthly rate member 444Noah Wisemember 444LPE7UL1Y1000013.08.20221metadata match in subject
862005.09.20222500Zoe MalcomOPSKATWWAT026000000001349870monthly member_number 765PDXYSV6FZoe Malcommember 765PDXYSV6F250015.08.20221customer_verf equal to eref
879405.09.202212500Woody StanleyBYLADEM1001DE02120300000000202051monthly rate member 123Woody Stanleymember 123XO4MFIZ21250006.09.20220date earlier than customer_date
876105.09.20225000Hazel EstradaINGDDEFFDE02500105170137075030monthly rate member 758HIYKRHS8William Estradamember 758HIYKRHS81000013.08.20220amount differs
874105.09.202230000Warren WebbPBNKDEFFDE02700100800030876808monthly rate member 536Barclay Richardsonmember 356BP0CFA9R3000020.08.20220no match

Hi @zenon,

I can clearly see a pattern from the table where ever the data is a mismatch you directly classify them as 0. First make sure if the problem can be solved using with few if and else statements, because there are fixed number of columns and the MATCH being 0 or 1 is directly related to mismatch in one or the other columns based on a condition. I can see that for date column you have predefined condition, so like that if you any set of rules for each column you can solve the problem easily and achieve 100% accuracy. Let me know if this is helpful and if any further queries we can discuss.

Thanks.

Hi @Siva_Sravana_Kumar_N
thank you for your response. The current implementation uses exactly what you wrote. I check if eref matches customer_verf depending on the amounts and the dates. Unfortunately, some of the data is entered manually and some is converted from handwritten to text. Sometimes eref is missing and users write customer_verf in the subject.
Probably I have not selected the best data as an example. Currently I have a matching rate of about 40%. My goal is to make it a little smarter and thus increase the rate.

My current idea is to turn the strings into tokens and then use them to train an RNN. Do you think this could work?
Thanks!

We notice your response with an unusual hyperlink for the above query. Could you help us to understand the purpose? We are here to help you to resolve your problem. Thank you.

Hi @zenon,

Even with RNN you cannot only consider text as your input because from the data I guess that date, name and amount features as well effect the output. So, while building the model, use TF-IDF for text to tokens and keep the numerical values the same combine all the features and feed it to normal ML models like Logistic Regression or SVM. Once you think that you want to increase accuracy on top of this, you can consider RNN for textual data for converting to a fixed length of features and then combine with numerical features and the make classification. I hope this will help you get started initially.

Thanks & Regards,
Sravana Neeli.