Data : Lists of ordered sequences, some of which may contain nested sequences.(As in the example below)
I have input sequences that contain not just single elements, but also lists.
Eg : i1 = [2, 5, 10, [1, 7 ,9 , 20], 11, 32].
If it were just a sequence like [2,4,6,7] that does not contain a nested sequence, I would just directly pass them to the embedding layer. But, in my case, that’s not possible.
The elements in my sequences are ordered by their date/time of occurrence.
So, for each ID, I have a sequence of ordered events.
Sometimes, multiple events occur on the same day, which leads to nested lists.
For example, consider the sequence [A, B, [D, C, I, K], M]
This means, Event A has occurred on day 1, event B on day two, and events[D,C,I,K] on day 3 etc.
So, given a sequence of events for each unique ID, my goal is to predict what will be the next event/sequence of events via an LSTM model.
I have just converted these events represented by text into integer tokens, and subsequently got their count vectors/one-hot vector representation.
But, I’m facing troubles getting embeddings from such an input representation.
Embedding layers in TF/Keras would only accept integer tokens and not one-hot vectors as input.
So could someone please tell me how to get embeddings for such input representation?
Could someone please provide a simple working example for some sample sequences like this?
- [A, B, [D, C, E], L]
- [[S,T,B], M]
- [M, N, [L,U]]
- [A, B, L]
Where A, B, C,… etc are events represented by text and lets say I want to represent those events by embedding vectors of size 50 or 100. With padding length = 4. In that case, my Input dim should be (None, 4) and the output via embedding layer should be (None, 4, 50) or (None, 4, 100) depending on the vector size. [None - batch size]
With Integer tokens :
A - 1, B-2, C-3, D-4, E-5, L-6, M-7, N-8, P-9, S-10, T-11, U-12
The padded sequences would look like this :
1. [1, 2, [4,3,5], 6, 0]
2. [[10, 11, 2], 7, 0, 0]
3. [7, 8, [6, 12], 0]
4. [1, 2, 6, 0]
Now, could someone please help me get outputs from embedding layer of the shape (Batch_size, seq_len, dim_len)?
Or, are there better suggestions to represent my LSTM input which contains nested sequences ?