Machine Learning Training Statement
Purpose of Machine Learning Training
The purpose of the machine learning training is to study the impact of urban road distribution, population density distribution, and the distribution of different types of POI (Points of Interest) on the distribution of bus networks. By doing so, we aim to develop a model that can predict the potential bus network distribution in a given city.
Available Data
Input Data Sources
Multiple npz arrays are used as input data. These arrays are stored in files named after the respective city names (cityname) in the format “npzs/{cityname}.npz”. Each array contains information for a specific city, which is divided into independent variables (x) and dependent variables (y):
-
Dependent Variable: A single-channel numerical matrix indicating whether a pixel is traversed by a “bus” route. A value of 1 indicates the presence of a bus route, while 0 indicates absence. This is referred to as “checkbus” and represents the actual distribution of the bus network.
-
Independent Variables: A 137-channel matrix where:
- The first channel, “checkroad”, records the distribution of roads in a manner similar to the dependent variable.
- The second channel, “popdensity”, records the average population density for each pixel, normalized between 0 and 1.
- The remaining channels represent the distribution of different types of POI. For each POI category, each grid cell records the number of POI points within it, normalized between 0 and 1. The correspondence between the channels in the npz array and their content is detailed in the independent variable channel list. This list includes the channel number in the npz array (channel), a code/name for easy identification of the channel content (e.g., checkroad for road networks, popdensity for population density, and four-digit codes for POI categories), and a Chinese explanation of the content within each channel (content).
For the same city, all channels in the npz array have consistent dimensions. However, for different cities, the dimensions of the npz arrays may vary, typically ranging from approximately 300x300 to 900x900. Each pixel corresponds to a real-world area of 200m x 200m. The number of channels is consistent across all cities (137 channels). If a city lacks a specific type of POI, the corresponding channel in that city’s data will be entirely zero.
Model Construction Ideas
Skeleton Recognition
The bus network can be abstractly viewed as a skeleton structure. In this structure, the intersections or corners of the bus network serve as the nodes or joints of the skeleton, and the lines represent the connections between these nodes. The process can begin by identifying these nodes and lines, followed by training. A similar approach can be applied to the checkroad channel in the independent variables.
Basic Logic
The output must satisfy a fundamental requirement: bus networks must be located on existing roads. This means that bus routes are necessarily subsets of the road network. In other words, a bus route can only exist where there is a road, but the presence of a road does not guarantee a bus route.
Multi-Channel Classification of Independent Variables for Model Training
Given that each channel of the independent variables may have distinct distribution characteristics and may not represent the same type of geographical element, the following approach is proposed:
-
For each channel or groups of similar channels (e.g., POI types with the same first two digits), design a corresponding model architecture to enable targeted training. For example:
- For the checkroad channel (road distribution), it is evident that the bus network must be a subset of the road network. A bus route can only exist where there is a road, but roads may not necessarily have bus routes. This rule should be incorporated into the model design for the checkroad channel.
- For traffic node-related POI (e.g., airports, train stations), their distribution is characterized by a small total number and concentrated distribution. Their impact on bus network distribution is significant. For instance, a city may have only one airport and a few train stations, which clearly require public transportation services. However, in the grid, these POI types are mostly zero in most cells, necessitating a targeted model design.
- For residential areas, their distribution is relatively uniform and dispersed. This may require a different approach.
The above examples are illustrative and not exhaustive. With 137 channels to consider, a detailed analysis of each channel is required. For the 137 input data channels formed by checkroad, popdensity, and POI distributions, it is necessary to consider the design of model networks for each channel. This may involve adjusting convolutional kernel sizes, attention allocation, and other parameters. For cities missing certain POI types, the corresponding channels will be filled with zeros to maintain uniform input data dimensions.
Channels can be designed individually or grouped into categories for model design. The existing classification is indicated in the “code/name” column, where entries labeled “POI” are followed by four-digit codes. The first two digits represent the major category, and the last two digits represent the subcategory. For entries with the last two digits as ‘00’, such as 0500, it represents the major category (e.g., dining), while 05XX represents specific subcategories within that major category. You may also consider further classification and personalized design.
Input Data Processing
The current approach involves using the center of the npz array as a reference to extract a 256x256 region for training data. In the future, a 256x256 sliding window approach can be considered to include all regions of the city in the training process. For now, the center cropping method is used.