更多技術交流、求職機會,歡迎關注字節跳動數據平臺微信公眾號,回復【1】進入官方交流群
Source Connector
本文將主要介紹負責數據讀取的組件SourceReader:
SourceReader
每個SourceReader都在獨立的線程中執行,只要我們保證SourceSplitCoordinator分配給不同SourceReader的切片沒有交集,在SourceReader的執行周期中,我們就可以不考慮任何有關并發的細節。
SourceReader接口
public interface SourceReader<T, SplitT extends SourceSplit> extends Serializable, AutoCloseable {void start();void pollNext(SourcePipeline<T> pipeline) throws Exception;void addSplits(List<SplitT> splits);/*** Check source reader has more elements or not.*/boolean hasMoreElements();/*** There will no more split will send to this source reader.* Source reader could be exited after process all assigned split.*/default void notifyNoMoreSplits() {}/*** Process all events which from {@link SourceSplitCoordinator}.*/default void handleSourceEvent(SourceEvent sourceEvent) {}/*** Store the split to the external system to recover when task failed.*/List<SplitT> snapshotState(long checkpointId);/*** When all tasks finished snapshot, notify checkpoint complete will be invoked.*/default void notifyCheckpointComplete(long checkpointId) throws Exception {}interface Context {TypeInfo<?>[] getTypeInfos();String[] getFieldNames();int getIndexOfSubtask();void sendSplitRequest();}
}
構造方法
這里需要完成和數據源訪問各種配置的提取,比如數據庫庫名表名、消息隊列cluster和topic、身份認證的配置等等。
示例
public RocketMQSourceReader(BitSailConfiguration readerConfiguration,Context context,Boundedness boundedness) {this.readerConfiguration = readerConfiguration;this.boundedness = boundedness;this.context = context;this.assignedRocketMQSplits = Sets.newHashSet();this.finishedRocketMQSplits = Sets.newHashSet();this.deserializationSchema = new RocketMQDeserializationSchema(readerConfiguration,context.getTypeInfos(),context.getFieldNames());this.noMoreSplits = false;cluster = readerConfiguration.get(RocketMQSourceOptions.CLUSTER);topic = readerConfiguration.get(RocketMQSourceOptions.TOPIC);consumerGroup = readerConfiguration.get(RocketMQSourceOptions.CONSUMER_GROUP);consumerTag = readerConfiguration.get(RocketMQSourceOptions.CONSUMER_TAG);pollBatchSize = readerConfiguration.get(RocketMQSourceOptions.POLL_BATCH_SIZE);pollTimeout = readerConfiguration.get(RocketMQSourceOptions.POLL_TIMEOUT);commitInCheckpoint = readerConfiguration.get(RocketMQSourceOptions.COMMIT_IN_CHECKPOINT);accessKey = readerConfiguration.get(RocketMQSourceOptions.ACCESS_KEY);secretKey = readerConfiguration.get(RocketMQSourceOptions.SECRET_KEY);
}
start方法
初始化數據源的訪問對象,例如數據庫的執行對象、消息隊列的consumer對象或者文件系統的連接。
示例
消息隊列
public void start() {try {if (StringUtils.isNotEmpty(accessKey) && StringUtils.isNotEmpty(secretKey)) {AclClientRPCHook aclClientRPCHook = new AclClientRPCHook(new SessionCredentials(accessKey, secretKey));consumer = new DefaultMQPullConsumer(aclClientRPCHook);} else {consumer = new DefaultMQPullConsumer();}consumer.setConsumerGroup(consumerGroup);consumer.setNamesrvAddr(cluster);consumer.setInstanceName(String.format(SOURCE_READER_INSTANCE_NAME_TEMPLATE,cluster, topic, consumerGroup, UUID.randomUUID()));consumer.setConsumerPullTimeoutMillis(pollTimeout);consumer.start();} catch (Exception e) {throw BitSailException.asBitSailException(RocketMQErrorCode.CONSUMER_CREATE_FAILED, e);}
}
數據庫
public void start() {this.connection = connectionHolder.connect();// Construct statement.String baseSql = ClickhouseJdbcUtils.getQuerySql(dbName, tableName, columnInfos);String querySql = ClickhouseJdbcUtils.decorateSql(baseSql, splitField, filterSql, maxFetchCount, true);try {this.statement = connection.prepareStatement(querySql);} catch (SQLException e) {throw new RuntimeException("Failed to prepare statement.", e);}LOG.info("Task {} started.", subTaskId);
}
FTP
public void start() {this.ftpHandler.loginFtpServer();if (this.ftpHandler.getFtpConfig().getSkipFirstLine()) {this.skipFirstLine = true;}
}
addSplits方法
將SourceSplitCoordinator給當前Reader分配的Splits列表添加到自己的處理隊列(Queue)或者集合(Set)中。
示例
public void addSplits(List<RocketMQSplit> splits) {LOG.info("Subtask {} received {}(s) new splits, splits = {}.",context.getIndexOfSubtask(),CollectionUtils.size(splits),splits);assignedRocketMQSplits.addAll(splits);
}
hasMoreElements方法
在無界的流計算場景中,會一直返回true保證Reader線程不被銷毀。
在批式場景中,分配給該Reader的切片處理完之后會返回false,表示該Reader生命周期的結束。
public boolean hasMoreElements() {if (boundedness == Boundedness.UNBOUNDEDNESS) {return true;}if (noMoreSplits) {return CollectionUtils.size(assignedRocketMQSplits) != 0;}return true;
}
pollNext方法
在addSplits方法添加完成切片處理隊列且hasMoreElements返回true時,該方法調用,開發者實現此方法真正和數據交互。
開發者在實現pollNext方法時候需要關注下列問題:
-
切片數據的讀取
-
從構造好的切片中去讀取數據。
-
-
數據類型的轉換
-
將外部數據轉換成BitSail的Row類型
-
示例
以RocketMQSourceReader為例:
從split隊列中選取split進行處理,讀取其信息,之后需要將讀取到的信息轉換成BitSail的Row類型,發送給下游處理。
public void pollNext(SourcePipeline<Row> pipeline) throws Exception {for (RocketMQSplit rocketmqSplit : assignedRocketMQSplits) {MessageQueue messageQueue = rocketmqSplit.getMessageQueue();PullResult pullResult = consumer.pull(rocketmqSplit.getMessageQueue(),consumerTag,rocketmqSplit.getStartOffset(),pollBatchSize,pollTimeout);if (Objects.isNull(pullResult) || CollectionUtils.isEmpty(pullResult.getMsgFoundList())) {continue;}for (MessageExt message : pullResult.getMsgFoundList()) {Row deserialize = deserializationSchema.deserialize(message.getBody());pipeline.output(deserialize);if (rocketmqSplit.getStartOffset() >= rocketmqSplit.getEndOffset()) {LOG.info("Subtask {} rocketmq split {} in end of stream.",context.getIndexOfSubtask(),rocketmqSplit);finishedRocketMQSplits.add(rocketmqSplit);break;}}rocketmqSplit.setStartOffset(pullResult.getNextBeginOffset());if (!commitInCheckpoint) {consumer.updateConsumeOffset(messageQueue, pullResult.getMaxOffset());}}assignedRocketMQSplits.removeAll(finishedRocketMQSplits);
}
轉換為BitSail Row類型的常用方式
自定義RowDeserializer類
對于不同格式的列應用不同converter,設置到相應Row的Field。
public class ClickhouseRowDeserializer {interface FiledConverter {Object apply(ResultSet resultSet) throws SQLException;}private final List<FiledConverter> converters;private final int fieldSize;public ClickhouseRowDeserializer(TypeInfo<?>[] typeInfos) {this.fieldSize = typeInfos.length;this.converters = new ArrayList<>();for (int i = 0; i < fieldSize; ++i) {converters.add(initFieldConverter(i + 1, typeInfos[i]));}}public Row convert(ResultSet resultSet) {Row row = new Row(fieldSize);try {for (int i = 0; i < fieldSize; ++i) {row.setField(i, converters.get(i).apply(resultSet));}} catch (SQLException e) {throw BitSailException.asBitSailException(ClickhouseErrorCode.CONVERT_ERROR, e.getCause());}return row;}private FiledConverter initFieldConverter(int index, TypeInfo<?> typeInfo) {if (!(typeInfo instanceof BasicTypeInfo)) {throw BitSailException.asBitSailException(CommonErrorCode.UNSUPPORTED_COLUMN_TYPE, typeInfo.getTypeClass().getName() + " is not supported yet.");}Class<?> curClass = typeInfo.getTypeClass();if (TypeInfos.BYTE_TYPE_INFO.getTypeClass() == curClass) {return resultSet -> resultSet.getByte(index);}if (TypeInfos.SHORT_TYPE_INFO.getTypeClass() == curClass) {return resultSet -> resultSet.getShort(index);}if (TypeInfos.INT_TYPE_INFO.getTypeClass() == curClass) {return resultSet -> resultSet.getInt(index);}if (TypeInfos.LONG_TYPE_INFO.getTypeClass() == curClass) {return resultSet -> resultSet.getLong(index);}if (TypeInfos.BIG_INTEGER_TYPE_INFO.getTypeClass() == curClass) {return resultSet -> {BigDecimal dec = resultSet.getBigDecimal(index);return dec == null ? null : dec.toBigInteger();};}if (TypeInfos.FLOAT_TYPE_INFO.getTypeClass() == curClass) {return resultSet -> resultSet.getFloat(index);}if (TypeInfos.DOUBLE_TYPE_INFO.getTypeClass() == curClass) {return resultSet -> resultSet.getDouble(index);}if (TypeInfos.BIG_DECIMAL_TYPE_INFO.getTypeClass() == curClass) {return resultSet -> resultSet.getBigDecimal(index);}if (TypeInfos.STRING_TYPE_INFO.getTypeClass() == curClass) {return resultSet -> resultSet.getString(index);}if (TypeInfos.SQL_DATE_TYPE_INFO.getTypeClass() == curClass) {return resultSet -> resultSet.getDate(index);}if (TypeInfos.SQL_TIMESTAMP_TYPE_INFO.getTypeClass() == curClass) {return resultSet -> resultSet.getTimestamp(index);}if (TypeInfos.SQL_TIME_TYPE_INFO.getTypeClass() == curClass) {return resultSet -> resultSet.getTime(index);}if (TypeInfos.BOOLEAN_TYPE_INFO.getTypeClass() == curClass) {return resultSet -> resultSet.getBoolean(index);}if (TypeInfos.VOID_TYPE_INFO.getTypeClass() == curClass) {return resultSet -> null;}throw new UnsupportedOperationException("Unsupported data type: " + typeInfo);}
}
實現DeserializationSchema接口
相對于實現RowDeserializer,我們更希望大家去實現一個繼承DeserializationSchema接口的實現類,將一定類型格式的數據對數據比如JSON、CSV轉換為BitSail Row類型。?
在具體的應用時,我們可以使用統一的接口創建相應的實現類
public class TextInputFormatDeserializationSchema implements DeserializationSchema<Writable, Row> {private BitSailConfiguration deserializationConfiguration;private TypeInfo<?>[] typeInfos;private String[] fieldNames;private transient DeserializationSchema<byte[], Row> deserializationSchema;public TextInputFormatDeserializationSchema(BitSailConfiguration deserializationConfiguration,TypeInfo<?>[] typeInfos,String[] fieldNames) {this.deserializationConfiguration = deserializationConfiguration;this.typeInfos = typeInfos;this.fieldNames = fieldNames;ContentType contentType = ContentType.valueOf(deserializationConfiguration.getNecessaryOption(HadoopReaderOptions.CONTENT_TYPE, HadoopErrorCode.REQUIRED_VALUE).toUpperCase());switch (contentType) {case CSV:this.deserializationSchema =new CsvDeserializationSchema(deserializationConfiguration, typeInfos, fieldNames);break;case JSON:this.deserializationSchema =new JsonDeserializationSchema(deserializationConfiguration, typeInfos, fieldNames);break;default:throw BitSailException.asBitSailException(HadoopErrorCode.UNSUPPORTED_ENCODING, "unsupported parser type: " + contentType);}}@Overridepublic Row deserialize(Writable message) {return deserializationSchema.deserialize((message.toString()).getBytes());}@Overridepublic boolean isEndOfStream(Row nextElement) {return false;}
}
也可以自定義當前需要解析類專用的DeserializationSchema:
public class MapredParquetInputFormatDeserializationSchema implements DeserializationSchema<Writable, Row> {private final BitSailConfiguration deserializationConfiguration;private final transient DateTimeFormatter localDateTimeFormatter;private final transient DateTimeFormatter localDateFormatter;private final transient DateTimeFormatter localTimeFormatter;private final int fieldSize;private final TypeInfo<?>[] typeInfos;private final String[] fieldNames;private final List<DeserializationConverter> converters;public MapredParquetInputFormatDeserializationSchema(BitSailConfiguration deserializationConfiguration,TypeInfo<?>[] typeInfos,String[] fieldNames) {this.deserializationConfiguration = deserializationConfiguration;this.typeInfos = typeInfos;this.fieldNames = fieldNames;this.localDateTimeFormatter = DateTimeFormatter.ofPattern(deserializationConfiguration.get(CommonOptions.DateFormatOptions.DATE_TIME_PATTERN));this.localDateFormatter = DateTimeFormatter.ofPattern(deserializationConfiguration.get(CommonOptions.DateFormatOptions.DATE_PATTERN));this.localTimeFormatter = DateTimeFormatter.ofPattern(deserializationConfiguration.get(CommonOptions.DateFormatOptions.TIME_PATTERN));this.fieldSize = typeInfos.length;this.converters = Arrays.stream(typeInfos).map(this::createTypeInfoConverter).collect(Collectors.toList());}@Overridepublic Row deserialize(Writable message) {int arity = fieldNames.length;Row row = new Row(arity);Writable[] writables = ((ArrayWritable) message).get();for (int i = 0; i < fieldSize; ++i) {row.setField(i, converters.get(i).convert(writables[i].toString()));}return row;}@Overridepublic boolean isEndOfStream(Row nextElement) {return false;}private interface DeserializationConverter extends Serializable {Object convert(String input);}private DeserializationConverter createTypeInfoConverter(TypeInfo<?> typeInfo) {Class<?> typeClass = typeInfo.getTypeClass();if (typeClass == TypeInfos.VOID_TYPE_INFO.getTypeClass()) {return field -> null;}if (typeClass == TypeInfos.BOOLEAN_TYPE_INFO.getTypeClass()) {return this::convertToBoolean;}if (typeClass == TypeInfos.INT_TYPE_INFO.getTypeClass()) {return this::convertToInt;}throw BitSailException.asBitSailException(CsvFormatErrorCode.CSV_FORMAT_COVERT_FAILED,String.format("Csv format converter not support type info: %s.", typeInfo));}private boolean convertToBoolean(String field) {return Boolean.parseBoolean(field.trim());}private int convertToInt(String field) {return Integer.parseInt(field.trim());}
}
snapshotState方法
生成并保存State的快照信息,用于ckeckpoint。
示例
public List<RocketMQSplit> snapshotState(long checkpointId) {LOG.info("Subtask {} start snapshotting for checkpoint id = {}.", context.getIndexOfSubtask(), checkpointId);if (commitInCheckpoint) {for (RocketMQSplit rocketMQSplit : assignedRocketMQSplits) {try {consumer.updateConsumeOffset(rocketMQSplit.getMessageQueue(), rocketMQSplit.getStartOffset());LOG.debug("Subtask {} committed message queue = {} in checkpoint id = {}.", context.getIndexOfSubtask(),rocketMQSplit.getMessageQueue(),checkpointId);} catch (MQClientException e) {throw new RuntimeException(e);}}}return Lists.newArrayList(assignedRocketMQSplits);
}
hasMoreElements方法
每次調用pollNext方法之前會做sourceReader.hasMoreElements()的判斷,當且僅當判斷通過,pollNext方法才會被調用。
示例
public boolean hasMoreElements() {if (noMoreSplits) {return CollectionUtils.size(assignedHadoopSplits) != 0;}return true;
}
notifyNoMoreSplits方法
當Reader處理完所有切片之后,會調用此方法。
示例
public void notifyNoMoreSplits() {LOG.info("Subtask {} received no more split signal.", context.getIndexOfSubtask());noMoreSplits = true;
}
【關于BitSail】:
?? Star 不迷路 https://github.com/bytedance/bitsail
提交問題和建議:https://github.com/bytedance/bitsail/issues
貢獻代碼:https://github.com/bytedance/bitsail/pulls
BitSail官網:https://bytedance.github.io/bitsail/zh/
訂閱郵件列表:bitsail+subscribe@googlegroups.com
加入BitSail技術社群