Qihoo 360 team and artificial intelligence research institute jointly developed deep learning scheduling platform XLearning

In the past two years, artificial intelligence technology has developed rapidly, and various deep learning frameworks represented by Google's open source TensorFlow are emerging one after another. In order to facilitate the use of various deep learning technologies by algorithm engineers, reduce complicated operations such as deployment and operation of the operating environment, improve the utilization of hardware resources such as GPU, and save hardware input costs. Qihoo 360 System Department big data team and artificial intelligence research institute Developed a deep learning scheduling platform - XLearning.

The XLearning platform integrates big data with deep learning. Based on Hadoop Yarn, it integrates the common deep learning frameworks such as TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, and XGBoost. It is a typical implementation of “AI on Hadoop”. XLearning was officially launched online this year (2017) in April. It has been iteratively updated through multiple versions, providing a unified and stable scheduling platform for users of each learning framework, realizing resource sharing and greatly improving resource utilization. And has good scalability and compatibility. It has been widely used in business departments such as company search, artificial intelligence research institute, commercialization, and data center.

XLearning architecture

Qihoo 360 team and artificial intelligence research institute jointly developed deep learning scheduling platform XLearning

Client: XLearning client, responsible for starting the job and obtaining the job execution status;

ApplicationMaster (AM): responsible for input data fragmentation, startup and management of Container, execution log storage, etc.

Container: The actual executor of the job, responsible for starting the Worker or PS (Parameter Server) process, monitoring and reporting the status of the process to the AM, uploading the output of the job, and so on. For TensorFlow type jobs, it is also responsible for starting the TensorBoard service.

XLearning features

Although XLearning has a simple structure, it has a wealth of functions for users to carry out model training, and relies on Yarn to provide unified management of work resources.

Support multiple deep learning frameworks

XLearning supports TensorFlow, MXNet distributed and stand-alone mode, and supports all stand-alone deep learning frameworks such as Caffe, Theano, PyTorch and more. For the same deep learning framework to support multi-version and custom version, to meet the user's individual needs, not limited to the installation version of each learning framework on the cluster machine.

Unified data management based on HDFS

XLearning provides multiple modes for data input and output, including streaming and reading of data, direct HDFS reading and writing, etc., depending on the amount of data processed by the job and the capacity of the clustered hard disk, depending on the situation, the read and write mode is used.

Visual interface

To make it easier for users to view job information, XLearning provides a visual interface for displaying job execution progress and output logs. After the job is completed, you can also view the log content to facilitate analysis of the progress of the training process. For TensorFlow type jobs, the TensorBoard service is supported. The job running interface is roughly divided into three parts (as shown below):

All Containers: Display the Container list and the corresponding information of each Container in the current job, such as Contianer ID, Container Host, Container Role, Current Status, Start Time, and End. Time (Finish Time), execution progress (Reporter Progress);

View TensorBoard: When the job type is TensorFlow, you can click on the link to jump directly to the TensorBoard page;

Save Model: During the execution of the job, the user can upload the output of the current training model to HDFS and display the list of currently uploaded models.

Qihoo 360 team and artificial intelligence research institute jointly developed deep learning scheduling platform XLearning

Native code compatibility

XLearning supports the ClusterSpec automatic allocation build of TensorFlow distributed mode. The stand-alone mode and other deep learning framework code can be migrated to XLearning without any modification, which is convenient for users to use quickly.

Checkpoint function

Using the deep learning framework's Checkpoint mechanism and direct reading and writing of HDFS data, XLearning makes it easy for users to implement training recovery and continue execution.

The XLearning open source version is easy to use and runs directly on the community Hadoop version. It is easy to use and has minimal entry-level learning costs. The company's Yarn version is a number of enhancements we have made in the community version, such as support for GPU resource scheduling, GPU communication affinity, DockerContainer support. Depending on these features, the company's version is more than GPU resource scheduling support, job Dockerization, temporary GPU virtual machine, Container Metrics visual chart display and other functions. These features will be shared with you by providing Yarn Patch or by using the Yarn version. You are also welcome to communicate with us at any time.

plastic connection plate type

Ac Electronic Centrifugal Switch Board,Motor Accessories Board,Centrifugal Switch Gear,Electric Motor Centrifugal Switch Accessories

Ningbo Zhenhai Rongda Electrical Appliance Co., Ltd. , https://www.centrifugalswitch.com