学堂在线大数据分析（全英文）期末考试答案 - 北京理工大学

wangke 学堂在线答案 2025-03-17 18:58:01 9

大数据分析（全英文） - 北京理工大学 - 学堂在线

1.判断题 (2分)

Item based collaborative Filtering algorithm calculate the item similarity according to item features. ( )

2.判断题 (2分)

Database provides the physical storage structure; ( )

3.判断题 (2分)

we can find one kind of tool to deal with all the data manage problems of the Database. ( )

4.判断题 (2分)

we can find one kind of tool to deal with all the data manage problems of the Big data. ( )

5.判断题 (2分)

The data in HDFS is immutable. ( )

6.判断题 (2分)

In HDFS, each storage file is first divided into multiple data blocks with a flexible length according to the data size. ( )

7.判断题 (2分)

Hadoop is the only big data architecture.

8.判断题 (2分)

HDFS support batch reading, writing operation and updating operation. ( )

9.判断题 (2分)

DFS distributed file system provides the logical storage structure of the data. ( )

10.判断题 (2分)

In customer Collaborative filtering, the similar users are defined based on the common items they purchased ( )

11.主观题 (10分)

Recommendation System – matrix decomposition Task:

You are given a dataset collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. It includes 100,000 ratings (1-5) from 943 users on 1682 movies.

You need to use the matrix decomposition method to predict the missing values of the rating matrix to complete a recommendation system so that you can recommend movies to a user based on the predicted ratings.

Submission Requirement:

Put submit following 3 screenshots of the 3-fold cross validation results of rating prediction in a single PDF file -- ID_NAME.PDF

matrix decomposition DATA.rar

代码语言

字数统计

文档将自动保存

添加附件

( 可上传1个附件，文件不超过100M) ?

上传附件，允许上传一个附件，100M以内

占位

12.单选题 (2分)

The two main components of big data are ( ) and ( ). ()

ADistributed Storage, Distributed Processing

BDistributed Collection, Distributed Processing

CDistributed Collection, Distributed Storage

DDistributed Collection, Distributed application

13.单选题 (2分)

MPP (Massively Parallel Processing) improves performance through ( ) parallelism. ( ) coordinates work with ( ) , ( ) coordinates work with one or more ( ). ( ) process queries in parallel. ( ) have their own CPU disk memory in shared nothing architecture. High speed interconnect for continuous pipelining of data processing. ( )

Asegment hosts, Master, segment host, Segment host, segment instances, Segment instances, Segment hosts

Bsegment instance, Master, segment host, Segment host, segment instances, Segment instances, Segment hosts

Csegment instance, Master, segment host, Segment host, segment instances, Segment instances, Segment instances

Dsegment hosts, Master, segment host, Segment host, segment instances, Segment hosts, Segment hosts

14.单选题 (2分)

Which of following description about the search interface of deep web is NOTcorrect ()

Ahas complex interfaces

Bsupports queries on several attributes

Cextracts contents from databases

Deasy to find

15.单选题 (2分)

The history progress of harnessing data is that ()

1) ()reporting and human analysis can be made on historical data

2) () can analyze the current data to improve business transaction

3) () Real-Time Analytics Processing to make the Realtime decision and improve Realtime business response

AOLAP: Online Analytical Processing; OLTP: Online Transaction Processing; RTAP: Real-Time Analytics Processing;

BOLTP: Online Transaction Processing; OLAP: Online Analytical Processing; RTAP: Real-Time Analytics Processing;

COLAP: Online Analytical Processing; RTAP: Real-Time Analytics Processing; OLTP: Online Transaction Processing;

DOLTP: Online Transaction Processing; RTAP: Real-Time Analytics Processing; OLAP: Online Analytical Processing;

16.单选题 (2分)

Based on the requirement we can build the business model, it includes ( ) and ( ). ( )

AConceptual Model, Logic Model

BLogic Model, Physical Model

CProcess model, Data model

DProcess model, logical model

17.单选题 (2分)

Which of the following big data characters best describes Data in Many Forms? ( )

AVolume

BVariety

CVeracity

DVelocity

18.单选题 (2分)

The most often used internal data acquisition tool is ( )

ADatawarehouse

BETL (Extract, Transform, load)

CData Trigger

DIncremental data extraction

19.单选题 (2分)

Deep web content includes ()

1 Pages that are not referred to by search engines due to lack of directed links

2 Non-web files accessible on the web, such as picture files, Pdf and word documents, etc.

3 A dynamic page obtained by querying the back-end online database by filling in the form.

4 Content that requires registration or other restrictions to access.

A1234

B124

C123

D234

20.单选题 (2分)

TensorFlow allows developers to create ( )-structures that describe how data moves through a ( ), or a series of processing nodes. Each node in the graph represents a ( ), Each connection or Edge between nodes is a ( ).

ADataflow Graphs, Graph (DAG), multidimensional data array or tensor, mathematical operation

BGraph (DAG), Dataflow Graphs, mathematical operation, multidimensional data array or tensor

CDataflow Graphs, Graph (DAG), mathematical operation, multidimensional data array or tensor

DGraph (DAG), Dataflow Graphs, mathematical operation, multidimensional data array or tensor

21.单选题 (2分)

Database connection programming interfaces such as ( ) can support SQL access by applications to the database, but they cannot provide complex functions such as transaction management, concurrent scheduling, buffer management, heterogeneous database conversion and inheritance in a distributed computing environment. This introduces the ( ). It is a layer of software that provides data exchange functions on top of the database. When the system is extended and need to access cross-platform heterogeneous databases, OS could be UNIX, Linux or Windows, forms could be mails, XML documents, EJB components, Web services, images, audio/video files or For other unstructured data, And the technology of the big data application layer is also diversified and various standards. The design of the ( ) needs to be compatible with various standard technologies and products, which introduces the ( ).

AODBC and JDBC; DAL data access layer; Unified data access interface; Unified data access interface;

BODBC and JDBC; DAL data access layer; DAL data access layer; Unified data access interface;

CDAL data access layer; ODBC and JDBC; DAL data access layer; Unified data access interface;

DODBC and JDBC; DAL data access layer; Unified data access interface; DAL data access layer;

22.单选题 (2分)

Which of the following is NOT the dimensionality reduction? ()

AWavelet transformation

BAttribute subset selection

CPrincipal component analysis

DData Cube Aggregation

23.单选题 (2分)

The ( ) annotation transparently translates your Python programs into TensorFlow graphs. ()

ATf.keras

Btf.function

CPremade Estimators

Dtf.data

24.单选题 (2分)

Web crawler crawling process is (B)

a) A list of uniform resource addresses called seed URL and use it as the link entry for crawling. When the crawler visits these seed URL s, it identifies all the needed links on the page and adds them to the queue to be crawled.

b) Put the already downloaded URL into the crawled URL list

c) Extract the new URL into the URL queue to be crawled and put them in the to be crawled URL queue according to strategy

d) The webpage links are taken out from the queue to be crawled, then Read URL, do the DNS resolution, and web pages were download into the Downloaded web library.

e) all the process will end until the queue for crawling is empty.

Aabcde

Badbce

Cacbde

Dadcbe

25.单选题 (2分)

Which of the following statement of data reduction is NOT right? ( )

AData reduction (subtraction) technology is used to help obtain a condensed data set from the original huge data set, and make this condensed data set maintain the integrity of the original data set

BData analysis on the condensed data set is obviously efficient higher, and the results of analysis are basically the same as those obtained by using the original data set

CThe time spent on data reduction could exceed or "offset" the time saved by analysis on the reduced data.

DThe data obtained by the reduction is much smaller than the original data, but can produce the same or almost the same analysis results.

26.单选题 (2分)

The execution model is based on BSP (Bulk Synchronous Processing) model. In this model, there are multiple processing units proceeding in parallel in a sequence of "Supersteps".Within each "Superstep", the processing sequence will be ()

a)each processing units first receive all messages delivered to them from the preceding "superstep",

b)When all the processing unit finishes the message delivery (hence the synchronization point)

c)may queue up the message that it intends to send to other processing units.

d)The queued up message will be delivered to the destined processing units but won't be seen until the next "superstep".

e)manipulate their local data

f) the next superstep can be started,

g)the cycle repeats until the termination condition has been reached.

Aaedcbfg

Baecdbfg

Cacedbfg

Dadecbfg

27.单选题 (2分)

( ) extract new or modified data in the database since the last extraction, at the same time, it normally would not have a big impact on the running business system. ()

AIncremental data extraction

BFull extraction

CTimestamp Extraction

DTrigger

28.单选题 (2分)

HANA improved the data analysis performance in data warehouse, Not because ()

AIt eliminates unnecessary complexity and latency

BAccelerate through simplification

CLeveraging the power of in-memory computing allows HANA to bring OLTP, transaction processing, and OLAP, data analytics, back together in one database.

DSpecialized data warehouses for reporting and analytics required the moving, transformation and pre-processing of transactional data, which introduces a huge complexity: sometimes an enterprise may hold three different copies of the same data

29.单选题 (2分)

The process begins by the ( ) issuing a query that is then passed to the ( ) . The ( ) contains information, such as the data dictionary and session information, which it uses to generate an ( )designed to retrieve the needed information from each underlying Node. Parallel Execution represents the implementation of the ( ) through the parallel computing of Node 1 to Node n. And the query results return to master node. ()

AClient, Master Node, Master Node, execution plan, execution plan

BMaster Node, Client, Master Node, execution plan, storing plan

CClient, Master Node, Client, execution plan, execution plan

DMaster Node, Client, Master Node, execution plan, storing plan

30.单选题 (2分)

In the many components of Spark, which is designed for Machine Learning? ( )

ASpark SQL

BSpark streaming

CMLlib

DGraph X

31.单选题 (2分)

Which description is not sure about Jim Gray ( )

ARelational database founder

BNautical sport enthusiast

CDivided scientific research into four types of paradigms

DBig data scientist

32.单选题 (2分)

The right order of reading data in HDFS.

a)Distributed Filesystem makes an RPC call to the namenode to determine location of datanodes where files are stored in form of blocks. For each block, the namenode returns address of datanodes (metadata of blocks and datanodes) that have a copy of block. Datanodes are sorted according to proximity (depending of network topology information).

b)The client opens the file by calling open () method on Distributed Filesystem.

c)The client then calls read () on the stream. DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first (closest) datanode for the first block in the file.

d)The Distributed Filesystem returns an FSDataInputStream (an input stream that supports file seeks) to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O.

e)Data is streamed from the datanode back to the client (in the form of packets) and read () is repeatedly called on the stream by client.

f) When the client has finished reading, it calls close () on the FSDataInputStream

g)When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block.

4-2-2.jpg

AABDCEGF

BBADCEGF

CBADCEFG

DBACDEGF

33.单选题 (2分)

Which of the following stage is the main reason of big data? ( )

AOperation and business system

BUser-generated content

CPerception stage

Dsocial media

34.单选题 (2分)

About Data Modeling design levels descriptions: Which one is correct matching?( C )

1)Based on the user's data function requirements. functions and association relationships are obtained, Entity Class corresponding to the business elements and functions.

2)More details of data entities, including primary keys, foreign keys, attributes, indexes, relationships, constraints, and even views, with data tables, data columns, value ranges, object-oriented classes, XML tags and other forms to describe.

3)The storage implementation of data, including data partition, data table space, and data integration.

A1-Conceptual model design 2-physical model design3- logical model design

B1- Logical model design 2-Physical model design3- Conceptual model design

C1-Conceptual model design 2- logical model design3- Physical model design

D1- Physical model design2- Conceptual model design3- logical model design

35.单选题 (2分)

Spark has several components to facilitate different type of computing tasks, like streaming,Graph etc. the components include ( )

1)Spark Core API2)Resilient distributed dataset (RDD),

3）Spark SQL 4）Spark topology

5）Spark Streaming 6）MLlib (Machine Learning Library)

7）GraphX8）Sklearn

A12345

B13456

C13567

D13578

36.单选题 (2分)

The correct big data lifecycle is ( )

Adata governance data collecting, data storing and data analyzing

Bdata collecting, data governance, data storing and data analyzing

Cdata collecting, data storing, data governance and data analyzing

Ddata collecting, data storing, data analyzing and data governance

37.单选题 (2分)

Data cleaning technology does not include ( )

AData transformation

BCleaning of missing data

CDeduplication of data

DPerform anomaly detection on the data set

38.单选题 (2分)

Which of the following is NOT the numerosity reduction? ()

APrincipal component analysis

BData Cube Aggregation

CClustering

DSampling

39.单选题 (2分)

In the execution of Graph Parallel Computing, which describes the roles of the master?( )

1)coordinate the execute of supersteps in sequence

2)signals the beginning of a new superstep to all workers after knowing all of them has completed the previous one

3)pings each worker to know their processing status

4)periodically issue "checkpoint" command to all workers who will then save its partition to a persistent graph store

A123

B134

C124

D1234

40.单选题 (2分)

Among the following which one is about idea, learning, notion, concept, synthesized, compared, thought-out, discussed? ()

AData

BInformation

CWisdom

DKnowledge

41.单选题 (2分)

In the following, which one is shared nothing architecture. ( )

ASMP

BNUMA

CMPP

DNone of them

42.单选题 (2分)

There are only 2 kinds of operation of RDD (Resilient Distributed Dataset), ( ). In ( ), data can be filter, joined map, reduced but no calculation is executed, only in ( ) the calculation can be done, and the value result can be generated. ( )

Amap and reduce, map, reduce

Btransformations and action, action, transformation

Ctransformations and action, transformation, action

Dmap and reduce, reduce, map

43.单选题 (2分)

What attributes subset selection method showed in the diagram? ( )

AForward Stepwise Attributes subset selection

BBackward Stepwise Attributes subset selection

CCombine forward selection and backward deletion

DDecision tree (decision tree) induction

44.单选题 (2分)

Relational databases and NoSQL databases have their own advantages and disadvantages and cannot be replaced by each other, () application scenarios: Key business systems in telecommunications, banking and other fields need to ensure strong transaction consistency; () application scenarios: Non-critical business (such as data analysis) of Internet companies, traditional companies.

ANoSQL database Relational database;

BRelational database; NoSQL database

CNoSQL database, NoSQL database;

DRelational database; Relational database

45.单选题 (2分)

( )is a user-friendly API standard for machine learning, will be the central high-level API used to build and train models. ( )

ASaveModel

BTensorFlowHub

CPremade Estimators

DTf.keras

46.单选题 (2分)

1.Which of the following are the choices of attributes subset selection methods?( C )

1)Forward Stepwise Attributes subset selection

2)Backward Stepwise Attributes subset selection

3)Combine forward selection and backward deletion

4)Principal component analysis

5)Reduction based on statistical analysis

6) Decision tree (decision tree) induction

A12346

B12345

C12356

D123456

47.单选题 (2分)

According to Gartner, there is estimated 20% data of organization is ( ) data, the other majority is ( ) data. ()

Astructured, unstructured

Bunstructured, structured

Cstructured, semi-structured

Dunstructured, semi-structured

48.单选题 (2分)

In the following picture, what are the right terms for each number ?

test 1-6.jpg

AData sources, Data storage, Data collection, Data Processing, Data Visualization, Report monitoring

BData sources, Data collection, Data storage, Data Visualization, Data Processing, Report monitoring

CData sources, Data collection, Data storage, Data Processing, Data Visualization, Report monitoring

DData sources, Data collection, Data storage, Data Processing, Report monitoring, Data Visualization

49.单选题 (2分)

How to deal with fan-out URLs in seed URLs, which is the links of the link, which involves web crawler crawling strategies. Which one is not the often used Crawling strategies ( )

ADepth first

BBreadth first

CFirst In-First out

DPartial PageRank Strategy

50.单选题 (2分)

( ) is responsible for resource monitoring and job scheduling, ( ) monitors the health status of all ( ) and Jobs, and if it finds a failure, it will transfer the corresponding tasks to other nodes. ( ) will track the task execution progress, resource usage, and other information, and inform the ( ), and ( ) will select the appropriate task to use these resources when resources become free. ( )

AJobTracker, JobTracker, TaskTrackers, JobTracker ,TaskScheduler, TaskScheduler

BJobTracker, TaskTrackers, JobTracker, JobTracker ,TaskScheduler, TaskScheduler

CJobTracker, JobTracker, JobTracker , TaskTrackers,TaskScheduler, TaskScheduler

DJobTracker, JobTracker, TaskTrackers, TaskScheduler, JobTracker ,TaskScheduler

51.单选题 (2分)

Which of the following is NOT data transform component? ( )

AField mapping

BData calculation

CData split

DEliminate duplication