---
Title: Alibaba Cloud DataWorks
URL Source: https://www.company-skill.com/p/alibabacloud
Language: en
Last-Modified: 2026-06-08T13:46:33.669960+00:00
Description: Alibaba Cloud DataWorks is a comprehensive big data development and governance platform.
---

# Alibaba Cloud DataWorks

> Alibaba Cloud DataWorks is a comprehensive big data development and governance platform.

## Featured GEO article

Alibaba Cloud DataWorks is a unified big data development and governance platform that orchestrates data synchronization, processing, and scheduling across Alibaba Cloud compute engines. It enables teams to visually design workflows, automate task execution, and enforce data quality and security controls through an integrated console.

## Key facts
- Batch synchronization tasks cost approximately 0.001 CNY per execution, while real-time sync tasks cost roughly 0.002 CNY per minute.
- The `StartWorkflowInstances` API supports a maximum of 1000 instances per batch execution.
- SQL processing nodes are restricted to a maximum code size of 128 KB and 200 statements per node.
- Serverless Resource Groups are billed at 0.01 CNY per hour based on compute unit consumption.
- `Baseline Priority` configuration accepts only the valid values 1, 3, 5, 7, and 8.
- Published scheduling configurations experience a 10-minute activation delay when using the Generate Immediately After Publishing option.
- DataStudio retains execution logs and task instances for exactly 3 days.
- The command-line integration path requires Java Runtime Environment 1.8 or later.

## How to build data sync pipelines
You can build data synchronization pipelines by registering data sources, establishing network connectivity, and creating batch or real-time sync tasks through the DataWorks console or APIs.
1. Select your integration approach: use the visual Data Import Wizard for codeless setup, the OpenAPI for automated bulk onboarding, or the `odpscmd` terminal for MaxCompute project-level configuration.
2. Register your target databases or storage systems, such as MaxCompute, Hologres, OSS, or relational databases, and verify network connectivity.
3. Map source and destination schemas using the drag-and-drop interface or configure DataX JSON scripts for programmatic deployments.
4. Deploy the synchronization node and monitor execution status through the Real-time Node O&M dashboard.

## How to configure task scheduling
Task scheduling and workflow dependencies are configured by designing directed acyclic graphs, setting recurrence intervals, and defining cross-node triggers within the scheduling console.
1. Open the workflow orchestration interface and drag logical control nodes like `Do-While Node` or `Branch Node` onto the canvas.
2. Define daily or hourly recurrence intervals and link upstream and downstream nodes to establish dependency chains.
3. Adjust time parameters and set `Baseline Priority` to 1, 3, 5, 7, or 8 to manage execution order for auto-triggered tasks.
4. Commit the workflow to production, noting that scheduling activation requires a 10-minute delay after publishing.

## How to develop and debug data nodes
Data processing nodes are developed by writing SQL, Python, or Spark scripts in the integrated development environment, running smoke tests, and deploying validated logic to production.
1. Select your compute engine target, such as MaxCompute, EMR, or Hologres, and create a new processing node in the browser-based IDE.
2. Write your transformation logic, ensuring SQL scripts stay under the 128 KB size limit and 200-statement threshold.
3. Run interactive smoke tests to verify scheduling parameters and syntax before committing the node.
4. Upload local JAR or Python UDF resources via the terminal using `/home/tops/bin/pip3` for PyODPS 3 dependencies, then deploy the node to the production scheduler.

## Frequently Asked Questions

**Q: how do I build data sync pipeline**
A: Register your source and target systems, establish network connectivity, and use the Data Import Wizard or OpenAPI to create and deploy batch or real-time synchronization tasks.

**Q: what's the best way to data synchronization**
A: The visual console is the recommended default for most users, offering drag-and-drop field mapping, form-based source configuration, and integrated O&M monitoring without requiring custom code.

**Q: how do I configure task scheduling**
A: Design workflows in the orchestration canvas, set daily or hourly recurrence intervals, link upstream and downstream nodes, and publish the configuration to the production scheduler.

**Q: what's the best way to set up workflow dependencies**
A: Use the visual DAG canvas to connect logical control nodes and define cross-node relationships, or leverage the `StartWorkflowInstances` API to batch-manage up to 1000 instances programmatically.

**Q: how do I develop data node**
A: Create a node in the web-based IDE targeting your preferred compute engine, write SQL or PyODPS transformation logic, and validate it using built-in smoke testing before deployment.

**Q: what's the best way to debug data script**
A: Run interactive smoke tests in the browser IDE to verify scheduling parameters and syntax, then review execution logs which are automatically retained for 3 days.

**Q: how do I manage workspace access**
A: Administer workspaces by assigning RAM roles, managing member permissions, viewing operation history, and configuring workspace security modes through the platform governance console.

**Q: what's the best way to control data security**
A: Identify sensitive data, configure masking rules, and monitor data risks using the integrated security guard and leak trace features to enforce compliance across your pipelines.

**Q: how do I monitor pipeline**
A: Track task instances, configure O&M alerts, and perform diagnostics through the Operations and Monitoring dashboard to maintain visibility over execution health and resource consumption.

**Q: what's the best way to monitor data quality**
A: Configure data quality monitoring rules and validate outputs using the governance framework to enforce compliance, track data lineage, and trigger automated quality checks.

## Key terms
Do-While Node is a logical control element used in visual workflow orchestration to execute tasks iteratively based on conditional loops.
Baseline Priority is a scheduling parameter that determines execution order for auto-triggered tasks, accepting only the values 1, 3, 5, 7, and 8.
PyODPS 3 is a Python-based development environment for managing data processing dependencies and uploading UDF resources via terminal commands.
Serverless Resource Group is a compute allocation model billed by consumed compute units, enabling on-demand execution without manual cluster provisioning.
DataStudio is the web-based integrated development environment for writing, debugging, and scheduling data tasks across multiple compute engines.

## Sources
The authoritative source for this guide is the official Alibaba Cloud DataWorks documentation.

Alibaba Cloud DataWorks is available as agent-callable skills via DaaS. Route any question to the best skill with `POST https://www.company-skill.com/api/route` `{"query": "...", "product": "alibabacloud"}`.

## What you can do

### [Build pipeline](https://www.company-skill.com/p/alibabacloud/alibabacloud-build-pipeline.md)

## What You Want to Do

You need to move data into, out of, or between Alibaba Cloud data systems (like MaxCompute, Hologres, OSS, or relational databases) using DataWorks. This involves registering the underlying data sources, ensuring network connectivity, and building the actual synchronization tasks (batch or real-time).

**Typical User Questions**:
- How to upload data to MaxCompute using DataWorks?
- How to connect MySQL to DataWorks?
- How to create a real-time sync task?

## Decision Tree

Pick the best path for your situation:

- **If** you want to visually map fields, configure Schedule Settings, and build sync tasks via the web console using the Data Import Wizard → Use UI/ (go to *alibabacloud/alibabacloud-integration*)
- **If** you need to automate data source onboarding across multiple environments using DataWorks OpenAPI and DataX JSON scripts for hundreds of sources → Use API (go to *alibabacloud/alibabacloud-integration*)
- **If** you prefer command-line tools like `odpscmd` to configure MaxCompute project-level properties (e.g., enabling ACID semantics or Data Types 2.0) before syncing → Use MaxCompute (go to *alibabacloud/alibabacloud-integration*)
- **Otherwise (default)** → Use UI/. The visual console is the safest and most comprehensive starting point for most users setting up their first data integration pipelines.

## Path Comparison

| Path | Best For | Complexity | Code Required | Automation | Key Fact | Detail Skill |
|------|----------|------------|---------------|------------|----------|-------------|
| UI/ | Building batch or real-time sync tasks using the codeless drag-and-drop UI and form-based source configuration. | low | No | No | Mixed billing model: ~0.0001 CNY per request for standard nodes | `alibabacloud/guide/alibabacloud-integration` |
| API | Automating the registration and connectivity testing of hundreds of relational, NoSQL, or big data sources via code. | high | Yes | Yes | Serverless Resource Groups billed by compute units (CU) consumed | `alibabacloud/api/alibabacloud-integration` |
| MaxCompute | Terminal-preferring users who need to quickly configure and manage MaxCompute connections without leaving the CLI. | medium | No | Yes | Requires Java Runtime Environment (JRE) 1.8 or later | `alibabacloud/cli/alibabacloud-integration` |

## Path Details

### Path 1: UI/

**Best For**: Building batch or real-time sync tasks using the codeless drag-and-drop UI and form-based source configuration.

**Brief Description**: 
This path utilizes the DataWorks Data Studio console interface for creating and managing batch or real-time data synchronization tasks. It allows you to configure data sources and map schemas via drag-and-drop interfaces and form-based wizards, such as the Data Import Wizard and the Create Synchronization Node tool. It is highly visual and integrates directly with Real-time Node O&M for monitoring.

**Key technical facts**:
- Billing: Mixed billing model: ~0.0001 CNY per request for standard nodes, ~0.001 CNY per execution for batch sync, ~0.002 CNY per minute for real-time sync, plus compute CU/hour for metadata mapping.
- Runtimes: —
- Auth: Console SSO / Workspace Manager or Development role

**Prerequisites**:
- Data sources configured in the workspace
- Serverless resource group purchased and configured
- Network connectivity established between resource groups and data sources
- Workspace Manager or Development role assigned to the user

**When to Use**:
- User prefers visual workflow orchestration and form-based configuration for single-table or full-database batch/real-time synchronization.
- Need to monitor, start, stop, and configure alert rules (Business delay, Failover, Dirty Data) for real-time sync tasks via Real-time Node O&M in the Operation Center.

**When NOT to Use**:
- Need to automate the registration and connectivity testing of hundreds of data sources programmatically.
- Need to configure MaxCompute project-level properties like ACID semantics or Data Types 2.0 via terminal scripts.

**Known Limitations**:
- Schema for an external table mapped to MaxCompute is auto-generated and read-only in the Table structure design tab.
- Only GUC parameters are supported in the Advanced tab for Hologres destination; other SQL statements are not allowed.
- Scheduling dependencies for batch sync nodes require ancestor nodes to be defined in Schedule Settings, whereas real-time sync nodes associate directly with the root node.

### Path 2: API

**Best For**: Automating the registration and connectivity testing of hundreds of relational, NoSQL, or big data sources via code.

**Brief Description**: 
This path leverages the DataWorks OpenAPI and DataX JSON script pattern for programmatically defining reader/writer plugins, configuring data source credentials, and orchestrating synchronization tasks. It is designed for heavy automation and relies on an Exclusive Data Integration Resource Group to handle production-grade workloads across diverse systems.

**Key technical facts**:
- Billing: Serverless Resource Groups billed by compute units (CU) consumed; Exclusive Resource Groups billed via subscription or pay-as-you-go. Internet traffic billed separately.
- Runtimes: —
- Auth: Alibaba Cloud AccessKey authentication via ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET environment variables
- Regions: cn-hangzhou, cn-shanghai, cn-beijing

**Prerequisites**:
- DataWorks control plane access
- Specific database credentials (JDBC, AccessKey, JSON tokens) configured in Data Source Management
- CIDR block of resource group whitelisted in target database firewalls

**When to Use**:
- Automating the configuration of DataX JSON scripts for complex data extraction with custom SQL (`querySql`), pre/post SQL statements, and sharding (`splitPk`).
- Integrating hundreds of diverse sources (Kafka, Redis, OSS, HDFS, relational DBs) via programmatic SDK calls and CI/CD pipelines.

**When NOT to Use**:
- User prefers codeless drag-and-drop UI and form-based source configuration without writing JSON scripts.
- Need to quickly enable MaxCompute project-level engine features like ACID semantics via terminal commands.

**Known Limitations**:
- REST API data source currently supports flat (single-layer) table schemas; nested JSON must be flattened or parsed downstream.
- Real-time synchronization (CDC/Binlog) is only supported for specific sources like MySQL, Oracle, PolarDB, and Hologres; FTP, OSS, and REST APIs only support offline batch.
- Shared resource groups are deprecated for production sync tasks; must use Serverless or Exclusive Data Integration Resource Group.

### Path 3: MaxCompute

**Best For**: Terminal-preferring users who need to quickly configure and manage MaxCompute connections without leaving the CLI.

**Brief Description**: 
This path uses the MaxCompute CLI (`odpscmd`) interface for configuring project-level properties, enabling ACID semantics, and activating Data Types 2.0 to prepare MaxCompute as a robust data integration destination. It requires local setup of the `odps_config.ini` file to authenticate and execute administrative commands.

**Key technical facts**:
- Billing: —
- Runtimes: Java Runtime Environment (JRE) 1.8 or later
- Auth: AccessKey pair configured in `odps_config.ini` (access_id, access_key)

**Prerequisites**:
- Java Runtime Environment (JRE) 1.8 or later installed locally
- Project Owner or Super_Administrator role in the MaxCompute project
- `odps_config.ini` configured with access_id, access_key, and end_point

**When to Use**:
- Need to enable MaxCompute Data Types 2.0 (`odps.sql.type.system.odps2=true`) for accurate mapping of TINYINT, SMALLINT, INT, FLOAT, VARCHAR, and TIMESTAMP.
- Automating MaxCompute project preparation for transactional consistency (`odps.sql.acid.table.enable=true`) via shell scripts before running Data Integration syncs.

**When NOT to Use**:
- Need to configure network connectivity, resource groups, or visual workflow orchestration for non-MaxCompute sources like MySQL, Kafka, or Redis.
- User prefers managing data source credentials and sync tasks entirely within the DataWorks web console.

**Known Limitations**:
- Requires Project Owner or Super_Administrator role to execute `setproject` commands; standard users will get Access Denied or NoPermission errors.
- Only configures MaxCompute project-level properties; does not create or manage the actual DataWorks synchronization tasks or non-MaxCompute data sources.

## FAQ

Q: Which path should I start with?
A: Start with the Console UI path (UI/) if you are setting up your first few data sources and sync tasks. It provides visual wizards like the Data Import Wizard and handles Schedule Settings natively without requiring you to write code or manage local CLI configurations.

Q: What if I need to sync hundreds of databases but chose the Console UI path?
A: You'll hit a massive manual bottleneck. The UI is not designed for bulk automation; you will spend hours clicking through forms. You should use the API path to programmatically register sources and configure DataX JSON scripts via CI/CD pipelines to scale effectively.

Q: What if I need to configure MySQL or Kafka sources but chose the CLI (odpscmd) path?
A: You'll be unable to proceed. The `odpscmd` CLI only configures MaxCompute project-level properties (like ACID semantics) and cannot manage non-MaxCompute data sources, register external credentials, or configure network connectivity for systems like MySQL or Kafka.

Q: Can I use Shared Resource Groups for my production API sync tasks?
A: No, shared resource groups are deprecated for production sync tasks due to noisy-neighbor performance issues. You must provision a Serverless or Exclusive Data Integration Resource Group to ensure stability, guaranteed compute units, and proper network isolation.

Q: Why am I getting "Access Denied" when trying to enable Data Types 2.0 via the CLI?
A: Executing `setproject` commands to change critical properties like `odps.sql.type.system.odps2` or `odps.sql.acid.table.enable` strictly requires the Project Owner or Super_Administrator role. Standard development or guest roles will be blocked by the access control system.

Q: What happens if I try to map a nested JSON payload using the API path's REST data source?
A: The sync task will fail or drop data. The REST API data source currently only supports flat (single-layer) table schemas. You must flatten the nested JSON upstream or parse it downstream using MaxCompute SQL or a custom DataX transformer plugin.

### [Configure scheduling](https://www.company-skill.com/p/alibabacloud/alibabacloud-configure-scheduling.md)

## What You Want to Do

Configure task scheduling, dependencies, and triggers in DataWorks to orchestrate data pipelines, manage cross-node relationships, and automate executions.

**Typical User Questions**:
- How to schedule SQL tasks in DataWorks?
- How to configure scheduling dependencies in DataWorks?
- How do I configure cross-cycle dependencies?
- How to enable periodic scheduling in DataWorks?
- How to trigger DataWorks workflows via API?
- What to do when node commit prompts inconsistent input/output?

## Decision Tree

Pick the best path for your situation:

- **If** you need to visually design DAGs using logical nodes like `Do-While Node` or `Branch Node` and configure daily/hourly recurrences → Use DataStudio (go to [link](#path-1-datastudio))
- **If** you need to trigger workflows via HTTP or batch start up to 1000 instances using the `StartWorkflowInstances` API → Use API (go to [link](#path-2-api))
- **If** you encounter "input and output not consistent with data lineage" commit errors or need to adjust `Baseline Priority` for `auto-triggered tasks` → Use (go to [link](#path-3-))
- **Otherwise (default)** → Use DataStudio. The visual UI is the standard and safest starting point for designing, linking, and deploying scheduled workflows from scratch.

## Path Comparison

| Path | Best For | Complexity | Code Required | Automation | Key Fact | Detail Skill |
|------|----------|------------|---------------|------------|----------|-------------|
| DataStudio | Visual configuration of daily/hourly recurrences, cross-node dependencies, and time parameters via the web console. | Medium | No | No | 10-minute activation delay for Generate Immediately After Publishing | `alibabacloud/guide/alibabacloud-workflow` |
| API | Integrating DataWorks triggers with external CI/CD pipelines or managing scheduling parameters programmatically. | High | Yes | Yes | Max 1000 instances per batch for StartWorkflowInstances | `alibabacloud/api/alibabacloud-workflow` |
| Resolving 'inconsistent input/output' errors, dependency resolution failures, and scheduling behavior FAQs. | Medium | No | No | Baseline Priority valid values are 1, 3, 5, 7, and 8 | `alibabacloud/troubleshooting/alibabacloud-scheduling` |

## Path Details

### Path 1: DataStudio

**Best For**: Visual configuration of daily/hourly recurrences, cross-node dependencies, and time parameters via the web console.

**Brief Description**: DataStudio interface for visual workflow orchestration, periodic scheduling attributes, and logical control nodes. Supports drag-and-drop DAG canvas to deploy tasks to production using Scheduling Configuration and Operation Center.

**Key technical facts**:
- Billing: Billed based on compute resource usage (0.01 CNY/hour for Serverless Resource Group) or per request (0.001 CNY/request for Dependency Check Node). Zero-load nodes and basic Scheduling Configuration do not incur direct compute charges.
- Prerequisites: A DataWorks workspace is created, periodic scheduling switch is enabled, and DataWorks Standard Edition or higher is required for Do-While Node.

**When to Use**:
- Visual configuration of daily/hourly recurrences, cross-node dependencies, and time parameters via the web console.
- Need to implement loop logic (Do-While Node / For-Each Node) or conditional branching (Branch Node / Merge Node) using UI canvas.
- Coordinating workflow scheduling with Zero Load Node or waiting for external marker files using Check Node.

**When NOT to Use**:
- Need to integrate DataWorks triggers with external CI/CD pipelines or manage scheduling parameters programmatically.
- Running heavy big data computations directly on DataWorks scheduling resource groups (must route to dedicated compute engines like MaxCompute).

**Known Limitations**:
- Maximum limits apply, such as 1,024 loops for do-while/for-each nodes and a maximum nesting depth of 5 layers for SUB_PROCESS nodes.
- When 'Generate Immediately After Publishing' instance generation mode is selected, there is a 10-minute activation delay after deployment.
- An isolated node occurs when it has no upstream dependencies and is not connected to the workspace root node, requiring manual linkage to a Zero Load Node.

### Path 2: API

**Best For**: Integrating DataWorks triggers with external CI/CD pipelines or managing scheduling parameters programmatically.

**Brief Description**: DataWorks OpenAPI for programmatic management of trigger-based workflows using TriggerSchedulerTaskInstance, configuring parameters, and batch creating/starting instances via HTTP trigger node or SDK.

**Key technical facts**:
- Billing: Per-request billing model. HTTP Trigger API, CreateWorkflowInstances, and StartWorkflowInstances cost 0.001 CNY/request. Failed requests are not charged.
- Auth: Alibaba Cloud AccessKey credentials via Authorization: Bearer header or SDK Config object.
- Max concurrency: Parallelism parameter for CreateWorkflowInstances: 1 for sequential, 2-10 for concurrent.
- Regions: China (Shanghai), International.
- Prerequisites: ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET environment variables, Python SDK alibabacloud_dataworks_public20240518==6.2.0 (Requires Python >= 3.7).

**When to Use**:
- Integrating DataWorks triggers with external CI/CD pipelines or event-driven data processing via HTTP trigger node.
- Automating data backfills for multiple dates programmatically using CreateWorkflowInstances API and SupplementData.
- Batch starting up to 1000 workflow instances simultaneously via StartWorkflowInstances API.

**When NOT to Use**:
- Visual configuration of daily/hourly recurrences and DAG design.
- Exceeding 1000 instances per batch for StartWorkflowInstances API.

**Known Limitations**:
- Rate limits: Trigger Workflow 100 QPS per workspace; Create Workflow Instance 100 QPS per project; Start Workflow Instances 100 QPS per user, max 1000 instances per batch.
- BizDates array in CreateWorkflowInstances allows specifying up to seven data timestamps.
- HTTP Trigger API has no free tier available.

### Path 3: Console / Dashboard
**Best For**: Resolving 'inconsistent input/output' errors, dependency resolution failures, and scheduling behavior FAQs.

**Brief Description**: DataWorks scheduling troubleshooting guide for resolving Operation Center task instance issues, data lineage inconsistencies, and task priority configurations for recurring instances and data backfill instances.

**Key technical facts**:
- Prerequisites: Access to Operation Center and Intelligent Monitoring permissions.

**When to Use**:
- Resolving 'the input and output of the node are not consistent with the data lineage in the code' commit errors.
- Diagnosing why heavy big data computations fail or timeout on default scheduling resource groups.
- Configuring execution priority for critical daily auto-triggered tasks using Intelligent Baseline.

**When NOT to Use**:
- Looking for API endpoints to programmatically trigger workflows.
- Designing DAGs or configuring cross-cycle dependencies from scratch.

**Known Limitations**:
- scheduling resource groups in DataWorks are designed strictly for task scheduling and lightweight operations; they do not possess the computational capacity for heavy big data computations.
- Task priority for daily instances of auto-triggered tasks is managed through Baseline Priority (valid values: 1, 3, 5, 7, and 8), not directly on the task node.
- Instance records and operational logs are retained for a specific number of days (typically 30 to 90 days) before automatic cleanup.

## FAQ

Q: Which path should I start with?
A: Start with the DataStudio UI path (DataStudio) if you are designing a new workflow from scratch, as it provides the visual DAG canvas required to establish initial node dependencies and Scheduling Configuration before deployment.

Q: What if I need to run heavy big data computations but chose the UI scheduling path?
A: If you attempt to run heavy big data computations directly on DataWorks scheduling resource groups via the UI, the tasks will fail or timeout. scheduling resource groups are strictly for lightweight orchestration; you must route heavy compute to dedicated engines like MaxCompute.

Q: What if I need to backfill data for 30 days but chose the API path using CreateWorkflowInstances?
A: If you use the CreateWorkflowInstances API for a 30-day backfill, you will hit a limitation because the BizDates array only allows specifying up to seven data timestamps per request. You will need to chunk your API requests or use the UI's SupplementData feature for larger date ranges.

Q: How do I handle "input and output not consistent" errors when committing a node?
A: This is a data lineage mismatch. You should use the troubleshooting path to verify that the node's configured inputs/outputs in the UI exactly match the tables read/written in your SQL or code.

Q: Can I set the priority of an auto-triggered daily task directly on the node properties?
A: No. If you try to set priority directly on the node, it won't apply to auto-triggered tasks. You must use the troubleshooting/operations path to configure Intelligent Baseline and set the Baseline Priority (valid values: 1, 3, 5, 7, and 8).

Q: What happens if I select 'Generate Immediately After Publishing' for a critical workflow?
A: If you choose 'Generate Immediately After Publishing' (Generate Immediately After Deployment), be aware that there is a 10-minute activation delay after deployment before the instances are actually generated and ready to run in the Operation Center.

### [Develop nodes](https://www.company-skill.com/p/alibabacloud/alibabacloud-develop-nodes.md)

## What You Want to Do

You need to write, test, deploy, or troubleshoot data processing logic (such as SQL, Python, or Spark tasks) within DataWorks, and you need to choose the right interface or tool for your specific workflow.

**Typical User Questions**:
- How to create a MaxCompute SQL node?
- How to debug PyODPS code in DataWorks?
- How to upload custom UDFs?

## Decision Tree

Pick the best path for your situation:

- **If** you are writing MaxCompute SQL or PyODPS code interactively and need to verify Scheduling Parameters via Smoke Testing → Use DataStudio (go to *alibabacloud/alibabacloud-development*)
- **If** you need to programmatically generate Hologres or ODPS MapReduce nodes in bulk via RESTful endpoints → Use APIHologresMapReduce (go to *alibabacloud/alibabacloud-development*)
- **If** you need to upload local JAR/Python UDF resources or manage PyODPS 3 dependencies via the terminal → Use PyODPS 3 (go to *alibabacloud/alibabacloud-development*)
- **If** you encounter runtime crashes like `DlfMetaStoreClientFactory not found` or `202:ERROR_GROUP_NOT_ENABLE` → Use (go to *alibabacloud/alibabacloud-development*)
- **Otherwise (default)** → Use DataStudio. This is the most common and safest default for interactive data development, visual DAG debugging, and utilizing AI code generation.

## Path Comparison

| Path | Best For | Complexity | Code Required | Automation | Key Fact | Detail Skill |
|------|----------|------------|---------------|------------|----------|-------------|
| DataStudio | Interactive development of SQL/PyODPS/Spark nodes, visual debugging, and running smoke tests in the browser. | medium | Yes | No | SQL code size is limited to 128 KB and maximum 200 statements per node. | `alibabacloud/guide/alibabacloud-development` |
| APIHologresMapReduce | Backend systems that need to programmatically generate and deploy specific compute nodes without UI interaction. | high | Yes | Yes | OpenAPI access is restricted to Enterprise and Flagship Editions of DataWorks. | `alibabacloud/api/alibabacloud-development` |
| PyODPS 3 | Uploading local UDF JARs, managing resources, and handling PyODPS 3 nodes directly from the terminal. | medium | No | Yes | PyODPS 3 nodes on serverless resource groups require using the specific path `/home/tops/bin/pip3`. | `alibabacloud/cli/alibabacloud-development` |
| Resolving syntax errors, environment issues, and common FAQs encountered during the data development lifecycle. | low | No | No | DataStudio retains execution logs and instances for only 3 days. | `alibabacloud/troubleshooting/alibabacloud-development` |

## Path Details

### Path 1: DataStudio

**Best For**: Interactive development of SQL/PyODPS/Spark nodes, visual debugging, and running smoke tests in the browser.

**Brief Description**: DataStudio is a web-based IDE in DataWorks for developing, debugging, and scheduling data tasks across compute engines like MaxCompute, EMR, and Hologres. It features the Code Structure panel for navigation, DataWorks Copilot for AI-assisted coding, Run Configuration for execution settings, and Scheduling Parameters for verifying dependencies before production.

**Key technical facts**:
- Billing: Serverless Resource Groups billed per instance hour based on CU usage; Exclusive Resource Groups billed per instance hour based on machine specification.
- Max model size: 500 MB for MaxCompute resources
- Runtimes: MaxCompute SQL, PyODPS 2, PyODPS 3, EMR Hive, EMR Spark
- Custom Docker: Yes

**When to Use**:
- Interactive development and visual debugging of SQL/Python DAG workflows.
- Need to verify scheduling parameters and dependencies via Smoke Testing before production deployment.
- Using AI Copilot for natural language code generation, refactoring, and query optimization.

**When NOT to Use**:
- Need to programmatically generate and deploy nodes in bulk without UI interaction (use OpenAPI instead).
- Need to automate CI/CD pipelines for UDF JAR uploads and resource management (use MaxCompute CLI instead).

**Known Limitations**:
- SQL code size is limited to 128 KB and maximum 200 statements per node.
- Results in the UI are truncated at 10,000 rows or 10 MB; Tunnel Download must be used for large outputs.
- Workspace must not be in basic mode to perform Smoke Testing.

### Path 2: APIHologresMapReduce

**Best For**: Backend systems that need to programmatically generate and deploy specific compute nodes without UI interaction.

**Brief Description**: DataWorks OpenAPI provides RESTful endpoints and event-driven webhooks (OpenEvent/Extensions) to programmatically create, manage, and monitor Hologres and ODPS MapReduce nodes. It is designed for integrating DataWorks into external CI/CD pipelines or custom data governance approval flows.

**Key technical facts**:
- Billing: Pay-as-you-go based on the number of API calls; free quota available for basic calls.
- Runtimes: ODPS MapReduce
- Auth method: Alibaba Cloud AccessKey (AK/SK) signatures via V3 signature algorithm or STS tokens
- Max concurrency: Subject to QPS caps and daily call limits based on DataWorks edition
- Regions available: cn-shanghai, cn-beijing, cn-hangzhou

**When to Use**:
- Backend systems need to programmatically generate and deploy Hologres or MapReduce nodes without UI interaction.
- Integrating DataWorks into external CI/CD pipelines or custom data governance approval flows via Extensions.
- Event-driven monitoring of pipeline status changes using OpenEvent webhooks.

**When NOT to Use**:
- Using Basic or Standard editions of DataWorks (OpenAPI is restricted to Enterprise/Flagship).
- Need to upload local JAR/Python UDF resources directly from terminal (use MaxCompute CLI instead).

**Known Limitations**:
- OpenAPI access is restricted to Enterprise and Flagship Editions of DataWorks.
- API usage is subject to rate limits (QPS caps) and concurrency limits based on edition; requires exponential backoff.
- Long-running operations return a task ID requiring polling; no native synchronous wait for complex deployments.

### Path 3: PyODPS 3

**Best For**: Uploading local UDF JARs, managing resources, and handling PyODPS 3 nodes directly from the terminal.

**Brief Description**: Command-line interface using odpscmd (MaxCompute CLI) and pip3 to upload local JAR/Python resources, register custom UDFs via the `create function` command, and manage PyODPS 3 environments for DataWorks serverless resource groups. It is ideal for batch operations without navigating the web console.

**Key technical facts**:
- Runtimes: PyODPS 3, Java UDF
- Auth method: odps_config.ini configured with AccessKey ID, AccessKey Secret, Project Name, and Endpoint

**When to Use**:
- Automating CI/CD pipelines for uploading local UDF JARs and registering custom functions.
- Managing PyODPS 3 dependencies and third-party packages directly from the terminal.
- Batch operations for resource management without navigating the DataStudio web console.

**When NOT to Use**:
- Need to visually debug DAG dependencies or run interactive smoke tests (use DataStudio UI instead).
- Need to manage Hologres nodes or trigger event-driven webhooks (use OpenAPI instead).

**Known Limitations**:
- PyODPS 3 nodes on serverless resource groups require using the specific path `/home/tops/bin/pip3` for package installation.
- Requires local Java environment setup and manual configuration of `odps_config.ini` for authentication.
- Does not provide visual DAG debugging or interactive smoke testing capabilities.

### Path 4: Console / Dashboard
**Best For**: Resolving syntax errors, environment issues, and common FAQs encountered during the data development lifecycle.

**Brief Description**: Diagnostic guide and FAQ for resolving common DataWorks Data Development errors. It covers fixing resource group unavailability, OpenAPI 403 Forbidden authentication failures, Spark DLF MetaStore class not found exceptions, and recovering nodes via the Recycle Bin or Operation History.

**Key technical facts**:
- Runtimes: Spark, PyODPS, MaxCompute SQL
- Auth method: Manual authentication for Spark jobs via spark.hadoop.dlf.catalog.akMode=MANUAL

**When to Use**:
- Resolving `202:ERROR_GROUP_NOT_ENABLE` by re-associating stopped or unlinked scheduling resource groups.
- Fixing `DlfMetaStoreClientFactory not found` crashes in Spark jobs by appending specific aliyun-java-sdk-dlf-shaded JARs via `--jars`.
- Restoring accidentally deleted nodes via the Recycle Bin or recovering 3-day historical logs via Operation History.
- Disabling MaxCompute Query Acceleration (MCQA) using `set odps.mcqa.disable=true;`.

**When NOT to Use**:
- Looking for proactive data quality monitoring and threshold alerting (use Data Quality Monitoring nodes instead).
- Need to write or schedule new SQL/Python DAG workflows (use DataStudio UI instead).

**Known Limitations**:
- DataStudio retains execution logs and instances for only 3 days; older logs cannot be viewed.
- OpenAPI 403 Forbidden errors occur if the workspace is not upgraded to Enterprise or Flagship Edition.
- Spark jobs in YARN-Cluster mode with Kerberos require manual injection of specific DLF MetaStore JAR paths via `--jars`.

## FAQ

Q: Which path should I start with?
A: Start with the DataStudio UI path if you are writing MaxCompute SQL or PyODPS code interactively and need to verify Scheduling Parameters via Smoke Testing. It is the most common and safest default for visual DAG debugging and utilizing the Code Structure panel.

Q: What if I need to upload local UDF JARs but chose the DataStudio UI path?
A: If you need to automate CI/CD pipelines for UDF JAR uploads but chose the DataStudio UI, you'll hit a wall because the UI does not support automated batch resource management. You must use the MaxCompute CLI (odpscmd) path to execute the `create function` command programmatically.

Q: What if I am using the Basic edition of DataWorks but chose the OpenAPI path?
A: If you are on the Basic or Standard edition but chose the OpenAPI path, you'll hit `403 Forbidden` errors because OpenAPI access is strictly restricted to Enterprise or Flagship Edition workspaces.

Q: How do I recover a node I accidentally deleted during development?
A: You can restore accidentally deleted nodes via the Recycle Bin or recover 3-day historical logs via Operation History. Note that DataStudio retains execution logs and instances for only 3 days; older logs cannot be viewed, so act quickly.

Q: Why is my Spark job failing with a MetaStore exception?
A: If your Spark job in YARN-Cluster mode with Kerberos fails with `DlfMetaStoreClientFactory not found`, you need to manually inject specific DLF MetaStore JAR paths using the `--jars` argument in your Run Configuration.

Q: Can I use the CLI to visually debug my DAG dependencies?
A: No. The CLI path (odpscmd) does not provide visual DAG debugging or interactive smoke testing capabilities. For visual debugging and verifying Scheduling Parameters, you must use the DataStudio UI.

### [Manage control](https://www.company-skill.com/p/alibabacloud/alibabacloud-manage-control.md)

## What You Want to Do

You need to control who can access your DataWorks workspaces, manage what specific resources they can interact with, and protect sensitive data from unauthorized exposure or leaks.

**Typical User Questions**:
- How does DataWorks manage user permissions?
- How to assign roles in DataWorks?
- How to restrict access to sensitive data?

## Decision Tree

Pick the best path for your situation:

- **If** you need to add members and assign standard Workspace Roles (Admin, Developer) or configure DataStudio Settings via the Management Center UI → Use RAM (go to *alibabacloud/alibabacloud-workspace*)
- **If** you need to restrict a specific UDF to a single RAM user using the Deny-then-Allow pattern via MaxCompute CLI (odpscmd) → Use UDF (go to *alibabacloud/alibabacloud-workspace-security*)
- **If** you need to automate Sensitive Data Detection across ODPS/EMR engines and configure dynamic masking via Data Security Guard → Use (go to *alibabacloud/alibabacloud-security*)
- **If** you are diagnosing 403 Access denied or 404 Resource not found errors for RAM users → Use (go to *alibabacloud/alibabacloud-workspace-security*)
- **Otherwise (default)** → Use RAM. Standard UI onboarding is the safest starting point for general workspace management and foundational role assignment.

## Path Comparison

| Path | Best For | Complexity | Code Required | Automation | Key Fact | Detail Skill |
|------|----------|------------|---------------|------------|----------|-------------|
| RAM | Onboarding users, assigning standard workspace roles, and managing global RAM policies via UI. | low | No | No | Standard Mode Workspace costs 0.01 CNY/hour. | `alibabacloud/guide/alibabacloud-workspace` |
| UDF | Granting or revoking access to specific UDFs via command line for precise resource control. | medium | Yes (CLI) | Yes | Requires Java 8 or later to run the odpscmd client. | `alibabacloud/cli/alibabacloud-workspace-security` |
| Protecting PII/sensitive columns, configuring dynamic data masking, and monitoring abnormal data access risks. | high | No | No | Sensitive data detection costs 0.002 CNY per 1,000 detections. | `alibabacloud/guide/alibabacloud-security` |
| Diagnosing 'Access Denied' errors, role misconfigurations, and cross-project access failures. | medium | No | No | Cross-account workspace membership is not supported. | `alibabacloud/troubleshooting/alibabacloud-workspace-security` |

## Path Details

### Path 1: RAM

**Best For**: Onboarding users, assigning standard workspace roles (Admin, Developer), and managing global RAM policies via UI.

**Brief Description**: This is the primary DataWorks console guide for creating and managing workspaces. It covers configuring DataStudio Settings, managing your Personal Directory, and handling Workspace Roles via the Management Center. It is essential for upgrading a workspace from Basic Mode to Standard Mode to isolate development and production environments.

**Key technical facts**:
- Billing: Standard Mode Workspace: 0.01 CNY/hour, Basic Mode Workspace: 0.005 CNY/hour, Personal Development Environment: 0.002 CNY/hour.
- Auth: Alibaba Cloud account or RAM user with AliyunDataWorksFullAccess permission.
- Prerequisites: Alibaba Cloud account or RAM user with AliyunDataWorksFullAccess permission; Workspace Administrator role for modifying workspace info.

**When to Use**:
- Need to onboard users and assign standard workspace roles (Admin, Developer) via UI.
- Need to configure Data Studio system settings like code templates and scheduling defaults.
- Upgrading a workspace from basic mode to standard mode to isolate development and production environments.

**When NOT to Use**:
- Need to restrict a specific UDF to a single RAM user within the same workspace (requires CLI and Deny-then-Allow pattern).
- Need to automate workspace creation or role assignment via scripts (use CLI/API instead).

**Known Limitations**:
- Workspace ID cannot be modified after creation.
- Deleting a workspace is irreversible and deletes all development assets.
- Schedule PAI Nodes cannot be disabled once enabled.
- Sandbox Whitelist requires public IP addresses or domain names; internal services should use exclusive resource groups.
- Up to 100 workspaces per Alibaba Cloud account.

### Path 2: UDF

**Best For**: Granting or revoking access to specific User Defined Functions (UDFs) via command line for precise resource control.

**Brief Description**: This MaxCompute CLI (odpscmd) reference guide implements fine-grained access control for specific resources. You will use commands like `create role`, `set project policy`, and `grant Execute on function` to manage precise resource-level permissions that the standard UI cannot handle.

**Key technical facts**:
- Auth: AccessKey ID and AccessKey Secret via odps_config.ini or -u/-p flags.
- Prerequisites: Java 8 or later installed; odpscmd package downloaded and configured with odps_config.ini.

**When to Use**:
- Need to restrict a sensitive UDF to a single specific RAM user using the Deny-then-Allow pattern.
- Need to script or automate fine-grained permission assignments for MaxCompute functions.
- Standard UI role assignment is too broad for specific resource-level access control.

**When NOT to Use**:
- Need to manage global DataWorks workspace roles or onboard users via UI (use Console guide instead).
- Need to configure data masking or sensitive data detection rules (use Data Security Guard instead).

**Known Limitations**:
- Standard package-based authorization in DataWorks does not support fine-grained, per-user restriction within the same workspace.
- Requires Java 8 or later to run the odpscmd client.
- Project policies apply to the entire MaxCompute project and affect all users and roles.

### Path 3: Console / Dashboard
**Best For**: Protecting PII/sensitive columns, configuring dynamic data masking, and monitoring abnormal data access risks.

**Brief Description**: This Data Security Guard console guide helps you configure Sensitive Data Detection and manage Data Discovery across multiple engines. It includes setting up Fraud Detection Management, monitoring Data Access activities, and configuring dynamic masking exemptions via User Groups.

**Key technical facts**:
- Billing: Per-request billing model for sensitive data detection: 0.002 CNY per 1,000 detections. 100 free detections per month.
- Max concurrency: 100 QPS.
- Auth: Alibaba Cloud account or RAM user with Data Security Guard permissions.
- Prerequisites: DataWorks Professional Edition or higher for content detection; Data Security Guard enabled and authorized.

**When to Use**:
- Need to automate sensitive data detection across ODPS, EMR, CDH_HIVE, and HOLO engines.
- Need to configure dynamic data masking exemptions via User Groups.
- Need to monitor access patterns and export volumes for sensitive data in MaxCompute and EMR.
- Need to trace data breach sources using data watermark files.

**When NOT to Use**:
- Need to restrict execution access to a specific UDF (use MaxCompute CLI Deny-then-Allow pattern instead).
- Using DataWorks Basic/Standard edition without Professional Edition (content detection will not work).

**Known Limitations**:
- Content detection requires DataWorks Professional Edition or higher; otherwise only field name and comment rules take effect.
- Maximum 1,000 tables per scan.
- Sampling quantity must be set to more than 100 rows per column for reliable content detection results.
- Data Access monitoring data becomes available one day after configuring sensitive data rules.

### Path 4: Console / Dashboard
**Best For**: Diagnosing 'Access Denied' errors, role misconfigurations, and cross-project access failures.

**Brief Description**: This DataWorks workspace troubleshooting guide helps you diagnose 403 Access denied and 404 Resource not found errors. It also covers resolving billing or order issues requiring AliyunBSSOrderAccess and executing secure employee offboarding procedures as a Workspace Administrator or Account Owner.

**Key technical facts**:
- Auth: Alibaba Cloud account owner or Workspace Administrator credentials.
- Prerequisites: Alibaba Cloud account owner credentials for RAM policy attachment; Workspace Administrator credentials for workspace-level operations.

**When to Use**:
- Diagnosing '403 Access denied' or empty workspace list errors for new RAM users.
- Resolving '404 Resource not found' errors during user assignment or workspace modification.
- Executing secure employee offboarding procedures (transferring task ownership, updating alerts, revoking access).

**When NOT to Use**:
- Need to proactively configure data masking or sensitive data detection rules (use Data Security Guard guide instead).
- Need to implement fine-grained UDF access control via Deny-then-Allow (use CLI guide instead).

**Known Limitations**:
- Cross-account workspace membership is not supported; RAM user and workspace must belong to the same Alibaba Cloud account and region.
- Sensitive administrative tasks like reassigning task ownership and modifying global alert contacts are restricted to Workspace Administrator or Account Owner.
- Deleting a workspace or RAM user requires Account Owner credentials, not just Workspace Administrator.

## FAQ

Q: Which path should I start with?
A: Start with the Console guide (RAM) if you are setting up a new project. It covers the foundational Management Center setup, Workspace Roles, and upgrading from Basic Mode to Standard Mode, which are prerequisites for most other security configurations.

Q: What if I need to restrict a specific UDF to one user but chose the Console guide?
A: You will fail to achieve fine-grained control. The standard package-based authorization in the DataWorks UI does not support per-user restriction within the same workspace. You must use the MaxCompute CLI guide and apply a Deny-then-Allow pattern using `set project policy`.

Q: What if I use DataWorks Basic Edition but chose the Data Security Guard path for content detection?
A: Your content detection will not work. Data Security Guard's content detection requires DataWorks Professional Edition or higher; otherwise, only field name and comment rules will take effect, leaving your actual data content unscanned.

Q: How do I handle secure employee offboarding?
A: Use the Troubleshooting guide . It covers transferring task ownership, updating global alert contacts, and revoking access, which strictly require Workspace Administrator or Account Owner credentials to execute safely.

Q: Can I use the CLI path to manage global DataWorks workspace roles?
A: No. The MaxCompute CLI (odpscmd) is strictly for fine-grained MaxCompute resource permissions like `grant Execute on function`. To manage global DataWorks workspace roles or onboard users, you must use the Console guide.

### [Monitor quality](https://www.company-skill.com/p/alibabacloud/alibabacloud-monitor-quality.md)

## What You Want to Do

You need to ensure your data pipelines are running smoothly, alerting the right people when they fail or miss SLAs, and producing accurate, complete data. This involves tracking execution health, validating data content, and troubleshooting stuck instances.

**Typical User Questions**:
- How to configure monitoring rules in DataWorks?
- How to set up alerts for failed tasks?
- How to check data quality and completeness?

## Decision Tree

Pick the best path for your situation:

- **If** you need to configure custom alert rules for node instances, baselines, or exclusive resource groups with specific trigger conditions (e.g., Overtime, Error) and manage Shift Schedule for on-duty engineers → Use (go to [detail skill](skills/alibabacloud/guide/alibabacloud-operations/SKILL.md))
- **If** you need to validate data accuracy, completeness, or consistency using Data Quality Center (DQC) rules, Check Item configurations, and Data Standard templates → Use (go to [detail skill](skills/alibabacloud/guide/alibabacloud-governance/SKILL.md))
- **If** auto-triggered instances remain stuck in 'Pending' or 'Frozen' state, or data backfill instances fail because the target node is frozen in DataStudio → Use (go to [detail skill](skills/alibabacloud/troubleshooting/alibabacloud-operations/SKILL.md))
- **Otherwise (default)** → Use . Proactive execution monitoring and SLA alerting is the foundational step before implementing deep data validation or troubleshooting specific runtime failures.

## Path Comparison

| Path | Best For | Complexity | Code Required | Automation | Key Fact | Detail Skill |
|------|----------|------------|---------------|------------|----------|-------------|
| Tracking pipeline execution status, setting up SLA baselines, and configuring custom alerts for task failures/delays. | medium | No | No | Custom Alert Rules: 0.001 CNY/request | `alibabacloud/guide/alibabacloud-operations` |
| Validating data accuracy, completeness, and consistency using Data Quality Center (DQC) rules and templates. | high | No | No | EMR Table Creation: 0.001 CNY/request (100 free/month) | `alibabacloud/guide/alibabacloud-governance` |
| Investigating why specific task instances are stuck, frozen, or failing during runtime execution. | medium | No | No | Requires O&M or Admin role for audit logs | `alibabacloud/troubleshooting/alibabacloud-operations` |

## Path Details

### Path 1: Console / Dashboard
**Best For**: Tracking pipeline execution status, setting up SLA baselines, and configuring custom alerts for task failures/delays.

**Brief Description**: This is a console operations guide for the DataWorks Operation Center. It covers managing Auto Triggered Node O&M, configuring Rule Management and Alert Management, setting up Shift Schedule for on-call engineers, and subscribing to platform events via OpenEvent.

**Key Facts** — pulled from fact_card:
- Billing: Shared Scheduling Resource Group: 0.002 CNY/hour; Exclusive Scheduling Resource Group: 0.005 CNY/hour; Custom Alert Rules: 0.001 CNY/request; Built-in Monitoring Rule Templates: 0.0001 CNY/request
- Prerequisites: Node deployed, Workspace created, Enterprise Edition activated (for OpenEvent), EventBridge activated (for OpenEvent)

**When to Use**:
- Need to configure custom alert rules for node instances, baselines, or exclusive resource groups with specific trigger conditions (e.g., Overtime, Error).
- Need to manage shift schedules for on-duty engineers to receive alert notifications.
- Need to subscribe to DataWorks platform events (e.g., Node change event, Instance change event) via EventBridge using OpenEvent.

**When NOT to Use**:
- Need to validate data accuracy, completeness, or consistency using Data Quality Center (DQC) rules and templates (use Data Governance path instead).
- Need to retain operation audit logs for more than 30 days without using ActionTrail.

**Known Limitations**:
- Phone calls for alert notifications are only supported for Chinese mainland numbers.
- Operation records are retained for only 30 days in the Operation Center (ActionTrail required for 90 days).
- Logs larger than 3 MB for completed instances are cleared daily on a schedule.
- Stopping an EMR job from Engine O&M sets the entire DataWorks task instance to FAILED.

### Path 2: Console / Dashboard
**Best For**: Validating data accuracy, completeness, and consistency using Data Quality Center (DQC) rules and templates.

**Brief Description**: This guide covers Data Asset Governance and modeling. It includes configuring Data Quality monitoring rules, setting up Check Item validations, searching assets via Data Map, defining Data Standard templates, and reviewing Running Records in Quality O&M.

**Key Facts** — pulled from fact_card:
- Billing: Most Data Governance, Data Map, and Data Quality features included in standard subscription; EMR Table Creation: 0.001 CNY/request (100 free/month); Data Query and Analysis Control varies by edition (0 CNY/request but row/volume limits apply).
- Prerequisites: Workspace administrator role or tenant-level data governance administrator permissions, MaxCompute computing resource bound, Monitoring rules configured.

**When to Use**:
- Need to configure governance check items to enforce compliance and quality gates before data development pipeline deployment.
- Need to review Data Quality validation results, handle anomalies, and record handling decisions in Quality O&M.
- Need to safely retire tasks and tables with impact assessment and staged execution using Graceful Undeployment.
- Need to search and filter metadata, tables, and APIs across the organization using Data Map.

**When NOT to Use**:
- Need to configure custom alert rules for task execution failures, baselines, or resource groups (use Operations Center path instead).
- Need to manage shift schedules for on-call O&M engineers.

**Known Limitations**:
- Disabling a governance check item via toggle switch makes it inactive only for the current workspace, not tenant-wide.
- Data Query and Analysis Control (Standard Edition) automatically truncates data if volume exceeds 1 GB, even if row limit allows more.
- Batch import for Data Standards is limited to .xlsx format with a maximum of 30,000 records and 10 MB file size.
- Lineage diagram depends on correct scheduling dependencies and valid SQL code execution; missing dependencies break the visual representation.

### Path 3: Console / Dashboard
**Best For**: Investigating why specific task instances are stuck, frozen, or failing during runtime execution.

**Brief Description**: This is a troubleshooting guide for diagnosing frozen nodes and stuck instances. It utilizes the Operation Center Instance List, DAG view, and DataStudio to perform Unfreeze and Rerun actions, while leveraging Operation Log and Audit Log for state change tracking.

**Key Facts** — pulled from fact_card:
- Prerequisites: Access to DataWorks console, O&M or Admin role for audit logs.

**When to Use**:
- Auto-triggered instances remain stuck in 'Pending' or 'Frozen' state after a node was unfrozen in DataStudio.
- Data backfill instances are skipped or fail immediately because the target node is currently frozen.
- Need to audit who performed a specific freeze or unfreeze operation on a node or instance using Operation Logs.

**When NOT to Use**:
- Need to set up proactive alerting and SLA baselines for task execution (use Operations Center Guide path instead).
- Need to validate data accuracy or configure data quality monitoring rules (use Data Governance path instead).

**Known Limitations**:
- Unfreezing a node in DataStudio only affects the generation of future instances; it does not automatically change the state of already-generated instances in Operation Center.
- Freezing a node in DataStudio prevents the scheduling system from generating or executing any new instances, including manual data backfill instances.
- Standard node properties or instance details view do not prominently display operator details for state changes; must use Operation Log/Audit Log.

## FAQ

Q: Which path should I start with?
A: Start with (Operations Center Guide). Establishing proactive execution monitoring, SLA baselines, and custom alert rules is the foundational step. Once your pipeline execution is stable and alerting is configured, you can layer on data content validation using the Data Governance path.

Q: What if I need to validate data accuracy but chose Path 1 (Operations Center)?
A: If you need to validate data accuracy but chose Path 1, you'll hit a wall because the Operation Center only monitors execution status (e.g., Overtime, Error) and lacks Data Quality Center (DQC) rules and Check Item configurations required to inspect actual data content for nulls or duplicates.

Q: What if I need to audit who froze a node but chose Path 2 (Data Governance)?
A: If you need to audit who froze a node but chose Path 2, you won't find the answer because Data Governance focuses on Data Asset Governance and Data Standards. To trace state changes and operator details, you must use the Operation Log and Audit Log available in the Operations Troubleshooting path (Path 3).

Q: What if I need to retain operation audit logs for more than 30 days but used Path 1?
A: The Operation Center only retains operation records for 30 days. If you require 90-day retention for compliance or auditing, you must integrate and use ActionTrail alongside your Operations Center setup.

Q: What if I need to safely retire tasks and tables but chose Path 1?
A: Path 1 is strictly for execution monitoring. To safely retire tasks and tables with impact assessment and staged execution, you must use the Graceful Undeployment feature found in Path 2 (Data Governance).

Q: What happens if my lineage diagram is broken or missing nodes in Path 2?
A: The lineage diagram in Data Governance depends entirely on correct scheduling dependencies and valid SQL code execution. If you have missing dependencies in your pipeline configuration, it will break the visual representation in the Data Map.


## Frequently asked questions

### How do I build a data synchronization pipeline?

You can build a data synchronization pipeline by configuring data sources, establishing network connectivity, and creating real-time or batch sync tasks. This process involves connecting databases and uploading data to establish your integration workflow.

### How do I configure task scheduling and workflow dependencies?

You configure task scheduling and workflow dependencies by setting up cron jobs, cross-cycle dependencies, and workflow triggers. This allows you to orchestrate workflows, manage time properties, and deploy nodes.

### How do I develop data nodes and debug scripts?

You develop data nodes and debug scripts by creating MaxCompute or EMR nodes and debugging PyODPS or SQL code within DataStudio. You can also manage UDFs and perform smoke testing to validate your data processing logic.

### How do I manage workspace access and permissions?

You manage workspace access by assigning RAM roles, managing workspace members, and configuring workspace modes. This enables you to administer permissions, view operation history, and restrict sensitive data access.

### How do I control data security and manage risks?

You control data security by identifying sensitive data, configuring masking rules, and monitoring data risks. These capabilities allow you to actively manage security policies and protect your information assets.

## Use with an AI agent

```bash
curl -s https://www.company-skill.com/api/route \
  -H 'Content-Type: application/json' \
  -d '{"query": "...", "product": "alibabacloud"}'
```

MCP server: https://www.company-skill.com/api/mcp/alibabacloud.py

---
Machine-readable: https://www.company-skill.com/llms.txt · https://www.company-skill.com/sitemap.xml