Skip to content

Conversation

@parashardhapola
Copy link
Member

@parashardhapola parashardhapola commented Nov 20, 2025

PR Type

Enhancement


Description

  • Add ontology term IDs and cell states to obs columns

  • Separate ontology term names from IDs in API response

  • Update API domain to production endpoint

  • Improve documentation with visualization examples


Diagram Walkthrough

flowchart LR
  API["API Response"] -->|Extract cellOntologyTermName| OntologyName["ontologyTerm"]
  API -->|Extract cellOntologyTerm| OntologyID["ontologyTermID"]
  API -->|Extract cellState| CellState["cellState"]
  OntologyName --> OBS["adata.obs columns"]
  OntologyID --> OBS
  CellState --> OBS
Loading

File Walkthrough

Relevant files
Configuration changes
__init__.py
Version bump to 0.11.0                                                                     

cytetype/init.py

  • Version bump from 0.10.0 to 0.11.0
+1/-1     
config.py
Update API domain to production endpoint                                 

cytetype/config.py

  • Update DEFAULT_API_URL from Modal endpoint to production domain
  • Change from https://nygen-labs-prod--cytetype-api.modal.run to
    https://prod.cytetype.nygen.io
+1/-1     
Enhancement
api.py
Separate ontology term names and IDs                                         

cytetype/api.py

  • Split ontology term extraction into two fields: ontologyTerm (name)
    and ontologyTermID (ID)
  • Extract cellOntologyTermName for term names
  • Extract cellOntologyTerm for term IDs
+3/-0     
main.py
Store ontology IDs and cell states in obs                               

cytetype/main.py

  • Add ontology term ID mapping to adata.obs with key
    {results_prefix}_cellOntologyTermID_{group_key}
  • Add cell state mapping to adata.obs with key
    {results_prefix}_cellState_{group_key}
  • Update success log message to include new columns
+27/-0   
Documentation
README.md
Update docs with improved examples                                             

README.md

  • Update example API domain URL from Modal to production
  • Improve code example with variable group_key for reusability
  • Add UMAP visualization step to pipeline
  • Update visualization examples to show all four annotation columns
+13/-7   
examples.md
Update all report links to production domain                         

docs/examples.md

  • Replace all Modal API URLs with production domain across multiple
    dataset links
  • Update 16 report links to use https://prod.cytetype.nygen.io
+20/-20 

@qodo-code-review
Copy link

qodo-code-review bot commented Nov 20, 2025

PR Compliance Guide 🔍

(Compliance updated until commit 763984d)

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing audit logs: New data handling for ontology IDs and cell states lacks explicit audit logging of
critical actions (e.g., transformations, updates to obs), which may be handled elsewhere
but is not evident in the added code.

Referred Code
# Update ontology term IDs
ontology_id_map = {
    item["clusterId"]: item["ontologyTermID"]
    for item in result_data.get("annotations", [])
}
self.adata.obs[f"{results_prefix}_cellOntologyTermID_{self.group_key}"] = (
    pd.Series(
        [
            ontology_id_map.get(cluster_id, "Unknown")
            for cluster_id in self.clusters
        ],
        index=self.adata.obs.index,
    ).astype("category")
)

# Update cell states
cell_state_map = {
    item["clusterId"]: item.get("cellState", "")
    for item in result_data.get("annotations", [])
}
self.adata.obs[f"{results_prefix}_cellState_{self.group_key}"] = pd.Series(


 ... (clipped 4 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Edge cases unhandled: Mapping for new fields (ontologyTermID, cellState) uses defaults like "Unknown"
or empty strings without explicit validation or error handling for malformed inputs, which
may be acceptable but is not clearly justified in the diff.

Referred Code
# Update ontology term IDs
ontology_id_map = {
    item["clusterId"]: item["ontologyTermID"]
    for item in result_data.get("annotations", [])
}
self.adata.obs[f"{results_prefix}_cellOntologyTermID_{self.group_key}"] = (
    pd.Series(
        [
            ontology_id_map.get(cluster_id, "Unknown")
            for cluster_id in self.clusters
        ],
        index=self.adata.obs.index,
    ).astype("category")
)

# Update cell states
cell_state_map = {
    item["clusterId"]: item.get("cellState", "")
    for item in result_data.get("annotations", [])
}
self.adata.obs[f"{results_prefix}_cellState_{self.group_key}"] = pd.Series(


 ... (clipped 4 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Log content review: Success message logs mention results keys but do not appear to include sensitive data;
however, without visibility into the content of result_data, it's unclear if
downstream logging could inadvertently expose sensitive fields.

Referred Code
logger.success(
    f"Annotations successfully added to `adata.obs['{results_prefix}_annotation_{self.group_key}']`\n"
    f"Ontology terms added to `adata.obs['{results_prefix}_cellOntologyTerm_{self.group_key}']`\n"
    f"Ontology term IDs added to `adata.obs['{results_prefix}_ontologyTermID_{self.group_key}']`\n"
    f"Cell states added to `adata.obs['{results_prefix}_cellState_{self.group_key}']`\n"
    f"Full results added to `adata.uns['{results_prefix}_results']`."
)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Input validation: Transformation trusts external API fields (cellOntologyTermName, cellOntologyTerm,
cellState) without explicit validation/sanitization in the added code, which may be
performed elsewhere but is not evident here.

Referred Code
transformed_annotation = {
    "clusterId": annotation_data.get("clusterId", cluster_id),
    "annotation": annotation_data.get("annotation", "Unknown"),
    "ontologyTerm": annotation_data.get(
        "cellOntologyTermName", "Unknown"
    ),
    "ontologyTermID": annotation_data.get(
        "cellOntologyTerm", "Unknown"
    ),
    # Include additional fields from new format

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

Previous compliance checks

Compliance check up to commit 6369c5a
Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing audit logs: New functionality updates adata.obs with ontology IDs and cell states without any
corresponding audit logging, which may be acceptable for a client library but cannot be
fully verified from the diff.

Referred Code
# Update ontology term IDs
ontology_id_map = {
    item["clusterId"]: item["ontologyTermID"]
    for item in result_data.get("annotations", [])
}
self.adata.obs[f"{results_prefix}_cellOntologyTermID_{self.group_key}"] = (
    pd.Series(
        [
            ontology_id_map.get(cluster_id, "Unknown")
            for cluster_id in self.clusters
        ],
        index=self.adata.obs.index,
    ).astype("category")
)

# Update cell states
cell_state_map = {
    item["clusterId"]: item.get("cellState", "")
    for item in result_data.get("annotations", [])
}
self.adata.obs[f"{results_prefix}_cellState_{self.group_key}"] = pd.Series(


 ... (clipped 4 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Edge case defaults: The transform uses default "Unknown" or empty strings for missing fields but no
explicit error handling or validation of external API inputs is shown in the new lines.

Referred Code
transformed_annotation = {
    "clusterId": annotation_data.get("clusterId", cluster_id),
    "annotation": annotation_data.get("annotation", "Unknown"),
    "ontologyTerm": annotation_data.get(
        "cellOntologyTermName", "Unknown"
    ),
    "ontologyTermID": annotation_data.get(
        "cellOntologyTerm", "Unknown"
    ),
    # Include additional fields from new format

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Log content check: The success log message references column names only; while no sensitive data is logged in
the new lines, full context of logging elsewhere is unknown.

Referred Code
logger.success(
    f"Annotations successfully added to `adata.obs['{results_prefix}_annotation_{self.group_key}']`\n"
    f"Ontology terms added to `adata.obs['{results_prefix}_cellOntologyTerm_{self.group_key}']`\n"
    f"Ontology term IDs added to `adata.obs['{results_prefix}_ontologyTermID_{self.group_key}']`\n"
    f"Cell states added to `adata.obs['{results_prefix}_cellState_{self.group_key}']`\n"
    f"Full results added to `adata.uns['{results_prefix}_results']`."
)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Input validation: New parsing of external API fields (cellOntologyTermName, cellOntologyTerm, cellState)
lacks explicit validation or sanitization beyond defaulting values.

Referred Code
"ontologyTerm": annotation_data.get(
    "cellOntologyTermName", "Unknown"
),
"ontologyTermID": annotation_data.get(
    "cellOntologyTerm", "Unknown"
),
# Include additional fields from new format

Learn more about managing compliance generic rules or creating your own custom rules

@qodo-code-review
Copy link

qodo-code-review bot commented Nov 20, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Consolidate annotation processing logic

Refactor the repetitive logic in cytetype/main.py for processing and storing
annotation fields. Use a configuration map and a single loop to handle all
fields, improving code scalability and maintainability.

Examples:

cytetype/main.py [197-244]
        annotation_map = {
            item["clusterId"]: item["annotation"]
            for item in result_data.get("annotations", [])
        }
        self.adata.obs[f"{results_prefix}_annotation_{self.group_key}"] = pd.Series(
            [annotation_map.get(cluster_id, "Unknown") for cluster_id in self.clusters],
            index=self.adata.obs.index,
        ).astype("category")

        # Update ontology terms

 ... (clipped 38 lines)

Solution Walkthrough:

Before:

def _store_results_and_annotations(...):
    annotations = result_data.get("annotations", [])

    # Process 'annotation'
    annotation_map = {item["clusterId"]: item["annotation"] for item in annotations}
    self.adata.obs[f"{results_prefix}_annotation_{self.group_key}"] = pd.Series(
        [annotation_map.get(c, "Unknown") for c in self.clusters], ...
    ).astype("category")

    # Process 'cellOntologyTerm'
    ontology_term_map = {item["clusterId"]: item["ontologyTerm"] for item in annotations}
    self.adata.obs[f"{results_prefix}_cellOntologyTerm_{self.group_key}"] = pd.Series(
        [ontology_term_map.get(c, "Unknown") for c in self.clusters], ...
    ).astype("category")

    # Process 'ontologyTermID' (new)
    ontology_id_map = {item["clusterId"]: item["ontologyTermID"] for item in annotations}
    self.adata.obs[f"{results_prefix}_cellOntologyTermID_{self.group_key}"] = pd.Series(
        [ontology_id_map.get(c, "Unknown") for c in self.clusters], ...
    ).astype("category")

    # ... and so on for 'cellState'

After:

def _store_results_and_annotations(...):
    annotations = result_data.get("annotations", [])

    annotation_fields = {
        "annotation": ("annotation", "Unknown"),
        "cellOntologyTerm": ("ontologyTerm", "Unknown"),
        "cellOntologyTermID": ("ontologyTermID", "Unknown"),
        "cellState": ("cellState", ""),
    }

    for obs_suffix, (api_key, default) in annotation_fields.items():
        field_map = {item["clusterId"]: item.get(api_key, default) for item in annotations}
        obs_key = f"{results_prefix}_{obs_suffix}_{self.group_key}"
        self.adata.obs[obs_key] = pd.Series(
            [field_map.get(cluster, default) for cluster in self.clusters],
            index=self.adata.obs.index,
        ).astype("category")
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies significant code duplication in cytetype/main.py for processing annotation fields, which was expanded in this PR, and proposes a valid refactoring that improves maintainability and scalability.

Medium
Possible issue
Handle potential None from API

Improve robustness by handling potential None values from the API for
cellOntologyTermName and cellOntologyTerm. Coalesce None to the "Unknown"
default string to ensure consistent data types.

cytetype/api.py [159-164]

 "ontologyTerm": annotation_data.get(
-    "cellOntologyTermName", "Unknown"
-),
+    "cellOntologyTermName"
+) or "Unknown",
 "ontologyTermID": annotation_data.get(
-    "cellOntologyTerm", "Unknown"
-),
+    "cellOntologyTerm"
+) or "Unknown",
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that get() with a default value does not handle an explicit None return from the API. The proposed or "Unknown" idiom is a robust and pythonic way to ensure a default string value in all cases (missing key or None value).

Medium
General
Ensure consistent types before conversion

To prevent potential type errors when converting to a pandas categorical type,
ensure all values from the cellState field are strings. This can be done by
explicitly casting values to strings, which also handles potential None values
from the API.

cytetype/main.py [236-244]

 # Update cell states
 cell_state_map = {
-    item["clusterId"]: item.get("cellState", "")
+    item["clusterId"]: item.get("cellState")
     for item in result_data.get("annotations", [])
 }
 self.adata.obs[f"{results_prefix}_cellState_{self.group_key}"] = pd.Series(
-    [cell_state_map.get(cluster_id, "") for cluster_id in self.clusters],
+    [str(cell_state_map.get(cluster_id, "")) for cluster_id in self.clusters],
     index=self.adata.obs.index,
 ).astype("category")
  • Apply / Chat
Suggestion importance[1-10]: 3

__

Why: The suggestion correctly identifies a potential type inconsistency with None values, but the proposed fix of using str() would convert None to the string 'None', which is likely not the desired behavior.

Low
  • Update

@parashardhapola parashardhapola merged commit 6c40ed4 into master Nov 20, 2025
1 check passed
@parashardhapola parashardhapola deleted the more_obs branch November 20, 2025 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants