Vertex Summer Intern 2026, Statistical Programming, AI

Vertex Pharmaceuticals•Boston, MA

1d•$30 - $50•Hybrid

About The Position

Kickstart Your Career at Vertex! Are you ready to make a real impact? At Vertex, our mission is to tackle serious diseases and to change lives, for the better, for the future. Our aim is to give you the skills, insights, and career guidance to be an important part of that future; to turn your potential into progression. As a Vertex intern or co-op, you’ll work on meaningful projects, collaborate with talented teams, and learn from industry leaders. We’re passionate about innovation, inclusion, and supporting your growth—inside and outside the lab. Why Vertex? Real Projects: You’ll work on assignments that make a real impact, not just busy work. Mentorship & Networking: Connect with leaders and peers who want to see you succeed through professional networks, connections, and collaborations that will shape your longer-term career. Flexible & Supportive: We offer flexible work options with Flex @ Vertex and prioritize your wellbeing. Inclusive Culture: Collaboration and inclusion are embedded in everything we do. Career Launchpad: Build skills, explore career paths, and get guidance for your future career. Ready to apply? Submit your application and let’s turn possibilities into reality! Your Impact The Vertex Statistical Programming internship program is a multi-week experiential training program for students currently working towards an advanced degree in Statistics, Biostatistics, Data Science, Computer Science, Applied Mathematics, Biomedical Engineering, or a related field. If you are passionate, collaborative, and growth-minded, an internship at Vertex will help you gain meaningful experience in our Statistical Programming functional areas and serve as a launchpad for your career. Important Notice Regarding Internship and Co-op Inquiries At Vertex Pharmaceuticals, we are committed to providing a fair and structured recruitment process for all students interested in internship and co-op opportunities. To ensure consistency and equity, all student applications must go through our Early Talent Acquisition Team. Due to the high volume of interest, we are unable to respond to individual solicitation. Direct solicitation to Vertex employees- including senior leaders via email will result in removal from the recruiting process. We appreciate your enthusiasm and interest in Vertex. To be considered for internship or co-op roles, please apply directly through our official application channels. (https://www.vrtx.com/careers/career-growth-and-opportunities/internships/) Thank you for respecting our process and helping us maintain a fair experience for all candidates. What you will be doing: We are seeking an intern to contribute to the development of an LLM-based agent designed to enhance the automation of clinical Table, Figure, and Listing (TFL) generation workflows. The primary responsibilities of the intern will include: Parsing Clinical TFL Shell Documents: Extracting structured specifications such as titles, population definitions, variables, footnotes, and programming notes from RTF/DOCX files. Extracting Structured Specifications: Transforming unstructured text into structured formats for downstream processing. Interpreting User Prompts: Understanding and processing user inputs to guide the automation workflow. Mapping Specifications to R Function/SAS Macros Libraries: Matching extracted specifications to appropriate R functions/SAS Macros within an existing library. Triggering TFL Generation Workflows: Automating the execution of TFL generation workflows The system will leverage a combination of prompt-driven reasoning, structured document parsing, retrieval-augmented generation (RAG), and tool/function calling to seamlessly integrate LLM outputs with deterministic R code/SAS code execution. The intern will be responsible for delivering the following Technical Deliverables: Shell-to-Function Matching Agent: Clinical TFL shell text and user prompts. Structured function calls to the R library. Schema-Constrained LLM Output: Generate JSON mappings for table types, population definitions, variables, and grouping. Validation Layer: Develop mechanisms to ensure LLM outputs align with predefined function signatures and constraints. Prompt Optimization: Enhance prompt engineering to improve reliability and minimize hallucinations in LLM outputs. System Integration: Seamlessly integrate the LLM agent with the existing automation system to enable TFL generation workflows. This role offers a unique opportunity to work at the intersection of statistical programming, machine learning, and clinical analytics. The successful candidate will gain hands-on experience in developing cutting-edge AI-driven automation tools that have a direct impact on the efficiency and accuracy of clinical reporting processes. This role is not focused on: Researching or training new LLM models. Fine-tuning large foundation models. Instead, this role emphasizes: Building applied AI systems. Developing production-oriented tools. Designing hybrid workflows that combine deterministic and probabilistic methods. Supporting statistical programmers in automating TFL generation.

Requirements

Programming Proficiency in Python (primary language for LLM development).
Working knowledge of R/SAS for statistical programming (plus).
Experience in integrating Python and R using tools such as: reticulate, Subprocess calls, API-based communication
Familiarity with Git and version control systems.
Proven ability to write modular, testable, and production-ready code (beyond Jupyter notebooks).
Candidates must have hands-on experience with at least one of the following frameworks or tools: OpenAI API (including function calling and tool calling) Azure OpenAI LangChain LlamaIndex Semantic Kernel Autogen CrewAI Haystack
Prompt engineering (e.g., few-shot prompting, system vs. user prompts).
Structured output generation (e.g., JSON schema enforcement).
Tool calling and function calling.
Multi-step reasoning chains and agent workflows.
Context window management for LLMs.
Output validation to ensure accuracy and reliability.
Experience in building RAG systems, including: Document chunking strategies for efficient processing. Embedding generation for semantic search. Vector similarity search and top-k retrieval.
Familiarity with at least one of the following vector databases: FAISS Pinecone Milvus Chroma Weaviate
Candidates should have a strong understanding of indexing technical documents and retrieving relevant context based on user queries.
Experience in parsing and extracting structured data from technical documents, including: RTF and DOCX formats. Structured reports and TFL shell documents.
Proficiency with Python libraries such as: python-docx striprtf docx2txt Regular expressions for rule-based parsing.
Candidates should be familiar with the differences between rule-based extraction and LLM-assisted extraction.
Experience in integrating APIs and tools, including: REST APIs and JSON schema validation.
Frameworks such as FastAPI or Flask (preferred).
Logging, error handling, and debugging.
Configuration-driven workflows for scalable systems.
Candidates must understand how to design workflows where LLM outputs trigger deterministic backend functions.
Enrolled in or have completed a Master's or Doctoral degree in Statistics, Biostatistics, Data Science, Computer Science, Applied Mathematics, Biomedical Engineering, or another related field
Coursework completed in Machine Learning, Statistical Computing, Linear Algebra, Probability, Data Structures and Algorithms
Have experience building at least one end-to-end LLM application
Posses a strong understanding of statistical reporting workflows
Be comfortable debugging ambiguous LLM behavior and improving model reliability
Show a keen interest in applying AI to clinical analystics and statistical programming
Legal authorization to work in the United States, now and in the future.
You must be enrolled in an advanced degree program if graduating before August 2026
You must be available to work full-time, 40 hours per week from May – August 2026

Responsibilities

Parsing Clinical TFL Shell Documents: Extracting structured specifications such as titles, population definitions, variables, footnotes, and programming notes from RTF/DOCX files.
Extracting Structured Specifications: Transforming unstructured text into structured formats for downstream processing.
Interpreting User Prompts: Understanding and processing user inputs to guide the automation workflow.
Mapping Specifications to R Function/SAS Macros Libraries: Matching extracted specifications to appropriate R functions/SAS Macros within an existing library.
Triggering TFL Generation Workflows: Automating the execution of TFL generation workflows
Shell-to-Function Matching Agent: Clinical TFL shell text and user prompts. Structured function calls to the R library.
Schema-Constrained LLM Output: Generate JSON mappings for table types, population definitions, variables, and grouping.
Validation Layer: Develop mechanisms to ensure LLM outputs align with predefined function signatures and constraints.
Prompt Optimization: Enhance prompt engineering to improve reliability and minimize hallucinations in LLM outputs.
System Integration: Seamlessly integrate the LLM agent with the existing automation system to enable TFL generation workflows.