Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
There was a problem hiding this comment.
Pull request overview
This PR adds a Knowledge Connector for Microsoft SharePoint that enables extracting and indexing documents from SharePoint sites into Cognigy's knowledge base system.
Changes:
- Added SharePoint Knowledge Connector with support for multiple file types (PDF, DOCX, TXT, CSV, JSON, JSONL, MD, PPTX)
- Implemented file extraction, text chunking, and knowledge source creation
- Updated dependencies to include LangChain libraries for document processing
Reviewed changes
Copilot reviewed 11 out of 14 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| package.json | Added base dependencies for extension tools and axios |
| extensions/microsoft-sharepoint/tsconfig.json | Added skipLibCheck configuration |
| extensions/microsoft-sharepoint/src/module.ts | Registered new connector and connection type |
| extensions/microsoft-sharepoint/src/knowledge-connectors/sharepoint-connector.ts | Main connector implementation with file processing logic |
| extensions/microsoft-sharepoint/src/knowledge-connectors/helpers/utils/config.ts | Configuration utilities for chunk size management |
| extensions/microsoft-sharepoint/src/knowledge-connectors/helpers/text_extractor.ts | Document text extraction with LangChain loaders |
| extensions/microsoft-sharepoint/src/knowledge-connectors/helpers/text_chunker.ts | Text splitting functionality for document chunking |
| extensions/microsoft-sharepoint/src/knowledge-connectors/helpers/list_files.ts | SharePoint file listing with filtering |
| extensions/microsoft-sharepoint/src/knowledge-connectors/helpers/chunk_extractor.ts | File download and chunk extraction orchestration |
| extensions/microsoft-sharepoint/src/connections/connectorConnection.ts | Connection schema for SharePoint authentication |
| extensions/microsoft-sharepoint/package.json | Updated dependencies including LangChain packages |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
extensions/microsoft-sharepoint/src/knowledge-connectors/helpers/chunk_extractor.ts
Outdated
Show resolved
Hide resolved
| fields: [ | ||
| { fieldName: "tenantId" }, | ||
| { fieldName: "clientId" }, | ||
| { fieldName: "clientSecret"}, |
There was a problem hiding this comment.
Inconsistent spacing: missing space before the closing brace. Should be { fieldName: \"clientSecret\" }, to match the formatting of the other fields.
…rs/chunk_extractor.ts Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 14 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
extensions/microsoft-sharepoint/src/knowledge-connectors/helpers/text_extractor.ts
Show resolved
Hide resolved
extensions/microsoft-sharepoint/src/knowledge-connectors/helpers/text_extractor.ts
Show resolved
Hide resolved
| } | ||
| } catch (error) { | ||
| // Continue with next file even if this one fails | ||
| console.error(`Failed to process file ${file.name}:`, error); |
There was a problem hiding this comment.
The error message doesn't provide actionable context. Consider including the drive name and file path, and suggesting next steps.
| console.error(`Failed to process file ${file.name}:`, error); | |
| console.error( | |
| `Failed to process file "${file.name}" (id: ${file.id}) in drive "${drive.name}" (id: ${driveId}). ` + | |
| `Check that the file exists in SharePoint, that the connector has access permissions, and that the file type is supported.`, | |
| error | |
| ); |
There was a problem hiding this comment.
@falak-asad I might just remove the logs again?
| export const logger = { | ||
| log: (level: string, context: any, message: string) => { |
There was a problem hiding this comment.
The custom logger object should have a more specific name like 'customLogger' or 'textExtractorLogger' to avoid confusion with standard console logging.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 14 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (!pptx) | ||
| return []; | ||
| return [ | ||
| new Document({ | ||
| pageContent: pptx, |
There was a problem hiding this comment.
The parseOfficeAsync function from the officeparser package is called without any error handling beyond the try-catch wrapper. According to the library documentation, if the parsing fails, it may return an empty string or undefined. The code checks for !pptx but should also handle the case where pptx is an empty string to avoid creating empty knowledge chunks.
| if (!pptx) | |
| return []; | |
| return [ | |
| new Document({ | |
| pageContent: pptx, | |
| // Handle cases where parsing fails or returns empty/meaningless content | |
| if (!pptx) { | |
| return []; | |
| } | |
| const cleanedContent = removeUnnecessaryChars(pptx); | |
| if (!cleanedContent) { | |
| return []; | |
| } | |
| return [ | |
| new Document({ | |
| pageContent: cleanedContent, |
| tags: sourceTags as string[], | ||
| chunkCount: chunks.length, | ||
| contentHashOrTimestamp: contentHash, | ||
| externalIdentifier: file.name, |
There was a problem hiding this comment.
Potential issue with external identifier uniqueness. Using file.name as the externalIdentifier may cause collisions if files with the same name exist in different drives or folders. Consider using a more unique identifier such as a combination of driveId and file.id, or the full file path, to prevent potential conflicts and ensure proper cleanup logic.
| } finally { | ||
| // Clean up temp file | ||
| if (fs.existsSync(tempFilePath)) { | ||
| try { fs.unlinkSync(tempFilePath); } catch (e) { /* ignore cleanup errors */ } |
There was a problem hiding this comment.
The temp file cleanup may fail silently in the catch block, potentially leaving temporary files on disk if the unlink operation fails. While the error is caught to prevent crashing, accumulated temp files could eventually consume disk space. Consider logging a warning when cleanup fails to help diagnose issues in production.
| try { fs.unlinkSync(tempFilePath); } catch (e) { /* ignore cleanup errors */ } | |
| try { | |
| fs.unlinkSync(tempFilePath); | |
| } catch (e) { | |
| console.warn(`Warning: Failed to remove temp file at ${tempFilePath}:`, e); | |
| } |
| // Convert split documents to ChunkContent format | ||
| const chunks: ChunkContent[] = splitDocuments | ||
| .slice(0, MAX_CHUNKS_PER_FILE) | ||
| .map((doc, index) => ({ | ||
| text: doc.pageContent, | ||
| data: { | ||
| title: `${fileName} - Part ${index + 1}`, | ||
| source: fileName, | ||
| fileType: fileExtension, | ||
| }, | ||
| })); |
There was a problem hiding this comment.
Hard-coded chunk limit of 500 chunks per file may not be appropriate for all use cases. Large documents could be truncated without warning to the user. Consider either making this configurable through the connector fields or logging a warning when a file exceeds this limit so users are aware that content is being truncated.
| // Convert split documents to ChunkContent format | |
| const chunks: ChunkContent[] = splitDocuments | |
| .slice(0, MAX_CHUNKS_PER_FILE) | |
| .map((doc, index) => ({ | |
| text: doc.pageContent, | |
| data: { | |
| title: `${fileName} - Part ${index + 1}`, | |
| source: fileName, | |
| fileType: fileExtension, | |
| }, | |
| })); | |
| const totalChunks = splitDocuments.length; | |
| if (totalChunks > MAX_CHUNKS_PER_FILE) { | |
| console.warn( | |
| `Truncating SharePoint file "${fileName}": ` + | |
| `${totalChunks} chunks generated, but only the first ` + | |
| `${MAX_CHUNKS_PER_FILE} will be kept. ` + | |
| `${totalChunks - MAX_CHUNKS_PER_FILE} chunks will be discarded.` | |
| ); | |
| } | |
| // Convert split documents to ChunkContent format | |
| const limitedDocuments = splitDocuments.slice(0, MAX_CHUNKS_PER_FILE); | |
| const chunks: ChunkContent[] = limitedDocuments.map((doc, index) => ({ | |
| text: doc.pageContent, | |
| data: { | |
| title: `${fileName} - Part ${index + 1}`, | |
| source: fileName, | |
| fileType: fileExtension, | |
| }, | |
| })); |
| // Process each file | ||
| for (const file of files) { | ||
| try { | ||
| const fileExtension = path.extname(file.name).slice(1).toLowerCase(); | ||
|
|
||
| // Get chunks for this file | ||
| const chunks = await getSharePointFileChunks( | ||
| accessToken, | ||
| driveId, | ||
| file.id, | ||
| file.name, | ||
| fileExtension | ||
| ); | ||
|
|
||
| if (chunks.length === 0) continue; | ||
|
|
||
| // Compute content hash to support safe upserts | ||
| const contentHash = createContentHash(chunks); | ||
|
|
||
| // Upsert knowledge source so we can skip re-ingestion if unchanged | ||
| const knowledgeSource = await api.upsertKnowledgeSource({ | ||
| name: file.name, | ||
| description: `Data from ${file.name} in SharePoint library ${drive.name}`, | ||
| tags: sourceTags as string[], | ||
| chunkCount: chunks.length, | ||
| contentHashOrTimestamp: contentHash, | ||
| externalIdentifier: file.name, | ||
| }); | ||
|
|
||
| if (knowledgeSource) { | ||
| // Create all chunks (the runtime may optimise if content unchanged) | ||
| for (const chunk of chunks) { | ||
| await api.createKnowledgeChunk({ | ||
| knowledgeSourceId: knowledgeSource.knowledgeSourceId, | ||
| text: chunk.text, | ||
| data: chunk.data, | ||
| }); | ||
| } | ||
| } | ||
|
|
||
| // Track processed external identifiers for cleanup | ||
| newSources.push(file.name); | ||
| } catch (error) { | ||
| // Continue with next file even if this one fails | ||
| console.error(`Failed to process file ${file.name}:`, error); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Sequential processing of files and chunks with await in loops will cause significant performance issues for SharePoint libraries with many files. Each file is processed one at a time, and within each file, chunks are created one at a time. For a library with hundreds of files, this could take hours. Consider using Promise.all() or batch processing to process multiple files in parallel, and batch-create chunks to improve performance.
| // Process each file | |
| for (const file of files) { | |
| try { | |
| const fileExtension = path.extname(file.name).slice(1).toLowerCase(); | |
| // Get chunks for this file | |
| const chunks = await getSharePointFileChunks( | |
| accessToken, | |
| driveId, | |
| file.id, | |
| file.name, | |
| fileExtension | |
| ); | |
| if (chunks.length === 0) continue; | |
| // Compute content hash to support safe upserts | |
| const contentHash = createContentHash(chunks); | |
| // Upsert knowledge source so we can skip re-ingestion if unchanged | |
| const knowledgeSource = await api.upsertKnowledgeSource({ | |
| name: file.name, | |
| description: `Data from ${file.name} in SharePoint library ${drive.name}`, | |
| tags: sourceTags as string[], | |
| chunkCount: chunks.length, | |
| contentHashOrTimestamp: contentHash, | |
| externalIdentifier: file.name, | |
| }); | |
| if (knowledgeSource) { | |
| // Create all chunks (the runtime may optimise if content unchanged) | |
| for (const chunk of chunks) { | |
| await api.createKnowledgeChunk({ | |
| knowledgeSourceId: knowledgeSource.knowledgeSourceId, | |
| text: chunk.text, | |
| data: chunk.data, | |
| }); | |
| } | |
| } | |
| // Track processed external identifiers for cleanup | |
| newSources.push(file.name); | |
| } catch (error) { | |
| // Continue with next file even if this one fails | |
| console.error(`Failed to process file ${file.name}:`, error); | |
| continue; | |
| } | |
| // Process each file in small concurrent batches to avoid sequential bottlenecks | |
| const FILE_BATCH_SIZE = 5; | |
| for (let i = 0; i < files.length; i += FILE_BATCH_SIZE) { | |
| const batch = files.slice(i, i + FILE_BATCH_SIZE); | |
| await Promise.all( | |
| batch.map(async (file) => { | |
| try { | |
| const fileExtension = path.extname(file.name).slice(1).toLowerCase(); | |
| // Get chunks for this file | |
| const chunks = await getSharePointFileChunks( | |
| accessToken, | |
| driveId, | |
| file.id, | |
| file.name, | |
| fileExtension | |
| ); | |
| if (chunks.length === 0) { | |
| return; | |
| } | |
| // Compute content hash to support safe upserts | |
| const contentHash = createContentHash(chunks); | |
| // Upsert knowledge source so we can skip re-ingestion if unchanged | |
| const knowledgeSource = await api.upsertKnowledgeSource({ | |
| name: file.name, | |
| description: `Data from ${file.name} in SharePoint library ${drive.name}`, | |
| tags: sourceTags as string[], | |
| chunkCount: chunks.length, | |
| contentHashOrTimestamp: contentHash, | |
| externalIdentifier: file.name, | |
| }); | |
| if (knowledgeSource) { | |
| // Create all chunks concurrently (the runtime may optimise if content unchanged) | |
| await Promise.all( | |
| chunks.map((chunk) => | |
| api.createKnowledgeChunk({ | |
| knowledgeSourceId: knowledgeSource.knowledgeSourceId, | |
| text: chunk.text, | |
| data: chunk.data, | |
| }) | |
| ) | |
| ); | |
| } | |
| // Track processed external identifiers for cleanup | |
| newSources.push(file.name); | |
| } catch (error) { | |
| // Continue with next file even if this one fails | |
| console.error(`Failed to process file ${file.name}:`, error); | |
| return; | |
| } | |
| }) | |
| ); |
| { fieldName: "tenantId" }, | ||
| { fieldName: "clientId" }, | ||
| { fieldName: "clientSecret" }, | ||
|
|
There was a problem hiding this comment.
Extra blank line before the closing bracket. Remove the blank line at line 10 to maintain consistent formatting with other connection schemas in the codebase.
| import * as path from 'path'; | ||
| import * as crypto from "crypto"; | ||
|
|
||
|
|
There was a problem hiding this comment.
Extra blank line after imports. Remove the extra blank line at line 8 to follow consistent formatting patterns seen in other knowledge connectors in the codebase.
| "sourceMap": true, | ||
| "moduleResolution": "node", | ||
| "watch": false | ||
| "skipLibCheck": true |
There was a problem hiding this comment.
The 'watch' property was removed from tsconfig.json. While 'skipLibCheck' was added (which is good for build performance), removing 'watch: false' changes the default TypeScript behavior. If watch mode was explicitly disabled before, this change may cause unintended watch mode activation. Verify that this change is intentional and won't affect the build process.
| } catch (error) { | ||
| // Continue with next file even if this one fails | ||
| console.error(`Failed to process file ${file.name}:`, error); | ||
| continue; |
There was a problem hiding this comment.
The error handling silently continues on failure but error messages may contain sensitive information or stack traces that could expose internal details. While the continuation logic is correct to keep processing other files, ensure that error logging is controlled in production environments to avoid exposing sensitive data.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 14 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1 @@ | |||
| {} | |||
There was a problem hiding this comment.
This root-level package.json is empty, but the root package-lock.json declares dependencies. This mismatch will break reproducible installs (npm will consider the lock out of sync) and is inconsistent with the repo’s per-extension package.json structure. Either remove the root package.json/package-lock.json entirely, or define the intended root dependencies and regenerate the lockfile from that package.json.
| {} | |
| { | |
| "name": "root", | |
| "version": "1.0.0", | |
| "private": true | |
| } |
| const url = `${baseUrl}?$top=1000`; | ||
|
|
||
| const response = await axios.get(url, { | ||
| headers: { | ||
| 'Authorization': `Bearer ${accessToken}`, | ||
| 'Content-Type': 'application/json' | ||
| } | ||
| }); | ||
|
|
||
| if (!response.data.value) { | ||
| return []; | ||
| } | ||
|
|
||
| // Filter for files only (not folders), non-empty, and supported types | ||
| const sharePointFiles: SharePointFile[] = response.data.value |
There was a problem hiding this comment.
The header comment claims pagination support, but the implementation only fetches the first page (?$top=1000) and ignores @odata.nextLink. Either implement pagination by following @odata.nextLink until exhausted, or update the comment to avoid promising behavior that doesn’t exist.
| const url = `${baseUrl}?$top=1000`; | |
| const response = await axios.get(url, { | |
| headers: { | |
| 'Authorization': `Bearer ${accessToken}`, | |
| 'Content-Type': 'application/json' | |
| } | |
| }); | |
| if (!response.data.value) { | |
| return []; | |
| } | |
| // Filter for files only (not folders), non-empty, and supported types | |
| const sharePointFiles: SharePointFile[] = response.data.value | |
| let url = `${baseUrl}?$top=1000`; | |
| const allItems: any[] = []; | |
| while (url) { | |
| const response = await axios.get(url, { | |
| headers: { | |
| 'Authorization': `Bearer ${accessToken}`, | |
| 'Content-Type': 'application/json' | |
| } | |
| }); | |
| if (!response.data || !response.data.value) { | |
| break; | |
| } | |
| allItems.push(...response.data.value); | |
| const nextLink = response.data['@odata.nextLink']; | |
| url = nextLink && typeof nextLink === 'string' ? nextLink : ''; | |
| } | |
| if (allItems.length === 0) { | |
| return []; | |
| } | |
| // Filter for files only (not folders), non-empty, and supported types | |
| const sharePointFiles: SharePointFile[] = allItems |
| // Log first few files for debugging | ||
| sharePointFiles.slice(0, 3).forEach((file, index) => { | ||
| console.log(`File ${index + 1}: ${file.name} (${file.size} bytes)`); | ||
| }); |
There was a problem hiding this comment.
This debug logging will run on every connector sync and can create noisy logs (and potentially leak document names). Consider removing it or gating it behind an explicit debug flag (e.g., env var) so production runs stay quiet by default.
| // Log first few files for debugging | |
| sharePointFiles.slice(0, 3).forEach((file, index) => { | |
| console.log(`File ${index + 1}: ${file.name} (${file.size} bytes)`); | |
| }); | |
| // Log first few files for debugging (enabled only when explicitly requested) | |
| if (process.env.SHAREPOINT_CONNECTOR_DEBUG === 'true') { | |
| sharePointFiles.slice(0, 3).forEach((file, index) => { | |
| console.log(`File ${index + 1}: ${file.name} (${file.size} bytes)`); | |
| }); | |
| } |
| // split document into paragraphs according to specified or default splitter | ||
| const splitDocuments = ( | ||
| await splitDocs(docs) | ||
| ).map((doc) => doc.pageContent); | ||
|
|
||
| // join the paragraphs into the format we want | ||
| const textParagraphs = splitDocuments.join('\n\n'); | ||
|
|
||
| logger.log("info", null, "Successfully used langchain to extract content"); | ||
|
|
||
| return textParagraphs; |
There was a problem hiding this comment.
lsExtractor already splits documents into chunks (via splitDocs) and then rejoins them into a single string, but getSharePointFileChunks immediately calls splitDocs again on that returned string. This duplicates work and makes it unclear where chunking is supposed to happen. Consider having lsExtractor return extracted text only (no chunking), or return split Document[] and let chunk_extractor be the single place that chunks.
| // split document into paragraphs according to specified or default splitter | |
| const splitDocuments = ( | |
| await splitDocs(docs) | |
| ).map((doc) => doc.pageContent); | |
| // join the paragraphs into the format we want | |
| const textParagraphs = splitDocuments.join('\n\n'); | |
| logger.log("info", null, "Successfully used langchain to extract content"); | |
| return textParagraphs; | |
| // Join cleaned document contents into a single string; downstream code is responsible for chunking | |
| const fullText = docs.map((doc) => doc.pageContent).join('\n\n'); | |
| logger.log("info", null, "Successfully used langchain to extract content"); | |
| return fullText; |
| { fieldName: "tenantId" }, | ||
| { fieldName: "clientId" }, | ||
| { fieldName: "clientSecret" }, | ||
|
|
There was a problem hiding this comment.
Minor: there’s an extra blank line in the fields array, and indentation differs from the other connection schema files in this extension. Cleaning this up will keep formatting consistent and reduce diffs over time.
|
|
||
| // Save to temp file (text_extractor needs file path) | ||
| const tempDir = os.tmpdir(); | ||
| const tempFileName = `${Date.now()}_${fileName}`; |
There was a problem hiding this comment.
The temp file name is derived from fileName. To avoid filesystem issues (invalid characters/very long names) and to guarantee uniqueness, prefer generating the temp file name from stable IDs (e.g., fileId) and/or sanitize the original file name before using it on disk.
| const tempFileName = `${Date.now()}_${fileName}`; | |
| const tempFileName = `${fileId}_${Date.now()}`; |
| const tempFileName = `${Date.now()}_${fileName}`; | ||
| const tempFilePath = path.join(tempDir, tempFileName); | ||
|
|
||
| fs.writeFileSync(tempFilePath, bodyContents as any); |
There was a problem hiding this comment.
This uses synchronous filesystem I/O (writeFileSync) inside an async workflow. When processing many files, this will block the Node.js event loop and can significantly reduce throughput. Prefer await fs.promises.writeFile(...) (and similar async APIs for cleanup) to keep the connector responsive.
Added the Knowledge Connector for Sharepoint